Awesome-omni-skill mojo
Mojo development patterns for high-performance computing: SIMD vectorization, zero-copy Python interop, GIL-free parallelism, C FFI (external_call, DLHandle), Python C-API extensions, and Hatch build integration. Use when writing Mojo kernels or hybrid Python-Mojo projects.
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/mojo" ~/.claude/skills/diegosouzapw-awesome-omni-skill-mojo && rm -rf "$T"
skills/development/mojo/SKILL.mdMojo (High-Performance Computing)
Scope
- Work in
(Mojo source) andsrc/mo/
(Python wrapper).src/py/ - Build high-performance kernels for numeric, AI, and data-intensive workloads.
- Develop Python extensions using Mojo's native C-API features.
- Call C/C++ libraries via CFFI (
,external_call
).DLHandle - Zero-copy data exchange between Python and Mojo.
Core Rules
Strict Typing
- Prefer
overfn
: Strict type checking, predictable performance, deterministic behavior.def - Explicit types for all function arguments and return values.
- Use
for compile-time constants.alias
fn dot_product(a: SIMD[DType.float32, 8], b: SIMD[DType.float32, 8]) -> Float32: return (a * b).reduce_add()
Memory Safety
- Leverage ownership semantics:
,owned
,borrowed
.inout - Document every use of
with a safety comment.UnsafePointer - Prefer stack allocation over heap when size is known at compile time.
fn process(borrowed data: Tensor[DType.float32]) -> Tensor[DType.float32]: # borrowed: read-only, no copy, caller retains ownership var result = Tensor[DType.float32](data.shape()) # ... transform ... return result^ # Transfer ownership to caller
Tooling
- Use
exclusively for Python environment management. Neveruv
orpixi
.conda - Use
CLI for compilation and testing.mojo
Docs Retrieval Policy (Modular)
Use Modular docs as the primary source of truth for Mojo behavior, MAX integration details, and assistant workflows.
Resolution order for reference context:
- Use scoped reference files first:
- https://docs.modular.com/llms-mojo.txt
- https://docs.modular.com/llms-mojo-python.txt
- https://docs.modular.com/llms-mojo-kernel.txt
- https://docs.modular.com/llms-max.txt
- Use https://docs.modular.com/llms.txt for broad discovery.
- Use https://docs.modular.com/llms-full.txt only when scoped files are insufficient.
Doc-linking requirements:
- Prefer deep links to exact manual/API pages instead of only top-level doc hubs.
- Include source links when behavior depends on implementation details.
- Avoid version-pinned changelog anchors; link to the latest changelog root.
Assistant Integration
Place shared Modular guidance in the assistant root context file for your environment:
- Codex / Gemini CLI:
AGENTS.md - Claude Code / Cline / Roo:
CLAUDE.md - Cursor:
or.cursorrules.cursor/rules - Windsurf:
or.windsurfrules.windsurf/rules
For Codex-style instructions, require links under:
- MAX docs:
https://docs.modular.com/max/... - Mojo docs:
https://docs.modular.com/mojo/... - Source code:
https://github.com/modular/modular/tree/main/...
Performance: SIMD + Parallelize
SIMD-First Vectorization
Replace scalar loops with SIMD operations:
from algorithm import vectorize fn relu_simd(inout tensor: Tensor[DType.float32]): alias simd_width = simdwidthof[DType.float32]() let zero = SIMD[DType.float32, simd_width](0) @parameter fn _relu[width: Int](idx: Int): let val = tensor.load[width=width](idx) tensor.store(idx, val.max(zero)) vectorize[_relu, simd_width](tensor.num_elements())
Rules:
- Use
to auto-detect hardware SIMD width.simdwidthof - Use
for compile-time loop specialization.@parameter - Benchmark SIMD vs scalar — verify vectorization gives real speedup.
GIL-Free Parallelism
True multi-core scaling without Python's GIL:
from algorithm import parallelize fn parallel_transform(inout data: Tensor[DType.float32], num_workers: Int): let chunk_size = data.num_elements() // num_workers @parameter fn _worker(worker_id: Int): let start = worker_id * chunk_size let end = min(start + chunk_size, data.num_elements()) for i in range(start, end): data[i] = expensive_compute(data[i]) parallelize[_worker](num_workers)
Combined SIMD + Parallel
fn process_large_dataset(inout data: Tensor[DType.float32]): alias simd_width = simdwidthof[DType.float32]() @parameter fn _chunk(chunk_id: Int): let offset = chunk_id * CHUNK_SIZE @parameter fn _vectorized[width: Int](idx: Int): let val = data.load[width=width](offset + idx) data.store(offset + idx, transform(val)) vectorize[_vectorized, simd_width](CHUNK_SIZE) parallelize[_chunk](data.num_elements() // CHUNK_SIZE)
Zero-Copy Python Interop
Via array_interface
Exchange data with NumPy without copying:
fn from_numpy(np_array: PythonObject) -> Tensor[DType.float32]: # Extract raw pointer from NumPy's array interface let interface = np_array.__array_interface__ let data_ptr = interface["data"][0].to_int() let shape = interface["shape"] # SAFETY: np_array must remain alive while this tensor exists. # The pointer is valid for shape[0] * sizeof(float32) bytes. let ptr = UnsafePointer[Float32](address=data_ptr) return Tensor[DType.float32](ptr, shape[0].to_int())
Returning Data to Python
fn to_numpy(tensor: Tensor[DType.float32]) -> PythonObject: let np = Python.import_module("numpy") # Create NumPy array from Mojo tensor data return np.frombuffer( tensor.data().as_bytes(), dtype=np.float32 ).reshape(tensor.shape())
Python Extensions (C-API)
Module Entry Point
from python.module import PythonModuleBuilder @export fn PyInit_my_module() -> PythonObject: var builder = PythonModuleBuilder("my_module") builder.add_function("dot_product", dot_product_wrapper) builder.add_function("relu", relu_wrapper) return builder.build()
Function Wrappers
fn dot_product_wrapper(args: PythonObject) -> PythonObject: let a = args[0] # NumPy array let b = args[1] # NumPy array let result = dot_product_simd( from_numpy(a), from_numpy(b) ) return to_numpy(result)
C FFI (Calling C Libraries from Mojo)
Static External Calls
Use
external_call for compile-time linked C functions:
from sys.ffi import external_call fn get_time() -> Float64: # Calls C's clock_gettime via libc (linked at compile time) return external_call["clock", Float64]() fn allocate_aligned(size: Int, alignment: Int) -> UnsafePointer[UInt8]: # SAFETY: Caller must free with aligned_free/free. return external_call["aligned_alloc", UnsafePointer[UInt8]](alignment, size)
Rules:
- First parameter is the C function name as a string literal.
- Second parameter is the Mojo return type.
- Remaining parameters are the arguments, which must map to C-compatible types.
- Only works for functions available at link time (libc, system libraries).
Dynamic Library Loading (DLHandle)
Load shared libraries at runtime for plugin-style integration:
from sys.ffi import DLHandle, c_char fn load_custom_library(): # Open shared library var lib = DLHandle("./libcustom_ops.so") # Get function pointer by name var compute_fn = lib.get_function[fn (UnsafePointer[Float32], Int) -> Float32]( "custom_compute" ) # Call the C function var data = UnsafePointer[Float32].alloc(1024) var result = compute_fn(data, 1024) data.free() # lib closes on drop (RAII)
When to use DLHandle:
- Loading vendor-specific acceleration libraries (CUDA, oneDNN, MKL).
- Plugin architectures where the library isn't known at compile time.
- Wrapping existing C/C++ libraries without recompilation.
C Struct Mapping
Map C structs for FFI interop:
@register_passable("trivial") struct CTimeSpec: var tv_sec: Int64 var tv_nsec: Int64 fn __init__(out self): self.tv_sec = 0 self.tv_nsec = 0 fn get_monotonic_time() -> CTimeSpec: var ts = CTimeSpec() # SAFETY: CTimeSpec layout matches C's struct timespec. # CLOCK_MONOTONIC (1) is always available on POSIX systems. external_call["clock_gettime", Int32](Int32(1), UnsafePointer.address_of(ts)) return ts
Rules:
- Use
for C-compatible structs (no destructor, copyable by memcpy).@register_passable("trivial") - Ensure field layout matches the C struct exactly (same types, same order).
- Use
,Int32
,Int64
,Float32
for exact-width C type mapping.Float64
C String Handling
from sys.ffi import c_char fn call_c_with_string(name: String): # Convert Mojo String to null-terminated C string var c_str = name.unsafe_cstr_ptr() # SAFETY: c_str is valid for the lifetime of `name`. external_call["puts", Int32](c_str) fn read_c_string(ptr: UnsafePointer[c_char]) -> String: # SAFETY: ptr must point to a valid null-terminated C string. return String(ptr)
Callback Patterns (C Calling Mojo)
Pass Mojo functions as C callbacks:
alias CCallback = fn (UnsafePointer[Float32], Int) -> None fn my_callback(data: UnsafePointer[Float32], len: Int) -> None: # Process data from C callback for i in range(len): data[i] = data[i] * 2.0 fn register_with_c_library(): # Pass Mojo function as C function pointer external_call["register_callback", None](my_callback)
Rules:
- Callback functions must use C-compatible types only.
- Use
if the callback needs a stable address.@always_inline("never") - Document lifetime requirements — the callback must not outlive its data.
Linking C Libraries
# Link system library at compile time mojo build --link-lib "m" src/main.mojo # libm (math) mojo build --link-lib "ssl" src/main.mojo # libssl # Link custom shared library mojo build --link-lib "custom_ops" --link-path "./lib" src/main.mojo
C FFI Type Mapping
| C Type | Mojo Type |
|---|---|
/ | |
/ | |
| |
| |
| |
| |
| |
| |
| (enforce read-only by convention) |
| |
Build System (hatch-mojo)
Use hatch-mojo — a Hatch build hook plugin for compiling Mojo sources into Python extensions, shared libraries, executables, and other artifacts.
Minimal Setup
# pyproject.toml [build-system] build-backend = "hatchling.build" requires = ["hatchling", "hatch-mojo"] [project] name = "my-package" version = "0.1.0" requires-python = ">=3.11" [[tool.hatch.build.targets.wheel.hooks.mojo.jobs]] name = "core" input = "src/mo/my_pkg/core.mojo" emit = "python-extension" module = "my_pkg._core" include-dirs = ["src/mo"]
# Build wheel with compiled Mojo extension hatch build -t wheel
Job Types (emit)
Value | Output | Use Case |
|---|---|---|
| / importable by Python | Primary — Mojo kernels callable from Python |
| / | Shared library for FFI consumers |
| | Static linking into other builds |
| Binary | CLI tools written in Mojo |
| | Object file for custom linking |
Multiple Jobs with Dependencies
[tool.hatch.build.targets.wheel.hooks.mojo] parallel = true # Compile independent jobs concurrently fail-fast = true # Stop on first failure # Reusable config profiles [tool.hatch.build.targets.wheel.hooks.mojo.profiles.common] include-dirs = ["src/mo"] [[tool.hatch.build.targets.wheel.hooks.mojo.jobs]] name = "core" input = "src/mo/my_pkg/core.mojo" emit = "python-extension" module = "my_pkg._core" profiles = ["common"] [[tool.hatch.build.targets.wheel.hooks.mojo.jobs]] name = "ops" input = "src/mo/my_pkg/ops.mojo" emit = "python-extension" module = "my_pkg._ops" profiles = ["common"] depends-on = ["core"] # Compiles after core [[tool.hatch.build.targets.wheel.hooks.mojo.jobs]] name = "cli" input = "src/mo/cli.mojo" emit = "executable" depends-on = ["core"] install = { kind = "scripts", path = "my-cli" }
Platform-Conditional Compilation
[[tool.hatch.build.targets.wheel.hooks.mojo.jobs]] name = "unix-accel" input = "src/mo/accel_unix.mojo" emit = "shared-lib" platforms = ["linux", "darwin"] # Skip on Windows install = { kind = "package", path = "my_pkg/lib" } [[tool.hatch.build.targets.wheel.hooks.mojo.jobs]] name = "cuda-kernels" input = "src/mo/cuda_*.mojo" # Glob expansion emit = "shared-lib" marker = "platform_machine == 'x86_64'" # PEP 508 marker install = { kind = "package", path = "my_pkg/cuda" }
Global Configuration Options
| Key | Default | Description |
|---|---|---|
| auto-detect | Path to mojo binary |
| | Compile independent jobs concurrently |
| | Stop on first failure (parallel mode) |
| | Skip compilation for editable installs |
| | Remove build dir before compiling |
| | Remove build dir after compiling |
| | Working directory for artifacts |
/ | | Global git-style glob filters |
Mojo Binary Discovery
Priority order:
config valuemojo-bin
environment variableHATCH_MOJO_BIN
onmojoPATH
relative to project root.venv/bin/mojo
Install Mappings (non-Python artifacts)
For jobs with
emit other than python-extension, specify where to install:
| Description | Example |
|---|---|---|
| Inside Python package dir | |
| As executable script | |
| As package data | |
| At package root | |
| Arbitrary path | |
Manual Compilation (without hatch-mojo)
# Build shared library directly mojo build --emit shared-lib src/mo/module.mojo -o src/py/package/_module.so # Build standalone binary mojo build src/mo/main.mojo -o dist/main # Run directly mojo run src/mo/main.mojo
Testing
Mojo Tests
fn test_dot_product(): let a = SIMD[DType.float32, 4](1.0, 2.0, 3.0, 4.0) let b = SIMD[DType.float32, 4](5.0, 6.0, 7.0, 8.0) let result = dot_product(a, b) assert_almost_equal(result, 70.0)
mojo test src/mo/tests/
Python-Mojo Boundary Tests
# tests/test_module.py import numpy as np from my_package._my_module import dot_product def test_dot_product(): a = np.array([1.0, 2.0, 3.0, 4.0], dtype=np.float32) b = np.array([5.0, 6.0, 7.0, 8.0], dtype=np.float32) result = dot_product(a, b) np.testing.assert_almost_equal(result, 70.0) def test_large_array_performance(): """Verify Mojo kernel is faster than NumPy for large arrays.""" a = np.random.randn(1_000_000).astype(np.float32) b = np.random.randn(1_000_000).astype(np.float32) import time start = time.perf_counter() result_mojo = dot_product(a, b) mojo_time = time.perf_counter() - start start = time.perf_counter() result_numpy = np.dot(a, b) numpy_time = time.perf_counter() - start np.testing.assert_almost_equal(result_mojo, result_numpy, decimal=1) # Mojo should be competitive or faster print(f"Mojo: {mojo_time:.6f}s, NumPy: {numpy_time:.6f}s")
uv run pytest tests/
Project Structure
project/ ├── pyproject.toml # hatch-mojo config lives here ├── src/ │ ├── mo/ # Mojo source │ │ ├── my_pkg/ │ │ │ ├── core.mojo # Python extension (emit: python-extension) │ │ │ ├── ops.mojo # SIMD/parallel kernels │ │ │ └── layers.mojo # Compute layers │ │ └── tests/ # Mojo unit tests │ └── py/ # Python package │ └── my_package/ │ ├── __init__.py # Re-exports from _core │ └── py.typed # PEP 561 ├── tests/ # Python integration tests ├── benchmarks/ # Performance comparisons └── .hatch_mojo/ # Build artifacts (gitignored)
Conventions
overfn
for all Mojo functions unless Python interop requiresdef
.def- Explicit types everywhere — no type inference for function signatures.
- Document
usage with safety comments.UnsafePointer - Benchmark against NumPy/Python baselines and document results.
- Use
for compile-time specialization.@parameter - Keep Python wrappers thin — compute logic stays in Mojo.
- Use
for all Python tooling (neveruv
,pip
, orpixi
).conda
Official References
- https://docs.modular.com/max/coding-assistants
- https://docs.modular.com/max/coding-assistants/codex/
- https://docs.modular.com/llms-mojo.txt
- https://docs.modular.com/llms-mojo-python.txt
- https://docs.modular.com/llms-mojo-kernel.txt
- https://docs.modular.com/llms-max.txt
- https://docs.modular.com/llms.txt
- https://docs.modular.com/mojo/
- https://docs.modular.com/mojo/changelog/
- https://docs.modular.com/mojo/manual/testing/
- https://docs.modular.com/mojo/manual/python/mojo-from-python/
- https://docs.modular.com/mojo/stdlib/ffi/external_call/
- https://github.com/modular/modular/blob/main/.github/agents/AGENTS.md
- https://github.com/modular/modular/blob/main/.github/agents/CLAUDE.md
- https://pypi.org/project/hatch-mojo/
Shared Styleguide Baseline
- Use shared styleguides for generic language/framework rules to reduce duplication in this skill.
- General Principles
- Python
- Keep this skill focused on tool-specific workflows, edge cases, and integration details.