Trending-skills flash-moe-inference
Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming
git clone https://github.com/Aradotso/trending-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/flash-moe-inference" ~/.claude/skills/aradotso-trending-skills-flash-moe-inference && rm -rf "$T"
skills/flash-moe-inference/SKILL.mdFlash-MoE Inference Engine
Skill by ara.so — Daily 2026 Skills collection.
Flash-MoE is a pure C/Objective-C/Metal inference engine that runs Qwen3.5-397B-A17B (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.
Requirements
- Hardware: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
- OS: macOS 26+ (Darwin 25+)
- Tools: Xcode Command Line Tools, Python 3.x (for weight extraction only)
- Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)
Installation & Build
# Clone the repo git clone https://github.com/danveloper/flash-moe cd flash-moe/metal_infer # Build everything make # Verify build artifacts ls infer chat main
The Makefile compiles
infer.m, chat.m, main.m with Metal shader compilation for shaders.metal.
Weight Preparation
Step 1: Extract non-expert weights
# From the metal_infer/ directory # Point to your downloaded Qwen3.5-397B safetensors directory python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/ # Produces: # model_weights.bin (~5.5GB, mmap'd at runtime) # model_weights.json (tensor manifest) # vocab.bin (vocabulary) # tokenizer.bin (BPE tokenizer data)
Step 2: Pack expert weights (4-bit, production)
# From repo root python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/ # Produces packed_experts/ directory (~209GB) # Each expert is a separate file: layer_XX_expert_YYYY.bin
Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)
# Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total) python3 metal_infer/repack_experts_2bit.py \ metal_infer/packed_experts/ \ metal_infer/packed_experts_2bit/
Key Commands
Basic inference
cd metal_infer # 4-bit inference (production quality, tool calling works) ./infer --prompt "Explain quantum computing" --tokens 100 # 2-bit inference (faster, breaks JSON/tool calling) ./infer --prompt "Explain quantum computing" --tokens 100 --2bit # Per-layer timing breakdown ./infer --prompt "Hello" --tokens 20 --timing
Interactive chat with tool calling
./chat # Opens TUI with full tool calling support # Uses 4-bit experts by default
MoE-only benchmark (measures expert throughput)
./main # Runs pure expert forward-pass benchmark # Reports tokens/sec without attention overhead
Project Structure
flash-moe/ ├── paper/ │ └── flash_moe.pdf # Full technical paper ├── metal_infer/ │ ├── infer.m # Complete inference engine (~7000 lines) │ ├── shaders.metal # Metal compute kernels (~1200 lines) │ ├── chat.m # Interactive chat TUI │ ├── tokenizer.h # Single-header C BPE tokenizer (449 lines) │ ├── main.m # MoE-only benchmark │ ├── Makefile │ ├── extract_weights.py # Safetensors → model_weights.bin │ ├── repack_experts_2bit.py # 4-bit → 2-bit requantization │ ├── train_predictor.py # Expert routing prediction analysis │ ├── model_weights.bin # Non-expert weights (mmap'd) │ ├── model_weights.json # Tensor manifest │ ├── vocab.bin │ ├── tokenizer.bin │ ├── packed_experts/ # 4-bit expert files (209GB) │ └── packed_experts_2bit/ # 2-bit expert files (120GB, optional) ├── repack_experts.py # 4-bit expert packing from safetensors ├── progress.py # Results visualization └── results.tsv # Experiment log
Architecture Overview
The model has 60 transformer layers:
- 45 GatedDeltaNet (linear attention) layers
- 15 standard full attention layers
- Each layer: 512 experts, K=4 activated per token + 1 shared expert
- Hidden dimension: 4096
Per-layer pipeline (4.28ms average at 4-bit)
CMD3(prev) → CMD1: attention projections + delta-net [1.22ms GPU] → CPU: flush results [0.01ms CPU] → CMD2: o_proj + norm + routing + shared [0.55ms GPU] → CPU: softmax + topK routing [0.003ms] → I/O: parallel pread K=4 experts [2.41ms SSD] → CMD3: expert forward + combine + norm [0.04ms encode, DEFERRED]
Metal Shader Kernels
The
shaders.metal file contains hand-written kernels. Key kernels:
// 4-bit dequantized matrix-vector multiply (FMA-optimized) // Key insight: fma(nibble, scale*x, bias*x) instead of (nibble*scale + bias)*x // Pre-compute scale*x and bias*x to fuse dequant+multiply in one FMA instruction kernel void matvec_4bit_fma( device const uint8_t* weights [[buffer(0)]], device const float* scales [[buffer(1)]], device const float* biases [[buffer(2)]], device const float* x [[buffer(3)]], device float* out [[buffer(4)]], uint tid [[thread_position_in_threadgroup]], uint gid [[threadgroup_position_in_grid]]) { // ... tiled SIMD-reduced FMA kernel // 12% faster than naive (nibble * scale + bias) * x } // Fused SwiGLU activation kernel void swiglu(device float* gate [[buffer(0)]], device const float* up [[buffer(1)]], uint gid [[thread_position_in_grid]]) { float g = gate[gid]; gate[gid] = (g / (1.0f + exp(-g))) * up[gid]; } // RMS normalization (two-pass) kernel void rms_norm_pass1(...) // sum of squares reduction kernel void rms_norm_pass2(...) // apply normalization // GPU RoPE (fused with Q deinterleave and K normalization) kernel void rope_qk(...) // MoE combine + residual + sigmoid gate (fused) kernel void moe_combine_residual(...)
SSD Expert Streaming Pattern
The core innovation — loading only K=4 active experts per layer from SSD:
// Parallel expert loading using GCD dispatch groups // From infer.m (conceptual pattern) dispatch_group_t group = dispatch_group_create(); dispatch_queue_t ioQueue = dispatch_get_global_queue(QOS_CLASS_USER_INITIATED, 0); for (int k = 0; k < K_EXPERTS; k++) { int expert_id = top_k_indices[k]; dispatch_group_async(group, ioQueue, ^{ // Each expert: ~6.75MB at 4-bit char path[256]; snprintf(path, sizeof(path), "packed_experts/layer_%02d_expert_%04d.bin", layer, expert_id); int fd = open(path, O_RDONLY); // pread() — non-blocking, OS page cache handles LRU pread(fd, expert_buffer[k], expert_size, 0); close(fd); }); } dispatch_group_wait(group, DISPATCH_TIME_FOREVER); // GPU compute follows — serial pipeline is hardware-optimal on Apple Silicon
Why
not pread()
: mmap incurs per-page fault overhead on cold data (~5x slower). Direct mmap()
pread() with OS page cache achieves ~71% hit rate naturally.
GatedDeltaNet Linear Attention (BLAS)
The recurrence update uses Accelerate BLAS — 64% faster than scalar:
// GatedDeltaNet state update per head (conceptual pattern) // state: 128×128 float matrix, 64 heads // From infer.m #import <Accelerate/Accelerate.h> for (int h = 0; h < 64; h++) { float* S = state + h * 128 * 128; // 128×128 state matrix float* q = Q + h * 128; float* k = K + h * 128; float* v = V + h * 128; // β·(k⊗v) outer product update // cblas_sger: S += beta * (k ⊗ v) cblas_sger(CblasRowMajor, 128, 128, beta[h], k, 1, v, 1, S, 128); // Decay: S = alpha * S cblas_sscal(128 * 128, alpha[h], S, 1); // Output: o = S @ q cblas_sgemv(CblasRowMajor, CblasNoTrans, 128, 128, 1.0f, S, 128, q, 1, 0.0f, output + h * 128, 1); }
Performance Configuration
4-bit (production default)
- Quality: Excellent — full tool calling, correct JSON
- Speed: 4.36 tok/s
- Disk: 209GB
2-bit (speed testing only)
- Quality: Good — but breaks JSON/tool calling (
instead of\name\
)"name" - Speed: 5.74 tok/s (7.05 peak single token with warm cache)
- Disk: 120GB
- Uses
flag to avoid page cache thrashingF_NOCACHE
What NOT to Try (Learned from 58 Experiments)
| Approach | Why it fails |
|---|---|
expert files | Per-page fault overhead: 5x slower than |
| management overhead: -70% |
prefetch | SSD DMA + GPU share memory controller — concurrent access: -73% GPU speed |
| Custom Metal LRU cache | GPU memory pressure: -38% vs OS page cache |
| LZ4 expert compression | Decompress overhead > warm cache savings: -13% |
| Temporal expert prediction | 25% hit rate, wastes SSD bandwidth: -18% |
| Speculative early routing | Cache pollution: -38% |
| MTP speculative decoding | MoE I/O scales per-token (unlike dense models): break-even |
| Spin-poll GPU wait | CPU thermal throttle competes with GPU: -23% |
| Parallel SSD + GPU overlap | Unified memory controller arbitration: net negative |
Key principle: On Apple Silicon, GPU DMA and SSD DMA share the same memory controller. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.
Troubleshooting
Build fails
# Ensure Xcode CLI tools are installed xcode-select --install # Check Metal compiler is available xcrun -sdk macosx metal --version
Out of memory
The engine is designed to use ~6GB active:
- 5.5GB:
(mmap'd, read-only)model_weights.bin - ~200MB: Metal scratch buffers
- Remaining ~42GB: OS page cache for expert data
If you see OOM, check for other processes consuming unified memory:
sudo memory_pressure vm_stat
Slow performance
# Check SSD speed — needs ~17GB/s for target performance # Run with timing to identify bottleneck ./infer --prompt "Hello" --tokens 5 --timing # Verify packed_experts/ is on internal SSD, not external drive diskutil info /
Wrong expert directory
# Default paths expected by infer.m: # metal_infer/packed_experts/ (4-bit) # metal_infer/packed_experts_2bit/ (2-bit) # Ensure you're running from metal_infer/ directory cd metal_infer ./infer --prompt "test"
Tool calling broken
Use 4-bit, not 2-bit. The 2-bit quantization corrupts quote characters in JSON output, making tool calling unreliable. Always use the default 4-bit configuration for agentic workloads.
Memory Safety
The engine explicitly manages all allocations:
- No unbounded caches
- Expert data never accumulates in GPU memory
is mmap'd read-only — kernel manages pagesmodel_weights.bin- Expert files are opened/read/closed per inference step