Skills mlx-local-inference
install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bendusy/mlx-local-inference" ~/.claude/skills/openclaw-skills-mlx-local-inference && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bendusy/mlx-local-inference" ~/.openclaw/skills/openclaw-skills-mlx-local-inference && rm -rf "$T"
manifest:
skills/bendusy/mlx-local-inference/SKILL.mdsource content
MLX Local Inference Stack
Local AI inference on Apple Silicon. oMLX handles LLM/VLM with continuous batching. Python libraries handle Embedding/ASR/OCR directly via
uv.
Architecture
┌─────────────────────────────────────┐ │ oMLX (localhost:8000/v1) │ │ - LLM (Qwen3.5-35B, etc.) │ │ - VLM (vision-language models) │ │ - Continuous batching + SSD cache │ └─────────────────────────────────────┘ ┌─────────────────────────────────────┐ │ Python Libraries (via uv run) │ │ - mlx-lm: Embedding │ │ - mlx-vlm: OCR (PaddleOCR-VL) │ │ - mlx-audio: ASR (Qwen3-ASR) │ └─────────────────────────────────────┘
Models
| Capability | Implementation | Model | Size |
|---|---|---|---|
| 💬 LLM | oMLX API | | ~20 GB |
| 👁️ VLM | oMLX API | Any mlx-vlm model | varies |
| 📐 Embed | mlx-lm (uv) | | ~1 GB |
| 🎤 ASR | mlx-audio (uv) | | ~1.5 GB |
| 👁️ OCR | mlx-vlm (uv) | | ~3.3 GB |
Usage
LLM / Vision-Language (via oMLX API)
from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="local") # Text generation resp = client.chat.completions.create( model="Qwen3.5-35B-A3B-4bit", messages=[{"role": "user", "content": "Hello"}] ) print(resp.choices[0].message.content)
Embeddings (via mlx-lm + uv)
uv run --with mlx-lm python -c " from mlx_lm import load model, tokenizer = load('~/models/Qwen3-Embedding-0.6B-4bit-DWQ') text = 'text to embed' inputs = tokenizer(text, return_tensors='np') embeddings = model(**inputs).last_hidden_state.mean(axis=1) print(embeddings.shape) "
ASR — Speech-to-Text (via mlx-audio + uv)
Important: Must run with
to avoid OpenMP threading issues (--python 3.11).SIGSEGV
uv run --python 3.11 --with mlx-audio python -m mlx_audio.stt.generate \ --model ~/models/Qwen3-ASR-1.7B-8bit \ --audio "audio.wav" \ --output-path /tmp/asr_result \ --format txt \ --language zh \ --verbose
OCR (via mlx-vlm + uv)
Important: The
function parameter order must begenerate.(model, processor, prompt, image)
cat << 'PY_EOF' > run_ocr.py import os from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template model_path = os.path.expanduser("~/models/PaddleOCR-VL-1.5-6bit") model, processor = load(model_path) prompt = apply_chat_template(processor, config=model.config, prompt="OCR:", num_images=1) output = generate(model, processor, prompt, "document.jpg", max_tokens=512, temp=0.0) print(output.text) PY_EOF uv run --python 3.11 --with mlx-vlm python run_ocr.py
Service Management (oMLX only)
# Check running models curl http://localhost:8000/v1/models # Restart oMLX launchctl kickstart -k gui/$(id -u)/com.omlx-server
Model Storage Strategy
All models stored in
using oMLX-compatible structure:~/models/
~/models/ ├── Qwen3-Embedding-0.6B-4bit-DWQ/ ├── Qwen3-ASR-1.7B-8bit/ ├── PaddleOCR-VL-1.5-6bit/ └── Qwen3.5-35B-A3B-4bit/
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
installed (uv
)curl -LsSf https://astral.sh/uv/install.sh | sh