Skills mlx-vlm
install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/mlx-vlm" ~/.claude/skills/terminalskills-skills-mlx-vlm && rm -rf "$T"
manifest:
skills/mlx-vlm/SKILL.mdsafety · automated scan (low risk)
This is a pattern-based risk scan, not a security review. Our crawler flagged:
- pip install
Always read a skill's source content before installing. Patterns alone don't mean the skill is malicious — but they warrant attention.
source content
MLX-VLM — Vision Language Models on Apple Silicon
Overview
mlx-vlm runs vision-language models natively on Apple Silicon using the MLX framework. It supports inference and fine-tuning with unified memory — no GPU server needed.
Repo:
Blaizzy/mlx-vlmRequirements: macOS 14+, Apple Silicon (M1/M2/M3/M4), Python 3.10+
Installation
# Create virtual environment (recommended) python3 -m venv ~/.venvs/mlx-vlm source ~/.venvs/mlx-vlm/bin/activate # Install pip install mlx-vlm
For development:
git clone https://github.com/Blaizzy/mlx-vlm.git cd mlx-vlm && pip install -e .
Supported Models
| Model | HuggingFace ID | Best For |
|---|---|---|
| Pixtral | | General vision, multi-image |
| Qwen2-VL | | OCR, document understanding |
| Phi-3-Vision | | Lightweight, fast inference |
| LLaVA-1.6 | | Conversation about images |
| Llama-3.2-Vision | | Strong general reasoning |
Inference
CLI
# Single image analysis python -m mlx_vlm.generate \ --model mlx-community/pixtral-12b-240910-4bit \ --image path/to/image.jpg \ --prompt "Describe this image in detail" \ --max-tokens 512 # Multi-image comparison python -m mlx_vlm.generate \ --model mlx-community/pixtral-12b-240910-4bit \ --image img1.jpg img2.jpg \ --prompt "Compare these two images"
Python API
from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template model_path = "mlx-community/pixtral-12b-240910-4bit" model, processor = load(model_path) prompt = apply_chat_template( processor, config=model.config, prompt="What objects are in this image?", images=["product.jpg"], ) output = generate( model, processor, prompt, images=["product.jpg"], max_tokens=512, temperature=0.7, ) print(output)
Batch Processing
import os, csv from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template model, processor = load("mlx-community/pixtral-12b-240910-4bit") image_dir = "images/" results = [] for filename in os.listdir(image_dir): if not filename.lower().endswith((".jpg", ".png", ".webp")): continue path = os.path.join(image_dir, filename) prompt = apply_chat_template( processor, config=model.config, prompt="Describe this product photo. Include: category, color, condition, key features.", images=[path], ) desc = generate(model, processor, prompt, images=[path], max_tokens=256) results.append({"file": filename, "description": desc}) with open("descriptions.csv", "w", newline="") as f: writer = csv.DictWriter(f, fieldnames=["file", "description"]) writer.writeheader() writer.writerows(results)
Fine-Tuning
Prepare Dataset
Create JSONL with image paths and conversations:
{"image": "train/001.jpg", "conversations": [{"role": "user", "content": "Classify this product"}, {"role": "assistant", "content": "Category: Electronics, Subcategory: Headphones, Condition: New"}]} {"image": "train/002.jpg", "conversations": [{"role": "user", "content": "Classify this product"}, {"role": "assistant", "content": "Category: Clothing, Subcategory: T-Shirt, Condition: Used - Good"}]}
Run Fine-Tuning (LoRA)
python -m mlx_vlm.lora \ --model mlx-community/pixtral-12b-240910-4bit \ --data ./dataset \ --train-file train.jsonl \ --valid-file val.jsonl \ --num-layers 8 \ --batch-size 1 \ --epochs 3 \ --lr 1e-5 \ --adapter-path ./adapters
Inference with Fine-Tuned Adapter
python -m mlx_vlm.generate \ --model mlx-community/pixtral-12b-240910-4bit \ --adapter-path ./adapters \ --image test.jpg \ --prompt "Classify this product"
Cloud API Comparison
| Factor | mlx-vlm (Local) | Cloud APIs (GPT-4V, Claude) |
|---|---|---|
| Cost | $0 after hardware | $0.01-0.04 per image |
| Privacy | Data stays local | Data sent to provider |
| Speed | ~2-8s per image (M3 Max) | ~1-3s per image |
| Offline | Yes | No |
| Custom models | LoRA fine-tuning | Limited / expensive |
| Quality | Good (7-12B models) | Excellent (frontier models) |
Performance Tips
- Use 4-bit quantized models (
in name) for 2-3x speedup with minimal quality loss4bit - M3 Max / M4 Pro with 36GB+ RAM can run 12B models comfortably
- For M1/M2 with 16GB, stick to 7B 4-bit models
- Set
for potential speedup on first runMLX_METAL_JIT=1 - Close memory-heavy apps before inference — unified memory is shared with system