Claude-code-minoan smolvlm
Local vision-language model for image analysis using SmolVLM-2B
install
source · Clone the upstream repo
git clone https://github.com/tdimino/claude-code-minoan
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tdimino/claude-code-minoan "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/design-media/smolvlm" ~/.claude/skills/tdimino-claude-code-minoan-smolvlm && rm -rf "$T"
manifest:
skills/design-media/smolvlm/SKILL.mdsource content
SmolVLM - Local Image Analysis
Analyze images locally using SmolVLM-2B, a state-of-the-art compact vision-language model optimized for Apple Silicon via mlx-vlm.
Quick Usage
Describe an Image
python ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png
Ask a Question About an Image
python ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png "What text is visible?"
Specific Tasks
# Extract text (OCR) python ~/.claude/skills/smolvlm/scripts/view_image.py screenshot.png "Extract all text" # UI analysis python ~/.claude/skills/smolvlm/scripts/view_image.py ui.png "Describe the UI elements" # Detailed description python ~/.claude/skills/smolvlm/scripts/view_image.py photo.jpg --detailed
Effective Prompts
General Description
- Basic description"Describe this image"
- Comprehensive"Describe this image in detail, including colors, composition, and any text"
Text Extraction (OCR)
"Extract all visible text from this image""What text appears in this screenshot?""Read the text in this document"
UI/Screenshot Analysis
"Describe the user interface elements""What buttons and controls are visible?""Identify the application and its current state"
Visual Question Answering
"How many [objects] are in this image?""What color is the [object]?""Is there a [object] in this image?"
Code/Technical
"What programming language is shown?""Describe what this code does""Identify any errors in this code screenshot"
Model Details
| Spec | Value |
|---|---|
| Model | SmolVLM-2B-Instruct |
| Size | ~4GB |
| Peak Memory | 5.8GB |
| Speed | ~94 tok/s (M-series) |
| Supported Formats | PNG, JPG, JPEG, GIF, WebP |
Requirements
- macOS with Apple Silicon (M1/M2/M3)
- Python 3.10+
- mlx-vlm package:
uv pip install mlx-vlm --system
Troubleshooting
"Model not found": First run downloads the model (~4GB). Wait for completion.
Out of memory: Close other applications. Model needs ~6GB free RAM.
Slow first inference: Model loading takes 10-15s on first use, subsequent calls are faster.