Claude-code-minoan smolvlm

Local vision-language model for image analysis using SmolVLM-2B

install
source · Clone the upstream repo
git clone https://github.com/tdimino/claude-code-minoan
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/tdimino/claude-code-minoan "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/design-media/smolvlm" ~/.claude/skills/tdimino-claude-code-minoan-smolvlm && rm -rf "$T"
manifest: skills/design-media/smolvlm/SKILL.md
source content

SmolVLM - Local Image Analysis

Analyze images locally using SmolVLM-2B, a state-of-the-art compact vision-language model optimized for Apple Silicon via mlx-vlm.

Quick Usage

Describe an Image

python ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png

Ask a Question About an Image

python ~/.claude/skills/smolvlm/scripts/view_image.py /path/to/image.png "What text is visible?"

Specific Tasks

# Extract text (OCR)
python ~/.claude/skills/smolvlm/scripts/view_image.py screenshot.png "Extract all text"

# UI analysis
python ~/.claude/skills/smolvlm/scripts/view_image.py ui.png "Describe the UI elements"

# Detailed description
python ~/.claude/skills/smolvlm/scripts/view_image.py photo.jpg --detailed

Effective Prompts

General Description

  • "Describe this image"
    - Basic description
  • "Describe this image in detail, including colors, composition, and any text"
    - Comprehensive

Text Extraction (OCR)

  • "Extract all visible text from this image"
  • "What text appears in this screenshot?"
  • "Read the text in this document"

UI/Screenshot Analysis

  • "Describe the user interface elements"
  • "What buttons and controls are visible?"
  • "Identify the application and its current state"

Visual Question Answering

  • "How many [objects] are in this image?"
  • "What color is the [object]?"
  • "Is there a [object] in this image?"

Code/Technical

  • "What programming language is shown?"
  • "Describe what this code does"
  • "Identify any errors in this code screenshot"

Model Details

SpecValue
ModelSmolVLM-2B-Instruct
Size~4GB
Peak Memory5.8GB
Speed~94 tok/s (M-series)
Supported FormatsPNG, JPG, JPEG, GIF, WebP

Requirements

  • macOS with Apple Silicon (M1/M2/M3)
  • Python 3.10+
  • mlx-vlm package:
    uv pip install mlx-vlm --system

Troubleshooting

"Model not found": First run downloads the model (~4GB). Wait for completion.

Out of memory: Close other applications. Model needs ~6GB free RAM.

Slow first inference: Model loading takes 10-15s on first use, subsequent calls are faster.