Trending-skills llmfit-hardware-model-matcher
Terminal tool that detects your hardware and recommends which LLM models will actually run well on your system
git clone https://github.com/Aradotso/trending-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/Aradotso/trending-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/llmfit-hardware-model-matcher" ~/.claude/skills/aradotso-trending-skills-llmfit-hardware-model-matcher && rm -rf "$T"
skills/llmfit-hardware-model-matcher/SKILL.mdllmfit Hardware Model Matcher
Skill by ara.so — Daily 2026 Skills collection.
llmfit detects your system's RAM, CPU, and GPU then scores hundreds of LLM models across quality, speed, fit, and context dimensions — telling you exactly which models will run well on your hardware. It ships with an interactive TUI and a CLI, supports multi-GPU, MoE architectures, dynamic quantization, and local runtime providers (Ollama, llama.cpp, MLX, Docker Model Runner).
Installation
macOS / Linux (Homebrew)
brew install llmfit
Quick install script
curl -fsSL https://llmfit.axjns.dev/install.sh | sh # Without sudo, installs to ~/.local/bin curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
Windows (Scoop)
scoop install llmfit
Docker / Podman
docker run ghcr.io/alexsjones/llmfit # With jq for scripting podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
From source (Rust)
git clone https://github.com/AlexsJones/llmfit.git cd llmfit cargo build --release # binary at target/release/llmfit
Core Concepts
- Fit tiers:
(runs great),perfect
(runs well),good
(runs but tight),marginal
(won't run)too_tight - Scoring dimensions: quality, speed (tok/s estimate), fit (memory headroom), context capacity
- Run modes: GPU, CPU+GPU offload, CPU-only, MoE
- Quantization: automatically selects best quant (e.g. Q4_K_M, Q5_K_S, mlx-4bit) for your hardware
- Providers: Ollama, llama.cpp, MLX, Docker Model Runner
Key Commands
Launch Interactive TUI
llmfit
CLI Table Output
llmfit --cli
Show System Hardware Detection
llmfit system llmfit --json system # JSON output
List All Models
llmfit list
Search Models
llmfit search "llama 8b" llmfit search "mistral" llmfit search "qwen coding"
Fit Analysis
# All runnable models ranked by fit llmfit fit # Only perfect fits, top 5 llmfit fit --perfect -n 5 # JSON output llmfit --json fit -n 10
Model Detail
llmfit info "Mistral-7B" llmfit info "Llama-3.1-70B"
Recommendations
# Top 5 recommendations (JSON default) llmfit recommend --json --limit 5 # Filter by use case: general, coding, reasoning, chat, multimodal, embedding llmfit recommend --json --use-case coding --limit 3 llmfit recommend --json --use-case reasoning --limit 5
Hardware Planning (invert: what hardware do I need?)
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json llmfit plan "Qwen/Qwen2.5-Coder-0.5B-Instruct" --context 8192 --json
REST API Server (for cluster scheduling)
llmfit serve llmfit serve --host 0.0.0.0 --port 8787
Hardware Overrides
When autodetection fails (VMs, broken nvidia-smi, passthrough setups):
# Override GPU VRAM llmfit --memory=32G llmfit --memory=24G --cli llmfit --memory=24G fit --perfect -n 5 llmfit --memory=24G recommend --json # Megabytes llmfit --memory=32000M # Works with any subcommand llmfit --memory=16G info "Llama-3.1-70B"
Accepted suffixes:
G/GB/GiB, M/MB/MiB, T/TB/TiB (case-insensitive).
Context Length Cap
# Estimate memory fit at 4K context llmfit --max-context 4096 --cli # With subcommands llmfit --max-context 8192 fit --perfect -n 5 llmfit --max-context 16384 recommend --json --limit 5 # Environment variable alternative export OLLAMA_CONTEXT_LENGTH=8192 llmfit recommend --json
REST API Reference
Start the server:
llmfit serve --host 0.0.0.0 --port 8787
Endpoints
# Health check curl http://localhost:8787/health # Node hardware info curl http://localhost:8787/api/v1/system # Full model list with filters curl "http://localhost:8787/api/v1/models?min_fit=marginal&runtime=llamacpp&sort=score&limit=20" # Top runnable models for this node (key scheduling endpoint) curl "http://localhost:8787/api/v1/models/top?limit=5&min_fit=good&use_case=coding" # Search by model name/provider curl "http://localhost:8787/api/v1/models/Mistral?runtime=any"
Query Parameters for /models
and /models/top
/models/models/top| Param | Values | Description |
|---|---|---|
/ | integer | Max rows returned |
| | Minimum fit tier |
| | Force perfect-only |
| | Filter by runtime |
| | Use case filter |
| string | Substring match on provider |
| string | Free-text across name/provider/size/use-case |
| | Sort column |
| | Include non-runnable models |
| integer | Per-request context cap |
Scripting & Automation Examples
Bash: Get top coding models as JSON
#!/bin/bash # Get top 3 coding models that fit perfectly llmfit recommend --json --use-case coding --limit 3 | \ jq -r '.models[] | "\(.name) (\(.score)) - \(.quantization)"'
Bash: Check if a specific model fits
#!/bin/bash MODEL="Mistral-7B" RESULT=$(llmfit info "$MODEL" --json 2>/dev/null) FIT=$(echo "$RESULT" | jq -r '.fit') if [[ "$FIT" == "perfect" || "$FIT" == "good" ]]; then echo "$MODEL will run well (fit: $FIT)" else echo "$MODEL may not run well (fit: $FIT)" fi
Bash: Auto-pull top Ollama model
#!/bin/bash # Get the top fitting model name and pull it with Ollama TOP_MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name') echo "Pulling: $TOP_MODEL" ollama pull "$TOP_MODEL"
Python: Query the REST API
import requests BASE_URL = "http://localhost:8787" def get_system_info(): resp = requests.get(f"{BASE_URL}/api/v1/system") return resp.json() def get_top_models(use_case="coding", limit=5, min_fit="good"): params = { "use_case": use_case, "limit": limit, "min_fit": min_fit, "sort": "score" } resp = requests.get(f"{BASE_URL}/api/v1/models/top", params=params) return resp.json() def search_models(query, runtime="any"): resp = requests.get( f"{BASE_URL}/api/v1/models/{query}", params={"runtime": runtime} ) return resp.json() # Example usage system = get_system_info() print(f"GPU: {system.get('gpu_name')} | VRAM: {system.get('vram_gb')}GB") models = get_top_models(use_case="reasoning", limit=3) for m in models.get("models", []): print(f"{m['name']}: score={m['score']}, fit={m['fit']}, quant={m['quantization']}")
Python: Hardware-aware model selector for agents
import subprocess import json def get_best_model_for_task(use_case: str, min_fit: str = "good") -> dict: """Use llmfit to select the best model for a given task.""" result = subprocess.run( ["llmfit", "recommend", "--json", "--use-case", use_case, "--limit", "1"], capture_output=True, text=True ) data = json.loads(result.stdout) models = data.get("models", []) return models[0] if models else None def plan_hardware_requirements(model_name: str, context: int = 4096) -> dict: """Get hardware requirements for running a specific model.""" result = subprocess.run( ["llmfit", "plan", model_name, "--context", str(context), "--json"], capture_output=True, text=True ) return json.loads(result.stdout) # Select best coding model best = get_best_model_for_task("coding") if best: print(f"Best coding model: {best['name']}") print(f" Quantization: {best['quantization']}") print(f" Estimated tok/s: {best['tps']}") print(f" Memory usage: {best['mem_pct']}%") # Plan hardware for a specific model plan = plan_hardware_requirements("Qwen/Qwen3-4B-MLX-4bit", context=8192) print(f"Min VRAM needed: {plan['hardware']['min_vram_gb']}GB") print(f"Recommended VRAM: {plan['hardware']['recommended_vram_gb']}GB")
Docker Compose: Node scheduler pattern
version: "3.8" services: llmfit-api: image: ghcr.io/alexsjones/llmfit command: serve --host 0.0.0.0 --port 8787 ports: - "8787:8787" environment: - OLLAMA_CONTEXT_LENGTH=8192 devices: - /dev/nvidia0:/dev/nvidia0 # pass GPU through
TUI Key Reference
| Key | Action |
|---|---|
/ or / | Navigate models |
| Search (name, provider, params, use case) |
/ | Exit search |
| Clear search |
| Cycle fit filter: All → Runnable → Perfect → Good → Marginal |
| Cycle availability: All → GGUF Avail → Installed |
| Cycle sort: Score → Params → Mem% → Ctx → Date → Use Case |
| Cycle color theme (auto-saved) |
| Visual mode (multi-select for comparison) |
| Select mode (column-based filtering) |
| Plan mode (what hardware needed for this model?) |
| Provider filter popup |
| Use-case filter popup |
| Capability filter popup |
| Mark model for comparison |
| Compare view (marked vs selected) |
| Download model (via detected runtime) |
| Refresh installed models from runtimes |
| Toggle detail view |
/ | Jump to top/bottom |
| Quit |
Themes
t cycles: Default → Dracula → Solarized → Nord → Monokai → GruvboxTheme saved to
~/.config/llmfit/theme
GPU Detection Details
| GPU Vendor | Detection Method |
|---|---|
| NVIDIA | (multi-GPU, aggregates VRAM) |
| AMD | |
| Intel Arc | sysfs (discrete) / (integrated) |
| Apple Silicon | (unified memory = VRAM) |
| Ascend | |
Common Patterns
"What can I run on my 16GB M2 Mac?"
llmfit fit --perfect -n 10 # or interactively llmfit # press 'f' to filter to Perfect fit
"I have a 3090 (24GB VRAM), what coding models fit?"
llmfit recommend --json --use-case coding | jq '.models[]' # or with manual override if detection fails llmfit --memory=24G recommend --json --use-case coding
"Can Llama 70B run on my machine?"
llmfit info "Llama-3.1-70B" # Plan what hardware you'd need llmfit plan "Llama-3.1-70B" --context 4096 --json
"Show me only models already installed in Ollama"
llmfit # press 'a' to cycle to Installed filter # or llmfit fit -n 20 # run, press 'i' in TUI for installed-first
"Script: find best model and start Ollama"
MODEL=$(llmfit recommend --json --limit 1 | jq -r '.models[0].name') ollama serve & ollama run "$MODEL"
"API: poll node capabilities for cluster scheduler"
# Check node, get top 3 good+ models for reasoning curl -s "http://node1:8787/api/v1/models/top?limit=3&min_fit=good&use_case=reasoning" | \ jq '.models[].name'
Troubleshooting
GPU not detected / wrong VRAM reported
# Verify detection llmfit system # Manual override llmfit --memory=24G --cli
not found but you have an NVIDIA GPUnvidia-smi
# Install CUDA toolkit or nvidia-utils, then retry # Or override manually: llmfit --memory=8G fit --perfect
Models show as too_tight but you have enough RAM
# llmfit may be using context-inflated estimates; cap context llmfit --max-context 2048 fit --perfect -n 10
REST API: test endpoints
# Spawn server and run validation suite python3 scripts/test_api.py --spawn # Test already-running server python3 scripts/test_api.py --base-url http://127.0.0.1:8787
Apple Silicon: VRAM shows as system RAM (expected)
# This is correct — Apple Silicon uses unified memory # llmfit accounts for this automatically llmfit system # should show backend: Metal
Context length environment variable
export OLLAMA_CONTEXT_LENGTH=4096 llmfit recommend --json # uses 4096 as context cap