Claude-skill-registry llamafile
When setting up local LLM inference without cloud APIs. When running GGUF models locally. When needing OpenAI-compatible API from a local model. When building offline/air-gapped AI tools. When troubleshooting local LLM server connections.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llamafile" ~/.claude/skills/majiayu000-claude-skill-registry-llamafile && rm -rf "$T"
skills/data/llamafile/SKILL.mdLlamafile
Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.
When to Use This Skill
Use this skill when:
- Installing llamafile binary and GGUF model files
- Starting llamafile server with optimal configuration
- Integrating llamafile with LiteLLM or OpenAI SDK
- Configuring llamafile for different performance profiles (GPU, CPU, network access)
- Troubleshooting llamafile server startup or API connection issues
- Building applications requiring local LLM inference
- Setting up commit message tools, code review systems, or other developer tools with local AI
- Managing llamafile as a background service
- Selecting and downloading appropriate GGUF models
- Validating OpenAI-compatible API responses
Core Capabilities
What Llamafile Provides
Llamafile combines llama.cpp with Cosmopolitan Libc to create single-file executables that:
- Run on macOS, Windows, Linux, FreeBSD, OpenBSD, NetBSD
- Support AMD64 and ARM64 architectures
- Serve OpenAI-compatible HTTP API on localhost
- Load GGUF model files for inference
- Provide
endpoint for monitoring/health - Support GPU acceleration (CUDA, Metal, Vulkan)
- Enable embeddings generation with
flag--embedding
API Compatibility
Llamafile exposes these OpenAI-compatible endpoints when running with
--server:
| Endpoint | Description | Requirements |
|---|---|---|
| Chat completions (primary) | Server mode |
| Text completions | Server mode |
| Generate embeddings | flag |
| Health check | Server mode |
Critical Detail: All OpenAI-compatible endpoints require
/v1 prefix in the URL path.
Installation
Download Llamafile Binary
# Download llamafile v0.9.3 binary curl -L -o llamafile https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3 # Make executable chmod 755 llamafile # Verify version ./llamafile --version
Alternative download sources:
- GitHub Release:
https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3 - SourceForge Mirror:
https://sourceforge.net/projects/llamafile.mirror/files/0.9.3/
Download GGUF Model
Llamafile requires GGUF format models. Download from Hugging Face:
# Recommended: Gemma 3 3B (balanced speed/quality, ~2GB) curl -L -o gemma-3-3b.gguf \ https://huggingface.co/Mozilla/gemma-3-3b-it-gguf/resolve/main/gemma-3-3b-it-Q4_K_M.gguf # Alternative: Pre-packaged llamafile with embedded model curl -LO https://huggingface.co/Mozilla/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile chmod +x llava-v1.5-7b-q4.llamafile
Recommended models by use case:
| Model | Size | Use Case | Download |
|---|---|---|---|
| Gemma 3 3B | ~2GB | Balanced speed/quality | Mozilla/gemma-3-3b-it-gguf |
| Qwen3-0.6B | ~500MB | Fast, lower quality | Mozilla/Qwen3-0.6B-gguf |
| Mistral 7B | ~4GB | Higher quality, slower | Mozilla/Mistral-7B-gguf |
| Llama 3.1 8B | ~5GB | Best quality, slowest | Mozilla/Llama-3.1-8B-gguf |
Quantization recommendation: Use Q4_K_M quantized models for optimal balance of quality and performance.
Server Configuration
Basic Server Command
Start llamafile server for local API access:
./llamafile --server \ -m /path/to/model.gguf \ --nobrowser \ --port 8080 \ --host 127.0.0.1
Critical flags explained:
: Required to enable HTTP API endpoints--server
: Path to GGUF model file (required)-m
: Prevents auto-opening browser on startup--nobrowser
: Default port (note: NOT 8000)--port 8080
: Localhost only (secure default)--host 127.0.0.1
Performance-Optimized Configuration
For GPU-accelerated inference with higher throughput:
./llamafile --server \ -m /path/to/model.gguf \ --nobrowser \ --port 8080 \ --host 127.0.0.1 \ --ctx-size 4096 \ --n-gpu-layers 99 \ --threads 8 \ --cont-batching \ --parallel 4
Advanced flags:
| Flag | Purpose | Default | When to Use |
|---|---|---|---|
| Prompt context window size | 512 | Increase for longer conversations |
| GPU offload layer count | 0 | Set to 99 to offload all layers to GPU |
| CPU threads for generation | Auto | Set explicitly for consistent performance |
| Threads for batch processing | Same as | Tune separately for prompt vs generation |
| Continuous batching | Off | Enable for multiple concurrent requests |
| Parallel sequence count | 1 | Increase for concurrent request handling |
| Lock model in memory | Off | Prevent swapping on systems with sufficient RAM |
| Enable embeddings endpoint | Off | Required for API |
Network-Accessible Configuration
To allow connections from other machines (development/testing only):
./llamafile --server \ -m /path/to/model.gguf \ --nobrowser \ --host 0.0.0.0 \ --port 8080
Security warning: Binding to
0.0.0.0 exposes the API to network access. Use only in trusted environments.
API Integration
Using LiteLLM (Recommended)
LiteLLM provides unified interface for llamafile and cloud LLM providers.
import litellm response = litellm.completion( model="llamafile/gemma-3-3b", # MUST use llamafile/ prefix messages=[{"role": "user", "content": "Hello, world!"}], api_base="http://localhost:8080/v1", # MUST include /v1 suffix temperature=0.3, max_tokens=200 ) print(response.choices[0].message.content)
Critical requirements for LiteLLM:
- Model name MUST use
prefix for routingllamafile/
MUST includeapi_base
suffix/v1- No API key required (any placeholder value works)
Related skill: For comprehensive LiteLLM configuration, activate the litellm skill:
Skill(command: "litellm")
Using OpenAI Python SDK
Direct integration with OpenAI SDK for llamafile endpoints:
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", # MUST include /v1 api_key="sk-no-key-required" # Any value works ) response = client.chat.completions.create( model="local-model", # Model name is flexible messages=[ {"role": "user", "content": "Hello, world!"} ], temperature=0.3, max_tokens=200 ) print(response.choices[0].message.content)
Using curl for Testing
Verify llamafile server is responding correctly:
# Health check curl http://localhost:8080/health # Chat completions curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "local", "messages": [{"role": "user", "content": "Hello"}], "temperature": 0.3, "max_tokens": 200 }' # Embeddings (requires --embedding flag on server) curl http://localhost:8080/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "local", "input": ["Hello world"] }'
Server Management
Process Management Script
Python script to start llamafile as background process with health checking:
import subprocess import time import httpx def start_llamafile( llamafile_path: str, model_path: str, port: int = 8080, host: str = "127.0.0.1" ) -> subprocess.Popen: """Start llamafile server as background process.""" cmd = [ llamafile_path, "--server", "-m", model_path, "--nobrowser", "--port", str(port), "--host", host, ] process = subprocess.Popen( cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, ) _wait_for_server(host, port) return process def _wait_for_server(host: str, port: int, timeout: int = 30) -> None: """Wait for server to respond to health checks.""" url = f"http://{host}:{port}/health" start = time.time() while time.time() - start < timeout: try: response = httpx.get(url, timeout=2) if response.status_code == 200: return except httpx.RequestError: pass time.sleep(0.5) raise TimeoutError(f"Server did not start within {timeout} seconds")
Configuration File Pattern
Example TOML configuration for applications using llamafile:
# ~/.config/app-name/config.toml [ai] model = "llamafile/gemma-3-3b" # Must use llamafile/ prefix temperature = 0.3 max_tokens = 200 [llamafile] path = "/home/user/.local/bin/llamafile" model_path = "/home/user/.local/share/app-name/models/gemma-3-3b.gguf" api_base = "http://127.0.0.1:8080/v1" # Include /v1 suffix
Troubleshooting
Server Fails to Start
Check if port is already in use:
# Find process using port 8080 lsof -i :8080 # Kill existing process kill $(lsof -t -i :8080)
Verify model file exists and is readable:
ls -lh /path/to/model.gguf
Check llamafile binary permissions:
ls -la /path/to/llamafile # Should show: -rwxr-xr-x (executable) # Fix permissions if needed chmod 755 /path/to/llamafile
Connection Refused Errors
Verify server is running:
# Check health endpoint curl http://localhost:8080/health # Check server is listening netstat -tlnp | grep 8080 # or lsof -i :8080
Common causes:
- Server not started with
flag--server - Wrong port number (8080 vs 8000)
- Missing
in API URL path/v1 - Server bound to
but accessing from another machine127.0.0.1
API Errors
Test basic connectivity:
# Verbose health check curl -v http://localhost:8080/health # Test chat completions with verbose output curl -v http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"test","messages":[{"role":"user","content":"Hi"}]}'
Common API issues:
| Error | Cause | Solution |
|---|---|---|
| 404 Not Found | Missing in URL | Add before endpoint path |
| Connection refused | Server not running | Start server with flag |
| Timeout | Model loading slowly | Wait longer or use smaller model |
| Invalid model | Wrong model path | Verify path to GGUF file |
Performance Issues
Optimize inference speed:
- Use quantized models (Q4_K_M recommended)
- Enable GPU acceleration:
--n-gpu-layers 99 - Increase threads:
--threads 8 - Enable continuous batching:
--cont-batching - Reduce context size if not needed:
--ctx-size 2048
Check GPU availability:
# NVIDIA GPU nvidia-smi # AMD GPU rocm-smi # Apple Metal (check activity monitor)
Common Pitfalls
Avoid these frequent errors when using llamafile:
- Port 8000 vs 8080: Llamafile defaults to port 8080, not 8000
- Missing
in API URL: Always include/v1
suffix for OpenAI-compatible endpoints/v1 - LiteLLM prefix: Must use
prefix in model name for proper routingllamafile/ - API key confusion: No real API key needed, but some clients require placeholder value
- Starting server from hooks: Application hooks should check if server is running, not start it
- Model path issues: Ensure GGUF file exists and is readable before starting server
- Binary permissions: Llamafile must be executable (
)chmod 755 - GPU layers on CPU: Setting
on CPU-only systems causes errors--n-gpu-layers
Version Information
Current stable version: 0.9.3 (May 14, 2025)
Version constants:
LLAMAFILE_MAJOR = 0 LLAMAFILE_MINOR = 9 LLAMAFILE_PATCH = 3
Recent changes in 0.9.3:
- Added Phi4 model support
- Added Qwen3 model support
- Respects NO_COLOR environment variable
- Fixed URL handling in JavaScript (preserves path when building relative URLs)
- Added Plaintext output option to LocalScore
Related Skills and Tools
Skills to activate:
- For unified LLM provider interface and routinglitellmSkill(command: "litellm")
External tools:
- LiteLLM - Unified interface for multiple LLM providers
- OpenAI Python SDK - Direct OpenAI-compatible API access
- llama.cpp - Underlying inference engine
- GGUF format - Model format specification
References
Official Documentation
- Mozilla llamafile GitHub - Primary repository and source code
- Mozilla llamafile Documentation - Official documentation site
- LiteLLM llamafile Provider - LiteLLM integration guide
- llama.cpp Server Documentation - Underlying server implementation
- Releases Page - Binary downloads and changelog
Model Resources
- Hugging Face Mozilla Models - Official Mozilla GGUF models
- GGUF Format Specification - Model file format details
Related Technologies
- Cosmopolitan Libc - Cross-platform binary format
- llama.cpp - LLM inference engine
- OpenAI API Reference - API compatibility reference