install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/Orchestra-Research/AI-Research-SKILLs/llama-cpp" ~/.claude/skills/comeonoliver-skillshub-llama-cpp && rm -rf "$T"
manifest:
skills/Orchestra-Research/AI-Research-SKILLs/llama-cpp/SKILL.mdsource content
llama.cpp
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
When to use llama.cpp
Use llama.cpp when:
- Running on CPU-only machines
- Deploying on Apple Silicon (M1/M2/M3/M4)
- Using AMD or Intel GPUs (no CUDA)
- Edge deployment (Raspberry Pi, embedded systems)
- Need simple deployment without Docker/Python
Use TensorRT-LLM instead when:
- Have NVIDIA GPUs (A100/H100)
- Need maximum throughput (100K+ tok/s)
- Running in datacenter with CUDA
Use vLLM instead when:
- Have NVIDIA GPUs
- Need Python-first API
- Want PagedAttention
Quick start
Installation
# macOS/Linux brew install llama.cpp # Or build from source git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make # With Metal (Apple Silicon) make LLAMA_METAL=1 # With CUDA (NVIDIA) make LLAMA_CUDA=1 # With ROCm (AMD) make LLAMA_HIP=1
Download model
# Download from HuggingFace (GGUF format) huggingface-cli download \ TheBloke/Llama-2-7B-Chat-GGUF \ llama-2-7b-chat.Q4_K_M.gguf \ --local-dir models/ # Or convert from HuggingFace python convert_hf_to_gguf.py models/llama-2-7b-chat/
Run inference
# Simple chat ./llama-cli \ -m models/llama-2-7b-chat.Q4_K_M.gguf \ -p "Explain quantum computing" \ -n 256 # Max tokens # Interactive chat ./llama-cli \ -m models/llama-2-7b-chat.Q4_K_M.gguf \ --interactive
Server mode
# Start OpenAI-compatible server ./llama-server \ -m models/llama-2-7b-chat.Q4_K_M.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 32 # Offload 32 layers to GPU # Client request curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
Quantization formats
GGUF format overview
| Format | Bits | Size (7B) | Speed | Quality | Use Case |
|---|---|---|---|---|---|
| Q4_K_M | 4.5 | 4.1 GB | Fast | Good | Recommended default |
| Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical |
| Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical |
| Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality |
| Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation |
| Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
Choosing quantization
# General use (balanced) Q4_K_M # 4-bit, medium quality # Maximum speed (more degradation) Q2_K or Q3_K_M # Maximum quality (slower) Q6_K or Q8_0 # Very large models (70B, 405B) Q3_K_M or Q4_K_S # Lower bits to fit in memory
Hardware acceleration
Apple Silicon (Metal)
# Build with Metal make LLAMA_METAL=1 # Run with GPU acceleration (automatic) ./llama-cli -m model.gguf -ngl 999 # Offload all layers # Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
NVIDIA GPUs (CUDA)
# Build with CUDA make LLAMA_CUDA=1 # Offload layers to GPU ./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers # Hybrid CPU+GPU for large models ./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
AMD GPUs (ROCm)
# Build with ROCm make LLAMA_HIP=1 # Run with AMD GPU ./llama-cli -m model.gguf -ngl 999
Common patterns
Batch processing
# Process multiple prompts from file cat prompts.txt | ./llama-cli \ -m model.gguf \ --batch-size 512 \ -n 100
Constrained generation
# JSON output with grammar ./llama-cli \ -m model.gguf \ -p "Generate a person: " \ --grammar-file grammars/json.gbnf # Outputs valid JSON only
Context size
# Increase context (default 512) ./llama-cli \ -m model.gguf \ -c 4096 # 4K context window # Very long context (if model supports) ./llama-cli -m model.gguf -c 32768 # 32K context
Performance benchmarks
CPU performance (Llama 2-7B Q4_K_M)
| CPU | Threads | Speed | Cost |
|---|---|---|---|
| Apple M3 Max | 16 | 50 tok/s | $0 (local) |
| AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour |
| Intel i9-13900K | 32 | 30 tok/s | $0.40/hour |
| AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
GPU acceleration (Llama 2-7B Q4_K_M)
| GPU | Speed | vs CPU | Cost |
|---|---|---|---|
| NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) |
| NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour |
| AMD MI250 | 70 tok/s | 2× | $2.00/hour |
| Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
Supported models
LLaMA family:
- Llama 2 (7B, 13B, 70B)
- Llama 3 (8B, 70B, 405B)
- Code Llama
Mistral family:
- Mistral 7B
- Mixtral 8x7B, 8x22B
Other:
- Falcon, BLOOM, GPT-J
- Phi-3, Gemma, Qwen
- LLaVA (vision), Whisper (audio)
Find models: https://huggingface.co/models?library=gguf
References
- Quantization Guide - GGUF formats, conversion, quality comparison
- Server Deployment - API endpoints, Docker, monitoring
- Optimization - Performance tuning, hybrid CPU+GPU
Resources
- GitHub: https://github.com/ggerganov/llama.cpp
- Models: https://huggingface.co/models?library=gguf
- Discord: https://discord.gg/llama-cpp