Hermes-agent llama-cpp
Run LLM inference with llama.cpp on CPU, Apple Silicon, AMD/Intel GPUs, or NVIDIA — plus GGUF model conversion and quantization (2–8 bit with K-quants and imatrix). Covers CLI, Python bindings, OpenAI-compatible server, and Ollama/LM Studio integration. Use for edge deployment, M1/M2/M3/M4 Macs, CUDA-less environments, or flexible local quantization.
git clone https://github.com/NousResearch/hermes-agent
T=$(mktemp -d) && git clone --depth=1 https://github.com/NousResearch/hermes-agent "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/mlops/inference/llama-cpp" ~/.claude/skills/nousresearch-hermes-agent-llama-cpp-d41f5b && rm -rf "$T"
skills/mlops/inference/llama-cpp/SKILL.mdllama.cpp + GGUF
Pure C/C++ LLM inference with minimal dependencies, plus the GGUF (GPT-Generated Unified Format) standard used for quantized weights. One toolchain covers conversion, quantization, and serving.
When to use
Use llama.cpp + GGUF when:
- Running on CPU-only machines or Apple Silicon (M1/M2/M3/M4) with Metal acceleration
- Using AMD (ROCm) or Intel GPUs where CUDA isn't available
- Edge deployment (Raspberry Pi, embedded systems, consumer laptops)
- Need flexible quantization (2–8 bit with K-quants)
- Want local AI tools (LM Studio, Ollama, text-generation-webui, koboldcpp)
- Want a single binary deploy without Docker/Python
Key advantages:
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD, Intel
- No Python runtime required (pure C/C++)
- K-quants + imatrix for better low-bit quality
- OpenAI-compatible server built in
- Rich ecosystem (Ollama, LM Studio, llama-cpp-python)
Use alternatives instead:
- vLLM — NVIDIA GPUs, PagedAttention, Python-first, max throughput
- TensorRT-LLM — Production NVIDIA (A100/H100), maximum speed
- AWQ/GPTQ — Calibrated quantization for NVIDIA-only deployments
- bitsandbytes — Simple HuggingFace transformers integration
- HQQ — Fast calibration-free quantization
Quick start
Install
# macOS / Linux (simplest) brew install llama.cpp # Or build from source git clone https://github.com/ggml-org/llama.cpp cd llama.cpp make # CPU make GGML_METAL=1 # Apple Silicon make GGML_CUDA=1 # NVIDIA CUDA make LLAMA_HIP=1 # AMD ROCm # Python bindings (optional) pip install llama-cpp-python # With CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir # With Metal: CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
Download a pre-quantized GGUF
# TheBloke hosts most popular models pre-quantized huggingface-cli download \ TheBloke/Llama-2-7B-Chat-GGUF \ llama-2-7b-chat.Q4_K_M.gguf \ --local-dir models/
Or convert a HuggingFace model to GGUF
# 1. Download HF model huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b # 2. Convert to FP16 GGUF python convert_hf_to_gguf.py ./llama-3.1-8b \ --outfile llama-3.1-8b-f16.gguf \ --outtype f16 # 3. Quantize to Q4_K_M ./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
Run inference
# One-shot prompt ./llama-cli -m model.Q4_K_M.gguf -p "Explain quantum computing" -n 256 # Interactive chat ./llama-cli -m model.Q4_K_M.gguf --interactive # With GPU offload ./llama-cli -m model.Q4_K_M.gguf -ngl 35 -p "Hello!"
Serve an OpenAI-compatible API
./llama-server \ -m model.Q4_K_M.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 35 \ -c 4096 \ --parallel 4 \ --cont-batching
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "local", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'
Quantization formats (GGUF)
K-quant methods (recommended)
| Type | Bits | Size (7B) | Quality | Use Case |
|---|---|---|---|---|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression (testing only) |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Fits small devices |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Speed critical |
| Q4_K_M | 4.5 | ~4.1 GB | High | Recommended default |
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality, minimal degradation |
Variant suffixes —
_S (Small, faster, lower quality), _M (Medium, balanced), _L (Large, better quality).
Legacy (Q4_0/Q4_1/Q5_0/Q5_1) exist but always prefer K-quants for better quality/size ratio.
IQ quantization — ultra-low-bit with importance-aware methods: IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_XS, IQ3_S, IQ4_XS. Require
--imatrix.
Task-specific defaults:
- General chat / assistants: Q4_K_M, or Q5_K_M if RAM allows
- Code generation: Q5_K_M or Q6_K (higher precision helps)
- Technical / medical: Q6_K or Q8_0
- Very large (70B, 405B) on consumer hardware: Q3_K_M or Q4_K_S
- Raspberry Pi / edge: Q2_K or Q3_K_S
Conversion workflows
Basic: HF → GGUF → quantized
python convert_hf_to_gguf.py ./model --outfile model-f16.gguf --outtype f16 ./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M ./llama-cli -m model-q4_k_m.gguf -p "Hello!" -n 50
With importance matrix (imatrix) — better low-bit quality
imatrix gives 10–20% perplexity improvement at Q4, essential at Q3 and below.
# 1. Convert to FP16 GGUF python convert_hf_to_gguf.py ./model --outfile model-f16.gguf # 2. Prepare calibration data (diverse text, ~100MB is ideal) cat > calibration.txt << 'EOF' The quick brown fox jumps over the lazy dog. Machine learning is a subset of artificial intelligence. # Add more diverse text samples... EOF # 3. Generate importance matrix ./llama-imatrix -m model-f16.gguf \ -f calibration.txt \ --chunk 512 \ -o model.imatrix \ -ngl 35 # 4. Quantize with imatrix ./llama-quantize --imatrix model.imatrix \ model-f16.gguf model-q4_k_m.gguf Q4_K_M
Multi-quant batch
#!/bin/bash MODEL="llama-3.1-8b-f16.gguf" IMATRIX="llama-3.1-8b.imatrix" ./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35 for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" done
Quality testing (perplexity)
./llama-perplexity -m model.gguf -f wikitext-2-raw/wiki.test.raw -c 512 # Baseline FP16: ~5.96 | Q4_K_M: ~6.06 (+1.7%) | Q2_K: ~6.87 (+15.3%)
Python bindings (llama-cpp-python)
Basic generation
from llama_cpp import Llama llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=35, # 0 for CPU only, 99 to offload everything n_threads=8, ) output = llm( "What is machine learning?", max_tokens=256, temperature=0.7, stop=["</s>", "\n\n"], ) print(output["choices"][0]["text"])
Chat completion + streaming
llm = Llama( model_path="./model-q4_k_m.gguf", n_ctx=4096, n_gpu_layers=35, chat_format="llama-3", # Or "chatml", "mistral", etc. ) # Non-streaming response = llm.create_chat_completion( messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is Python?"}, ], max_tokens=256, temperature=0.7, ) print(response["choices"][0]["message"]["content"]) # Streaming for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True): print(chunk["choices"][0]["text"], end="", flush=True)
Embeddings
llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35) vec = llm.embed("This is a test sentence.") print(f"Embedding dimension: {len(vec)}")
Hardware acceleration
Apple Silicon (Metal)
make clean && make GGML_METAL=1 ./llama-cli -m model.gguf -ngl 99 -p "Hello" # offload all layers
llm = Llama( model_path="model.gguf", n_gpu_layers=99, # Offload everything n_threads=1, # Metal handles parallelism )
Performance: M3 Max ~40–60 tok/s on Llama 2-7B Q4_K_M.
NVIDIA (CUDA)
make clean && make GGML_CUDA=1 ./llama-cli -m model.gguf -ngl 35 -p "Hello" # Hybrid for large models ./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest # Multi-GPU split ./llama-cli -m large-model.gguf --tensor-split 0.5,0.5 -ngl 60
AMD (ROCm)
make LLAMA_HIP=1 ./llama-cli -m model.gguf -ngl 999
CPU
# Match PHYSICAL cores, not logical ./llama-cli -m model.gguf -t 8 -p "Hello" # BLAS acceleration (2–3× speedup) make LLAMA_OPENBLAS=1
llm = Llama( model_path="model.gguf", n_gpu_layers=0, n_threads=8, n_batch=512, # Larger batch = faster prompt processing )
Performance benchmarks
CPU (Llama 2-7B Q4_K_M)
| CPU | Threads | Speed |
|---|---|---|
| Apple M3 Max (Metal) | 16 | 50 tok/s |
| AMD Ryzen 9 7950X | 32 | 35 tok/s |
| Intel i9-13900K | 32 | 30 tok/s |
GPU offloading on RTX 4090
| Layers GPU | Speed | VRAM |
|---|---|---|
| 0 (CPU only) | 30 tok/s | 0 GB |
| 20 (hybrid) | 80 tok/s | 8 GB |
| 35 (all) | 120 tok/s | 12 GB |
Supported models
- LLaMA family: Llama 2 (7B/13B/70B), Llama 3 (8B/70B/405B), Code Llama
- Mistral family: Mistral 7B, Mixtral 8x7B/8x22B
- Other: Falcon, BLOOM, GPT-J, Phi-3, Gemma, Qwen, LLaVA (vision), Whisper (audio)
Find GGUF models: https://huggingface.co/models?library=gguf
Ecosystem integrations
Ollama
cat > Modelfile << 'EOF' FROM ./model-q4_k_m.gguf TEMPLATE """{{ .System }} {{ .Prompt }}""" PARAMETER temperature 0.7 PARAMETER num_ctx 4096 EOF ollama create mymodel -f Modelfile ollama run mymodel "Hello!"
LM Studio
- Place GGUF file in
~/.cache/lm-studio/models/ - Open LM Studio and select the model
- Configure context length and GPU offload, start inference
text-generation-webui
cp model-q4_k_m.gguf text-generation-webui/models/ python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
OpenAI client → llama-server
from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed") response = client.chat.completions.create( model="local-model", messages=[{"role": "user", "content": "Hello!"}], max_tokens=256, ) print(response.choices[0].message.content)
Best practices
- Use K-quants — Q4_K_M is the recommended default
- Use imatrix for Q4 and below (calibration improves quality substantially)
- Offload as many layers as VRAM allows — start high, reduce by 5 on OOM
- Thread count — match physical cores, not logical
- Batch size — increase
(e.g. 512) for faster prompt processingn_batch - Context — start at 4096, grow only as needed (memory scales with ctx)
- Flash Attention — add
if your build supports it--flash-attn
Common issues (quick fixes)
Model loads slowly — use
--mmap for memory-mapped loading.
Out of memory (GPU) — reduce
-ngl, use a smaller quant (Q4_K_S / Q3_K_M), or quantize the KV cache:
Llama(model_path="...", type_k=2, type_v=2, n_gpu_layers=35) # Q4_0 KV cache
Garbage output — wrong
chat_format, temperature too high, or model file corrupted. Test with temperature=0.1 and verify FP16 baseline works.
Connection refused (server) — bind to
--host 0.0.0.0, check lsof -i :8080.
See
references/troubleshooting.md for the full playbook.
References
- advanced-usage.md — speculative decoding, batched inference, grammar-constrained generation, LoRA, multi-GPU, custom builds, benchmark scripts
- quantization.md — perplexity tables, use-case guide, model size scaling (7B/13B/70B RAM needs), imatrix deep dive
- server.md — OpenAI API endpoints, Docker deployment, NGINX load balancing, monitoring
- optimization.md — CPU threading, BLAS, GPU offload heuristics, batch tuning, benchmarks
- troubleshooting.md — install/convert/quantize/inference/server issues, Apple Silicon, debugging
Resources
- GitHub: https://github.com/ggml-org/llama.cpp
- Python bindings: https://github.com/abetlen/llama-cpp-python
- Pre-quantized models: https://huggingface.co/TheBloke
- GGUF converter Space: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- License: MIT