Claude-code-minoan llama-cpp
Run GGUF models directly, load LoRA adapters, benchmark inference speed, and serve models via llama-server using llama.cpp. Includes Qwen 3.5 serve scripts (9B dense + F16, 35B MoE) with asymmetric KV cache and thinking mode. Secondary to Ollama; use when needing direct model control or LoRA hot-loading. Triggers on 'llama.cpp', 'GGUF', 'LoRA adapter', 'benchmark inference', 'llama-server'.
git clone https://github.com/tdimino/claude-code-minoan
T=$(mktemp -d) && git clone --depth=1 https://github.com/tdimino/claude-code-minoan "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/integration-automation/llama-cpp" ~/.claude/skills/tdimino-claude-code-minoan-llama-cpp && rm -rf "$T"
skills/integration-automation/llama-cpp/SKILL.mdllama.cpp - Secondary Inference Engine
Direct access to llama.cpp for faster inference, LoRA adapter loading, and benchmarking on Apple Silicon. Ollama remains primary for RLAMA and general use; llama.cpp is the power tool.
Prerequisites
brew install llama.cpp
Binaries:
llama-cli, llama-server, llama-embedding, llama-quantize
Quick Reference
Resolve Ollama Model to GGUF Path
To avoid duplicating model files, resolve an Ollama model name to its GGUF blob path:
~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b
Run Inference
GGUF=$(~/.claude/skills/llama-cpp/scripts/ollama_model_path.sh qwen2.5:7b) llama-cli -m "$GGUF" -p "Your prompt here" -n 128 --n-gpu-layers all --single-turn --simple-io --no-display-prompt
Start API Server
To start an OpenAI-compatible server (port 8081, avoids Ollama's 11434):
~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf> # Or with options: PORT=8082 CTX=8192 ~/.claude/skills/llama-cpp/scripts/llama_serve.sh <model.gguf>
Test the server:
curl http://localhost:8081/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"default","messages":[{"role":"user","content":"Hello"}]}'
Serve Qwen3.5
Dedicated servers for Qwen3.5 models with asymmetric KV cache, jinja templates, and thinking mode.
9B Dense (recommended for 24-36GB systems):
# Default: Qwen3.5-9B, thinking mode, 32K context ~/.claude/skills/llama-cpp/scripts/llama_serve_qwen35_9b.sh # Full precision F16 (~17.9 GB, zero quantization loss) ~/.claude/skills/llama-cpp/scripts/llama_serve_qwen35_9b.sh ~/models/Qwen3.5-9B-BF16.gguf # Non-thinking mode, shorter context THINK=0 CTX=8192 ~/.claude/skills/llama-cpp/scripts/llama_serve_qwen35_9b.sh
35B MoE (for 64+ GB systems):
~/.claude/skills/llama-cpp/scripts/llama_serve_qwen35.sh # defaults to qwen3.5:35b-a3b
9B Q4 uses ~6.6 GB (ample headroom); F16 uses ~17.9 GB (fits with 32K context on 36GB). Asymmetric KV cache (q8_0 keys + q4_0 values) saves ~60% KV memory vs FP16 cache.
F16 (Full Precision) Mode
For maximum quality (zero quantization loss), download and serve the BF16 GGUF:
# Download once (~17.9 GB) huggingface-cli download unsloth/Qwen3.5-9B-GGUF "Qwen3.5-9B-BF16.gguf" --local-dir ~/models # Serve F16 ~/.claude/skills/llama-cpp/scripts/llama_serve_qwen35_9b.sh ~/models/Qwen3.5-9B-BF16.gguf
F16 vs Q4 on M4 Max 36GB:
| Q4_K_M (default) | BF16 (F16) | |
|---|---|---|
| Size | 6.6 GB | 17.9 GB |
| Speed | ~38 tok/s | ~8-12 tok/s |
| Quality | ~99.5% | 100% (reference) |
| Max context | 262K | ~32K comfortable |
Benchmark (llama.cpp vs Ollama)
~/.claude/skills/llama-cpp/scripts/llama_bench.sh qwen2.5:7b
Reports prompt processing and generation tok/s for both engines side by side.
LoRA Adapter Inference
Load a LoRA adapter dynamically on top of a base GGUF model (no merge required):
~/.claude/skills/llama-cpp/scripts/llama_lora.sh <base.gguf> <lora.gguf> "Your prompt"
This is the key advantage over Ollama: hot-swap LoRA adapters without rebuilding models.
Convert Kothar LoRA to GGUF
Convert HuggingFace LoRA adapters from the Kothar training pipeline into a merged GGUF model:
python3 ~/.claude/skills/llama-cpp/scripts/convert_lora_to_gguf.py \ --base NousResearch/Hermes-2-Mistral-7B-DPO \ --lora <path-or-hf-id> \ --output kothar-q4_k_m.gguf \ --quantize q4_k_m
When to Use llama.cpp vs Ollama
| Task | Use |
|---|---|
| RLAMA queries | Ollama (native integration) |
| Quick model chat | Ollama () |
| LoRA adapter testing | llama.cpp () |
| Benchmarking tok/s | llama.cpp () |
| Maximum inference speed | llama.cpp (10-20% faster) |
| Custom server config | llama.cpp () |
| Embedding generation | Either (Ollama simpler, llama-embedding more control) |
| Kothar GGUF conversion | llama.cpp () |
Architecture
Ollama (primary, port 11434) llama.cpp (secondary, port 8081) ├── RLAMA RAG queries ├── LoRA adapter hot-loading ├── Model management (pull/list) ├── Benchmarking ├── General chat ├── Custom server configs └── Embeddings (nomic-embed-text) └── Kothar GGUF conversion Both share the same GGUF model files (~/.ollama/models/blobs/)
Subprocess Best Practices (Build 8180+)
When calling llama-cli from scripts or subprocesses:
- Always use
— generates one response then exits (prevents interactive chat mode hang)--single-turn - Always use
— suppresses ANSI spinner that floods redirected output--simple-io - Always use
— suppresses prompt echo--no-display-prompt - Use
instead of legacy--n-gpu-layers all-ngl 999 - Use
(not bare--flash-attn on
) — now takes argument--flash-attn - Timing stats appear in stdout as
(via[ Prompt: X t/s | Generation: Y t/s ]
, default: on)--show-timings - Redirect stderr to file, not variable — spinner output can overflow bash variables