Skilllibrary llama-cpp
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/llama-cpp" ~/.claude/skills/merceralex397-collab-skilllibrary-llama-cpp && rm -rf "$T"
manifest:
11-ai-llm-runtime-and-integration/llama-cpp/SKILL.mdsource content
Purpose
Compile, quantize, and run models with llama.cpp for efficient local inference on CPU, GPU, or hybrid setups.
When to use this skill
- building llama.cpp from source with CUDA/Metal/OpenBLAS support
- converting HuggingFace models to GGUF format
- configuring GPU layer offloading (
) for hybrid CPU/GPUn_gpu_layers - tuning context size, batch size, and thread count for performance
Do not use this skill when
- deploying production GPU inference — prefer
(vLLM)inference-serving - choosing which model to use — prefer
model-selection - building agent memory — prefer
agent-memory
Procedure
- Clone and build —
.git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp - Compile with backend —
(NVIDIA),cmake -B build -DGGML_CUDA=ON
(Apple),-DGGML_METAL=ON
(CPU).-DGGML_BLAS=ON - Convert model —
.python convert_hf_to_gguf.py <model-dir> --outfile model.gguf --outtype f16 - Quantize —
. Choose quant level by quality/size tradeoff../llama-quantize model.gguf model-Q4_K_M.gguf Q4_K_M - Run inference —
../llama-cli -m model-Q4_K_M.gguf -p "prompt" -n 256 -ngl 35 --ctx-size 4096 - Start server —
for OpenAI-compatible API../llama-server -m model.gguf --port 8080 -ngl 35 --ctx-size 8192 - Tune performance — adjust
(threads),-t
(batch size),-b
(GPU layers),-ngl
.--ctx-size - Benchmark —
to measure tokens/sec../llama-bench -m model.gguf -ngl 35
Quantization levels
| Quant | Bits | Size (7B) | Quality | Use case |
|---|---|---|---|---|
| Q2_K | 2-3 | ~2.7 GB | Low | Extreme compression |
| Q4_K_M | 4 | ~4.1 GB | Good | Best balance |
| Q5_K_M | 5 | ~4.8 GB | Very good | Quality priority |
| Q6_K | 6 | ~5.5 GB | Near-fp16 | Max local quality |
| Q8_0 | 8 | ~7.2 GB | Excellent | If VRAM allows |
GPU offloading
# Full GPU offload (all layers on GPU) ./llama-cli -m model.gguf -ngl 99 --ctx-size 4096 # Partial offload (first 20 layers on GPU, rest on CPU) ./llama-cli -m model.gguf -ngl 20 --ctx-size 4096 # CPU only ./llama-cli -m model.gguf -ngl 0 -t 8 --ctx-size 2048
Decision rules
- Use
as the default quantization — best quality/size tradeoff.Q4_K_M - Set
to as many layers as GPU VRAM allows — even partial offload helps significantly.-ngl - Larger context sizes consume more RAM —
needs ~2 GB extra for 7B model.--ctx-size 4096 - Use
equal to physical core count (not hyperthreads) for CPU inference.-t - Always benchmark after changing parameters — small changes can have large throughput effects.
References
Related skills
— production GPU serving with vLLMinference-serving
— CPU-only optimization strategiesoffline-cpu-inference
— choosing which model to quantizemodel-selection