Skilllibrary llama-cpp

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/llama-cpp" ~/.claude/skills/merceralex397-collab-skilllibrary-llama-cpp && rm -rf "$T"

manifest: 11-ai-llm-runtime-and-integration/llama-cpp/SKILL.md

source content

Purpose

Compile, quantize, and run models with llama.cpp for efficient local inference on CPU, GPU, or hybrid setups.

When to use this skill

building llama.cpp from source with CUDA/Metal/OpenBLAS support
converting HuggingFace models to GGUF format
configuring GPU layer offloading (
```
n_gpu_layers
```
) for hybrid CPU/GPU
tuning context size, batch size, and thread count for performance

Do not use this skill when

deploying production GPU inference — prefer
```
inference-serving
```
(vLLM)
choosing which model to use — prefer
```
model-selection
```
building agent memory — prefer
```
agent-memory
```

Procedure

Clone and build —

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp

Compile with backend —

cmake -B build -DGGML_CUDA=ON

(NVIDIA),

-DGGML_METAL=ON

(Apple),

-DGGML_BLAS=ON

(CPU).

Convert model —

python convert_hf_to_gguf.py <model-dir> --outfile model.gguf --outtype f16

Quantize —
```
./llama-quantize model.gguf model-Q4_K_M.gguf Q4_K_M
```
. Choose quant level by quality/size tradeoff.

Run inference —

./llama-cli -m model-Q4_K_M.gguf -p "prompt" -n 256 -ngl 35 --ctx-size 4096

Start server —

./llama-server -m model.gguf --port 8080 -ngl 35 --ctx-size 8192

for OpenAI-compatible API.

Tune performance — adjust
```
-t
```
(threads),
```
-b
```
(batch size),
```
-ngl
```
(GPU layers),
```
--ctx-size
```
.
Benchmark —
```
./llama-bench -m model.gguf -ngl 35
```
to measure tokens/sec.

Quantization levels

Quant	Bits	Size (7B)	Quality	Use case
Q2_K	2-3	~2.7 GB	Low	Extreme compression
Q4_K_M	4	~4.1 GB	Good	Best balance
Q5_K_M	5	~4.8 GB	Very good	Quality priority
Q6_K	6	~5.5 GB	Near-fp16	Max local quality
Q8_0	8	~7.2 GB	Excellent	If VRAM allows

GPU offloading

# Full GPU offload (all layers on GPU)
./llama-cli -m model.gguf -ngl 99 --ctx-size 4096

# Partial offload (first 20 layers on GPU, rest on CPU)
./llama-cli -m model.gguf -ngl 20 --ctx-size 4096

# CPU only
./llama-cli -m model.gguf -ngl 0 -t 8 --ctx-size 2048

Decision rules

Use
```
Q4_K_M
```
as the default quantization — best quality/size tradeoff.
Set
```
-ngl
```
to as many layers as GPU VRAM allows — even partial offload helps significantly.
Larger context sizes consume more RAM —
```
--ctx-size 4096
```
needs ~2 GB extra for 7B model.
Use
```
-t
```
equal to physical core count (not hyperthreads) for CPU inference.
Always benchmark after changing parameters — small changes can have large throughput effects.

References

Related skills

```
inference-serving
```
— production GPU serving with vLLM
```
offline-cpu-inference
```
— CPU-only optimization strategies
```
model-selection
```
— choosing which model to quantize