Skilllibrary llama-cpp

install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/llama-cpp" ~/.claude/skills/merceralex397-collab-skilllibrary-llama-cpp && rm -rf "$T"
manifest: 11-ai-llm-runtime-and-integration/llama-cpp/SKILL.md
source content

Purpose

Compile, quantize, and run models with llama.cpp for efficient local inference on CPU, GPU, or hybrid setups.

When to use this skill

  • building llama.cpp from source with CUDA/Metal/OpenBLAS support
  • converting HuggingFace models to GGUF format
  • configuring GPU layer offloading (
    n_gpu_layers
    ) for hybrid CPU/GPU
  • tuning context size, batch size, and thread count for performance

Do not use this skill when

  • deploying production GPU inference — prefer
    inference-serving
    (vLLM)
  • choosing which model to use — prefer
    model-selection
  • building agent memory — prefer
    agent-memory

Procedure

  1. Clone and build
    git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
    .
  2. Compile with backend
    cmake -B build -DGGML_CUDA=ON
    (NVIDIA),
    -DGGML_METAL=ON
    (Apple),
    -DGGML_BLAS=ON
    (CPU).
  3. Convert model
    python convert_hf_to_gguf.py <model-dir> --outfile model.gguf --outtype f16
    .
  4. Quantize
    ./llama-quantize model.gguf model-Q4_K_M.gguf Q4_K_M
    . Choose quant level by quality/size tradeoff.
  5. Run inference
    ./llama-cli -m model-Q4_K_M.gguf -p "prompt" -n 256 -ngl 35 --ctx-size 4096
    .
  6. Start server
    ./llama-server -m model.gguf --port 8080 -ngl 35 --ctx-size 8192
    for OpenAI-compatible API.
  7. Tune performance — adjust
    -t
    (threads),
    -b
    (batch size),
    -ngl
    (GPU layers),
    --ctx-size
    .
  8. Benchmark
    ./llama-bench -m model.gguf -ngl 35
    to measure tokens/sec.

Quantization levels

QuantBitsSize (7B)QualityUse case
Q2_K2-3~2.7 GBLowExtreme compression
Q4_K_M4~4.1 GBGoodBest balance
Q5_K_M5~4.8 GBVery goodQuality priority
Q6_K6~5.5 GBNear-fp16Max local quality
Q8_08~7.2 GBExcellentIf VRAM allows

GPU offloading

# Full GPU offload (all layers on GPU)
./llama-cli -m model.gguf -ngl 99 --ctx-size 4096

# Partial offload (first 20 layers on GPU, rest on CPU)
./llama-cli -m model.gguf -ngl 20 --ctx-size 4096

# CPU only
./llama-cli -m model.gguf -ngl 0 -t 8 --ctx-size 2048

Decision rules

  • Use
    Q4_K_M
    as the default quantization — best quality/size tradeoff.
  • Set
    -ngl
    to as many layers as GPU VRAM allows — even partial offload helps significantly.
  • Larger context sizes consume more RAM —
    --ctx-size 4096
    needs ~2 GB extra for 7B model.
  • Use
    -t
    equal to physical core count (not hyperthreads) for CPU inference.
  • Always benchmark after changing parameters — small changes can have large throughput effects.

References

Related skills

  • inference-serving
    — production GPU serving with vLLM
  • offline-cpu-inference
    — CPU-only optimization strategies
  • model-selection
    — choosing which model to quantize