Skilllibrary offline-cpu-inference
install
source · Clone the upstream repo
git clone https://github.com/merceralex397-collab/skilllibrary
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/offline-cpu-inference" ~/.claude/skills/merceralex397-collab-skilllibrary-offline-cpu-inference && rm -rf "$T"
manifest:
11-ai-llm-runtime-and-integration/offline-cpu-inference/SKILL.mdsource content
Purpose
Optimize LLM inference for CPU-only environments using quantization, memory mapping, thread tuning, and efficient model selection.
When to use this skill
- running LLM inference on machines without GPUs
- optimizing llama.cpp CPU performance (threads, batch size, memory mapping)
- choosing quantization levels for RAM-constrained systems
- deploying inference on edge devices or commodity servers
Do not use this skill when
- GPU hardware is available — prefer
inference-serving - choosing which model to use — prefer
model-selection - building vector search — prefer
embeddings-indexing
Procedure
- Assess hardware — check available RAM (
), CPU cores (free -h
), and instruction set support (AVX2, AVX-512).nproc - Choose model size — RAM budget: model file + ~2GB overhead + context KV cache. 8GB RAM = 7B Q4, 16GB = 13B Q4 or 7B Q6, 32GB = 30B Q4.
- Select quantization —
for best quality/size balance.Q4_K_M
for extreme compression.Q3_K_S
if RAM allows.Q5_K_M - Enable memory mapping — llama.cpp uses
by default. Ensure sufficient virtual memory. Usemmap
to pin in RAM for consistent performance.--mlock - Tune threads — set
to physical core count (not hyperthreads). On NUMA systems, use-t
.numactl --cpunodebind=0 - Set batch size — larger batches improve throughput:
for prompt processing. Reduce for interactive use:-b 512
.-b 128 - Use prompt caching — enable
to avoid re-processing repeated system prompts.--prompt-cache - Benchmark —
to measure prompt eval and token generation speed../llama-bench -m model.gguf -t <threads>
Hardware sizing guide
| RAM | Model size | Quantization | Context | Speed (est.) |
|---|---|---|---|---|
| 8 GB | 7B | Q4_K_M (4.1GB) | 2048 | ~10 tok/s |
| 16 GB | 7B | Q6_K (5.5GB) | 8192 | ~15 tok/s |
| 16 GB | 13B | Q4_K_M (7.4GB) | 4096 | ~6 tok/s |
| 32 GB | 30B | Q4_K_M (17GB) | 4096 | ~3 tok/s |
| 64 GB | 70B | Q4_K_M (38GB) | 4096 | ~2 tok/s |
Performance tuning
# Optimal CPU inference (adjust -t to your physical core count) ./llama-cli -m model-Q4_K_M.gguf \ -t 8 \ # physical cores -b 512 \ # batch size for prompt processing --ctx-size 4096 \ --mlock \ # pin model in RAM -p "Your prompt here" # Server mode with prompt caching ./llama-server -m model-Q4_K_M.gguf \ -t 8 \ --ctx-size 4096 \ --port 8080 \ --prompt-cache prompt-cache.bin
Decision rules
- Physical cores, not hyperthreads — HT adds ~10% throughput but increases latency variance.
is the sweet spot — measurably better than Q4_0/Q4_1 with negligible size increase.Q4_K_M- Memory-map large models — keeps RSS low, lets OS manage paging, but performance is worse than full RAM.
- Use
only if the entire model fits in RAM — partial mlock causes OOM kills.--mlock - Prompt caching saves 50-90% of prompt eval time for repeated system prompts.
References
Related skills
— building and running llama.cppllama-cpp
— GPU-based serving when hardware is availableinference-serving
— choosing appropriately sized modelsmodel-selection