Skilllibrary offline-cpu-inference

install

source · Clone the upstream repo

git clone https://github.com/merceralex397-collab/skilllibrary

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/merceralex397-collab/skilllibrary "$T" && mkdir -p ~/.claude/skills && cp -r "$T/11-ai-llm-runtime-and-integration/offline-cpu-inference" ~/.claude/skills/merceralex397-collab-skilllibrary-offline-cpu-inference && rm -rf "$T"

manifest: 11-ai-llm-runtime-and-integration/offline-cpu-inference/SKILL.md

source content

Purpose

Optimize LLM inference for CPU-only environments using quantization, memory mapping, thread tuning, and efficient model selection.

When to use this skill

running LLM inference on machines without GPUs
optimizing llama.cpp CPU performance (threads, batch size, memory mapping)
choosing quantization levels for RAM-constrained systems
deploying inference on edge devices or commodity servers

Do not use this skill when

GPU hardware is available — prefer
```
inference-serving
```
choosing which model to use — prefer
```
model-selection
```
building vector search — prefer
```
embeddings-indexing
```

Procedure

Assess hardware — check available RAM (
```
free -h
```
), CPU cores (
```
nproc
```
), and instruction set support (AVX2, AVX-512).
Choose model size — RAM budget: model file + ~2GB overhead + context KV cache. 8GB RAM = 7B Q4, 16GB = 13B Q4 or 7B Q6, 32GB = 30B Q4.
Select quantization —
```
Q4_K_M
```
for best quality/size balance.
```
Q3_K_S
```
for extreme compression.
```
Q5_K_M
```
if RAM allows.
Enable memory mapping — llama.cpp uses
```
mmap
```
by default. Ensure sufficient virtual memory. Use
```
--mlock
```
to pin in RAM for consistent performance.
Tune threads — set
```
-t
```
to physical core count (not hyperthreads). On NUMA systems, use
```
numactl --cpunodebind=0
```
.
Set batch size — larger batches improve throughput:
```
-b 512
```
for prompt processing. Reduce for interactive use:
```
-b 128
```
.
Use prompt caching — enable
```
--prompt-cache
```
to avoid re-processing repeated system prompts.
Benchmark —
```
./llama-bench -m model.gguf -t <threads>
```
to measure prompt eval and token generation speed.

Hardware sizing guide

RAM	Model size	Quantization	Context	Speed (est.)
8 GB	7B	Q4_K_M (4.1GB)	2048	~10 tok/s
16 GB	7B	Q6_K (5.5GB)	8192	~15 tok/s
16 GB	13B	Q4_K_M (7.4GB)	4096	~6 tok/s
32 GB	30B	Q4_K_M (17GB)	4096	~3 tok/s
64 GB	70B	Q4_K_M (38GB)	4096	~2 tok/s

Performance tuning

# Optimal CPU inference (adjust -t to your physical core count)
./llama-cli -m model-Q4_K_M.gguf \
  -t 8 \               # physical cores
  -b 512 \             # batch size for prompt processing
  --ctx-size 4096 \
  --mlock \            # pin model in RAM
  -p "Your prompt here"

# Server mode with prompt caching
./llama-server -m model-Q4_K_M.gguf \
  -t 8 \
  --ctx-size 4096 \
  --port 8080 \
  --prompt-cache prompt-cache.bin

Decision rules

Physical cores, not hyperthreads — HT adds ~10% throughput but increases latency variance.
```
Q4_K_M
```
is the sweet spot — measurably better than Q4_0/Q4_1 with negligible size increase.
Memory-map large models — keeps RSS low, lets OS manage paging, but performance is worse than full RAM.
Use
```
--mlock
```
only if the entire model fits in RAM — partial mlock causes OOM kills.
Prompt caching saves 50-90% of prompt eval time for repeated system prompts.

References

Related skills

```
llama-cpp
```
— building and running llama.cpp
```
inference-serving
```
— GPU-based serving when hardware is available
```
model-selection
```
— choosing appropriately sized models