Skills llamacpp-bench
Run llama.cpp benchmarks on GGUF models to measure prompt processing (pp) and token generation (tg) performance. Use when the user wants to benchmark LLM models, compare model performance, test inference speed, or run llama-bench on GGUF files. Supports Vulkan, CUDA, ROCm, and CPU backends.
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/alexhegit/llamacpp-bench" ~/.claude/skills/clawdbot-skills-llamacpp-bench && rm -rf "$T"
skills/alexhegit/llamacpp-bench/SKILL.mdllamacpp-bench
Run standardized benchmarks on GGUF models using llama.cpp's
llama-bench tool.
Quick Start
# Basic benchmark llama-bench -m model.gguf -p 512,1024,2048 -n 128,256 -ngl 99 # With specific backend LLAMA_BACKEND=vulkan llama-bench -m model.gguf -p 512,1024,2048 -n 128,256 -ngl 99
Benchmark Parameters
| Parameter | Description | Default |
|---|---|---|
| Model path (GGUF file) | required |
| Prompt sizes to test | 512 |
| Generation lengths to test | 128 |
| GPU layers to offload | 99 |
| CPU threads | auto |
| Device selection | auto |
Standard Test Suite
For consistent comparisons across models, use:
-p 512,1024,2048 -n 128,256 -ngl 99
This tests:
- Prompt processing: 512, 1024, 2048 tokens
- Token generation: 128, 256 tokens
Interpreting Results
| Metric | Meaning | Good Performance |
|---|---|---|
| Prompt processing speed at 512 tokens | >1000 t/s |
| Prompt processing speed at 1024 tokens | >1000 t/s |
| Prompt processing speed at 2048 tokens | >1000 t/s |
| Token generation speed (128 tokens) | >50 t/s |
| Token generation speed (256 tokens) | >50 t/s |
Backend Selection
llama-bench auto-detects available backends. Priority order:
- CUDA (NVIDIA GPUs)
- ROCm (AMD GPUs)
- Vulkan (cross-platform GPU)
- CPU (fallback)
To force a backend, set environment variable or check build:
# Check available backends llama-bench --help | grep -i "backend\|cuda\|rocm\|vulkan"
Batch Benchmarking
Use the provided script for benchmarking multiple models:
./scripts/benchmark_models.sh /path/to/models/*.gguf
Saving Results
Output can be redirected to a file:
llama-bench -m model.gguf -p 512,1024,2048 -n 128,256 -ngl 99 > results.txt
Or use the benchmark script which auto-saves to timestamped files.
Common Issues
- Out of memory: Reduce
(GPU layers) or test smaller prompt sizes-ngl - Slow CPU performance: Ensure
matches CPU core count-t - Backend not found: Check llama.cpp was built with the desired backend
Building / Updating llama.cpp
Check Current Version
./scripts/build_llamacpp.sh -v
Shows:
- Current Git commit and branch
- Build date
- Whether behind upstream
- Available backends
Build or Update
# Interactive mode (prompts for backend selection) ./scripts/build_llamacpp.sh -u # Specify backend directly ./scripts/build_llamacpp.sh -u -b vulkan # Vulkan (AMD/Intel GPUs) ./scripts/build_llamacpp.sh -u -b cuda # CUDA (NVIDIA GPUs) ./scripts/build_llamacpp.sh -u -b rocm # ROCm (AMD GPUs) ./scripts/build_llamacpp.sh -u -b cpu # CPU only # Clean rebuild ./scripts/build_llamacpp.sh -c -b vulkan # Custom build directory ./scripts/build_llamacpp.sh -u -b cuda -d /custom/path
Build Options
| Flag | Description |
|---|---|
| Show version info and exit |
| Update to latest from GitHub |
| Clean build (remove existing) |
| Backend: vulkan, cuda, rocm, cpu |
| Build directory path |
| Parallel jobs (default: CPU count) |
Finding llama-bench
The benchmark script auto-detects llama-bench in these locations:
/DATA/Benchmark/llama.cpp/build/bin/llama-bench~/Repo/llama.cpp/build/bin/llama-bench~/lab/build/bin/llama-bench
If not found, it will search your home directory or you can build it using the script above.