Claude-skill-registry llamacpp

Complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. Triggers on: llama.cpp questions, LLM inference code, GGUF models, local AI/ML inference, C/C++ LLM integration, \"how do I use llama.cpp\", API function lookups, implementation questions, troubleshooting llama.cpp issues, and any llama-cpp or ggerganov/llama.cpp mentions.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llamacpp" ~/.claude/skills/majiayu000-claude-skill-registry-llamacpp && rm -rf "$T"

manifest: skills/data/llamacpp/SKILL.md

llama.cpp C API Guide

Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.

Overview

llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:

Complete API Reference: All non-deprecated functions organized by category
Common Workflows: Working examples for typical use cases
Best Practices: Patterns for efficient and correct API usage

Quick Start

See references/workflows.md for complete working examples. Basic workflow:

```
llama_backend_init()
```
- Initialize backend
```
llama_model_load_from_file()
```
- Load model
```
llama_init_from_model()
```
- Create context
```
llama_tokenize()
```
- Convert text to tokens
```
llama_decode()
```
- Process tokens
```
llama_sampler_sample()
```
- Sample next token
Cleanup in reverse order

When to Use This Skill

Use this skill when:

API Lookup: You need to find a specific function (e.g., "How do I load a model?", "What function creates a context?")
Code Generation: You're writing C code that uses llama.cpp
Workflow Guidance: You need to understand the steps for a task (e.g., text generation, embeddings, chat)
Advanced Features: You're working with batches, sequences, LoRA adapters, state management, or custom sampling
Migration: You're updating code from deprecated functions to current API

Core Concepts

Key Objects

llama_model
: Loaded model weights and architecture
llama_context
: Inference state (KV cache, compute buffers)
llama_batch
: Input tokens and positions for processing
llama_sampler
: Token sampling configuration
llama_vocab
: Vocabulary and tokenizer
llama_memory_t
: KV cache memory handle

Typical Flow

Initialize:
```
llama_backend_init()
```
Load Model:
```
llama_model_load_from_file()
```
Create Context:
```
llama_init_from_model()
```
Tokenize:
```
llama_tokenize()
```
Process:
```
llama_encode()
```
or
```
llama_decode()
```
Sample:
```
llama_sampler_sample()
```
Generate: Repeat steps 5-6
Cleanup: Free in reverse order

API Reference

For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.

API Files:

api-core.md (220 lines) - Initialization, parameters, model loading
api-model-info.md (193 lines) - Model properties, architecture detection NEW
api-context.md (412 lines) - Context, memory (KV cache), state management
api-inference.md (417 lines) - Batch operations, inference, tokenization, chat
api-sampling.md (467 lines) - All 25+ sampling strategies + backend sampling API [NEW]
api-advanced.md (359 lines) - LoRA adapters, performance, training

Total: 172 active, non-deprecated functions (b7631) across 6 organized files

Quick Function Lookup

Most common:

llama_backend_init()

llama_model_load_from_file()

llama_init_from_model()

llama_tokenize()

llama_decode()

llama_sampler_sample()

llama_vocab_is_eog()

llama_memory_clear()

See references/api.md for all 172 function signatures and detailed usage.

Common Workflows

See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.

Best Practices

See references/workflows.md for detailed best practices. Key points:

Always use default parameter functions (
```
llama_model_default_params()
```
, etc.)
Check return values for errors
Free resources in reverse order of creation
Handle dynamic buffer sizes for tokenization
Query actual context size after creation (
```
llama_n_ctx()
```
)
Check for end-of-generation with
```
llama_vocab_is_eog()
```

Common Patterns

End-of-generation check (

llama_vocab_is_eog()

), logits retrieval (

llama_get_logits_ith()

), batch creation (

llama_batch_get_one()

), tokenization buffer handling. See references/workflows.md for complete code examples.

Troubleshooting

Common Issues

Model loading fails:

Verify file path and GGUF format validity
Check available RAM/VRAM for model size
Reduce
```
n_gpu_layers
```
if GPU memory insufficient

Tokenization returns negative value:

Buffer too small; reallocate with
```
-n
```
size and retry
See tokenization pattern in Common Patterns

Decode/encode returns non-zero:

Verify batch initialization (
```
llama_batch_get_one()
```
or
```
llama_batch_init()
```
)
Check context capacity (
```
llama_n_ctx()
```
)
Ensure positions within context window

Silent failures / no output:

Check if
```
llama_vocab_is_eog()
```
immediately returns true
Verify sampler initialization
Enable logging:
```
llama_log_set()
```

Performance issues:

Increase
```
n_threads
```
for CPU
Set
```
n_gpu_layers
```
for GPU offloading
Use larger
```
n_batch
```
for prompts
See Performance & Utilities

Sliding Window Attention (SWA) issues:

If using Mistral-style models with SWA, set
```
ctx_params.swa_full = true
```
to access beyond attention window
Check:
```
llama_model_n_swa(model)
```
to detect SWA size and configuration needs
Symptoms: Token positions beyond window size causing decode errors

Per-sequence state errors:

Ensure sequence ID matches when loading:

llama_state_seq_load_file(ctx, "file", dest_seq_id, ...)

Verify token buffer is large enough for loaded tokens
Check sequence wasn't cleared or removed before loading state

Model type detection:

Use
```
llama_model_has_encoder()
```
before assuming decoder-only architecture
For recurrent models (Mamba/RWKV), KV cache behavior differs from standard transformers
Encoder-decoder models require
```
llama_encode()
```
then
```
llama_decode()
```
workflow

For advanced issues: https://github.com/ggerganov/llama.cpp/discussions

Resources

API Reference (6 files, 2,086 lines total) - Complete API reference split by category for targeted loading:
- api-core.md - Initialization, parameters, model loading
- api-model-info.md - Model properties, architecture detection
- api-context.md - Context, memory, state management
- api-inference.md - Batch, inference, tokenization, chat
- api-sampling.md - All 25+ sampling strategies + backend sampling API
- api-advanced.md - LoRA, performance, training
references/workflows.md (1,616 lines) - 15 complete working examples: basic workflows (text generation, chat, embeddings, batching, sequences), intermediate (LoRA, state, sampling, encoder-decoder, memory), advanced features (XTC/DRY, per-sequence state, model detection), and production applications (interactive chat, streaming).

Key Differences from Deprecated API

If you're updating old code:

Use

llama_model_load_from_file()

instead of

llama_load_model_from_file()

Use
```
llama_model_free()
```
instead of
```
llama_free_model()
```

Use

llama_init_from_model()

instead of

llama_new_context_with_model()

Use
```
llama_vocab_*()
```
functions instead of
```
llama_token_*()
```
Use
```
llama_state_*()
```
functions instead of deprecated state functions

See the API reference for complete mappings.