Claude-skill-registry llamacpp
Complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. Triggers on: llama.cpp questions, LLM inference code, GGUF models, local AI/ML inference, C/C++ LLM integration, \"how do I use llama.cpp\", API function lookups, implementation questions, troubleshooting llama.cpp issues, and any llama-cpp or ggerganov/llama.cpp mentions.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llamacpp" ~/.claude/skills/majiayu000-claude-skill-registry-llamacpp && rm -rf "$T"
skills/data/llamacpp/SKILL.mdllama.cpp C API Guide
Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.
Overview
llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:
- Complete API Reference: All non-deprecated functions organized by category
- Common Workflows: Working examples for typical use cases
- Best Practices: Patterns for efficient and correct API usage
Quick Start
See references/workflows.md for complete working examples. Basic workflow:
- Initialize backendllama_backend_init()
- Load modelllama_model_load_from_file()
- Create contextllama_init_from_model()
- Convert text to tokensllama_tokenize()
- Process tokensllama_decode()
- Sample next tokenllama_sampler_sample()- Cleanup in reverse order
When to Use This Skill
Use this skill when:
- API Lookup: You need to find a specific function (e.g., "How do I load a model?", "What function creates a context?")
- Code Generation: You're writing C code that uses llama.cpp
- Workflow Guidance: You need to understand the steps for a task (e.g., text generation, embeddings, chat)
- Advanced Features: You're working with batches, sequences, LoRA adapters, state management, or custom sampling
- Migration: You're updating code from deprecated functions to current API
Core Concepts
Key Objects
: Loaded model weights and architecturellama_model
: Inference state (KV cache, compute buffers)llama_context
: Input tokens and positions for processingllama_batch
: Token sampling configurationllama_sampler
: Vocabulary and tokenizerllama_vocab
: KV cache memory handlellama_memory_t
Typical Flow
- Initialize:
llama_backend_init() - Load Model:
llama_model_load_from_file() - Create Context:
llama_init_from_model() - Tokenize:
llama_tokenize() - Process:
orllama_encode()llama_decode() - Sample:
llama_sampler_sample() - Generate: Repeat steps 5-6
- Cleanup: Free in reverse order
API Reference
For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.
API Files:
- api-core.md (220 lines) - Initialization, parameters, model loading
- api-model-info.md (193 lines) - Model properties, architecture detection NEW
- api-context.md (412 lines) - Context, memory (KV cache), state management
- api-inference.md (417 lines) - Batch operations, inference, tokenization, chat
- api-sampling.md (467 lines) - All 25+ sampling strategies + backend sampling API [NEW]
- api-advanced.md (359 lines) - LoRA adapters, performance, training
Total: 172 active, non-deprecated functions (b7631) across 6 organized files
Quick Function Lookup
Most common:
llama_backend_init(), llama_model_load_from_file(), llama_init_from_model(), llama_tokenize(), llama_decode(), llama_sampler_sample(), llama_vocab_is_eog(), llama_memory_clear()
See references/api.md for all 172 function signatures and detailed usage.
Common Workflows
See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.
Best Practices
See references/workflows.md for detailed best practices. Key points:
- Always use default parameter functions (
, etc.)llama_model_default_params() - Check return values for errors
- Free resources in reverse order of creation
- Handle dynamic buffer sizes for tokenization
- Query actual context size after creation (
)llama_n_ctx() - Check for end-of-generation with
llama_vocab_is_eog()
Common Patterns
End-of-generation check (
llama_vocab_is_eog()), logits retrieval (llama_get_logits_ith()), batch creation (llama_batch_get_one()), tokenization buffer handling. See references/workflows.md for complete code examples.
Troubleshooting
Common Issues
Model loading fails:
- Verify file path and GGUF format validity
- Check available RAM/VRAM for model size
- Reduce
if GPU memory insufficientn_gpu_layers
Tokenization returns negative value:
- Buffer too small; reallocate with
size and retry-n - See tokenization pattern in Common Patterns
Decode/encode returns non-zero:
- Verify batch initialization (
orllama_batch_get_one()
)llama_batch_init() - Check context capacity (
)llama_n_ctx() - Ensure positions within context window
Silent failures / no output:
- Check if
immediately returns truellama_vocab_is_eog() - Verify sampler initialization
- Enable logging:
llama_log_set()
Performance issues:
- Increase
for CPUn_threads - Set
for GPU offloadingn_gpu_layers - Use larger
for promptsn_batch - See Performance & Utilities
Sliding Window Attention (SWA) issues:
- If using Mistral-style models with SWA, set
to access beyond attention windowctx_params.swa_full = true - Check:
to detect SWA size and configuration needsllama_model_n_swa(model) - Symptoms: Token positions beyond window size causing decode errors
Per-sequence state errors:
- Ensure sequence ID matches when loading:
llama_state_seq_load_file(ctx, "file", dest_seq_id, ...) - Verify token buffer is large enough for loaded tokens
- Check sequence wasn't cleared or removed before loading state
Model type detection:
- Use
before assuming decoder-only architecturellama_model_has_encoder() - For recurrent models (Mamba/RWKV), KV cache behavior differs from standard transformers
- Encoder-decoder models require
thenllama_encode()
workflowllama_decode()
For advanced issues: https://github.com/ggerganov/llama.cpp/discussions
Resources
- API Reference (6 files, 2,086 lines total) - Complete API reference split by category for targeted loading:
- api-core.md - Initialization, parameters, model loading
- api-model-info.md - Model properties, architecture detection
- api-context.md - Context, memory, state management
- api-inference.md - Batch, inference, tokenization, chat
- api-sampling.md - All 25+ sampling strategies + backend sampling API
- api-advanced.md - LoRA, performance, training
- references/workflows.md (1,616 lines) - 15 complete working examples: basic workflows (text generation, chat, embeddings, batching, sequences), intermediate (LoRA, state, sampling, encoder-decoder, memory), advanced features (XTC/DRY, per-sequence state, model detection), and production applications (interactive chat, streaming).
Key Differences from Deprecated API
If you're updating old code:
- Use
instead ofllama_model_load_from_file()llama_load_model_from_file() - Use
instead ofllama_model_free()llama_free_model() - Use
instead ofllama_init_from_model()llama_new_context_with_model() - Use
functions instead ofllama_vocab_*()llama_token_*() - Use
functions instead of deprecated state functionsllama_state_*()
See the API reference for complete mappings.