Claude-skill-registry llamacpp

Complete llama.cpp C/C++ API reference covering model loading, inference, text generation, embeddings, chat, tokenization, sampling, batching, KV cache, LoRA adapters, and state management. Triggers on: llama.cpp questions, LLM inference code, GGUF models, local AI/ML inference, C/C++ LLM integration, \"how do I use llama.cpp\", API function lookups, implementation questions, troubleshooting llama.cpp issues, and any llama-cpp or ggerganov/llama.cpp mentions.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/llamacpp" ~/.claude/skills/majiayu000-claude-skill-registry-llamacpp && rm -rf "$T"
manifest: skills/data/llamacpp/SKILL.md
source content

llama.cpp C API Guide

Comprehensive reference for the llama.cpp C API, documenting all non-deprecated functions and common usage patterns.

Overview

llama.cpp is a C/C++ implementation for LLM inference with minimal dependencies and state-of-the-art performance. This skill provides:

  • Complete API Reference: All non-deprecated functions organized by category
  • Common Workflows: Working examples for typical use cases
  • Best Practices: Patterns for efficient and correct API usage

Quick Start

See references/workflows.md for complete working examples. Basic workflow:

  1. llama_backend_init()
    - Initialize backend
  2. llama_model_load_from_file()
    - Load model
  3. llama_init_from_model()
    - Create context
  4. llama_tokenize()
    - Convert text to tokens
  5. llama_decode()
    - Process tokens
  6. llama_sampler_sample()
    - Sample next token
  7. Cleanup in reverse order

When to Use This Skill

Use this skill when:

  1. API Lookup: You need to find a specific function (e.g., "How do I load a model?", "What function creates a context?")
  2. Code Generation: You're writing C code that uses llama.cpp
  3. Workflow Guidance: You need to understand the steps for a task (e.g., text generation, embeddings, chat)
  4. Advanced Features: You're working with batches, sequences, LoRA adapters, state management, or custom sampling
  5. Migration: You're updating code from deprecated functions to current API

Core Concepts

Key Objects

  • llama_model
    : Loaded model weights and architecture
  • llama_context
    : Inference state (KV cache, compute buffers)
  • llama_batch
    : Input tokens and positions for processing
  • llama_sampler
    : Token sampling configuration
  • llama_vocab
    : Vocabulary and tokenizer
  • llama_memory_t
    : KV cache memory handle

Typical Flow

  1. Initialize:
    llama_backend_init()
  2. Load Model:
    llama_model_load_from_file()
  3. Create Context:
    llama_init_from_model()
  4. Tokenize:
    llama_tokenize()
  5. Process:
    llama_encode()
    or
    llama_decode()
  6. Sample:
    llama_sampler_sample()
  7. Generate: Repeat steps 5-6
  8. Cleanup: Free in reverse order

API Reference

For detailed API documentation, the complete API is split across 6 files for efficient targeted loading. Start with references/api-core.md which links to all other sections.

API Files:

  • api-core.md (220 lines) - Initialization, parameters, model loading
  • api-model-info.md (193 lines) - Model properties, architecture detection NEW
  • api-context.md (412 lines) - Context, memory (KV cache), state management
  • api-inference.md (417 lines) - Batch operations, inference, tokenization, chat
  • api-sampling.md (467 lines) - All 25+ sampling strategies + backend sampling API [NEW]
  • api-advanced.md (359 lines) - LoRA adapters, performance, training

Total: 172 active, non-deprecated functions (b7631) across 6 organized files

Quick Function Lookup

Most common:

llama_backend_init()
,
llama_model_load_from_file()
,
llama_init_from_model()
,
llama_tokenize()
,
llama_decode()
,
llama_sampler_sample()
,
llama_vocab_is_eog()
,
llama_memory_clear()

See references/api.md for all 172 function signatures and detailed usage.

Common Workflows

See references/workflows.md for 13 complete working examples: basic text generation, chat, embeddings, batch processing, multi-sequence, LoRA, state save/load, custom sampling (XTC/DRY), encoder-decoder models, model detection, and memory management patterns.

Best Practices

See references/workflows.md for detailed best practices. Key points:

  • Always use default parameter functions (
    llama_model_default_params()
    , etc.)
  • Check return values for errors
  • Free resources in reverse order of creation
  • Handle dynamic buffer sizes for tokenization
  • Query actual context size after creation (
    llama_n_ctx()
    )
  • Check for end-of-generation with
    llama_vocab_is_eog()

Common Patterns

End-of-generation check (

llama_vocab_is_eog()
), logits retrieval (
llama_get_logits_ith()
), batch creation (
llama_batch_get_one()
), tokenization buffer handling. See references/workflows.md for complete code examples.

Troubleshooting

Common Issues

Model loading fails:

  • Verify file path and GGUF format validity
  • Check available RAM/VRAM for model size
  • Reduce
    n_gpu_layers
    if GPU memory insufficient

Tokenization returns negative value:

  • Buffer too small; reallocate with
    -n
    size and retry
  • See tokenization pattern in Common Patterns

Decode/encode returns non-zero:

  • Verify batch initialization (
    llama_batch_get_one()
    or
    llama_batch_init()
    )
  • Check context capacity (
    llama_n_ctx()
    )
  • Ensure positions within context window

Silent failures / no output:

  • Check if
    llama_vocab_is_eog()
    immediately returns true
  • Verify sampler initialization
  • Enable logging:
    llama_log_set()

Performance issues:

  • Increase
    n_threads
    for CPU
  • Set
    n_gpu_layers
    for GPU offloading
  • Use larger
    n_batch
    for prompts
  • See Performance & Utilities

Sliding Window Attention (SWA) issues:

  • If using Mistral-style models with SWA, set
    ctx_params.swa_full = true
    to access beyond attention window
  • Check:
    llama_model_n_swa(model)
    to detect SWA size and configuration needs
  • Symptoms: Token positions beyond window size causing decode errors

Per-sequence state errors:

  • Ensure sequence ID matches when loading:
    llama_state_seq_load_file(ctx, "file", dest_seq_id, ...)
  • Verify token buffer is large enough for loaded tokens
  • Check sequence wasn't cleared or removed before loading state

Model type detection:

  • Use
    llama_model_has_encoder()
    before assuming decoder-only architecture
  • For recurrent models (Mamba/RWKV), KV cache behavior differs from standard transformers
  • Encoder-decoder models require
    llama_encode()
    then
    llama_decode()
    workflow

For advanced issues: https://github.com/ggerganov/llama.cpp/discussions

Resources

  • API Reference (6 files, 2,086 lines total) - Complete API reference split by category for targeted loading:
  • references/workflows.md (1,616 lines) - 15 complete working examples: basic workflows (text generation, chat, embeddings, batching, sequences), intermediate (LoRA, state, sampling, encoder-decoder, memory), advanced features (XTC/DRY, per-sequence state, model detection), and production applications (interactive chat, streaming).

Key Differences from Deprecated API

If you're updating old code:

  • Use
    llama_model_load_from_file()
    instead of
    llama_load_model_from_file()
  • Use
    llama_model_free()
    instead of
    llama_free_model()
  • Use
    llama_init_from_model()
    instead of
    llama_new_context_with_model()
  • Use
    llama_vocab_*()
    functions instead of
    llama_token_*()
  • Use
    llama_state_*()
    functions instead of deprecated state functions

See the API reference for complete mappings.