Kreuzberg chunking-embeddings

chunking emueddings

install
source · Clone the upstream repo
git clone https://github.com/kreuzberg-dev/kreuzberg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/kreuzberg-dev/kreuzberg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.ai-rulez/skills/chunking-embeddings" ~/.claude/skills/kreuzberg-dev-kreuzberg-chunking-embeddings && rm -rf "$T"
manifest: .ai-rulez/skills/chunking-embeddings/SKILL.md
source content

Chunking & Embeddings

Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration

Chunking Architecture Overview

Location:

crates/kreuzberg/src/chunking/
,
crates/kreuzberg/src/embeddings.rs

Extracted Text
    |
[1. Normalization] -> Clean whitespace, remove control chars
    |
[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
    |
[3. Overlap Management] -> Control context window overlap
    |
[4. Optional Embedding] -> Generate vectors with FastEmbed
    |
Output: Vec<Chunk> with text, vectors, metadata

Chunking Strategies

Location:

crates/kreuzberg/src/chunking/mod.rs

StrategyPatternBest For
Fixed-SizeSliding window with configurable overlapUniform chunks for embedding models with fixed token limits
SemanticSplit by sentences, merge/split by similarity thresholdSmart context preservation for LLM consumption and semantic search
Syntax-AwareSplit by paragraph/section/heading/code-block structurePreserving document structure (sections, code blocks) in RAG
Recursive (LangChain pattern)Try separators in order:
\n\n
,
\n
,
,
Best general-purpose chunking; auto-finds optimal split points

Key config fields per strategy (see struct definitions in

chunking/mod.rs
):

  • Fixed-Size:
    chunk_size
    ,
    overlap
    ,
    trim_whitespace
  • Semantic:
    target_chunk_size
    ,
    min/max_chunk_size
    ,
    semantic_threshold
    ,
    use_sentence_boundaries
  • Syntax-Aware:
    chunk_by
    (Paragraph/Section/Heading/Sentence/CodeBlock),
    max_chunk_size
    ,
    respect_code_blocks
  • Recursive:
    separators[]
    ,
    chunk_size
    ,
    overlap

Chunking Configuration Presets

Location:

crates/kreuzberg/src/chunking/mod.rs

PresetChunk SizeOverlapStrategyUse Case
Balanced512 tokens50SemanticRAG sweet spot
Compact256 tokens32Fixed-SizeDense vectors
Extended1024 tokens100RecursiveFull context
Minimal128 tokens16(default)Lightweight embeddings

Usage: set

config.chunking.preset = Some("balanced")
in
ExtractionConfig
.

Embedding Generation with FastEmbed

Location:

crates/kreuzberg/src/embeddings.rs

Model Selection

ModelDimensionsNotes
BAAI/bge-small-en-v1.5
(default)
384Fast, excellent for RAG
BAAI/bge-small-zh-v1.5
384Chinese optimized
BAAI/bge-base-en-v1.5
768Better quality, slower
jinaai/jina-embeddings-v2-base-en
768Long context (up to 8192 tokens)
Custom(path)
variesCustom ONNX model path

Embedding Pattern

TextEmbeddingManager
provides singleton-cached models per config. Pattern:

  1. get_or_init_model()
    -- lazy-loads ONNX model (downloads if needed), caches in
    Arc<RwLock<HashMap>>
  2. embed_chunks()
    -- collects chunk texts, calls
    model.embed(texts, batch_size)
    , zips results back to
    ChunkWithEmbedding

Default config:

batch_size=256
,
device=CPU
,
parallel_requests=4
.

ONNX Runtime Requirement

Embeddings require ONNX Runtime. Feature-gated via:

[features]
embeddings = ["dep:fastembed", "dep:ort"]

Install:

brew install onnxruntime
(macOS) /
apt install libonnxruntime libonnxruntime-dev
(Linux). Verify:
echo $ORT_DYLIB_PATH
.

RAG Integration Pattern

The full extraction-to-RAG pipeline:

  1. Extract:
    extract_file(path, config)
    ->
    ExtractionResult
  2. Chunk: Apply preset strategy to
    result.content
    ->
    Vec<Chunk>
  3. Embed: If embedding config present,
    TextEmbeddingManager::embed_chunks()
    ->
    Vec<ChunkWithEmbedding>
  4. Output:
    RagDocument { file_path, metadata, chunks }
    ready for vector DB ingestion

See

ChunkWithEmbedding
struct in
types.rs
: contains
text
,
embedding: Vec<f32>
,
dimensions
,
norm
,
metadata
.

Critical Rules

  1. Chunking is preprocessing - Always apply before embedding to ensure consistent vector sizes
  2. Overlap prevents information loss - Set overlap to 15-20% of chunk size
  3. Embedding models are stateful - Lazy load and cache to avoid repeated initialization
  4. ONNX Runtime is required - Gracefully degrade if not available (skip embeddings)
  5. Batch embedding for performance - Never embed single chunks; batch 50-1000 chunks
  6. Normalize embeddings for search - Use L2 norm for cosine similarity
  7. Cache embedding results - Don't re-embed identical text chunks
  8. Model selection impacts quality - bge-small (384) for speed, bge-base (768) for quality

Related Skills

  • extraction-pipeline-patterns - Text extraction preceding chunking
  • api-server-mcp - Endpoint for chunking + embedding operations
  • ocr-backend-management - OCR text quality affects chunking success