Kreuzberg extraction-pipeline-patterns

extraction pipeline patterns

install

source · Clone the upstream repo

git clone https://github.com/kreuzberg-dev/kreuzberg

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/kreuzberg-dev/kreuzberg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.ai-rulez/skills/extraction-pipeline-patterns" ~/.claude/skills/kreuzberg-dev-kreuzberg-extraction-pipeline-patterns && rm -rf "$T"

manifest: .ai-rulez/skills/extraction-pipeline-patterns/SKILL.md

source content

Extraction Pipeline Patterns

Kreuzberg's format detection -> extraction -> fallback orchestration for 75+ file formats

Core Pipeline Architecture

The extraction pipeline (

crates/kreuzberg/src/core/pipeline.rs

crates/kreuzberg/src/extraction/

) orchestrates:

Format Detection - MIME type inference + extension validation -> select appropriate extractor
Intelligent Extraction - Route to format-specific extractors (PDF, DOCX, Excel, HTML, images, archives, etc.)
Fallback Strategies - Password-protected PDFs, OCR for images, nested archive handling, corrupted file recovery
Post-Processing Pipeline - Validators, quality processing, chunking, custom hooks (see
```
core/pipeline.rs
```
)

Format Detection Strategy

Location:

crates/kreuzberg/src/core/mime.rs

crates/kreuzberg/src/core/formats.rs

Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.

// Pseudocode: core/mime.rs
match (magic_bytes(content), extension) {
    (Some(fmt), Some(ext)) if aligned -> Ok(fmt),
    (Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch),
    (Some(fmt), None) -> Ok(fmt),  // magic bytes only
    (None, Some(ext)) -> Ok(from_extension(ext)),
    _ -> Err(UnknownFormat),
}

Extraction Modules (75 Formats)

Category	Extractors	Key Modules
Office	DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS	`extraction/{docx,excel,pptx}.rs`
PDF	Standard + encrypted, password attempts	`pdf/` subdirectory (13 files)
Images	PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled)	`extraction/image.rs` + `ocr/`
Web	HTML, XHTML, XML, SVG (DOM parsing)	`extraction/html.rs` (67KB - complex table handling)
Email	EML, MSG (headers, body, attachments, threading)	`extraction/email.rs`
Archives	ZIP, TAR, GZ, 7Z (recursive extraction)	`extraction/archive.rs` (31KB)
Markdown	MD, TXT, RST, Org Mode, RTF	`extraction/markdown.rs`
Academic	LaTeX, BibTeX, JATS, Jupyter, DocBook	`extraction/{structured,xml}.rs`

Extraction Dispatcher

// Pseudocode: extraction/mod.rs
let format = detect_format(source.bytes, source.extension);
let result = match format {
    Pdf -> extract_pdf(source, config),
    Docx -> extract_docx(source, config),
    Image -> extract_image_with_ocr_fallback(source, config),
    Archive -> extract_archive_recursive(source, config),
    _ -> extract_with_plugin(format, source, config),
};
run_pipeline(result, config)  // post-processing always runs

Fallback Strategies

Password-Protected PDFs: Try primary password -> secondary password list -> return
```
is_encrypted=true
```
in metadata on failure
OCR Fallback: If image text extraction confidence < threshold, trigger OCR backend; return both results with scores
Nested Archives: Recursive extraction with configurable depth limit; flatten or preserve hierarchy
Corrupted File Recovery: Stream-based parsing, emit content up to error point, include error location in metadata

Configuration Integration

Location:

crates/kreuzberg/src/core/config.rs

crates/kreuzberg/src/core/config_validation.rs

ExtractionConfig

holds format-specific configs (

pdf

image

html

office

), fallback orchestration (

fallback

), and post-processing (

postprocessor

chunking

keywords

). See struct definition in

config.rs

Plugin System Integration

Location:

crates/kreuzberg/src/plugins/

CustomExtractor: Override built-in format extractors
PostProcessor: Modify results after extraction (Early/Middle/Late stages)
Validator: Fail-fast validation (e.g., minimum text length)
OCRBackend: Swap OCR engine

Plugin registry loaded at startup, cached for zero-cost lookup.

Feature Flag Strategy

Location:

Cargo.toml

(workspace),

crates/kreuzberg/Cargo.toml

FEATURE_MATRIX.md

20+ features across 9 language bindings. Key feature groups:

Group	Features	Notes
OCR	`tesseract` (default), `tesseract-static` , `ocr-minimal`	Mutually exclusive recommendation
Formats	`pdf` , `pdf-minimal` , `office` , `office-minimal`
AI/ML	`embeddings` (requires ONNX), `keywords-yake` , `keywords-rake` , `language-detection`
Server	`api` (Axum), `mcp` , `tokio-runtime` , `lite-runtime`
Bindings	`python-bindings` , `ruby-bindings` , `php-bindings` , `node-bindings` , `wasm`

Conditional compilation: modules gated with

#[cfg(feature = "...")]

. Runtime

validate_config()

warns if requested feature not compiled in.

Feature Flag Critical Rules

Never mix conflicting features - e.g.,
```
ocr-minimal
```
+
```
tesseract
```
should error at compile time
Always provide feature diagnostics - Config validation must warn if feature unavailable
Default to maximum feature set - Unless embedded/minimal specifically requested
Test all feature combinations - Matrix testing in CI catches regressions
WASM incompatible with embeddings, keywords, OCR

Critical Rules

Always use format detection before routing to extractors (prevent confusion attacks)
Stream-based parsing for PDFs/archives to handle multi-GB files
Post-pipeline is mandatory: All extraction results flow through
```
run_pipeline()
```
for validators/hooks
Plugin overrides are order-dependent: Plugins registered first take priority
Fallback timeouts: Set reasonable OCR/archive extraction timeouts (config-driven)
Metadata preservation: Include format detection confidence, extraction method used, any fallbacks applied

Related Skills

ocr-backend-management - OCR engine selection and image preprocessing
chunking-embeddings - Post-extraction text splitting with FastEmbed
api-server-mcp - Axum endpoint for extraction pipeline exposure and MCP server