install
source · Clone the upstream repo
git clone https://github.com/kreuzberg-dev/kreuzberg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/kreuzberg-dev/kreuzberg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.ai-rulez/skills/extraction-pipeline-patterns" ~/.claude/skills/kreuzberg-dev-kreuzberg-extraction-pipeline-patterns && rm -rf "$T"
manifest:
.ai-rulez/skills/extraction-pipeline-patterns/SKILL.mdsource content
Extraction Pipeline Patterns
Kreuzberg's format detection -> extraction -> fallback orchestration for 75+ file formats
Core Pipeline Architecture
The extraction pipeline (
crates/kreuzberg/src/core/pipeline.rs, crates/kreuzberg/src/extraction/) orchestrates:
- Format Detection - MIME type inference + extension validation -> select appropriate extractor
- Intelligent Extraction - Route to format-specific extractors (PDF, DOCX, Excel, HTML, images, archives, etc.)
- Fallback Strategies - Password-protected PDFs, OCR for images, nested archive handling, corrupted file recovery
- Post-Processing Pipeline - Validators, quality processing, chunking, custom hooks (see
)core/pipeline.rs
Format Detection Strategy
Location:
crates/kreuzberg/src/core/mime.rs, crates/kreuzberg/src/core/formats.rs
Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.
// Pseudocode: core/mime.rs match (magic_bytes(content), extension) { (Some(fmt), Some(ext)) if aligned -> Ok(fmt), (Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch), (Some(fmt), None) -> Ok(fmt), // magic bytes only (None, Some(ext)) -> Ok(from_extension(ext)), _ -> Err(UnknownFormat), }
Extraction Modules (75 Formats)
| Category | Extractors | Key Modules |
|---|---|---|
| Office | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS | |
| Standard + encrypted, password attempts | subdirectory (13 files) | |
| Images | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled) | + |
| Web | HTML, XHTML, XML, SVG (DOM parsing) | (67KB - complex table handling) |
| EML, MSG (headers, body, attachments, threading) | | |
| Archives | ZIP, TAR, GZ, 7Z (recursive extraction) | (31KB) |
| Markdown | MD, TXT, RST, Org Mode, RTF | |
| Academic | LaTeX, BibTeX, JATS, Jupyter, DocBook | |
Extraction Dispatcher
// Pseudocode: extraction/mod.rs let format = detect_format(source.bytes, source.extension); let result = match format { Pdf -> extract_pdf(source, config), Docx -> extract_docx(source, config), Image -> extract_image_with_ocr_fallback(source, config), Archive -> extract_archive_recursive(source, config), _ -> extract_with_plugin(format, source, config), }; run_pipeline(result, config) // post-processing always runs
Fallback Strategies
- Password-Protected PDFs: Try primary password -> secondary password list -> return
in metadata on failureis_encrypted=true - OCR Fallback: If image text extraction confidence < threshold, trigger OCR backend; return both results with scores
- Nested Archives: Recursive extraction with configurable depth limit; flatten or preserve hierarchy
- Corrupted File Recovery: Stream-based parsing, emit content up to error point, include error location in metadata
Configuration Integration
Location:
crates/kreuzberg/src/core/config.rs, crates/kreuzberg/src/core/config_validation.rs
ExtractionConfig holds format-specific configs (pdf, image, html, office), fallback orchestration (fallback), and post-processing (postprocessor, chunking, keywords). See struct definition in config.rs.
Plugin System Integration
Location:
crates/kreuzberg/src/plugins/
- CustomExtractor: Override built-in format extractors
- PostProcessor: Modify results after extraction (Early/Middle/Late stages)
- Validator: Fail-fast validation (e.g., minimum text length)
- OCRBackend: Swap OCR engine
Plugin registry loaded at startup, cached for zero-cost lookup.
Feature Flag Strategy
Location:
Cargo.toml (workspace), crates/kreuzberg/Cargo.toml, FEATURE_MATRIX.md
20+ features across 9 language bindings. Key feature groups:
| Group | Features | Notes |
|---|---|---|
| OCR | (default), , | Mutually exclusive recommendation |
| Formats | , , , | |
| AI/ML | (requires ONNX), , , | |
| Server | (Axum), , , | |
| Bindings | , , , , |
Conditional compilation: modules gated with
#[cfg(feature = "...")]. Runtime validate_config() warns if requested feature not compiled in.
Feature Flag Critical Rules
- Never mix conflicting features - e.g.,
+ocr-minimal
should error at compile timetesseract - Always provide feature diagnostics - Config validation must warn if feature unavailable
- Default to maximum feature set - Unless embedded/minimal specifically requested
- Test all feature combinations - Matrix testing in CI catches regressions
- WASM incompatible with embeddings, keywords, OCR
Critical Rules
- Always use format detection before routing to extractors (prevent confusion attacks)
- Stream-based parsing for PDFs/archives to handle multi-GB files
- Post-pipeline is mandatory: All extraction results flow through
for validators/hooksrun_pipeline() - Plugin overrides are order-dependent: Plugins registered first take priority
- Fallback timeouts: Set reasonable OCR/archive extraction timeouts (config-driven)
- Metadata preservation: Include format detection confidence, extraction method used, any fallbacks applied
Related Skills
- ocr-backend-management - OCR engine selection and image preprocessing
- chunking-embeddings - Post-extraction text splitting with FastEmbed
- api-server-mcp - Axum endpoint for extraction pipeline exposure and MCP server