Kreuzberg kreuzberg
git clone https://github.com/kreuzberg-dev/kreuzberg
T=$(mktemp -d) && git clone --depth=1 https://github.com/kreuzberg-dev/kreuzberg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/kreuzberg" ~/.claude/skills/kreuzberg-dev-kreuzberg-kreuzberg && rm -rf "$T"
skills/kreuzberg/SKILL.mdKreuzberg Document Extraction
Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
Use this skill when writing code that:
- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)
Installation
Python
pip install kreuzberg # Optional OCR backends: pip install kreuzberg[easyocr] # EasyOCR
Node.js
npm install @kreuzberg/node
Rust
# Cargo.toml [dependencies] kreuzberg = { version = "4", features = ["tokio-runtime"] } # features: tokio-runtime (required for sync + batch), pdf, ocr, chunking, # embeddings, language-detection, keywords-yake, keywords-rake
CLI
# Download from GitHub releases, or: cargo install kreuzberg-cli
Quick Start
Python (Async)
from kreuzberg import extract_file result = await extract_file("document.pdf") print(result.content) # extracted text print(result.metadata) # document metadata print(result.tables) # extracted tables
Python (Sync)
from kreuzberg import extract_file_sync result = extract_file_sync("document.pdf") print(result.content)
Node.js
import { extractFile } from '@kreuzberg/node'; const result = await extractFile('document.pdf'); console.log(result.content); console.log(result.metadata); console.log(result.tables);
Node.js (Sync)
import { extractFileSync } from '@kreuzberg/node'; const result = extractFileSync('document.pdf');
Rust (Async)
use kreuzberg::{extract_file, ExtractionConfig}; #[tokio::main] async fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig::default(); let result = extract_file("document.pdf", None, &config).await?; println!("{}", result.content); Ok(()) }
Rust (Sync) — requires tokio-runtime
feature
tokio-runtimeuse kreuzberg::{extract_file_sync, ExtractionConfig}; fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig::default(); let result = extract_file_sync("document.pdf", None, &config)?; println!("{}", result.content); Ok(()) }
CLI
kreuzberg extract document.pdf kreuzberg extract document.pdf --format json kreuzberg extract document.pdf --output-format markdown
Configuration
All languages use the same configuration structure with language-appropriate naming conventions.
Python (snake_case)
from kreuzberg import ( ExtractionConfig, OcrConfig, TesseractConfig, PdfConfig, ChunkingConfig, ) config = ExtractionConfig( ocr=OcrConfig( backend="tesseract", language="eng", tesseract_config=TesseractConfig(psm=6, enable_table_detection=True), ), pdf_options=PdfConfig(passwords=["secret123"]), chunking=ChunkingConfig(max_chars=1000, max_overlap=200), output_format="markdown", ) result = await extract_file("document.pdf", config=config)
Node.js (camelCase)
import { extractFile, type ExtractionConfig } from '@kreuzberg/node'; const config: ExtractionConfig = { ocr: { backend: 'tesseract', language: 'eng' }, pdfOptions: { passwords: ['secret123'] }, chunking: { maxChars: 1000, maxOverlap: 200 }, outputFormat: 'markdown', }; const result = await extractFile('document.pdf', null, config);
Rust (snake_case)
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat}; let config = ExtractionConfig { ocr: Some(OcrConfig { backend: "tesseract".into(), language: "eng".into(), ..Default::default() }), chunking: Some(ChunkingConfig { max_characters: 1000, overlap: 200, ..Default::default() }), output_format: OutputFormat::Markdown, ..Default::default() }; let result = extract_file("document.pdf", None, &config).await?;
Config File (TOML)
output_format = "markdown" [ocr] backend = "tesseract" language = "eng" [chunking] max_chars = 1000 max_overlap = 200 [pdf_options] passwords = ["secret123"]
# CLI: auto-discovers kreuzberg.toml in current/parent directories kreuzberg extract doc.pdf # or explicit: kreuzberg extract doc.pdf --config kreuzberg.toml kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
Batch Processing
Python
from kreuzberg import batch_extract_files, batch_extract_files_sync # Async results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"]) # Sync results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"]) for result in results: print(f"{len(result.content)} chars extracted")
Node.js
import { batchExtractFiles } from '@kreuzberg/node'; const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);
Rust — requires tokio-runtime
feature
tokio-runtimeuse kreuzberg::{batch_extract_file, ExtractionConfig}; let config = ExtractionConfig::default(); let paths = vec!["doc1.pdf", "doc2.docx"]; let results = batch_extract_file(paths, &config).await?;
CLI
kreuzberg batch *.pdf --format json kreuzberg batch docs/*.docx --output-format markdown
OCR
OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
Backends
- Tesseract (default): Built-in native binding. All Tesseract languages supported.
- EasyOCR (Python only):
. Passpip install kreuzberg[easyocr]
.easyocr_kwargs={"gpu": True} - PaddleOCR (Python only): Bundled since 4.8.5, no extra install needed. Pass
.paddleocr_kwargs={"use_angle_cls": True} - Guten (Node.js only): Built-in OCR backend via
.GutenOcrBackend
Language Codes
config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed
Force OCR
config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable
ExtractionResult Fields
| Field | Python | Node.js | Rust | Description |
|---|---|---|---|---|
| Text content | | | | Extracted text (str/String) |
| MIME type | | | | Input document MIME type |
| Metadata | | | | Document metadata (dict/object/HashMap) |
| Tables | | | | Extracted tables with cells + markdown |
| Languages | | | | Detected languages (if enabled) |
| Chunks | | | | Text chunks (if chunking enabled) |
| Images | | | | Extracted images (if enabled) |
| Elements | | | | Semantic elements (if element_based format) |
| Pages | | | | Per-page content (if page extraction enabled) |
| Keywords | | | | Extracted keywords (if enabled) |
Error Handling
Python
from kreuzberg import ( extract_file_sync, KreuzbergError, ParsingError, OCRError, ValidationError, MissingDependencyError, ) try: result = extract_file_sync("file.pdf") except ParsingError as e: print(f"Failed to parse: {e}") except OCRError as e: print(f"OCR failed: {e}") except ValidationError as e: print(f"Invalid input: {e}") except MissingDependencyError as e: print(f"Missing dependency: {e}") except KreuzbergError as e: print(f"Extraction failed: {e}")
Node.js
import { extractFile, KreuzbergError, ParsingError, OcrError, ValidationError, MissingDependencyError, } from '@kreuzberg/node'; try { const result = await extractFile('file.pdf'); } catch (e) { if (e instanceof ParsingError) { /* ... */ } else if (e instanceof OcrError) { /* ... */ } else if (e instanceof ValidationError) { /* ... */ } else if (e instanceof KreuzbergError) { /* ... */ } }
Rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError}; let config = ExtractionConfig::default(); match extract_file("file.pdf", None, &config).await { Ok(result) => println!("{}", result.content), Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"), Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"), Err(e) => eprintln!("Error: {e}"), }
Common Pitfalls
- Python ChunkingConfig fields: Use
andmax_chars
, NOTmax_overlap
ormax_characters
.overlap - Rust extract_file signature: Third argument is
(a reference), not&ExtractionConfig
. UseOption
for defaults.&ExtractionConfig::default() - Rust feature gates:
,extract_file_sync
, andbatch_extract_file
all requirebatch_extract_file_sync
in Cargo.toml.features = ["tokio-runtime"] - Rust async context:
is async. Useextract_file
or call from an async context.#[tokio::main] - CLI --format vs --output-format:
controls CLI output (text/json).--format
controls content format (plain/markdown/djot/html).--output-format - Node.js extractFile signature:
— mimeType is the second arg (passextractFile(path, mimeType?, config?)
to skip).null - Python detect_mime_type: The function for detecting from bytes is
. For paths usedetect_mime_type(data)
.detect_mime_type_from_path(path) - Config file field names: Use snake_case in TOML/YAML/JSON config files (e.g.,
,max_chars
,max_overlap
).pdf_options
Supported Formats (Summary)
| Category | Extensions |
|---|---|
| |
| Word | , |
| Spreadsheets | , , , , , , , |
| Presentations | , , |
| eBooks | , |
| Images | , , , , , , , , , , , , , , , , , , |
| Markup | , , , |
| Data | , , , , , |
| Text | , , , , , , |
, | |
| Archives | , , , , |
| Academic | , , , , , , , , , , , , , , , |
See references/supported-formats.md for the complete format reference with MIME types.
Additional Resources
Detailed reference files for specific topics:
- Python API Reference — All functions, config classes, plugin protocols, exact signatures
- Node.js API Reference — All functions, TypeScript interfaces, worker pool APIs
- Rust API Reference — All functions with feature gates, structs, Cargo.toml examples
- CLI Reference — All commands, flags, config precedence, exit codes
- Configuration Reference — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
- Supported Formats — All 85+ formats with file extensions and MIME types
- Advanced Features — Plugins, embeddings, MCP server, API server, security limits
- Other Language Bindings — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker
Full documentation: https://docs.kreuzberg.dev GitHub: https://github.com/kreuzberg-dev/kreuzberg