Kreuzberg kreuzberg

Name: kreuzberg
Author: kreuzberg-dev

install

source · Clone the upstream repo

git clone https://github.com/kreuzberg-dev/kreuzberg

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/kreuzberg-dev/kreuzberg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/kreuzberg" ~/.claude/skills/kreuzberg-dev-kreuzberg-kreuzberg && rm -rf "$T"

manifest: skills/kreuzberg/SKILL.md

source content

Kreuzberg Document Extraction

Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.

Use this skill when writing code that:

Extracts text or metadata from documents
Performs OCR on scanned documents or images
Batch-processes multiple files
Configures extraction options (output format, chunking, OCR, language detection)
Implements custom plugins (post-processors, validators, OCR backends)

Installation

Python

pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr]    # EasyOCR

Node.js

npm install @kreuzberg/node

Rust

# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
#           embeddings, language-detection, keywords-yake, keywords-rake

CLI

# Download from GitHub releases, or:
cargo install kreuzberg-cli

Quick Start

Python (Async)

from kreuzberg import extract_file

result = await extract_file("document.pdf")
print(result.content)       # extracted text
print(result.metadata)      # document metadata
print(result.tables)        # extracted tables

Python (Sync)

from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)

Node.js

import { extractFile } from '@kreuzberg/node';

const result = await extractFile('document.pdf');
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);

Node.js (Sync)

import { extractFileSync } from '@kreuzberg/node';

const result = extractFileSync('document.pdf');

Rust (Async)

use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}

Rust (Sync) — requires

tokio-runtime

feature

use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}

CLI

kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown

Configuration

All languages use the same configuration structure with language-appropriate naming conventions.

Python (snake_case)

from kreuzberg import (
    ExtractionConfig, OcrConfig, TesseractConfig,
    PdfConfig, ChunkingConfig,
)

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    ),
    pdf_options=PdfConfig(passwords=["secret123"]),
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
    output_format="markdown",
)

result = await extract_file("document.pdf", config=config)

Node.js (camelCase)

import { extractFile, type ExtractionConfig } from '@kreuzberg/node';

const config: ExtractionConfig = {
    ocr: { backend: 'tesseract', language: 'eng' },
    pdfOptions: { passwords: ['secret123'] },
    chunking: { maxChars: 1000, maxOverlap: 200 },
    outputFormat: 'markdown',
};

const result = await extractFile('document.pdf', null, config);

Rust (snake_case)

use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".into(),
        language: "eng".into(),
        ..Default::default()
    }),
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        ..Default::default()
    }),
    output_format: OutputFormat::Markdown,
    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;

Config File (TOML)

output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

[chunking]
max_chars = 1000
max_overlap = 200

[pdf_options]
passwords = ["secret123"]

# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'

Batch Processing

Python

from kreuzberg import batch_extract_files, batch_extract_files_sync

# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:
    print(f"{len(result.content)} chars extracted")

Node.js

import { batchExtractFiles } from '@kreuzberg/node';

const results = await batchExtractFiles(['doc1.pdf', 'doc2.docx']);

Rust — requires

tokio-runtime

feature

use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;

CLI

kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown

OCR

OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).

Backends

Tesseract (default): Built-in native binding. All Tesseract languages supported.

EasyOCR (Python only):

pip install kreuzberg[easyocr]

. Pass

easyocr_kwargs={"gpu": True}

PaddleOCR (Python only): Bundled since 4.8.5, no extra install needed. Pass
```
paddleocr_kwargs={"use_angle_cls": True}
```
.
Guten (Node.js only): Built-in OCR backend via
```
GutenOcrBackend
```
.

Language Codes

config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed

Force OCR

config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable

ExtractionResult Fields

Field	Python	Node.js	Rust	Description
Text content	`result.content`	`result.content`	`result.content`	Extracted text (str/String)
MIME type	`result.mime_type`	`result.mimeType`	`result.mime_type`	Input document MIME type
Metadata	`result.metadata`	`result.metadata`	`result.metadata`	Document metadata (dict/object/HashMap)
Tables	`result.tables`	`result.tables`	`result.tables`	Extracted tables with cells + markdown
Languages	`result.detected_languages`	`result.detectedLanguages`	`result.detected_languages`	Detected languages (if enabled)
Chunks	`result.chunks`	`result.chunks`	`result.chunks`	Text chunks (if chunking enabled)
Images	`result.images`	`result.images`	`result.images`	Extracted images (if enabled)
Elements	`result.elements`	`result.elements`	`result.elements`	Semantic elements (if element_based format)
Pages	`result.pages`	`result.pages`	`result.pages`	Per-page content (if page extraction enabled)
Keywords	`result.keywords`	`result.keywords`	`result.keywords`	Extracted keywords (if enabled)

Error Handling

Python

from kreuzberg import (
    extract_file_sync, KreuzbergError, ParsingError,
    OCRError, ValidationError, MissingDependencyError,
)

try:
    result = extract_file_sync("file.pdf")
except ParsingError as e:
    print(f"Failed to parse: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except ValidationError as e:
    print(f"Invalid input: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
except KreuzbergError as e:
    print(f"Extraction failed: {e}")

Node.js

import {
    extractFile, KreuzbergError, ParsingError,
    OcrError, ValidationError, MissingDependencyError,
} from '@kreuzberg/node';

try {
    const result = await extractFile('file.pdf');
} catch (e) {
    if (e instanceof ParsingError) { /* ... */ }
    else if (e instanceof OcrError) { /* ... */ }
    else if (e instanceof ValidationError) { /* ... */ }
    else if (e instanceof KreuzbergError) { /* ... */ }
}

Rust

use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
    Ok(result) => println!("{}", result.content),
    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
    Err(e) => eprintln!("Error: {e}"),
}

Common Pitfalls

Python ChunkingConfig fields: Use
```
max_chars
```
and
```
max_overlap
```
, NOT
```
max_characters
```
or
```
overlap
```
.
Rust extract_file signature: Third argument is
```
&ExtractionConfig
```
(a reference), not
```
Option
```
. Use
```
&ExtractionConfig::default()
```
for defaults.

Rust feature gates:

extract_file_sync

batch_extract_file

, and

batch_extract_file_sync

all require

features = ["tokio-runtime"]

in Cargo.toml.

Rust async context:
```
extract_file
```
is async. Use
```
#[tokio::main]
```
or call from an async context.
CLI --format vs --output-format:
```
--format
```
controls CLI output (text/json).
```
--output-format
```
controls content format (plain/markdown/djot/html).
Node.js extractFile signature:
```
extractFile(path, mimeType?, config?)
```
— mimeType is the second arg (pass
```
null
```
to skip).
Python detect_mime_type: The function for detecting from bytes is
```
detect_mime_type(data)
```
. For paths use
```
detect_mime_type_from_path(path)
```
.
Config file field names: Use snake_case in TOML/YAML/JSON config files (e.g.,
```
max_chars
```
,
```
max_overlap
```
,
```
pdf_options
```
).

Supported Formats (Summary)

Category	Extensions
PDF	`.pdf`
Word	`.docx` , `.odt`
Spreadsheets	`.xlsx` , `.xlsm` , `.xlsb` , `.xls` , `.xla` , `.xlam` , `.xltm` , `.ods`
Presentations	`.pptx` , `.ppt` , `.ppsx`
eBooks	`.epub` , `.fb2`
Images	`.png` , `.jpg` , `.jpeg` , `.gif` , `.webp` , `.bmp` , `.tiff` , `.tif` , `.jp2` , `.jpx` , `.jpm` , `.mj2` , `.jbig2` , `.jb2` , `.pnm` , `.pbm` , `.pgm` , `.ppm` , `.svg`
Markup	`.html` , `.htm` , `.xhtml` , `.xml`
Data	`.json` , `.yaml` , `.yml` , `.toml` , `.csv` , `.tsv`
Text	`.txt` , `.md` , `.markdown` , `.djot` , `.rst` , `.org` , `.rtf`
Email	`.eml` , `.msg`
Archives	`.zip` , `.tar` , `.tgz` , `.gz` , `.7z`
Academic	`.bib` , `.biblatex` , `.ris` , `.nbib` , `.enw` , `.csl` , `.tex` , `.latex` , `.typ` , `.jats` , `.ipynb` , `.docbook` , `.opml` , `.pod` , `.mdoc` , `.troff`

See references/supported-formats.md for the complete format reference with MIME types.

Additional Resources

Detailed reference files for specific topics:

Python API Reference — All functions, config classes, plugin protocols, exact signatures
Node.js API Reference — All functions, TypeScript interfaces, worker pool APIs
Rust API Reference — All functions with feature gates, structs, Cargo.toml examples
CLI Reference — All commands, flags, config precedence, exit codes
Configuration Reference — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
Supported Formats — All 85+ formats with file extensions and MIME types
Advanced Features — Plugins, embeddings, MCP server, API server, security limits
Other Language Bindings — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker

Full documentation: https://docs.kreuzberg.dev GitHub: https://github.com/kreuzberg-dev/kreuzberg

Kreuzberg kreuzberg

Kreuzberg Document Extraction

Installation

Python

Node.js

Rust

CLI

Quick Start

Python (Async)

Python (Sync)

Node.js

Node.js (Sync)

Rust (Async)

Rust (Sync) — requires
`tokio-runtime`
feature

CLI

Configuration

Python (snake_case)

Node.js (camelCase)

Rust (snake_case)

Config File (TOML)

Batch Processing

Python

Node.js

Rust — requires
`tokio-runtime`
feature

CLI

OCR

Backends

Language Codes

Force OCR

ExtractionResult Fields

Error Handling

Python

Node.js

Rust

Common Pitfalls

Supported Formats (Summary)

Additional Resources