install
source · Clone the upstream repo
git clone https://github.com/kreuzberg-dev/kreuzberg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/kreuzberg-dev/kreuzberg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.ai-rulez/skills/format-specific-extraction" ~/.claude/skills/kreuzberg-dev-kreuzberg-format-specific-extraction && rm -rf "$T"
manifest:
.ai-rulez/skills/format-specific-extraction/SKILL.mdsource content
priority: high
Format-Specific Extraction Workflows
Office XML (DOCX/PPTX/ODT)
ZIP archive → Security validation → XML parsing → Text + tables + metadata
ZipBombValidator::new(limits).validate(&mut archive)?- Extract XML files from archive (
,word/document.xml
,ppt/slides/*.xml
)content.xml - Parse with
(streaming) +quick-xml::Reader
+DepthValidatorStringGrowthValidator - Extract metadata via
crate::extraction::office_metadata::extract_metadata() - See:
,extractors/docx.rs
,extractors/pptx.rsextractors/odt.rs
Bytes → pdfium-render → Per-page text + OCR fallback → Tables → Metadata
pdfium.create_document_from_bytes(content, None)?- Check if needs OCR:
config.force_ocr || !has_searchable_text() - Extract text per page, tables if
enabledconfig.pages - Feature-gated:
#[cfg(feature = "pdf")] - See:
extractors/pdf/mod.rs
Archives (ZIP/TAR/7z/GZIP)
Validate → Extract metadata → Extract plaintext files only
BEFORE any extractionZipBombValidator- Extract metadata (file list, sizes)
- Extract text content from plaintext files
- Use
helperbuild_archive_result() - See:
,extractors/archive.rsextraction/archive/*.rs
Structured Text (JSON/YAML/TOML/XML)
Detect format from MIME → Parse → Pretty-print → Metadata
Single
StructuredExtractor handles multiple MIME types. Parse with format-specific library, pretty-print to text.
See: extractors/structured.rs
Email (EML/MSG)
Parse headers → Extract body (text/html) → Process attachments
See:
extraction/email.rs, extractors/email.rs
Common Helpers
| Helper | Location | Purpose |
|---|---|---|
| | Office XML metadata |
| | Convert cell grid to GFM table |
| | Standard archive result |
Adding a New Format
- Add MIME type to
inEXT_TO_MIMEcore/mime.rs - Create extractor implementing
traitDocumentExtractor - Set
andsupported_mime_types()
(default: 50)priority() - Register in
→extractors/mod.rsregister_default_extractors() - Feature-gate if optional:
#[cfg(feature = "my-format")] - Apply security validators for user content
- Add tests with fixture files