Claude-skill-registry extractor

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/extractor" ~/.claude/skills/majiayu000-claude-skill-registry-extractor && rm -rf "$T"
manifest: skills/data/extractor/SKILL.md
source content

Extractor

Self-correcting agentic document extraction using a Preset-First Methodology. Auto-detects document type and applies calibrated extraction settings.

Quick Start

# Auto mode (recommended) - detects document type automatically
.pi/skills/extractor/run.sh paper.pdf

# Specify output directory
.pi/skills/extractor/run.sh paper.pdf --out ./results

# Get markdown output directly
.pi/skills/extractor/run.sh paper.pdf --markdown

# OCR scanned PDFs (lazy-loads OCRmyPDF docker image if needed)
.pi/skills/extractor/run.sh scanned.pdf --auto-ocr

Extraction Modes

ModeFlagDescription
Auto(default)Profile detector picks best settings
Fast
--fast
PyMuPDF only, no ML/LLM (fastest)
Accurate
--accurate
Full pipeline with LLM enhancements
Offline
--offline
Deterministic, no network calls
# Fast mode - quick extraction, no LLM
.pi/skills/extractor/run.sh report.pdf --fast

# Accurate mode - full pipeline with LLM for tables/math
.pi/skills/extractor/run.sh paper.pdf --accurate

# Offline smoke test (deterministic)
.pi/skills/extractor/run.sh doc.pdf --offline

Collaboration Flow

For PDFs without

--preset
, the skill runs an intelligent collaboration flow:

  1. Profile Detection: Analyzes document (layout, tables, formulas, requirements)
  2. High Confidence Match: If confidence >= 8, auto-extracts with detected preset
  3. Low Confidence / Unknown:
    • Interactive (TTY): Prompts user to select preset
    • Non-interactive: Uses auto mode with warning
# See what the detector finds (no extraction)
.pi/skills/extractor/run.sh paper.pdf --profile-only

# Output:
# {
#   "preset": "arxiv",
#   "confidence": 12,
#   "tables": true,
#   "figures": true,
#   "formulas": true,
#   "recommended_mode": "accurate"
# }

# Interactive prompt (in terminal)
.pi/skills/extractor/run.sh unknown_paper.pdf
# Analyzing: unknown_paper.pdf
# Detected: multi-column layout, 12 pages
# Contains: tables, figures, formulas
#
# Select extraction preset:
#   [1] arxiv - Academic papers [RECOMMENDED]
#   [2] requirements_spec - Engineering specs
#   [3] auto - Let pipeline decide
#   [4] fast - Quick extraction, no LLM
# Enter choice [1-4]:

# Non-interactive (batch/CI) - auto-selects
echo | .pi/skills/extractor/run.sh paper.pdf --no-interactive

Preset Selection

The pipeline auto-detects document type via

s00_profile_detector
:

PresetDetected WhenConfidence Points
arxivAcademic papers (2-column, math, "Abstract/References")+5 filename, +4 sections, +3 layout
requirements_specEngineering specs (REQ-xxx, "Shall", nested sections)+5 filename, +4 REQ pattern
autoUnknown documentsFallback when confidence < 8
# Force a specific preset (skip detection)
.pi/skills/extractor/run.sh paper.pdf --preset arxiv
.pi/skills/extractor/run.sh spec.pdf --preset requirements_spec

# Let collaboration flow decide
.pi/skills/extractor/run.sh paper.pdf

Output Options

# JSON output (default) - full structured data
.pi/skills/extractor/run.sh doc.pdf --json

# Markdown output - human-readable text
.pi/skills/extractor/run.sh doc.pdf --markdown

# Sections only (skip tables/figures)
.pi/skills/extractor/run.sh doc.pdf --sections-only

Supported Formats

Cross-format parity measured against HTML reference (2026-01-17):

FormatMethodParityNotes
MarkdownDirect parse100%Perfect structural match
DOCXNative XML (python-docx)100%Perfect structural match
HTMLBeautifulSoupReferenceBaseline for comparison
XMLdefusedxml90%Structure preserved, markdown differs
PDF14-stage pipeline87%Varies by document complexity
RSTdocutils85%Section structure varies
EPUBebooklib82%Chapter structure varies
PPTXpython-pptx81%Slide-based structure
XLSXopenpyxl16%Expected (spreadsheet format)
ImagesOCR/VLM16%Requires VLM for text extraction

Pipeline Stages

The full pipeline runs 14+ stages:

00_profile_detector     Detect document type, select preset
01_annotation_processor Strip PDF annotations
02_marker_extractor     Extract blocks (text, tables, figures)
03_suspicious_headers   Verify header classifications with VLM
04_section_builder      Build document sections
05_table_extractor      Extract and describe tables
06_figure_extractor     Extract and describe figures
07_duckdb_ingest        Assemble into queryable DB
08_extract_requirements Mine requirements (if detected)
08b_lean4_theorem_prover Formal proofs (scientific only)
09_section_summarizer   Generate section summaries
10_markdown_exporter    Export to Markdown
14_report_generator     Generate extraction report

Output Structure

{
  "success": true,
  "preset": "arxiv",
  "outputs": {
    "markdown": "results/10_markdown_exporter/document.md",
    "sections": "results/04_section_builder/json_output/04_sections.json",
    "tables": "results/05_table_extractor/json_output/05_tables.json",
    "figures": "results/06_figure_extractor/json_output/06_figures.json",
    "report": "results/14_report_generator/json_output/final_report.json"
  },
  "counts": {
    "sections": 12,
    "tables": 5,
    "figures": 8
  }
}

Batch Processing

# Process all PDFs in a directory
.pi/skills/extractor/run.sh ./documents/ --out ./results

# With glob pattern
.pi/skills/extractor/run.sh ./documents/ --glob "**/*.pdf"

# Non-interactive batch (CI/scripts)
.pi/skills/extractor/run.sh ./documents/ --no-interactive

# Force preset for entire batch
.pi/skills/extractor/run.sh ./documents/ --preset arxiv --out ./results

Agent-Friendly Flags

FlagPurpose
--profile-only
Return profile JSON without extraction
--no-interactive
Skip prompts, use auto mode
--preset <name>
Force preset (skip detection)
--fast
No LLM, quick extraction
--toc-check
Check TOC integrity against extracted sections
--auto-ocr
OCR scanned PDFs with OCRmyPDF (lazy-loads docker image)
--no-auto-ocr
Disable OCRmyPDF preprocessing for scanned PDFs
--skip-scanned
Skip scanned PDFs and write a skip manifest
--ocr-lang <langs>
OCR language(s), e.g.
eng
or
eng+deu
--ocr-deskew
Deskew scanned pages during OCR
--ocr-force
Force OCR even if text exists
--ocr-timeout <sec>
OCR timeout in seconds
--continue-on-error
Continue pipeline on step failures (batch-friendly)

TOC Integrity Check

Verify that extracted sections match the PDF's Table of Contents (bookmarks):

# Check integrity on pipeline output directory
.pi/skills/extractor/run.sh ./results/ --toc-check

# Check specific DuckDB file
.pi/skills/extractor/run.sh ./results/corpus.duckdb --toc-check

Output:

{
  "success": true,
  "has_toc": true,
  "integrity_score": 0.85,
  "status": "GOOD",
  "toc_entries_count": 20,
  "sections_count": 18,
  "matched_count": 17,
  "missing_count": 3,
  "matched": [
    { "toc_title": "1. Introduction", "section_id": "sec_001", "score": 0.95 }
  ],
  "missing": [{ "toc_title": "Appendix A", "toc_page": 45 }]
}

Status levels:

  • EXCELLENT: >= 90% match
  • GOOD: >= 70% match
  • FAIR: >= 50% match
  • POOR: < 50% match

Environment

Requires the extractor project with its virtual environment:

  • Project:
    /home/graham/workspace/experiments/extractor
  • Venv:
    .venv/bin/python
  • Dependencies:
    scillm
    ,
    fetcher
    (local paths)

Set

EXTRACTOR_ROOT
to override the project location.

Sanity Check

# Verify skill works across all formats
.pi/skills/extractor/sanity.sh

Tests: HTML, MD, XML, RST, DOCX, PPTX, EPUB, XLSX, PDF, PNG

LLM Requirements

For accurate mode (VLM/table descriptions):

  • CHUTES_API_BASE
    - Chutes API endpoint
  • CHUTES_API_KEY
    - API key
  • CHUTES_VLM_MODEL
    - Vision model (default: Qwen/Qwen3-VL-235B-A22B-Instruct)
  • CHUTES_TEXT_MODEL
    - Text model (default: moonshotai/Kimi-K2-Instruct-0905)

For Lean4 proving (arxiv preset):

  • lean_runner
    container running
  • OPENROUTER_API_KEY
    set