Claude-skill-registry extractor
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/extractor" ~/.claude/skills/majiayu000-claude-skill-registry-extractor && rm -rf "$T"
manifest:
skills/data/extractor/SKILL.mdsource content
Extractor
Self-correcting agentic document extraction using a Preset-First Methodology. Auto-detects document type and applies calibrated extraction settings.
Quick Start
# Auto mode (recommended) - detects document type automatically .pi/skills/extractor/run.sh paper.pdf # Specify output directory .pi/skills/extractor/run.sh paper.pdf --out ./results # Get markdown output directly .pi/skills/extractor/run.sh paper.pdf --markdown # OCR scanned PDFs (lazy-loads OCRmyPDF docker image if needed) .pi/skills/extractor/run.sh scanned.pdf --auto-ocr
Extraction Modes
| Mode | Flag | Description |
|---|---|---|
| Auto | (default) | Profile detector picks best settings |
| Fast | | PyMuPDF only, no ML/LLM (fastest) |
| Accurate | | Full pipeline with LLM enhancements |
| Offline | | Deterministic, no network calls |
# Fast mode - quick extraction, no LLM .pi/skills/extractor/run.sh report.pdf --fast # Accurate mode - full pipeline with LLM for tables/math .pi/skills/extractor/run.sh paper.pdf --accurate # Offline smoke test (deterministic) .pi/skills/extractor/run.sh doc.pdf --offline
Collaboration Flow
For PDFs without
--preset, the skill runs an intelligent collaboration flow:
- Profile Detection: Analyzes document (layout, tables, formulas, requirements)
- High Confidence Match: If confidence >= 8, auto-extracts with detected preset
- Low Confidence / Unknown:
- Interactive (TTY): Prompts user to select preset
- Non-interactive: Uses auto mode with warning
# See what the detector finds (no extraction) .pi/skills/extractor/run.sh paper.pdf --profile-only # Output: # { # "preset": "arxiv", # "confidence": 12, # "tables": true, # "figures": true, # "formulas": true, # "recommended_mode": "accurate" # } # Interactive prompt (in terminal) .pi/skills/extractor/run.sh unknown_paper.pdf # Analyzing: unknown_paper.pdf # Detected: multi-column layout, 12 pages # Contains: tables, figures, formulas # # Select extraction preset: # [1] arxiv - Academic papers [RECOMMENDED] # [2] requirements_spec - Engineering specs # [3] auto - Let pipeline decide # [4] fast - Quick extraction, no LLM # Enter choice [1-4]: # Non-interactive (batch/CI) - auto-selects echo | .pi/skills/extractor/run.sh paper.pdf --no-interactive
Preset Selection
The pipeline auto-detects document type via
s00_profile_detector:
| Preset | Detected When | Confidence Points |
|---|---|---|
| arxiv | Academic papers (2-column, math, "Abstract/References") | +5 filename, +4 sections, +3 layout |
| requirements_spec | Engineering specs (REQ-xxx, "Shall", nested sections) | +5 filename, +4 REQ pattern |
| auto | Unknown documents | Fallback when confidence < 8 |
# Force a specific preset (skip detection) .pi/skills/extractor/run.sh paper.pdf --preset arxiv .pi/skills/extractor/run.sh spec.pdf --preset requirements_spec # Let collaboration flow decide .pi/skills/extractor/run.sh paper.pdf
Output Options
# JSON output (default) - full structured data .pi/skills/extractor/run.sh doc.pdf --json # Markdown output - human-readable text .pi/skills/extractor/run.sh doc.pdf --markdown # Sections only (skip tables/figures) .pi/skills/extractor/run.sh doc.pdf --sections-only
Supported Formats
Cross-format parity measured against HTML reference (2026-01-17):
| Format | Method | Parity | Notes |
|---|---|---|---|
| Markdown | Direct parse | 100% | Perfect structural match |
| DOCX | Native XML (python-docx) | 100% | Perfect structural match |
| HTML | BeautifulSoup | Reference | Baseline for comparison |
| XML | defusedxml | 90% | Structure preserved, markdown differs |
| 14-stage pipeline | 87% | Varies by document complexity | |
| RST | docutils | 85% | Section structure varies |
| EPUB | ebooklib | 82% | Chapter structure varies |
| PPTX | python-pptx | 81% | Slide-based structure |
| XLSX | openpyxl | 16% | Expected (spreadsheet format) |
| Images | OCR/VLM | 16% | Requires VLM for text extraction |
Pipeline Stages
The full pipeline runs 14+ stages:
00_profile_detector Detect document type, select preset 01_annotation_processor Strip PDF annotations 02_marker_extractor Extract blocks (text, tables, figures) 03_suspicious_headers Verify header classifications with VLM 04_section_builder Build document sections 05_table_extractor Extract and describe tables 06_figure_extractor Extract and describe figures 07_duckdb_ingest Assemble into queryable DB 08_extract_requirements Mine requirements (if detected) 08b_lean4_theorem_prover Formal proofs (scientific only) 09_section_summarizer Generate section summaries 10_markdown_exporter Export to Markdown 14_report_generator Generate extraction report
Output Structure
{ "success": true, "preset": "arxiv", "outputs": { "markdown": "results/10_markdown_exporter/document.md", "sections": "results/04_section_builder/json_output/04_sections.json", "tables": "results/05_table_extractor/json_output/05_tables.json", "figures": "results/06_figure_extractor/json_output/06_figures.json", "report": "results/14_report_generator/json_output/final_report.json" }, "counts": { "sections": 12, "tables": 5, "figures": 8 } }
Batch Processing
# Process all PDFs in a directory .pi/skills/extractor/run.sh ./documents/ --out ./results # With glob pattern .pi/skills/extractor/run.sh ./documents/ --glob "**/*.pdf" # Non-interactive batch (CI/scripts) .pi/skills/extractor/run.sh ./documents/ --no-interactive # Force preset for entire batch .pi/skills/extractor/run.sh ./documents/ --preset arxiv --out ./results
Agent-Friendly Flags
| Flag | Purpose |
|---|---|
| Return profile JSON without extraction |
| Skip prompts, use auto mode |
| Force preset (skip detection) |
| No LLM, quick extraction |
| Check TOC integrity against extracted sections |
| OCR scanned PDFs with OCRmyPDF (lazy-loads docker image) |
| Disable OCRmyPDF preprocessing for scanned PDFs |
| Skip scanned PDFs and write a skip manifest |
| OCR language(s), e.g. or |
| Deskew scanned pages during OCR |
| Force OCR even if text exists |
| OCR timeout in seconds |
| Continue pipeline on step failures (batch-friendly) |
TOC Integrity Check
Verify that extracted sections match the PDF's Table of Contents (bookmarks):
# Check integrity on pipeline output directory .pi/skills/extractor/run.sh ./results/ --toc-check # Check specific DuckDB file .pi/skills/extractor/run.sh ./results/corpus.duckdb --toc-check
Output:
{ "success": true, "has_toc": true, "integrity_score": 0.85, "status": "GOOD", "toc_entries_count": 20, "sections_count": 18, "matched_count": 17, "missing_count": 3, "matched": [ { "toc_title": "1. Introduction", "section_id": "sec_001", "score": 0.95 } ], "missing": [{ "toc_title": "Appendix A", "toc_page": 45 }] }
Status levels:
- EXCELLENT: >= 90% match
- GOOD: >= 70% match
- FAIR: >= 50% match
- POOR: < 50% match
Environment
Requires the extractor project with its virtual environment:
- Project:
/home/graham/workspace/experiments/extractor - Venv:
.venv/bin/python - Dependencies:
,scillm
(local paths)fetcher
Set
EXTRACTOR_ROOT to override the project location.
Sanity Check
# Verify skill works across all formats .pi/skills/extractor/sanity.sh
Tests: HTML, MD, XML, RST, DOCX, PPTX, EPUB, XLSX, PDF, PNG
LLM Requirements
For accurate mode (VLM/table descriptions):
- Chutes API endpointCHUTES_API_BASE
- API keyCHUTES_API_KEY
- Vision model (default: Qwen/Qwen3-VL-235B-A22B-Instruct)CHUTES_VLM_MODEL
- Text model (default: moonshotai/Kimi-K2-Instruct-0905)CHUTES_TEXT_MODEL
For Lean4 proving (arxiv preset):
container runninglean_runner
setOPENROUTER_API_KEY