Claude-skill-registry extractor

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/extractor" ~/.claude/skills/majiayu000-claude-skill-registry-extractor && rm -rf "$T"

manifest: skills/data/extractor/SKILL.md

source content

Extractor

Self-correcting agentic document extraction using a Preset-First Methodology. Auto-detects document type and applies calibrated extraction settings.

Quick Start

# Auto mode (recommended) - detects document type automatically
.pi/skills/extractor/run.sh paper.pdf

# Specify output directory
.pi/skills/extractor/run.sh paper.pdf --out ./results

# Get markdown output directly
.pi/skills/extractor/run.sh paper.pdf --markdown

# OCR scanned PDFs (lazy-loads OCRmyPDF docker image if needed)
.pi/skills/extractor/run.sh scanned.pdf --auto-ocr

Extraction Modes

Mode	Flag	Description
Auto	(default)	Profile detector picks best settings
Fast	`--fast`	PyMuPDF only, no ML/LLM (fastest)
Accurate	`--accurate`	Full pipeline with LLM enhancements
Offline	`--offline`	Deterministic, no network calls

# Fast mode - quick extraction, no LLM
.pi/skills/extractor/run.sh report.pdf --fast

# Accurate mode - full pipeline with LLM for tables/math
.pi/skills/extractor/run.sh paper.pdf --accurate

# Offline smoke test (deterministic)
.pi/skills/extractor/run.sh doc.pdf --offline

Collaboration Flow

For PDFs without

--preset

, the skill runs an intelligent collaboration flow:

Profile Detection: Analyzes document (layout, tables, formulas, requirements)
High Confidence Match: If confidence >= 8, auto-extracts with detected preset
Low Confidence / Unknown:
- Interactive (TTY): Prompts user to select preset
- Non-interactive: Uses auto mode with warning

# See what the detector finds (no extraction)
.pi/skills/extractor/run.sh paper.pdf --profile-only

# Output:
# {
#   "preset": "arxiv",
#   "confidence": 12,
#   "tables": true,
#   "figures": true,
#   "formulas": true,
#   "recommended_mode": "accurate"
# }

# Interactive prompt (in terminal)
.pi/skills/extractor/run.sh unknown_paper.pdf
# Analyzing: unknown_paper.pdf
# Detected: multi-column layout, 12 pages
# Contains: tables, figures, formulas
#
# Select extraction preset:
#   [1] arxiv - Academic papers [RECOMMENDED]
#   [2] requirements_spec - Engineering specs
#   [3] auto - Let pipeline decide
#   [4] fast - Quick extraction, no LLM
# Enter choice [1-4]:

# Non-interactive (batch/CI) - auto-selects
echo | .pi/skills/extractor/run.sh paper.pdf --no-interactive

Preset Selection

The pipeline auto-detects document type via

s00_profile_detector

Preset	Detected When	Confidence Points
arxiv	Academic papers (2-column, math, "Abstract/References")	+5 filename, +4 sections, +3 layout
requirements_spec	Engineering specs (REQ-xxx, "Shall", nested sections)	+5 filename, +4 REQ pattern
auto	Unknown documents	Fallback when confidence < 8

# Force a specific preset (skip detection)
.pi/skills/extractor/run.sh paper.pdf --preset arxiv
.pi/skills/extractor/run.sh spec.pdf --preset requirements_spec

# Let collaboration flow decide
.pi/skills/extractor/run.sh paper.pdf

Output Options

# JSON output (default) - full structured data
.pi/skills/extractor/run.sh doc.pdf --json

# Markdown output - human-readable text
.pi/skills/extractor/run.sh doc.pdf --markdown

# Sections only (skip tables/figures)
.pi/skills/extractor/run.sh doc.pdf --sections-only

Supported Formats

Cross-format parity measured against HTML reference (2026-01-17):

Format	Method	Parity	Notes
Markdown	Direct parse	100%	Perfect structural match
DOCX	Native XML (python-docx)	100%	Perfect structural match
HTML	BeautifulSoup	Reference	Baseline for comparison
XML	defusedxml	90%	Structure preserved, markdown differs
PDF	14-stage pipeline	87%	Varies by document complexity
RST	docutils	85%	Section structure varies
EPUB	ebooklib	82%	Chapter structure varies
PPTX	python-pptx	81%	Slide-based structure
XLSX	openpyxl	16%	Expected (spreadsheet format)
Images	OCR/VLM	16%	Requires VLM for text extraction

Pipeline Stages

The full pipeline runs 14+ stages:

00_profile_detector     Detect document type, select preset
01_annotation_processor Strip PDF annotations
02_marker_extractor     Extract blocks (text, tables, figures)
03_suspicious_headers   Verify header classifications with VLM
04_section_builder      Build document sections
05_table_extractor      Extract and describe tables
06_figure_extractor     Extract and describe figures
07_duckdb_ingest        Assemble into queryable DB
08_extract_requirements Mine requirements (if detected)
08b_lean4_theorem_prover Formal proofs (scientific only)
09_section_summarizer   Generate section summaries
10_markdown_exporter    Export to Markdown
14_report_generator     Generate extraction report

Output Structure

{
  "success": true,
  "preset": "arxiv",
  "outputs": {
    "markdown": "results/10_markdown_exporter/document.md",
    "sections": "results/04_section_builder/json_output/04_sections.json",
    "tables": "results/05_table_extractor/json_output/05_tables.json",
    "figures": "results/06_figure_extractor/json_output/06_figures.json",
    "report": "results/14_report_generator/json_output/final_report.json"
  },
  "counts": {
    "sections": 12,
    "tables": 5,
    "figures": 8
  }
}

Batch Processing

# Process all PDFs in a directory
.pi/skills/extractor/run.sh ./documents/ --out ./results

# With glob pattern
.pi/skills/extractor/run.sh ./documents/ --glob "**/*.pdf"

# Non-interactive batch (CI/scripts)
.pi/skills/extractor/run.sh ./documents/ --no-interactive

# Force preset for entire batch
.pi/skills/extractor/run.sh ./documents/ --preset arxiv --out ./results

Agent-Friendly Flags

Flag	Purpose
`--profile-only`	Return profile JSON without extraction
`--no-interactive`	Skip prompts, use auto mode
`--preset <name>`	Force preset (skip detection)
`--fast`	No LLM, quick extraction
`--toc-check`	Check TOC integrity against extracted sections
`--auto-ocr`	OCR scanned PDFs with OCRmyPDF (lazy-loads docker image)
`--no-auto-ocr`	Disable OCRmyPDF preprocessing for scanned PDFs
`--skip-scanned`	Skip scanned PDFs and write a skip manifest
`--ocr-lang <langs>`	OCR language(s), e.g. `eng` or `eng+deu`
`--ocr-deskew`	Deskew scanned pages during OCR
`--ocr-force`	Force OCR even if text exists
`--ocr-timeout <sec>`	OCR timeout in seconds
`--continue-on-error`	Continue pipeline on step failures (batch-friendly)

TOC Integrity Check

Verify that extracted sections match the PDF's Table of Contents (bookmarks):

# Check integrity on pipeline output directory
.pi/skills/extractor/run.sh ./results/ --toc-check

# Check specific DuckDB file
.pi/skills/extractor/run.sh ./results/corpus.duckdb --toc-check

Output:

{
  "success": true,
  "has_toc": true,
  "integrity_score": 0.85,
  "status": "GOOD",
  "toc_entries_count": 20,
  "sections_count": 18,
  "matched_count": 17,
  "missing_count": 3,
  "matched": [
    { "toc_title": "1. Introduction", "section_id": "sec_001", "score": 0.95 }
  ],
  "missing": [{ "toc_title": "Appendix A", "toc_page": 45 }]
}

Status levels:

EXCELLENT: >= 90% match
GOOD: >= 70% match
FAIR: >= 50% match
POOR: < 50% match

Environment

Requires the extractor project with its virtual environment:

Project:

/home/graham/workspace/experiments/extractor

Venv:
```
.venv/bin/python
```
Dependencies:
```
scillm
```
,
```
fetcher
```
(local paths)

Set

EXTRACTOR_ROOT

to override the project location.

Sanity Check

# Verify skill works across all formats
.pi/skills/extractor/sanity.sh

Tests: HTML, MD, XML, RST, DOCX, PPTX, EPUB, XLSX, PDF, PNG

LLM Requirements

For accurate mode (VLM/table descriptions):

```
CHUTES_API_BASE
```
- Chutes API endpoint
```
CHUTES_API_KEY
```
- API key
```
CHUTES_VLM_MODEL
```
- Vision model (default: Qwen/Qwen3-VL-235B-A22B-Instruct)
```
CHUTES_TEXT_MODEL
```
- Text model (default: moonshotai/Kimi-K2-Instruct-0905)

For Lean4 proving (arxiv preset):

```
lean_runner
```
container running
```
OPENROUTER_API_KEY
```
set