Claude-skill-registry debug-pdf
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/debug-pdf" ~/.claude/skills/majiayu000-claude-skill-registry-debug-pdf && rm -rf "$T"
manifest:
skills/data/debug-pdf/SKILL.mdsource content
Debug PDF Skill
Automate the lifecycle of an extraction failure: Failure -> Analysis -> Fixture -> Test
Why This Exists
Extractors (Marker, Surya, Camelot) break on specific PDF patterns (scanned pages, TOC dots, cursed fonts, watermarks). Manually reproducing these bugs is slow.
/debug-pdf fast-tracks this by:
- Downloading the failed artifact
- Identifying structural "traps" (TOC dots, watermarks, ligatures, etc.)
- Generating minimal reproduction fixtures using
fixture-tricky - Combining multiple failures into a single stress test PDF
Quick Start
# Analyze a single failed URL ./run.sh analyze "https://example.com/broken.pdf" # Process multiple failures in batch ./run.sh batch failed_urls.txt --output report.json # Combine all fixtures into one stress test PDF ./run.sh combine stress_test.pdf --max-pages 20 # List known failure patterns ./run.sh list-patterns # Check session status ./run.sh status
Commands
analyze <url>
Analyze a single PDF URL and optionally generate a reproduction fixture.
./run.sh analyze "https://example.com/broken.pdf" ./run.sh analyze "https://example.com/broken.pdf" --no-repro ./run.sh analyze "https://example.com/broken.pdf" --send-inbox
batch <url-file>
Process multiple URLs from a file (one URL per line).
# Create URL file echo "https://example.com/doc1.pdf" > failed.txt echo "https://example.com/doc2.pdf" >> failed.txt # Run batch analysis ./run.sh batch failed.txt --output analysis.json --send-inbox
combine [output.pdf]
Merge all generated fixtures into a single stress test PDF.
./run.sh combine stress_test.pdf --max-pages 15
list-patterns
Display all known failure patterns and their descriptions.
status
Show current debug session status and fixture count.
Detected Patterns (14/17 = 82%)
Structural (4/4 detected):
- Scanned image PDF without text layerscanned_no_ocr
- Slide deck with minimal text per pagesparse_content_slides
- Complex multi-column layouts (via text block analysis)multi_column
- Text obscured by watermark overlayswatermarks
Encoding (5/5 detected):
- Table of contents with dotted leaderstoc_noise
- Print metadata (Jkt/PO/Frm) in contentmetadata_artifacts
- Zero-width spaces, direction markersinvisible_chars
- Windows-1252 encoded smart quotescurly_quotes
- fi/fl/ff ligature charactersligatures
Layout (4/4 detected):
- Footnotes merged into body text (via font size/position heuristics)footnotes_inline
- Tables spanning multiple pages (flag only, no merging)split_tables
- Headers/footers mixed into content (via PyMuPDF4LLM Layout)header_footer_bleed
- Many embedded diagrams/chartsdiagram_heavy
Network (1/3 detected locally):
- Wayback Machine URL wrapper (detected via URL pattern)archive_org_wrap
- Marketing platform cookie gates (network-level, not detectable locally)auth_required
- Government/defense access controls (network-level, not detectable locally)access_restricted
Workflow Integration
When
memory or extractor agent reports failures:
- Collect failed URLs in a text file
- Run batch analysis:
./run.sh batch failed_urls.txt - Review pattern distribution in output
- Generate combined stress test:
./run.sh combine stress_test.pdf - Add stress test to extractor's regression suite
- New patterns get added to
for future testingfixture-tricky
Data Storage
All data is stored in
~/.pi/debug-pdf/:
- Individual analysis session JSON filessessions/
- Generated reproduction PDFsfixtures/
- Quick reference to most recent analysislast_analysis.json
Dependencies
(fitz) - PDF structure analysispymupdf
- ML-based layout detection for header/footer bleedpymupdf4llm
- HTTP downloads with redirect handlinghttpx
- CLI interfacetyper
- Loggingloguru
Sibling skills used:
- Robust URL downloading with Playwright supportfetcher
- Adversarial PDF generationfixture-tricky
- Verification of generated fixturesextractor
- Cross-agent notificationsagent-inbox
Testing
# Run test suite (24 tests) python -m pytest tests/test_debug_pdf.py -v # Generate test fixtures only python tests/test_debug_pdf.py
Test coverage includes:
- URL validation (security hardening)
- Wayback URL detection and extraction
- Multi-column layout detection
- Header/footer bleed detection
- Split table detection
- Footnote detection
- Full PDF analysis integration
Sanity Check
./sanity.sh
Verifies:
- Python dependencies installed
- Sibling skills available
- Data directory accessible
- CLI commands functional