Aiwg pdf-extractor
Extract text, tables, and images from PDF files. Use when converting PDF documentation, manuals, or reports to searchable text.
install
source · Clone the upstream repo
git clone https://github.com/jmagly/aiwg
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/pdf-extractor" ~/.claude/skills/jmagly-aiwg-pdf-extractor && rm -rf "$T"
manifest:
.agents/skills/pdf-extractor/SKILL.mdsource content
PDF Extractor Skill
Purpose
Single responsibility: Extract structured content (text, tables, images) from PDF files into organized, searchable formats. (BP-4)
Grounding Checkpoint (Archetype 1 Mitigation)
Before executing, VERIFY:
- PDF file exists and is readable (
confirms PDF format)file <path> - PDF is not corrupted (
returns metadata)pdfinfo <path> - Password known if encrypted
- Output directory is writable
- Required tools available (pdfplumber, pytesseract for OCR)
DO NOT proceed without verification. Inspect PDF metadata first.
Uncertainty Escalation (Archetype 2 Mitigation)
ASK USER instead of guessing when:
- PDF appears to be scanned (needs OCR) but OCR tools unavailable
- Multiple table formats detected - unclear which parser to use
- Password-protected but no password provided
- Image extraction quality unclear (resolution, format preferences)
- Language detection needed for OCR
NEVER assume PDF structure without inspection.
Context Scope (Archetype 3 Mitigation)
| Context Type | Included | Excluded |
|---|---|---|
| RELEVANT | Target PDF, extraction options, output path | Other PDF files |
| PERIPHERAL | Similar PDF structure examples | Unrelated documents |
| DISTRACTOR | Previous extraction attempts | Other file formats |
Workflow Steps
Step 1: Inspect PDF (Grounding)
# Check file type file document.pdf # Get PDF metadata pdfinfo document.pdf # Check page count pdfinfo document.pdf | grep Pages # Check if encrypted pdfinfo document.pdf | grep Encrypted
Step 2: Determine Extraction Strategy
| PDF Type | Detection | Strategy |
|---|---|---|
| Text-based | produces readable text | Direct extraction |
| Scanned/Image | produces empty/garbled | OCR required |
| Mixed | Some pages text, some images | Hybrid approach |
| Tables | Visual grid patterns | Table extraction mode |
| Forms | Interactive fields | Form field extraction |
Step 3: Execute Extraction
Option A: With skill-seekers (if installed)
# Basic extraction skill-seekers pdf --pdf document.pdf --name myskill # With table extraction skill-seekers pdf --pdf document.pdf --name myskill --extract-tables # With OCR for scanned docs skill-seekers pdf --pdf document.pdf --name myskill --ocr # With parallel processing (large PDFs) skill-seekers pdf --pdf document.pdf --name myskill --parallel --workers 8 # Password-protected skill-seekers pdf --pdf document.pdf --name myskill --password "secret"
Option B: Manual extraction guidance
# Basic text extraction pdftotext -layout document.pdf output.txt # Extract with page markers pdftotext -layout -eol unix document.pdf output.txt # Extract images pdfimages -all document.pdf images/ # OCR scanned PDF (requires tesseract) pdftoppm document.pdf page -png tesseract page-*.png output -l eng
Step 4: Validate Output
# Check extraction quality head -100 output/<skill-name>/references/content.md # Verify table extraction grep -A 10 "| " output/<skill-name>/references/*.md # Check image extraction ls -la output/<skill-name>/assets/images/
Recovery Protocol (Archetype 4 Mitigation)
On error:
- PAUSE - Stop extraction, preserve partial output
- DIAGNOSE - Check error type:
→ Verify pathFile not found
→ Ask user for passwordPassword required
→ Try repair withCorrupt PDFqpdf --check
→ Check tesseract installation, language packsOCR failed
→ Process in chunks, reduce workersMemory error
- ADAPT - Switch strategy based on diagnosis
- RETRY - Resume with adapted approach (max 3 attempts)
- ESCALATE - Ask user for guidance
Checkpoint Support
State saved to:
.aiwg/working/checkpoints/pdf-extractor/
For large PDFs, extraction saves progress per chunk:
checkpoints/pdf-extractor/ ├── document_metadata.json ├── pages_1-50.json ├── pages_51-100.json └── current_position.json
Output Structure
output/<skill-name>/ ├── SKILL.md # Skill description with PDF summary ├── references/ │ ├── index.md # Table of contents │ ├── chapter_1.md # Content by section │ ├── chapter_2.md │ └── tables.md # Extracted tables └── assets/ └── images/ # Extracted images (if enabled) ├── page_1_fig_1.png └── page_5_chart_1.png
Configuration Options
{ "name": "mymanual", "description": "Product manual documentation", "pdf_path": "docs/manual.pdf", "extract_options": { "chunk_size": 10, "min_quality": 6.0, "extract_images": true, "min_image_size": 150, "ocr_enabled": false, "ocr_language": "eng", "table_extraction": true }, "categories": { "getting_started": ["introduction", "setup", "installation"], "usage": ["using", "operation", "guide"], "reference": ["appendix", "specifications", "api"] } }
Extraction Quality Metrics
| Metric | Good | Acceptable | Poor |
|---|---|---|---|
| Text extraction rate | >95% | 80-95% | <80% |
| Table accuracy | >90% | 70-90% | <70% |
| Image quality | >300 DPI | 150-300 DPI | <150 DPI |
| OCR confidence | >90% | 70-90% | <70% |
Troubleshooting
| Issue | Diagnosis | Solution |
|---|---|---|
| Garbled text | Scanned PDF | Enable OCR mode |
| Missing tables | Complex layout | Use with pdfplumber |
| Poor OCR | Low resolution | Increase DPI, check language pack |
| Memory error | Large PDF | Use chunked extraction, reduce workers |
| Corrupt PDF | File damaged | Try or |
Dependencies
Required:
- Python 3.10+
- pdfplumber or pypdf
Optional (for advanced features):
- pytesseract + tesseract-ocr (for OCR)
- Pillow (for image processing)
- camelot-py (for complex tables)
References
- Skill Seekers PDF Support: https://github.com/jmagly/Skill_Seekers/blob/main/docs/PDF_MCP_TOOL.md
- REF-001: Production-Grade Agentic Workflows (BP-1, BP-4)
- REF-002: LLM Failure Modes (Archetype 1-4 mitigations)