OpenSpace pdf-extract-progressive-tools
Progressive tool-chain PDF extraction with explicit read_file, run_shell, and execute_code_sandbox sequencing
git clone https://github.com/HKUDS/OpenSpace
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-download-extract-fallback-enhanced-899f5b" ~/.claude/skills/hkuds-openspace-pdf-extract-progressive-tools && rm -rf "$T"
gdpval_bench/skills/pdf-download-extract-fallback-enhanced-899f5b/SKILL.mdPDF Text Extraction with Progressive Tool Fallback
This skill provides a robust workflow for extracting text from PDF documents using a sequenced approach with agent tools, with explicit fallback mechanisms based on observed tool behavior.
Critical Insight from Execution Data
read_file often returns binary/image data for PDFs, not extracted text. When this occurs, immediately escalate to run_shell with pdftotext before attempting Python-based extraction.
Entry Point: Determine Your Starting Point
Before beginning, identify your scenario:
| Scenario | Start Here | Skip |
|---|---|---|
| PDF already on local disk | Step 1 (read_file attempt) | Download steps |
| PDF at a web URL | Download first, then Step 1 | None |
| PDF content already extracted | Step 4 (Quality verification) | Steps 1-3 |
Overview
PDF extraction failures cascade when tool sequencing is unclear. This workflow ensures maximum success rate through explicit tool progression:
- read_file - Quick attempt, but may return binary data
- run_shell + pdftotext - Reliable extraction when read_file fails
- execute_code_sandbox + PyMuPDF - Final fallback for complex PDFs
Step-by-Step Instructions
Step 1: Attempt read_file First
Always try the simplest approach first:
Tool: read_file Path: document.pdf
Expected outcome: Extracted text content
Critical check: Examine the returned content:
- ✅ Text visible: Proceed to Step 4 (Quality verification)
- ⚠️ Binary/image data detected: Immediately proceed to Step 2
- ❌ File not found: Verify path or download first
Binary data indicators:
- Content starts with
header without text extraction%PDF- - Content appears as garbled characters or base64
- Content contains PNG/JPEG markers within PDF wrapper
- File size seems reasonable but no readable text
Step 2: Escalate to run_shell with pdftotext
When read_file returns binary data, do NOT attempt execute_code_sandbox yet. Use run_shell immediately:
Tool: run_shell Command: pdftotext document.pdf document.txt
If pdftotext is not available:
Tool: run_shell Command: apt-get update && apt-get install -y poppler-utils && pdftotext document.pdf document.txt
Then read the extracted text:
Tool: read_file Path: document.txt
Expected outcome: Clean text extraction
If this fails:
- Check if file is password-protected
- Check if file is corrupted (run
)file document.pdf - Proceed to Step 3
Step 3: Final Fallback to execute_code_sandbox with PyMuPDF
Only attempt this if Steps 1-2 fail:
Tool: execute_code_sandbox Language: python Code: | import fitz # PyMuPDF try: doc = fitz.open("document.pdf") text = "" for page in doc: text += page.get_text() doc.close() with open("document_pymupdf.txt", "w") as f: f.write(text) print("SUCCESS: Extracted {} characters".format(len(text))) except Exception as e: print(f"FAILED: {e}")
Then read the result:
Tool: read_file Path: document_pymupdf.txt
Step 4: Quality Verification
Regardless of which method succeeded, verify extraction quality:
- Check text length: Should be proportional to PDF pages (~500-2000 chars per page)
- Check readability: Text should form coherent sentences
- Check for truncation: Look for cut-off words or missing sections
- Compare methods: If multiple methods worked, compare outputs
If quality is poor:
- Try alternative extraction tools (pdfplumber, camelot-py for tables)
- Consider OCR for scanned documents
- Document limitations clearly
Step 5: Graceful Degradation to Domain Knowledge
If all extraction methods fail:
- Document the specific failure mode for each tool attempted
- Extract any partial content that was successfully retrieved
- Supplement missing content from established domain knowledge
- Clearly mark which portions are from source vs. generated from knowledge
- Provide citations for any claimed requirements or specifications
Example degradation note:
NOTE: Source document [path/URL] was inaccessible due to [specific tool failures]. Content below combines partial extraction with established domain knowledge for [topic]. All claims verified against [alternative sources] where possible. Tool Failure Log: - read_file: Returned binary data (no text extraction) - run_shell/pdftotext: Command not available in environment - execute_code_sandbox/PyMuPDF: Sandbox execution failed with [error]
Complete Tool Orchestration Script
# pdf-extract-orchestrator.py # Implements the progressive tool fallback pattern def extract_pdf_text(pdf_path): """ Progressive PDF extraction following tool precedence: 1. read_file (quick check) 2. run_shell + pdftotext (primary extraction) 3. execute_code_sandbox + PyMuPDF (final fallback) """ extraction_log = [] # Step 1: Try read_file print("Step 1: Attempting read_file...") try: content = read_file(pdf_path) if is_binary_or_image_data(content): extraction_log.append("read_file: Returned binary data") # Proceed to Step 2 else: extraction_log.append("read_file: Success") return content, extraction_log except Exception as e: extraction_log.append(f"read_file: Failed - {e}") # Step 2: Try run_shell with pdftotext print("Step 2: Attempting run_shell + pdftotext...") try: run_shell(f"pdftotext {pdf_path} output.txt") content = read_file("output.txt") if content and len(content) > 100: extraction_log.append("run_shell/pdftotext: Success") return content, extraction_log else: extraction_log.append("run_shell/pdftotext: Empty extraction") except Exception as e: extraction_log.append(f"run_shell/pdftotext: Failed - {e}") # Step 3: Try execute_code_sandbox with PyMuPDF print("Step 3: Attempting execute_code_sandbox + PyMuPDF...") try: code = """ import fitz doc = fitz.open("""" + pdf_path + """") text = "" for page in doc: text += page.get_text() doc.close() print(text[:1000]) # Preview """ result = execute_code_sandbox(language="python", code=code) extraction_log.append("execute_code_sandbox/PyMuPDF: Success") return result, extraction_log except Exception as e: extraction_log.append(f"execute_code_sandbox/PyMuPDF: Failed - {e}") # Step 4: All methods failed extraction_log.append("ALL METHODS FAILED - Escalate to domain knowledge") return None, extraction_log def is_binary_or_image_data(content): """Detect if content is binary/image data rather than extracted text""" if not content: return True # Check for PDF header without text extraction if content.startswith("%PDF-"): return True # Check for high ratio of non-printable characters non_printable = sum(1 for c in content if ord(c) < 32 and c not in '\n\r\t') if len(content) > 0 and non_printable / len(content) > 0.1: return True return False
Tool Precedence Decision Tree
┌─────────────────┐ │ Start: PDF │ │ Available? │ └────────┬────────┘ │ ┌────────▼────────┐ │ Step 1: │ │ read_file │ └────────┬────────┘ │ ┌──────────────┼──────────────┐ │ │ │ ┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐ │ Text │ │ Binary │ │ Error/ │ │ Returned │ │ Data │ │ Not Found│ └─────┬─────┘ └─────┬─────┘ └─────┬─────┘ │ │ │ │ ┌────▼─────┐ ┌────▼─────┐ │ │ Step 2: │ │ Download │ │ │ run_shell│ │ or Fix │ │ │ pdftotext│ │ Path │ │ └────┬─────┘ └──────────┘ │ │ │ ┌────▼─────┐ │ │ Success? │ │ └────┬─────┘ │ │ ┌─────▼─────┐ ┌─────▼─────┐ │ Yes │ │ No │ └─────┬─────┘ └─────┬─────┘ │ │ │ ┌────▼─────────┐ │ │ Step 3: │ │ │ execute_ │ │ │ code_sandbox │ │ │ PyMuPDF │ │ └──────────────┘ │ ┌─────▼──────────────────┐ │ Step 4: Quality Check │ │ Step 5: Document │ │ Limitations │ └────────────────────────┘
Best Practices
- Check read_file output immediately: Don't assume it extracted text - verify before proceeding
- Escalate quickly on binary data: Don't waste iterations trying read_file multiple times
- Prefer run_shell over execute_code_sandbox: Shell tools are more reliable for PDF extraction when available
- Log each tool attempt: Document which method succeeded for future reference
- Preserve extraction artifacts: Keep intermediate files for debugging
- Verify extraction quality: Check text length and readability before accepting results
- Document tool failures: When falling back to domain knowledge, specify which tools failed and why
Common Failure Modes by Tool
| Tool | Symptom | Cause | Solution |
|---|---|---|---|
| read_file | Binary PDF data | Tool doesn't extract PDF text | Escalate to run_shell immediately |
| read_file | PNG/JPEG data | PDF contains embedded images | Use OCR tools or request text version |
| run_shell | pdftotext not found | Tool not installed | Install poppler-utils first |
| run_shell | Empty output | Password-protected PDF | Request accessible version |
| execute_code_sandbox | Unknown error | Sandbox execution issue | Try run_shell alternative or document limitation |
| execute_code_sandbox | Import error | PyMuPDF not installed | Include pip install in script |
When to Use This Skill
- PDFs from web downloads: After downloading, apply this extraction workflow
- PDFs already local: Start at Step 1 with existing file path
- Automated document processing: Where reliability matters more than speed
- Regulatory/compliance documents: Where source verification is critical
- Situations with tool uncertainty: When environment capabilities are unknown
Migration from Parent Skill
This skill enhances
pdf-download-extract-fallback by:
- Explicit tool sequencing: Parent described shell commands; this specifies agent tool order
- Binary detection: Parent assumed download success; this checks read_file output quality
- Faster escalation: Parent tried pdftotext then PyMuPDF; this escalates immediately on binary data
- Agent-focused: Parent was shell-script focused; this is optimized for agent tool calls
- Execution insights: Incorporates learnings from failed task 0353ee0c showing read_file limitations