OpenSpace pdf-text-extraction-fallback-85d5ca
Fallback workflow for extracting text from PDFs when read_file returns binary data
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-text-extraction-fallback-85d5ca" ~/.claude/skills/hkuds-openspace-pdf-text-extraction-fallback-85d5ca && rm -rf "$T"
manifest:
gdpval_bench/skills/pdf-text-extraction-fallback-85d5ca/SKILL.mdsource content
PDF Text Extraction Fallback
Use this skill when
read_file returns binary data or garbled content for PDF files instead of readable text. This workflow provides a reliable fallback using command-line PDF tools.
When to Use
withread_file
returns binary data, unreadable characters, or errorsfiletype: pdf- You need to extract text from a PDF to process its contents
- Standard file reading methods fail to extract usable text
Step-by-Step Instructions
Step 1: Detect Binary/Unreadable PDF Output
After attempting to read a PDF with
read_file, check if the output is:
- Binary data (contains null bytes, non-printable characters)
- Garbled text with many special characters
- Empty or truncated content
# Example of problematic output from read_file %PDF-1.4 1 0 obj << /Type /Catalog ...
If the output looks like raw PDF structure or binary, proceed to Step 2.
Step 2: Use shell_agent with PDF Tools
Invoke
shell_agent to extract text using pdftotext (preferred) or pdfplumber (Python fallback):
Task: Extract all text content from <filename.pdf> using pdftotext or pdfplumber. Output the extracted text in readable format. If pdftotext is not available, use Python with pdfplumber library.
Example shell_agent invocation:
shell_agent task="Extract text from Move_Out_Inspection_Tracker.pdf using pdftotext. Save output to a .txt file and return the content."
Step 3: Validate Extracted Content
After extraction, validate that the content contains expected text patterns:
# Validation checklist def validate_pdf_extraction(text, expected_patterns=None): checks = [ bool(text.strip()), # Not empty len(text) > 50, # Has substantial content not text.startswith('%PDF'), # Not raw PDF structure ] if expected_patterns: for pattern in expected_patterns: checks.append(pattern.lower() in text.lower()) return all(checks)
Common expected patterns to check:
- Document-specific keywords (e.g., "inspection", "resident", "date")
- Expected data formats (dates, names, IDs)
- Minimum word count threshold
Step 4: Handle Extraction Failures
If validation fails:
-
Try alternative tool: If
failed, trypdftotext
:pdfplumbershell_agent task="Extract text from <file.pdf> using Python pdfplumber library. Handle any encoding issues." -
Try OCR fallback: For scanned PDFs:
shell_agent task="This PDF may be scanned. Use pytesseract or similar OCR tool to extract text from <file.pdf>." -
Report specific error: Document what patterns were expected but not found.
Step 5: Proceed with Data Processing
Once validated text is obtained:
- Parse the extracted text for required data
- Store or process the content as needed
- Continue with the original task workflow
Code Example
# Complete extraction workflow def extract_pdf_text_fallback(pdf_path, expected_patterns=None): """Extract text from PDF with fallback handling.""" # Step 1: Try read_file first content = read_file(filetype="pdf", file_path=pdf_path) # Step 2: Check if binary/unreadable if is_binary_or_garbled(content): # Step 3: Use shell_agent fallback result = shell_agent( task=f"Extract all text from {pdf_path} using pdftotext. Return the text content." ) content = result.stdout # Step 4: Validate if not validate_pdf_extraction(content, expected_patterns): # Try pdfplumber as secondary fallback result = shell_agent( task=f"Extract text from {pdf_path} using Python pdfplumber library." ) content = result.stdout return content def is_binary_or_garbled(text): """Check if text appears to be binary or unreadable.""" if not text: return True if text.startswith('%PDF'): return True # Check for high ratio of non-printable characters non_printable = sum(1 for c in text if ord(c) > 127 or ord(c) < 32) return non_printable / len(text) > 0.3
Tips
- pdftotext is typically faster and pre-installed on many systems
- pdfplumber handles complex layouts better but requires Python
- For scanned PDFs, you'll need OCR tools (tesseract, pytesseract)
- Always validate extracted content matches your expected data patterns
- Save intermediate extraction results for debugging if needed