OpenSpace pdf-text-extraction
Extract text from PDFs using shell tools when read_file fails
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-text-extraction" ~/.claude/skills/hkuds-openspace-pdf-text-extraction && rm -rf "$T"
manifest:
gdpval_bench/skills/pdf-text-extraction/SKILL.mdsource content
PDF Text Extraction (Fallback Method)
When to Use This Skill
Use this skill when
read_file with filetype='pdf':
- Returns binary image data instead of text
- Produces errors or incomplete content
- Fails to extract structured data reliably
The built-in PDF handler is unreliable for structured text extraction. Shell-based tools provide more robust alternatives.
Available Methods
Method 1: pdftotext (Recommended)
# Extract text from PDF to stdout pdftotext /path/to/file.pdf - # Extract text to a file pdftotext /path/to/file.pdf output.txt # Preserve layout (maintains spacing/structure) pdftotext -layout /path/to/file.pdf output.txt
Usage in agent:
run_shell command="pdftotext -layout /path/to/document.pdf -"
Method 2: pdfinfo (Metadata)
# Get PDF metadata (pages, author, creation date, etc.) pdfinfo /path/to/file.pdf
Usage in agent:
run_shell command="pdfinfo /path/to/document.pdf"
Method 3: Python with PyMuPDF (fitz)
import fitz # PyMuPDF doc = fitz.open("/path/to/file.pdf") text = "" for page in doc: text += page.get_text() doc.close() print(text)
Usage in agent:
run_shell command="python3 -c \"import fitz; doc=fitz.open('file.pdf'); print(''.join(p.get_text() for p in doc))\""
Method 4: Python with pdfplumber (Tables)
import pdfplumber with pdfplumber.open("/path/to/file.pdf") as pdf: for page in pdf.pages: text = page.extract_text() tables = page.extract_tables() # For tabular data
Usage in agent:
run_shell command="python3 -c \"import pdfplumber; pdf=pdfplumber.open('file.pdf'); print(''.join(p.extract_text() or '' for p in pdf.pages))\""
Workflow
-
Try pdftotext first - Fastest and most reliable for plain text
run_shell command="pdftotext -layout /path/to/file.pdf -" -
If pdftotext unavailable, check for Python libraries
run_shell command="python3 -c \"import fitz; print('PyMuPDF available')\"" -
For table/structured data, use pdfplumber
run_shell command="python3 -c \"import pdfplumber; ...\"" -
Verify extraction succeeded - Check output contains readable text, not binary data
Installation Notes
If tools are not available:
# Ubuntu/Debian apt-get install poppler-utils # pdftotext, pdfinfo # Install Python libraries pip install pymupdf pdfplumber
Anti-Patterns to Avoid
- Do NOT rely solely on
withread_file
for critical text extractionfiletype='pdf' - Do NOT assume PDF text is in any particular order - verify extracted content
- Do NOT use image-based extraction unless the PDF is scanned (use OCR instead)
Example: Complete Extraction Pattern
# Step 1: Try pdftotext RESULT=$(pdftotext -layout /path/to/form.pdf - 2>/dev/null) # Step 2: Verify we got text, not error if [ -z "$RESULT" ]; then # Fallback to Python RESULT=$(python3 -c "import fitz; doc=fitz.open('/path/to/form.pdf'); print(''.join(p.get_text() for p in doc))") fi # Step 3: Use extracted text echo "$RESULT"