OpenSpace reliable-pdf-extraction-ac5f89
Extract PDF text content using shell tools or Python libraries when read_file PDF handler fails
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/reliable-pdf-extraction-ac5f89" ~/.claude/skills/hkuds-openspace-reliable-pdf-extraction-ac5f89 && rm -rf "$T"
manifest:
gdpval_bench/skills/reliable-pdf-extraction-ac5f89/SKILL.mdsource content
Reliable PDF Text Extraction
Problem
The
read_file tool with filetype='pdf' can be unreliable for PDF text extraction. It may:
- Return binary image data instead of text
- Fail with errors on certain PDF structures
- Lose formatting or structured content
Solution
Use
run_shell with dedicated PDF extraction tools instead of relying on read_file for PDFs.
Methods
Method 1: pdftotext (Recommended)
pdftotext input.pdf output.txt
Or to extract to stdout:
pdftotext input.pdf -
With layout preservation:
pdftotext -layout input.pdf output.txt
Method 2: pdfinfo (Metadata)
pdfinfo input.pdf
Useful for checking page count, dimensions, and PDF properties before extraction.
Method 3: Python with PyMuPDF (fitz)
import fitz # PyMuPDF doc = fitz.open("input.pdf") text = "" for page in doc: text += page.get_text() doc.close()
Method 4: Python with pdfplumber (Tables)
import pdfplumber with pdfplumber.open("input.pdf") as pdf: for page in pdf.pages: text = page.extract_text() tables = page.extract_tables()
Workflow
-
Check PDF exists and is readable:
pdfinfo input.pdf 2>/dev/null || echo "PDF not accessible" -
Extract text using pdftotext:
pdftotext -layout input.pdf - > extracted_text.txt -
If pdftotext fails, try Python fallback:
import fitz doc = fitz.open("input.pdf") for i, page in enumerate(doc): print(f"--- Page {i+1} ---") print(page.get_text()) doc.close() -
Verify extraction succeeded:
- Check output is non-empty
- Verify text is readable (not binary/garbled)
- Confirm expected content is present
When to Use
| Tool | Best For |
|---|---|
| Fast, simple text extraction |
| Preserving spacing/formatting |
| Complex PDFs, programmatic access |
| Tables and structured data |
Example Integration
# In your agent workflow, prefer this pattern: result = run_shell(command="pdftotext document.pdf -", timeout=30) if result.stdout and len(result.stdout.strip()) > 0: content = result.stdout else: # Fallback to Python content = execute_python_to_extract_pdf("document.pdf")
Notes
- Install tools if needed:
(for pdftotext/pdfinfo)apt-get install poppler-utils - Python libraries:
pip install pymupdf pdfplumber - Some PDFs are image-scanned and require OCR (Tesseract) instead
- Always validate extracted content before proceeding with downstream tasks