OpenSpace reliable-pdf-extraction
Use shell commands or Python libraries to extract PDF text when read_file PDF handler fails
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/reliable-pdf-extraction" ~/.claude/skills/hkuds-openspace-reliable-pdf-extraction && rm -rf "$T"
manifest:
gdpval_bench/skills/reliable-pdf-extraction/SKILL.mdsource content
Reliable PDF Text Extraction
Problem
The
read_file tool with filetype='pdf' often returns binary image data, errors, or unusable output when attempting to extract text from PDF documents. This makes it unreliable for structured data extraction tasks.
Solution
Use
run_shell with command-line tools (pdftotext, pdfinfo) or execute_code_sandbox with Python libraries (PyMuPDF, pdfplumber) to extract PDF text content reliably.
Methods
Method 1: pdftotext (Recommended for simple extraction)
# Extract all text to stdout pdftotext input.pdf - # Or extract to file pdftotext input.pdf output.txt cat output.txt
Method 2: pdfinfo (For metadata)
pdfinfo input.pdf
Method 3: Python with PyMuPDF (fitz)
import fitz # PyMuPDF doc = fitz.open("input.pdf") text = "" for page in doc: text += page.get_text() print(text) doc.close()
Method 4: Python with pdfplumber (Better for tables/structured data)
import pdfplumber with pdfplumber.open("input.pdf") as pdf: for page in pdf.pages: text = page.extract_text() print(text) # For tables: # tables = page.extract_tables()
Workflow
-
Attempt
withread_file
first (in case it works)filetype='pdf' -
Check output - If you receive:
- Binary/garbage data
- Error messages
- Empty or truncated content
- Image data instead of text
-
Fall back to one of the extraction methods above:
- Use
viapdftotext
for quick text extractionrun_shell - Use
viapdfplumber
for structured data/tablesexecute_code_sandbox - Use
for complex layouts or when you need more controlPyMuPDF
- Use
-
Process the extracted text for your task
Example Usage
# Via run_shell result = run_shell(command="pdftotext document.pdf -") pdf_text = result.stdout # Via execute_code_sandbox code = """ import pdfplumber with pdfplumber.open("/path/to/document.pdf") as pdf: for page in pdf.pages: print(page.extract_text()) """ result = execute_code_sandbox(code=code) pdf_text = result.stdout
Tips
- pdftotext is fastest and most reliable for plain text extraction
- pdfplumber excels at extracting tables and preserving layout
- PyMuPDF offers the most control for complex PDF structures
- Always check if the PDF is scanned/image-based (may need OCR tools like
)tesseract - Some PDFs have copy protection that may prevent text extraction