OpenSpace reliable-pdf-extraction

Use shell commands or Python libraries to extract PDF text when read_file PDF handler fails

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/reliable-pdf-extraction" ~/.claude/skills/hkuds-openspace-reliable-pdf-extraction && rm -rf "$T"

manifest: gdpval_bench/skills/reliable-pdf-extraction/SKILL.md

source content

Reliable PDF Text Extraction

Problem

The

read_file

tool with

filetype='pdf'

often returns binary image data, errors, or unusable output when attempting to extract text from PDF documents. This makes it unreliable for structured data extraction tasks.

Solution

Use

run_shell

with command-line tools (

pdftotext

pdfinfo

) or

execute_code_sandbox

with Python libraries (PyMuPDF, pdfplumber) to extract PDF text content reliably.

Methods

Method 1: pdftotext (Recommended for simple extraction)

# Extract all text to stdout
pdftotext input.pdf -

# Or extract to file
pdftotext input.pdf output.txt
cat output.txt

Method 2: pdfinfo (For metadata)

pdfinfo input.pdf

Method 3: Python with PyMuPDF (fitz)

import fitz  # PyMuPDF

doc = fitz.open("input.pdf")
text = ""
for page in doc:
    text += page.get_text()
print(text)
doc.close()

Method 4: Python with pdfplumber (Better for tables/structured data)

import pdfplumber

with pdfplumber.open("input.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)
        # For tables:
        # tables = page.extract_tables()

Workflow

Attempt
```
read_file
```
with
```
filetype='pdf'
```
first (in case it works)
Check output - If you receive:
- Binary/garbage data
- Error messages
- Empty or truncated content
- Image data instead of text
Fall back to one of the extraction methods above:
- Use
```
pdftotext
```
  via
```
run_shell
```
  for quick text extraction
- Use
```
pdfplumber
```
  via
```
execute_code_sandbox
```
  for structured data/tables
- Use
```
PyMuPDF
```
  for complex layouts or when you need more control
Process the extracted text for your task

Example Usage

# Via run_shell
result = run_shell(command="pdftotext document.pdf -")
pdf_text = result.stdout

# Via execute_code_sandbox
code = """
import pdfplumber
with pdfplumber.open("/path/to/document.pdf") as pdf:
    for page in pdf.pages:
        print(page.extract_text())
"""
result = execute_code_sandbox(code=code)
pdf_text = result.stdout

Tips

pdftotext is fastest and most reliable for plain text extraction
pdfplumber excels at extracting tables and preserving layout
PyMuPDF offers the most control for complex PDF structures
Always check if the PDF is scanned/image-based (may need OCR tools like
```
tesseract
```
)
Some PDFs have copy protection that may prevent text extraction