OpenSpace pdf-read-file-fallback
Extract text from PDFs using pdftotext when read_file returns binary data
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-read-file-fallback" ~/.claude/skills/hkuds-openspace-pdf-read-file-fallback && rm -rf "$T"
manifest:
gdpval_bench/skills/pdf-read-file-fallback/SKILL.mdsource content
PDF Text Extraction Fallback
When to Use
Use this pattern when
read_file with filetype="pdf" returns binary image data instead of extractable text content. This commonly occurs with PDFs that contain scanned images or complex formatting.
Steps
1. Attempt Primary Extraction
First, try using
read_file:
result = read_file(file_path="document.pdf", filetype="pdf")
Important: Use
filetype (not file_type) - incorrect parameter naming will cause execution failures.
2. Detect Binary/Image Data
Check if the result contains unusable content:
# Indicators of binary/image data: # - Contains null bytes: '\x00' # - Very short or empty # - Contains image markers (PNG/JPEG headers) # - Unreadable character sequences if not result or len(result) < 50 or '\x00' in str(result): # Proceed to fallback
3. Use pdftotext via run_shell
Extract text using the
pdftotext command-line tool:
shell_result = run_shell(command="pdftotext -layout document.pdf -") text_content = shell_result.stdout
The
- flag outputs to stdout for easy capture. The -layout flag preserves original formatting.
4. Handle pdftotext Unavailable
If
pdftotext is not installed, try Python-based extraction:
result = execute_code_sandbox(code=""" import pdfplumber text = '' with pdfplumber.open('document.pdf') as pdf: for page in pdf.pages: extracted = page.extract_text() if extracted: text += extracted + '\\n' print(text) """)
Complete Example
file_path = "report.pdf" # Primary attempt result = read_file(file_path=file_path, filetype="pdf") # Validate and fallback if needed if not result or len(str(result)) < 100 or '\x00' in str(result): # Fallback to pdftotext shell_result = run_shell(command=f"pdftotext -layout {file_path} -") text_content = shell_result.stdout # If pdftotext fails, try Python extraction if not text_content or len(text_content) < 50: code_result = execute_code_sandbox(code=f""" import pdfplumber text = '' with pdfplumber.open('{file_path}') as pdf: for page in pdf.pages: extracted = page.extract_text() if extracted: text += extracted + '\\n' print(text) """) text_content = code_result
Notes
is part of thepdftotext
package on most Linux systemspoppler-utils- For scanned/image-only PDFs, consider OCR tools (tesseract) instead
- Always validate parameter names against tool documentation (
vsfiletype
)file_type