OpenSpace pdf-text-extraction-fallback
Extract text from PDFs using pdftotext when read_file returns binary data
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-text-extraction-fallback" ~/.claude/skills/hkuds-openspace-pdf-text-extraction-fallback && rm -rf "$T"
manifest:
gdpval_bench/skills/pdf-text-extraction-fallback/SKILL.mdsource content
PDF Text Extraction Fallback
When to Use This Skill
Use this skill when
read_file with filetype="pdf" returns binary/image data instead of readable text content. This is a common issue with PDF files that contain embedded images or complex formatting.
Steps
1. Validate Parameters First
Before attempting extraction, ensure you're using the correct parameter name:
- Use
(notfiletype
) for thefile_type
functionread_file - Incorrect parameter names can cause silent failures
# Correct read_file(filetype="pdf", file_path="document.pdf") # Incorrect - may fail silently read_file(file_type="pdf", file_path="document.pdf")
2. Detect Binary Data Issue
After calling
read_file, check if the result contains:
- Garbled/binary characters
- Image data representations (e.g.,
byte strings with non-text content)b'...' - Unreadable or corrupted-looking content
If yes, proceed with the pdftotext workaround.
3. Extract Text via pdftotext
Use
run_shell to call pdftotext, which extracts text directly from PDF files:
# Extract text to stdout result = run_shell(command="pdftotext /path/to/document.pdf -") text_content = result.stdout
The
- flag tells pdftotext to output to stdout instead of creating a file.
4. Handle Output and Errors
result = run_shell(command="pdftotext /path/to/document.pdf -") if result.stderr: # Check for errors like "pdftotext not found" # May need to install poppler-utils pass text_content = result.stdout # text_content now contains the extracted text
Example Workflow
# Step 1: Try normal read with correct parameters content = read_file(filetype="pdf", file_path="reference.pdf") # Step 2: Check if content is readable if not content or looks_like_binary(content): # Step 3: Fall back to pdftotext result = run_shell(command="pdftotext reference.pdf -") text_content = result.stdout # Step 4: Verify extraction succeeded if result.stderr: # Handle error (e.g., install pdftotext) pass
Installation Notes
pdftotext is part of the poppler-utils package:
- Debian/Ubuntu:
apt-get install poppler-utils - macOS:
brew install poppler - Many Linux environments: Pre-installed
Alternative: Output to File
If stdout approach has issues, output to a temporary file:
run_shell(command="pdftotext /path/to/document.pdf /tmp/output.txt") text_content = read_file(filetype="txt", file_path="/tmp/output.txt")
Best Practices
- Always validate the
parameter spelling before troubleshootingfiletype - Check both stdout and stderr from pdftotext
- For multi-page PDFs, pdftotext preserves page breaks with form feeds
- This method works better for text-based PDFs than image-scanned PDFs