OpenSpace robust-pdf-read
Reliably extract text from PDFs using pdftotext when standard file reading fails.
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/robust-pdf-read" ~/.claude/skills/hkuds-openspace-robust-pdf-read && rm -rf "$T"
manifest:
gdpval_bench/skills/robust-pdf-read/SKILL.mdsource content
Robust PDF Text Extraction
Problem
Standard file reading tools (e.g.,
read_file) often fail to extract text from PDF documents. Instead of returning parsed text, they may return:
- Raw binary data
- Base64 encoded images
- Garbled characters or null bytes
This occurs because PDFs are complex binary formats, not plain text files. Attempts to parse them using general-purpose Python libraries (like PyMuPDF) in sandboxed environments may also fail due to missing dependencies or environment restrictions.
Solution
Use the
pdftotext command-line utility (part of poppler-utils) via run_shell. This tool is commonly pre-installed in Linux environments and reliably extracts text content from PDFs.
Procedure
1. Detect Extraction Failure
When attempting to read a PDF:
- Check the content returned by
.read_file - If the content contains null bytes (
), appears as base64, or is clearly binary/garbled, assume standard reading has failed.\x00
2. Execute pdftotext
Run the following shell command using
run_shell:
pdftotext -layout -nopgbrk <file_path> -
: Maintains the physical layout of the text (optional but recommended).-layout
: Prevents inserting form feed characters between pages.-nopgbrk
: Outputs content to stdout instead of creating a new file.-
3. Parse Output
Capture the stdout from the shell command. This string is the extracted text.
Example Usage
Scenario: You need to read
document.pdf.
Step 1: Attempt standard read
content = read_file("document.pdf") if "\x00" in content or not content.strip(): # Fallback needed pass
Step 2: Fallback to shell
result = run_shell("pdftotext -layout -nopgbrk document.pdf -") text = result.stdout
Prerequisites
- The environment must have
installed (usually viapdftotext
).poppler-utils - If
is not found, attempt to install it (pdftotext
) if permissions allow, or notify the user.apt-get install poppler-utils
Benefits
- Reliability: Bypasses Python library dependency issues in sandboxes.
- Speed: Command-line tools are often faster than loading heavy Python libraries.
- Compatibility: Works consistently across most Linux-based agent environments.