OpenSpace robust-pdf-read

Reliably extract text from PDFs using pdftotext when standard file reading fails.

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/robust-pdf-read" ~/.claude/skills/hkuds-openspace-robust-pdf-read && rm -rf "$T"

manifest: gdpval_bench/skills/robust-pdf-read/SKILL.md

source content

Robust PDF Text Extraction

Problem

Standard file reading tools (e.g.,

read_file

) often fail to extract text from PDF documents. Instead of returning parsed text, they may return:

Raw binary data
Base64 encoded images
Garbled characters or null bytes

This occurs because PDFs are complex binary formats, not plain text files. Attempts to parse them using general-purpose Python libraries (like PyMuPDF) in sandboxed environments may also fail due to missing dependencies or environment restrictions.

Solution

Use the

pdftotext

command-line utility (part of

poppler-utils

) via

run_shell

. This tool is commonly pre-installed in Linux environments and reliably extracts text content from PDFs.

Procedure

1. Detect Extraction Failure

When attempting to read a PDF:

Check the content returned by
```
read_file
```
.
If the content contains null bytes (
```
\x00
```
), appears as base64, or is clearly binary/garbled, assume standard reading has failed.

2. Execute pdftotext

Run the following shell command using

run_shell

pdftotext -layout -nopgbrk <file_path> -

```
-layout
```
: Maintains the physical layout of the text (optional but recommended).
```
-nopgbrk
```
: Prevents inserting form feed characters between pages.
```
-
```
: Outputs content to stdout instead of creating a new file.

3. Parse Output

Capture the stdout from the shell command. This string is the extracted text.

Example Usage

Scenario: You need to read

document.pdf

Step 1: Attempt standard read

content = read_file("document.pdf")
if "\x00" in content or not content.strip():
    # Fallback needed
    pass

Step 2: Fallback to shell

result = run_shell("pdftotext -layout -nopgbrk document.pdf -")
text = result.stdout

Prerequisites

The environment must have
```
pdftotext
```
installed (usually via
```
poppler-utils
```
).
If
```
pdftotext
```
is not found, attempt to install it (
```
apt-get install poppler-utils
```
) if permissions allow, or notify the user.

Benefits

Reliability: Bypasses Python library dependency issues in sandboxes.
Speed: Command-line tools are often faster than loading heavy Python libraries.
Compatibility: Works consistently across most Linux-based agent environments.