OpenSpace pdf-extract-shell-first

PDF text extraction with tool cascade prioritizing shell pdftotext before Python fallback

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-download-extract-fallback-enhanced-e27e0c" ~/.claude/skills/hkuds-openspace-pdf-extract-shell-first && rm -rf "$T"

manifest: gdpval_bench/skills/pdf-download-extract-fallback-enhanced-e27e0c/SKILL.md

source content

PDF Extract with Shell-First Tool Cascade

This skill provides an optimized workflow for extracting text content from PDF documents (local files or downloaded URLs) using a prioritized tool cascade that favors shell-based extraction before falling back to Python libraries.

Why Shell-First?

Analysis of execution patterns shows:

```
read_file
```
on PDFs sometimes returns binary/image data instead of text
```
run_shell
```
with
```
pdftotext
```
has higher success rate and fewer sandbox errors
```
execute_code_sandbox
```
can fail with "unknown error" in constrained environments
Shell tools are more reliable for PDF text extraction when available

Entry Point: Determine Your Starting Point

Before beginning, identify your scenario:

Scenario	Start Here	Skip
PDF already on local disk	Step 1 (Try read_file)	Shell download steps
PDF at a web URL	Shell download, then Step 1	None
Need maximum reliability	Full cascade (all 3 tools)	None

Complete Workflow

Step 0: Download PDF (URL Only)

If your PDF is at a web URL, download it first using browser user-agent:

curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o target.pdf "URL_HERE"

Key flags:

```
-L
```
: Follow redirects
```
-A
```
: Set user-agent header to mimic a real browser
```
-o
```
: Specify output filename

If you already have the PDF locally, skip to Step 1.

Step 1: Try read_file (Primary Attempt)

First, attempt to extract text using the

read_file

tool:

read_file(filetype="pdf", file_path="target.pdf")

Evaluate the response:

Response Type	Interpretation	Next Action
Clean readable text	Success	Proceed to content analysis
Binary data / PNG image / garbled	`read_file` returned raw data	Go to Step 2 immediately
Error / timeout	Tool failure	Go to Step 2 immediately

Critical: If

read_file

returns binary image data or garbled content, do not retry
read_file
. Immediately proceed to Step 2.

Step 2: Use run_shell with pdftotext (Preferred Fallback)

When

read_file

fails or returns binary data, use

run_shell

with

pdftotext

run_shell(command="pdftotext target.pdf output.txt")

Then read the extracted text:

read_file(filetype="txt", file_path="output.txt")

If pdftotext is not found, install it first:

run_shell(command="apt-get update && apt-get install -y poppler-utils")
# Or for macOS:
run_shell(command="brew install poppler")

Then retry:

run_shell(command="pdftotext target.pdf output.txt")

Verify extraction quality:

Check that
```
output.txt
```
exists and has content
Sample the text to ensure it's readable (not garbled)
If extraction looks corrupted, proceed to Step 3

Step 3: Use execute_code_sandbox with PyMuPDF (Last Resort)

pdftotext

is unavailable or produces poor results, use Python's PyMuPDF via

execute_code_sandbox

import fitz  # PyMuPDF

doc = fitz.open("target.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

with open("output.txt", "w") as f:
    f.write(text)
print(f"Extracted {len(text)} characters from {len(doc)} pages")

Execute via:

execute_code_sandbox(code="<python code above>")

Then read the result:

read_file(filetype="txt", file_path="output.txt")

Note:

execute_code_sandbox

may fail with "unknown error" in some environments. If this occurs, document the failure and proceed to Step 4.

Step 4: Graceful Degradation to Domain Knowledge

If all extraction methods fail:

Document the specific failure mode for each tool attempted
Extract any partial content that was successfully retrieved
Supplement missing content from established domain knowledge
Clearly mark which portions are from source vs. generated from knowledge
Provide citations for any claimed requirements or specifications

Example degradation documentation:

EXTRACTION FAILURE REPORT:
- Source: [URL or file path]
- read_file: Returned binary/image data (no text extraction)
- run_shell/pdftotext: [Tool not available / produced garbled output / succeeded]
- execute_code_sandbox/PyMuPDF: [Failed with unknown error / succeeded]

NOTE: Content below combines partial extraction with established domain 
knowledge for [topic]. Verify against official sources when available.

Tool Selection Decision Tree

                    PDF to Extract
                          │
                          ▼
                  ┌───────────────┐
                  │  read_file    │
                  │  (primary)    │
                  └───────┬───────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
     Returns text   Returns binary   Error/timeout
        (✓)         / image data         │
            │             │             │
            ▼             ▼             ▼
       SUCCESS    ┌───────────────┐
                  │ run_shell     │
                  │ pdftotext     │
                  └───────┬───────┘
                          │
                  ┌───────┼───────┐
                  │       │       │
             Succeeds  Not      Garbled
                (✓)   avail.    output
                  │       │       │
                  ▼       ▼       ▼
             SUCCESS ┌───────────────┐
                     │ execute_code  │
                     │ _sandbox      │
                     │ PyMuPDF       │
                     └───────┬───────┘
                             │
                     ┌───────┼───────┐
                     │       │       │
                Succeeds   Fails   Error
                   (✓)      │       │
                     │      ▼       │
                     ▼   Domain     │
                 SUCCESS  Knowledge │
                             │      │
                             └──────┘
                              FAILURE
                              DOCUMENTED

Complete Automated Script

#!/bin/bash
# pdf-extract-cascade.sh
# Implements the full tool cascade for PDF extraction

INPUT="$1"
OUTPUT_PDF="target.pdf"
OUTPUT_TXT="output.txt"

# Step 0: Handle URL vs local file
if [[ "$INPUT" =~ ^https?:// ]]; then
    echo "Downloading PDF from URL..."
    curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$INPUT"
else
    if [ ! -f "$INPUT" ]; then
        echo "ERROR: Local file not found: $INPUT"
        exit 1
    fi
    OUTPUT_PDF="$INPUT"
fi

# Step 1: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
    echo "WARNING: File is not a valid PDF"
    file "$OUTPUT_PDF"
fi

# Step 2: Try pdftotext (shell-first approach)
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
    if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
        if [ -s "$OUTPUT_TXT" ]; then
            echo "SUCCESS: Extraction completed with pdftotext"
            wc -l "$OUTPUT_TXT"
            exit 0
        fi
    fi
fi

# Step 3: Fallback to PyMuPDF
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys

try:
    doc = fitz.open("target.pdf")
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    with open("output.txt", "w") as f:
        f.write(text)
    print(f"SUCCESS: Extracted {len(text)} characters from {len(doc)} pages")
    sys.exit(0)
except Exception as e:
    print(f"PyMuPDF failed: {e}")
    sys.exit(1)
PYTHON_SCRIPT

# Step 4: Handle complete failure
if [ $? -ne 0 ]; then
    echo "FAILURE: All extraction methods failed"
    echo "ACTION: Generate content from domain knowledge"
    echo "Document each tool's failure mode for future reference"
    exit 1
fi

Best Practices

Never retry read_file on binary response: If
```
read_file
```
returns image/binary data, immediately switch to
```
run_shell
```
Prefer run_shell over execute_code_sandbox: Shell tools have higher reliability and fewer sandbox-related errors
Verify before trusting: Always check extracted text is readable, not just that the command succeeded
Document failures: Record which tools failed and how, to inform future extraction attempts
Preserve originals: Keep the source PDF for debugging and re-extraction if needed
Check tool availability early: Test for
```
pdftotext
```
before starting complex workflows

Common Failure Modes and Responses

Symptom	Likely Cause	Recommended Action
read_file returns PNG/binary	PDF rendered as image, not parsed	Immediately use run_shell with pdftotext
pdftotext: command not found	poppler-utils not installed	Run `apt-get install poppler-utils` first
pdftotext produces empty file	Password-protected or scanned PDF	Try PyMuPDF, or use OCR tools
execute_code_sandbox "unknown error"	Sandbox execution issue	Document failure, use domain knowledge fallback
Garbled text output	Encoding issues	Try PyMuPDF with `page.get_text("text")`
All tools fail	Severely corrupted or encrypted PDF	Document limitation, use knowledge-based content

When to Use This Skill

Extracting text from local PDF files where
```
read_file
```
may return binary data
Processing PDFs in automated pipelines requiring high reliability
Situations where
```
execute_code_sandbox
```
has shown instability
Working with PDFs from sources that may deliver rendered images instead of parseable text
Any workflow where shell tool availability can be assumed or easily installed

Migration Notes from pdf-download-extract-fallback

This skill (

pdf-extract-shell-first

) differs from the parent in these key ways:

Explicit tool sequencing: Clearly prioritizes
```
read_file
```
→
```
run_shell
```
→
```
execute_code_sandbox
```
No retry on binary read_file: Instructs immediate fallback when binary data detected
Shell-first philosophy: Emphasizes
```
pdftotext
```
via
```
run_shell
```
as preferred over Python
Reduced download focus: Assumes PDF is available or downloads in pre-step; focuses on extraction cascade
Decision tree visualization: Provides clear flowchart for tool selection