OpenSpace pdf-extract-shell-first

PDF text extraction with tool cascade prioritizing shell pdftotext before Python fallback

install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-download-extract-fallback-enhanced-e27e0c" ~/.claude/skills/hkuds-openspace-pdf-extract-shell-first && rm -rf "$T"
manifest: gdpval_bench/skills/pdf-download-extract-fallback-enhanced-e27e0c/SKILL.md
source content

PDF Extract with Shell-First Tool Cascade

This skill provides an optimized workflow for extracting text content from PDF documents (local files or downloaded URLs) using a prioritized tool cascade that favors shell-based extraction before falling back to Python libraries.

Why Shell-First?

Analysis of execution patterns shows:

  • read_file
    on PDFs sometimes returns binary/image data instead of text
  • run_shell
    with
    pdftotext
    has higher success rate and fewer sandbox errors
  • execute_code_sandbox
    can fail with "unknown error" in constrained environments
  • Shell tools are more reliable for PDF text extraction when available

Entry Point: Determine Your Starting Point

Before beginning, identify your scenario:

ScenarioStart HereSkip
PDF already on local diskStep 1 (Try read_file)Shell download steps
PDF at a web URLShell download, then Step 1None
Need maximum reliabilityFull cascade (all 3 tools)None

Complete Workflow

Step 0: Download PDF (URL Only)

If your PDF is at a web URL, download it first using browser user-agent:

curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o target.pdf "URL_HERE"

Key flags:

  • -L
    : Follow redirects
  • -A
    : Set user-agent header to mimic a real browser
  • -o
    : Specify output filename

If you already have the PDF locally, skip to Step 1.

Step 1: Try read_file (Primary Attempt)

First, attempt to extract text using the

read_file
tool:

read_file(filetype="pdf", file_path="target.pdf")

Evaluate the response:

Response TypeInterpretationNext Action
Clean readable textSuccessProceed to content analysis
Binary data / PNG image / garbled
read_file
returned raw data
Go to Step 2 immediately
Error / timeoutTool failureGo to Step 2 immediately

Critical: If

read_file
returns binary image data or garbled content, do not retry
read_file
. Immediately proceed to Step 2.

Step 2: Use run_shell with pdftotext (Preferred Fallback)

When

read_file
fails or returns binary data, use
run_shell
with
pdftotext
:

run_shell(command="pdftotext target.pdf output.txt")

Then read the extracted text:

read_file(filetype="txt", file_path="output.txt")

If pdftotext is not found, install it first:

run_shell(command="apt-get update && apt-get install -y poppler-utils")
# Or for macOS:
run_shell(command="brew install poppler")

Then retry:

run_shell(command="pdftotext target.pdf output.txt")

Verify extraction quality:

  • Check that
    output.txt
    exists and has content
  • Sample the text to ensure it's readable (not garbled)
  • If extraction looks corrupted, proceed to Step 3

Step 3: Use execute_code_sandbox with PyMuPDF (Last Resort)

If

pdftotext
is unavailable or produces poor results, use Python's PyMuPDF via
execute_code_sandbox
:

import fitz  # PyMuPDF

doc = fitz.open("target.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

with open("output.txt", "w") as f:
    f.write(text)
print(f"Extracted {len(text)} characters from {len(doc)} pages")

Execute via:

execute_code_sandbox(code="<python code above>")

Then read the result:

read_file(filetype="txt", file_path="output.txt")

Note:

execute_code_sandbox
may fail with "unknown error" in some environments. If this occurs, document the failure and proceed to Step 4.

Step 4: Graceful Degradation to Domain Knowledge

If all extraction methods fail:

  1. Document the specific failure mode for each tool attempted
  2. Extract any partial content that was successfully retrieved
  3. Supplement missing content from established domain knowledge
  4. Clearly mark which portions are from source vs. generated from knowledge
  5. Provide citations for any claimed requirements or specifications

Example degradation documentation:

EXTRACTION FAILURE REPORT:
- Source: [URL or file path]
- read_file: Returned binary/image data (no text extraction)
- run_shell/pdftotext: [Tool not available / produced garbled output / succeeded]
- execute_code_sandbox/PyMuPDF: [Failed with unknown error / succeeded]

NOTE: Content below combines partial extraction with established domain 
knowledge for [topic]. Verify against official sources when available.

Tool Selection Decision Tree

                    PDF to Extract
                          │
                          ▼
                  ┌───────────────┐
                  │  read_file    │
                  │  (primary)    │
                  └───────┬───────┘
                          │
            ┌─────────────┼─────────────┐
            │             │             │
     Returns text   Returns binary   Error/timeout
        (✓)         / image data         │
            │             │             │
            ▼             ▼             ▼
       SUCCESS    ┌───────────────┐
                  │ run_shell     │
                  │ pdftotext     │
                  └───────┬───────┘
                          │
                  ┌───────┼───────┐
                  │       │       │
             Succeeds  Not      Garbled
                (✓)   avail.    output
                  │       │       │
                  ▼       ▼       ▼
             SUCCESS ┌───────────────┐
                     │ execute_code  │
                     │ _sandbox      │
                     │ PyMuPDF       │
                     └───────┬───────┘
                             │
                     ┌───────┼───────┐
                     │       │       │
                Succeeds   Fails   Error
                   (✓)      │       │
                     │      ▼       │
                     ▼   Domain     │
                 SUCCESS  Knowledge │
                             │      │
                             └──────┘
                              FAILURE
                              DOCUMENTED

Complete Automated Script

#!/bin/bash
# pdf-extract-cascade.sh
# Implements the full tool cascade for PDF extraction

INPUT="$1"
OUTPUT_PDF="target.pdf"
OUTPUT_TXT="output.txt"

# Step 0: Handle URL vs local file
if [[ "$INPUT" =~ ^https?:// ]]; then
    echo "Downloading PDF from URL..."
    curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" -o "$OUTPUT_PDF" "$INPUT"
else
    if [ ! -f "$INPUT" ]; then
        echo "ERROR: Local file not found: $INPUT"
        exit 1
    fi
    OUTPUT_PDF="$INPUT"
fi

# Step 1: Verify file type
echo "Verifying file type..."
if ! file "$OUTPUT_PDF" | grep -q "PDF document"; then
    echo "WARNING: File is not a valid PDF"
    file "$OUTPUT_PDF"
fi

# Step 2: Try pdftotext (shell-first approach)
echo "Attempting pdftotext extraction..."
if command -v pdftotext &> /dev/null; then
    if pdftotext "$OUTPUT_PDF" "$OUTPUT_TXT" 2>/dev/null; then
        if [ -s "$OUTPUT_TXT" ]; then
            echo "SUCCESS: Extraction completed with pdftotext"
            wc -l "$OUTPUT_TXT"
            exit 0
        fi
    fi
fi

# Step 3: Fallback to PyMuPDF
echo "Falling back to PyMuPDF..."
python3 << 'PYTHON_SCRIPT'
import fitz
import sys

try:
    doc = fitz.open("target.pdf")
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    with open("output.txt", "w") as f:
        f.write(text)
    print(f"SUCCESS: Extracted {len(text)} characters from {len(doc)} pages")
    sys.exit(0)
except Exception as e:
    print(f"PyMuPDF failed: {e}")
    sys.exit(1)
PYTHON_SCRIPT

# Step 4: Handle complete failure
if [ $? -ne 0 ]; then
    echo "FAILURE: All extraction methods failed"
    echo "ACTION: Generate content from domain knowledge"
    echo "Document each tool's failure mode for future reference"
    exit 1
fi

Best Practices

  1. Never retry read_file on binary response: If
    read_file
    returns image/binary data, immediately switch to
    run_shell
  2. Prefer run_shell over execute_code_sandbox: Shell tools have higher reliability and fewer sandbox-related errors
  3. Verify before trusting: Always check extracted text is readable, not just that the command succeeded
  4. Document failures: Record which tools failed and how, to inform future extraction attempts
  5. Preserve originals: Keep the source PDF for debugging and re-extraction if needed
  6. Check tool availability early: Test for
    pdftotext
    before starting complex workflows

Common Failure Modes and Responses

SymptomLikely CauseRecommended Action
read_file returns PNG/binaryPDF rendered as image, not parsedImmediately use run_shell with pdftotext
pdftotext: command not foundpoppler-utils not installedRun
apt-get install poppler-utils
first
pdftotext produces empty filePassword-protected or scanned PDFTry PyMuPDF, or use OCR tools
execute_code_sandbox "unknown error"Sandbox execution issueDocument failure, use domain knowledge fallback
Garbled text outputEncoding issuesTry PyMuPDF with
page.get_text("text")
All tools failSeverely corrupted or encrypted PDFDocument limitation, use knowledge-based content

When to Use This Skill

  • Extracting text from local PDF files where
    read_file
    may return binary data
  • Processing PDFs in automated pipelines requiring high reliability
  • Situations where
    execute_code_sandbox
    has shown instability
  • Working with PDFs from sources that may deliver rendered images instead of parseable text
  • Any workflow where shell tool availability can be assumed or easily installed

Migration Notes from pdf-download-extract-fallback

This skill (

pdf-extract-shell-first
) differs from the parent in these key ways:

  1. Explicit tool sequencing: Clearly prioritizes
    read_file
    run_shell
    execute_code_sandbox
  2. No retry on binary read_file: Instructs immediate fallback when binary data detected
  3. Shell-first philosophy: Emphasizes
    pdftotext
    via
    run_shell
    as preferred over Python
  4. Reduced download focus: Assumes PDF is available or downloads in pre-step; focuses on extraction cascade
  5. Decision tree visualization: Provides clear flowchart for tool selection