OpenSpace pdf-text-extraction-9424c5

Extract text from PDF files using pdftotext when read_file returns binary data

install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-text-extraction-9424c5" ~/.claude/skills/hkuds-openspace-pdf-text-extraction-9424c5 && rm -rf "$T"
manifest: gdpval_bench/skills/pdf-text-extraction-9424c5/SKILL.md
source content

PDF Text Extraction via pdftotext

Problem

When using

read_file
on PDF documents, the function may return binary image data or garbled content instead of readable text. This occurs because PDFs can contain scanned images or complex binary structures that
read_file
cannot properly parse as text.

Solution

Use the

pdftotext
command-line utility via
run_shell
to extract clean text content from PDF files.

Steps

1. Verify PDF file exists

import os

pdf_path = "path/to/document.pdf"
if not os.path.exists(pdf_path):
    raise FileNotFoundError(f"PDF not found: {pdf_path}")

2. Extract text using pdftotext

from tools import run_shell

# Extract text to stdout
result = run_shell(command=f"pdftotext '{pdf_path}' -", timeout=60)
pdf_text = result.stdout

# Alternative: extract to a temporary file
temp_txt = "/tmp/extracted.txt"
run_shell(command=f"pdftotext '{pdf_path}' '{temp_txt}'", timeout=60)
with open(temp_txt, 'r') as f:
    pdf_text = f.read()

3. Handle parameter naming carefully

When calling

read_file
, be aware of the parameter name:

  • Use
    filetype="pdf"
    (not
    file_type
    )
  • Some tool implementations may use different parameter names
# Correct parameter usage
content = read_file(file_path="doc.pdf", filetype="pdf")

# If this returns binary/garbled data, fall back to pdftotext

Common pdftotext Options

OptionDescription
-
Output to stdout
-layout
Maintain original layout
-f <n>
Start from page n
-l <n>
End at page n
-q
Quiet mode

Example with options:

result = run_shell(command=f"pdftotext -layout -q '{pdf_path}' -", timeout=60)

Error Handling

from tools import run_shell

def extract_pdf_text(pdf_path):
    """Extract text from PDF using pdftotext with error handling."""
    import os
    
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"PDF not found: {pdf_path}")
    
    result = run_shell(command=f"pdftotext '{pdf_path}' -", timeout=60)
    
    if result.returncode != 0:
        raise RuntimeError(f"pdftotext failed: {result.stderr}")
    
    return result.stdout.strip()

When to Use This Pattern

  • read_file
    returns binary data, garbled text, or image content for a PDF
  • You need searchable/processable text from PDF documents
  • The PDF contains text (not just scanned images - for those, consider OCR tools)

Prerequisites

  • pdftotext
    must be installed (part of
    poppler-utils
    on Debian/Ubuntu,
    poppler
    on macOS via Homebrew)
  • Verify availability:
    run_shell(command="which pdftotext")