OpenSpace pdf-text-extraction-9424c5
Extract text from PDF files using pdftotext when read_file returns binary data
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-text-extraction-9424c5" ~/.claude/skills/hkuds-openspace-pdf-text-extraction-9424c5 && rm -rf "$T"
manifest:
gdpval_bench/skills/pdf-text-extraction-9424c5/SKILL.mdsource content
PDF Text Extraction via pdftotext
Problem
When using
read_file on PDF documents, the function may return binary image data or garbled content instead of readable text. This occurs because PDFs can contain scanned images or complex binary structures that read_file cannot properly parse as text.
Solution
Use the
pdftotext command-line utility via run_shell to extract clean text content from PDF files.
Steps
1. Verify PDF file exists
import os pdf_path = "path/to/document.pdf" if not os.path.exists(pdf_path): raise FileNotFoundError(f"PDF not found: {pdf_path}")
2. Extract text using pdftotext
from tools import run_shell # Extract text to stdout result = run_shell(command=f"pdftotext '{pdf_path}' -", timeout=60) pdf_text = result.stdout # Alternative: extract to a temporary file temp_txt = "/tmp/extracted.txt" run_shell(command=f"pdftotext '{pdf_path}' '{temp_txt}'", timeout=60) with open(temp_txt, 'r') as f: pdf_text = f.read()
3. Handle parameter naming carefully
When calling
read_file, be aware of the parameter name:
- Use
(notfiletype="pdf"
)file_type - Some tool implementations may use different parameter names
# Correct parameter usage content = read_file(file_path="doc.pdf", filetype="pdf") # If this returns binary/garbled data, fall back to pdftotext
Common pdftotext Options
| Option | Description |
|---|---|
| Output to stdout |
| Maintain original layout |
| Start from page n |
| End at page n |
| Quiet mode |
Example with options:
result = run_shell(command=f"pdftotext -layout -q '{pdf_path}' -", timeout=60)
Error Handling
from tools import run_shell def extract_pdf_text(pdf_path): """Extract text from PDF using pdftotext with error handling.""" import os if not os.path.exists(pdf_path): raise FileNotFoundError(f"PDF not found: {pdf_path}") result = run_shell(command=f"pdftotext '{pdf_path}' -", timeout=60) if result.returncode != 0: raise RuntimeError(f"pdftotext failed: {result.stderr}") return result.stdout.strip()
When to Use This Pattern
returns binary data, garbled text, or image content for a PDFread_file- You need searchable/processable text from PDF documents
- The PDF contains text (not just scanned images - for those, consider OCR tools)
Prerequisites
must be installed (part ofpdftotext
on Debian/Ubuntu,poppler-utils
on macOS via Homebrew)poppler- Verify availability:
run_shell(command="which pdftotext")