OpenSpace local-pdf-extraction

Extract text from local PDFs using pdftotext or PyMuPDF via run_shell

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/local-pdf-extraction" ~/.claude/skills/hkuds-openspace-local-pdf-extraction && rm -rf "$T"

manifest: gdpval_bench/skills/local-pdf-extraction/SKILL.md

source content

Local PDF Extraction Workflow

Use this skill when you need to extract text from PDF files that exist locally on the filesystem, and

read_file

returns binary data instead of readable text.

When to Use

PDF files exist in the local workspace or known directories
```
read_file
```
on PDFs returns binary/garbled data instead of text
You need to process PDF content for analysis, summarization, or data extraction

Step-by-Step Instructions

Step 1: Locate PDF Files

First, list directory contents to find all PDF files:

ls -la *.pdf
# or for recursive search
find . -name "*.pdf" -type f

Step 2: Extract PDFs to Text

Choose one of these methods based on available tools:

Method A: Using pdftotext (poppler-utils)

# Extract single PDF
pdftotext input.pdf output.txt

# Batch extract all PDFs in directory
for pdf in *.pdf; do
    pdftotext "$pdf" "${pdf%.pdf}.txt"
done

Method B: Using PyMuPDF (fitz) via Python

python3 << 'EOF'
import fitz  # PyMuPDF
import glob
import os

for pdf_path in glob.glob("*.pdf"):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    
    txt_path = pdf_path.replace(".pdf", ".txt")
    with open(txt_path, "w", encoding="utf-8") as f:
        f.write(text)
    print(f"Extracted: {pdf_path} -> {txt_path}")
EOF

Step 3: Read Extracted Text Files

Once extracted, use

read_file

to read the

.txt

files:

# Now you can read the text files normally
content = read_file(filetype="txt", file_path="document.txt")

Step 4: Process Content

Proceed with your analysis, summarization, or data extraction on the text content.

Complete Workflow Example

# Step 1: Find PDFs
ls -la *.pdf

# Step 2: Extract all PDFs to text
for pdf in *.pdf; do
    pdftotext "$pdf" "${pdf%.pdf}.txt"
done

# Step 3: Verify extraction
ls -la *.txt

Or as a Python script via

run_shell

python3 << 'SCRIPT'
import fitz, glob
for pdf in glob.glob("*.pdf"):
    doc = fitz.open(pdf)
    text = "".join(page.get_text() for page in doc)
    with open(pdf.replace(".pdf", ".txt"), "w") as f:
        f.write(text)
    print(f"Done: {pdf}")
SCRIPT

Troubleshooting

pdftotext not found: Install with
```
apt-get install poppler-utils
```
or use PyMuPDF method
Empty text output: PDF may be image-based; consider OCR tools like
```
pdftoppm
```
+
```
tesseract
```
Encoding issues: Ensure output files use UTF-8 encoding

Key Takeaways

Never use
```
read_file
```
directly on PDFs for text extraction
Always list directory first to confirm PDF locations
Use
```
run_shell
```
with pdftotext or PyMuPDF for reliable extraction
Batch process when multiple PDFs exist
Read the resulting
```
.txt
```
files for further processing