OpenSpace local-pdf-extraction
Extract text from local PDFs using pdftotext or PyMuPDF via run_shell
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/local-pdf-extraction" ~/.claude/skills/hkuds-openspace-local-pdf-extraction && rm -rf "$T"
manifest:
gdpval_bench/skills/local-pdf-extraction/SKILL.mdsource content
Local PDF Extraction Workflow
Use this skill when you need to extract text from PDF files that exist locally on the filesystem, and
read_file returns binary data instead of readable text.
When to Use
- PDF files exist in the local workspace or known directories
on PDFs returns binary/garbled data instead of textread_file- You need to process PDF content for analysis, summarization, or data extraction
Step-by-Step Instructions
Step 1: Locate PDF Files
First, list directory contents to find all PDF files:
ls -la *.pdf # or for recursive search find . -name "*.pdf" -type f
Step 2: Extract PDFs to Text
Choose one of these methods based on available tools:
Method A: Using pdftotext (poppler-utils)
# Extract single PDF pdftotext input.pdf output.txt # Batch extract all PDFs in directory for pdf in *.pdf; do pdftotext "$pdf" "${pdf%.pdf}.txt" done
Method B: Using PyMuPDF (fitz) via Python
python3 << 'EOF' import fitz # PyMuPDF import glob import os for pdf_path in glob.glob("*.pdf"): doc = fitz.open(pdf_path) text = "" for page in doc: text += page.get_text() txt_path = pdf_path.replace(".pdf", ".txt") with open(txt_path, "w", encoding="utf-8") as f: f.write(text) print(f"Extracted: {pdf_path} -> {txt_path}") EOF
Step 3: Read Extracted Text Files
Once extracted, use
read_file to read the .txt files:
# Now you can read the text files normally content = read_file(filetype="txt", file_path="document.txt")
Step 4: Process Content
Proceed with your analysis, summarization, or data extraction on the text content.
Complete Workflow Example
# Step 1: Find PDFs ls -la *.pdf # Step 2: Extract all PDFs to text for pdf in *.pdf; do pdftotext "$pdf" "${pdf%.pdf}.txt" done # Step 3: Verify extraction ls -la *.txt
Or as a Python script via
run_shell:
python3 << 'SCRIPT' import fitz, glob for pdf in glob.glob("*.pdf"): doc = fitz.open(pdf) text = "".join(page.get_text() for page in doc) with open(pdf.replace(".pdf", ".txt"), "w") as f: f.write(text) print(f"Done: {pdf}") SCRIPT
Troubleshooting
- pdftotext not found: Install with
or use PyMuPDF methodapt-get install poppler-utils - Empty text output: PDF may be image-based; consider OCR tools like
+pdftoppmtesseract - Encoding issues: Ensure output files use UTF-8 encoding
Key Takeaways
- Never use
directly on PDFs for text extractionread_file - Always list directory first to confirm PDF locations
- Use
with pdftotext or PyMuPDF for reliable extractionrun_shell - Batch process when multiple PDFs exist
- Read the resulting
files for further processing.txt