OpenSpace robust-pdf-extraction

Multi-method PDF extraction with sequential fallback and OCR for scanned documents

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/robust-pdf-extraction" ~/.claude/skills/hkuds-openspace-robust-pdf-extraction && rm -rf "$T"

manifest: gdpval_bench/skills/robust-pdf-extraction/SKILL.md

source content

Robust PDF Extraction Workflow

This skill provides a systematic approach to extracting text from PDF files, handling both text-based and scanned/image-based documents through progressive fallback methods.

When to Use

Processing PDFs of unknown or mixed types (text vs. scanned images)
Critical document processing where extraction failure is not acceptable
Batch processing multiple PDFs with varying formats

Workflow Steps

Step 1: Verify File Accessibility

Before attempting extraction, confirm the PDF exists and is readable:

# Check file exists and get basic info
ls -la /path/to/document.pdf

# Or search for files if location uncertain
find /path -name "*.pdf" -type f 2>/dev/null

Step 2: Attempt Primary Extraction (pdfplumber)

Start with pdfplumber for best text structure preservation:

import pdfplumber

def extract_with_pdfplumber(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
    return text.strip()

Step 3: Fallback to Secondary Method (pypdfium2)

If pdfplumber returns empty or incomplete text:

import pdfium2

def extract_with_pypdfium2(pdf_path):
    pdf = pdfium2.PdfDocument(pdf_path)
    text = ""
    for page in pdf:
        text_page = page.get_textpage()
        page_text = text_page.get_text_bounded()
        if page_text:
            text += page_text + "\n"
    return text.strip()

Step 4: Fallback to Tertiary Method (pdftotext)

If pypdfium2 also fails, use command-line pdftotext:

pdftotext /path/to/document.pdf - 2>/dev/null

Or in Python:

import subprocess

def extract_with_pdftotext(pdf_path):
    result = subprocess.run(
        ['pdftotext', pdf_path, '-'],
        capture_output=True,
        text=True
    )
    return result.stdout.strip()

Step 5: Detect Scanned/Image-Based PDFs

After each extraction attempt, verify text was actually extracted:

def is_meaningful_text(text, min_chars=50):
    """Check if extracted text is meaningful (not empty or just whitespace)"""
    if not text:
        return False
    # Remove whitespace and check length
    cleaned = ''.join(text.split())
    return len(cleaned) >= min_chars

Step 6: OCR Fallback for Scanned Documents

If all text extraction methods return empty/insufficient text, the PDF is likely scanned. Use OCR:

import pdf2image
import pytesseract
from PIL import Image

def extract_with_ocr(pdf_path, dpi=300):
    """Extract text from scanned PDFs using OCR"""
    text = ""
    images = pdf2image.convert_from_path(pdf_path, dpi=dpi)
    for image in images:
        page_text = pytesseract.image_to_string(image)
        text += page_text + "\n"
    return text.strip()

Complete Workflow Function

def robust_pdf_extract(pdf_path):
    """
    Extract text from PDF using progressive fallback methods.
    Returns (text, method_used) tuple.
    """
    methods = [
        ("pdfplumber", extract_with_pdfplumber),
        ("pypdfium2", extract_with_pypdfium2),
        ("pdftotext", extract_with_pdftotext),
    ]
    
    for method_name, extract_func in methods:
        try:
            text = extract_func(pdf_path)
            if is_meaningful_text(text):
                return text, method_name
        except Exception as e:
            print(f"{method_name} failed: {e}")
            continue
    
    # All text methods failed - try OCR
    try:
        text = extract_with_ocr(pdf_path)
        if is_meaningful_text(text):
            return text, "ocr"
    except Exception as e:
        print(f"OCR failed: {e}")
    
    return "", "failed"

Dependencies

Install required packages:

pip install pdfplumber pypdfium2 pdf2image pytesseract pillow
# Also need system packages:
# apt-get install poppler-utils tesseract-ocr  # Debian/Ubuntu
# brew install poppler tesseract  # macOS

Best Practices

Log which method succeeded - helps identify document types for future optimization
Set reasonable character thresholds - adjust
```
min_chars
```
based on expected document content
Handle exceptions gracefully - each method may fail for different reasons
Consider DPI for OCR - higher DPI (300+) improves accuracy but increases processing time
Cache results - if processing same PDFs repeatedly, store successful extraction method per file

Troubleshooting

Symptom	Likely Cause	Solution
All methods return empty	Scanned PDF	OCR fallback should handle this
pdfplumber fails with permission error	File locked or permissions issue	Check file permissions with `ls -la`
OCR returns gibberish	Low quality scan or wrong language	Increase DPI, specify language in pytesseract
pdftotext not found	Missing poppler-utils	Install system package