OpenSpace pdf-extraction-fallback-7db3aa

Resilient multi-tier PDF extraction with sequential fallback strategies when initial reading fails

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-extraction-fallback-7db3aa" ~/.claude/skills/hkuds-openspace-pdf-extraction-fallback-7db3aa && rm -rf "$T"

manifest: gdpval_bench/skills/pdf-extraction-fallback-7db3aa/SKILL.md

source content

PDF Extraction Fallback Strategy

Purpose

When processing complex documents (tax forms, legal documents, scanned materials), PDF extraction often fails on the first attempt. This skill provides a systematic fallback approach that tries multiple extraction methods in sequence until one succeeds.

When to Use

Initial PDF reading tools return errors or empty content
Document appears to be scanned/image-based rather than text-based
Previous extraction attempts produced incomplete or garbled output
Working with forms, tables, or structured documents that need reliable extraction

Fallback Sequence

Tier 1: Shell-Based Extraction (pdftotext)

Start with command-line tools that often handle edge cases better:

# Extract text maintaining layout
pdftotext -layout input.pdf output.txt

# Extract raw text (faster, less formatting)
pdftotext input.pdf output.txt

# Extract specific page range
pdftotext -f 1 -l 3 input.pdf output.txt

Check if output contains meaningful content before proceeding.

Tier 2: Python-Based Parsing

If shell tools fail, use Python libraries with different extraction approaches:

# Using PyPDF2 for basic text extraction
import PyPDF2
with open('document.pdf', 'rb') as f:
    reader = PyPDF2.PdfReader(f)
    text = ''.join(page.extract_text() for page in reader.pages)

# Using pdfplumber for tables and structured content
import pdfplumber
with pdfplumber.open('document.pdf') as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        tables = page.extract_tables()

# Using pypdf for newer PDF features
from pypdf import PdfReader
reader = PdfReader('document.pdf')
text = ''.join(page.extract_text() for page in reader.pages)

Tier 3: OCR Tools (for Scanned Documents)

If the PDF contains images or scanned content, use OCR:

# Using tesseract via command line
tesseract input.pdf output --psm 6

# Using Python with pytesseract
import pytesseract
from pdf2image import convert_from_path

images = convert_from_path('document.pdf')
text = ''.join(pytesseract.image_to_string(img) for img in images)

Implementation Pattern

def extract_pdf_resilient(pdf_path):
    """Try multiple extraction methods until one succeeds."""
    
    # Tier 1: Shell extraction
    result = run_shell(f'pdftotext -layout "{pdf_path}" -')
    if result.stdout and len(result.stdout.strip()) > 100:
        return result.stdout, 'pdftotext'
    
    # Tier 2: Python libraries
    try:
        import pdfplumber
        with pdfplumber.open(pdf_path) as pdf:
            text = ''.join(page.extract_text() or '' for page in pdf.pages)
        if text.strip():
            return text, 'pdfplumber'
    except Exception:
        pass
    
    # Tier 3: OCR fallback
    try:
        from pdf2image import convert_from_path
        import pytesseract
        images = convert_from_path(pdf_path)
        text = ''.join(pytesseract.image_to_string(img) for img in images)
        if text.strip():
            return text, 'tesseract-ocr'
    except Exception:
        pass
    
    raise ExtractionError("All extraction methods failed")

Decision Criteria

Indicator	Action
Empty output	Proceed to next tier
Garbled/special characters	Try next tier
Partial content	Accept if meets minimum threshold
Tool not available	Skip to next tier
Format-specific errors	Try alternative library

Best Practices

Validate each attempt - Check output length and quality before accepting
Log which method succeeded - Track which tier worked for future reference
Set minimum content thresholds - Don't accept trivial results (e.g., <50 chars)
Combine methods if needed - Some documents need multiple approaches for different sections
Preserve original file - Never modify the source PDF during extraction attempts

Error Handling

Catch exceptions at each tier, don't fail immediately
Log detailed error messages for debugging
Continue to next tier even if current tier partially succeeds but produces poor quality
After all tiers fail, provide clear summary of what was tried

Output Quality Check

Before declaring extraction complete:

def validate_extraction(text):
    if not text or len(text.strip()) < 50:
        return False
    if text.count('') > len(text) * 0.1:  # Too many replacement chars
        return False
    if len(set(text)) < 10:  # Too little character variety
        return False
    return True