OpenSpace pdf-extraction-fallbacks-7d54a9

Multi-fallback PDF extraction with sequential approaches and early failure detection

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-extraction-fallbacks-7d54a9" ~/.claude/skills/hkuds-openspace-pdf-extraction-fallbacks-7d54a9 && rm -rf "$T"

manifest: gdpval_bench/skills/pdf-extraction-fallbacks-7d54a9/SKILL.md

source content

PDF Extraction Fallbacks

This skill provides a robust workflow for extracting text from PDFs when source documents may fail to download or extract due to JavaScript protection, CORS restrictions, or encoding issues.

When to Use

Downloading regulatory documents from government/agency websites
Extracting text from PDFs that may be JavaScript-protected
Handling PDFs with potential CORS or encoding issues
When you need reliable text extraction with guaranteed fallbacks

Core Workflow

Step 1: Download with Validation

# Download PDF and immediately validate
curl -L -o document.pdf "URL_HERE"

# Check file size (reject if < 1KB - likely error page)
FILE_SIZE=$(stat -f%z document.pdf 2>/dev/null || stat -c%s document.pdf 2>/dev/null)
if [ "$FILE_SIZE" -lt 1024 ]; then
    echo "FAIL: File too small ($FILE_SIZE bytes) - likely error page"
    # Log the actual content to diagnose
    head -c 500 document.pdf
    exit 1
fi

# Check for HTML/error content instead of PDF
if head -c 500 document.pdf | grep -qi "<!DOCTYPE html\|<html\|error\|access denied"; then
    echo "FAIL: Downloaded HTML/error page instead of PDF"
    exit 1
fi

Step 2: Sequential Extraction Fallbacks

Try extraction methods in order of reliability:

Fallback 1: pdftotext (poppler-utils)

if command -v pdftotext &> /dev/null; then
    pdftotext -layout document.pdf output.txt 2>/dev/null
    if [ -s output.txt ] && [ $(wc -c < output.txt) -gt 100 ]; then
        echo "SUCCESS: pdftotext extraction"
        exit 0
    fi
fi

Fallback 2: PyMuPDF (fitz)

import fitz  # PyMuPDF

def extract_with_pymupdf(pdf_path):
    try:
        doc = fitz.open(pdf_path)
        text = ""
        for page in doc:
            text += page.get_text()
        doc.close()
        if len(text.strip()) > 100:
            return text
        return None
    except Exception as e:
        print(f"PyMuPDF failed: {e}")
        return None

Fallback 3: pdfplumber

import pdfplumber

def extract_with_pdfplumber(pdf_path):
    try:
        text = ""
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n"
        if len(text.strip()) > 100:
            return text
        return None
    except Exception as e:
        print(f"pdfplumber failed: {e}")
        return None

Step 3: Content Sanity Check

After any extraction, validate the output:

def validate_extraction(text, min_chars=100, min_words=20):
    """Check if extracted text is meaningful content."""
    if not text:
        return False, "Empty extraction"
    
    text = text.strip()
    if len(text) < min_chars:
        return False, f"Too short: {len(text)} chars"
    
    words = text.split()
    if len(words) < min_words:
        return False, f"Too few words: {len(words)}"
    
    # Check for error patterns
    error_patterns = [
        "access denied", "permission denied", "javascript required",
        "failed to load", "cannot display", "corrupted"
    ]
    text_lower = text.lower()
    for pattern in error_patterns:
        if pattern in text_lower[:500]:  # Check beginning
            return False, f"Error pattern detected: {pattern}"
    
    return True, "Valid extraction"

Complete Python Implementation

import requests
import subprocess
import os
from pathlib import Path

def robust_pdf_extraction(url, output_path="extracted.txt", temp_pdf="temp.pdf"):
    """
    Multi-fallback PDF extraction with validation at each step.
    Returns (success, text_or_error)
    """
    
    # Step 1: Download with validation
    try:
        response = requests.get(url, timeout=30, headers={
            'User-Agent': 'Mozilla/5.0 (compatible; DocumentExtractor/1.0)'
        })
        response.raise_for_status()
    except Exception as e:
        return False, f"Download failed: {e}"
    
    # Check response size
    if len(response.content) < 1024:
        return False, f"Downloaded content too small: {len(response.content)} bytes"
    
    # Check for HTML error pages
    if response.content[:500].lower().find(b'<html') != -1:
        return False, "Downloaded HTML page instead of PDF"
    
    # Save PDF
    Path(temp_pdf).write_bytes(response.content)
    
    # Step 2: Try extraction methods in order
    extraction_methods = [
        ("pdftotext", extract_pdftotext),
        ("PyMuPDF", extract_pymupdf),
        ("pdfplumber", extract_pdfplumber),
    ]
    
    for method_name, extract_func in extraction_methods:
        try:
            text = extract_func(temp_pdf)
            valid, msg = validate_extraction(text)
            if valid:
                Path(output_path).write_text(text)
                return True, text
            print(f"{method_name}: {msg}")
        except Exception as e:
            print(f"{method_name} exception: {e}")
    
    # Cleanup
    os.remove(temp_pdf)
    return False, "All extraction methods failed"


def extract_pdftotext(pdf_path):
    result = subprocess.run(
        ["pdftotext", "-layout", pdf_path, "-"],
        capture_output=True, text=True, timeout=60
    )
    return result.stdout if result.returncode == 0 else None


def extract_pymupdf(pdf_path):
    import fitz
    doc = fitz.open(pdf_path)
    text = "".join(page.get_text() for page in doc)
    doc.close()
    return text


def extract_pdfplumber(pdf_path):
    import pdfplumber
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
    return text

Failure Detection Checklist

Check	Threshold	Action
File size	< 1KB	Reject - likely error page
Content type	HTML detected	Reject - not a PDF
Extracted text	< 100 chars	Try next fallback
Word count	< 20 words	Try next fallback
Error patterns	Found in first 500 chars	Reject extraction

Best Practices

Always validate immediately after download - Don't wait until extraction to discover the PDF is invalid
Log each fallback attempt - Helps diagnose which sites need special handling
Set reasonable timeouts - PDF processing can hang on corrupted files
Clean up temp files - Especially important in automated workflows
Preserve original PDFs - Keep copies for debugging extraction failures
Check for JavaScript protection - Some sites require headless browser rendering first

Common Failure Modes

Symptom	Likely Cause	Solution
92-byte "PDF"	JavaScript error page	Use headless browser (Playwright/Selenium)
HTML content	Redirect to login/error	Check authentication requirements
Empty extraction	Scan-only PDF	Use OCR (pytesseract) as additional fallback
Garbled text	Encoding issues	Try different PDF libraries

Integration Example

# For regulatory document retrieval workflows
def retrieve_regulatory_doc(doc_url, output_dir="docs"):
    success, result = robust_pdf_extraction(
        doc_url,
        output_path=f"{output_dir}/content.txt",
        temp_pdf=f"{output_dir}/temp.pdf"
    )
    
    if success:
        print(f"✓ Extracted {len(result)} characters")
        return result
    else:
        print(f"✗ Failed: {result}")
        # Log URL for manual review
        with open("failed_urls.log", "a") as f:
            f.write(f"{doc_url}: {result}\n")
        return None