OpenSpace robust-pdf-extraction
Multi-method PDF extraction with sequential fallback and OCR for scanned documents
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/robust-pdf-extraction" ~/.claude/skills/hkuds-openspace-robust-pdf-extraction && rm -rf "$T"
manifest:
gdpval_bench/skills/robust-pdf-extraction/SKILL.mdsource content
Robust PDF Extraction Workflow
This skill provides a systematic approach to extracting text from PDF files, handling both text-based and scanned/image-based documents through progressive fallback methods.
When to Use
- Processing PDFs of unknown or mixed types (text vs. scanned images)
- Critical document processing where extraction failure is not acceptable
- Batch processing multiple PDFs with varying formats
Workflow Steps
Step 1: Verify File Accessibility
Before attempting extraction, confirm the PDF exists and is readable:
# Check file exists and get basic info ls -la /path/to/document.pdf # Or search for files if location uncertain find /path -name "*.pdf" -type f 2>/dev/null
Step 2: Attempt Primary Extraction (pdfplumber)
Start with pdfplumber for best text structure preservation:
import pdfplumber def extract_with_pdfplumber(pdf_path): text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: text += page_text + "\n" return text.strip()
Step 3: Fallback to Secondary Method (pypdfium2)
If pdfplumber returns empty or incomplete text:
import pdfium2 def extract_with_pypdfium2(pdf_path): pdf = pdfium2.PdfDocument(pdf_path) text = "" for page in pdf: text_page = page.get_textpage() page_text = text_page.get_text_bounded() if page_text: text += page_text + "\n" return text.strip()
Step 4: Fallback to Tertiary Method (pdftotext)
If pypdfium2 also fails, use command-line pdftotext:
pdftotext /path/to/document.pdf - 2>/dev/null
Or in Python:
import subprocess def extract_with_pdftotext(pdf_path): result = subprocess.run( ['pdftotext', pdf_path, '-'], capture_output=True, text=True ) return result.stdout.strip()
Step 5: Detect Scanned/Image-Based PDFs
After each extraction attempt, verify text was actually extracted:
def is_meaningful_text(text, min_chars=50): """Check if extracted text is meaningful (not empty or just whitespace)""" if not text: return False # Remove whitespace and check length cleaned = ''.join(text.split()) return len(cleaned) >= min_chars
Step 6: OCR Fallback for Scanned Documents
If all text extraction methods return empty/insufficient text, the PDF is likely scanned. Use OCR:
import pdf2image import pytesseract from PIL import Image def extract_with_ocr(pdf_path, dpi=300): """Extract text from scanned PDFs using OCR""" text = "" images = pdf2image.convert_from_path(pdf_path, dpi=dpi) for image in images: page_text = pytesseract.image_to_string(image) text += page_text + "\n" return text.strip()
Complete Workflow Function
def robust_pdf_extract(pdf_path): """ Extract text from PDF using progressive fallback methods. Returns (text, method_used) tuple. """ methods = [ ("pdfplumber", extract_with_pdfplumber), ("pypdfium2", extract_with_pypdfium2), ("pdftotext", extract_with_pdftotext), ] for method_name, extract_func in methods: try: text = extract_func(pdf_path) if is_meaningful_text(text): return text, method_name except Exception as e: print(f"{method_name} failed: {e}") continue # All text methods failed - try OCR try: text = extract_with_ocr(pdf_path) if is_meaningful_text(text): return text, "ocr" except Exception as e: print(f"OCR failed: {e}") return "", "failed"
Dependencies
Install required packages:
pip install pdfplumber pypdfium2 pdf2image pytesseract pillow # Also need system packages: # apt-get install poppler-utils tesseract-ocr # Debian/Ubuntu # brew install poppler tesseract # macOS
Best Practices
- Log which method succeeded - helps identify document types for future optimization
- Set reasonable character thresholds - adjust
based on expected document contentmin_chars - Handle exceptions gracefully - each method may fail for different reasons
- Consider DPI for OCR - higher DPI (300+) improves accuracy but increases processing time
- Cache results - if processing same PDFs repeatedly, store successful extraction method per file
Troubleshooting
| Symptom | Likely Cause | Solution |
|---|---|---|
| All methods return empty | Scanned PDF | OCR fallback should handle this |
| pdfplumber fails with permission error | File locked or permissions issue | Check file permissions with |
| OCR returns gibberish | Low quality scan or wrong language | Increase DPI, specify language in pytesseract |
| pdftotext not found | Missing poppler-utils | Install system package |