OpenSpace pdf-extraction-fallback-80956b
Multi-fallback PDF download and text extraction with early failure detection
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-extraction-fallback-80956b" ~/.claude/skills/hkuds-openspace-pdf-extraction-fallback-80956b && rm -rf "$T"
manifest:
gdpval_bench/skills/pdf-extraction-fallback-80956b/SKILL.mdsource content
PDF Extraction Fallback Workflow
This skill provides a robust, multi-layered approach to downloading and extracting text from PDF documents when sources may be protected, corrupted, or inaccessible via standard methods.
When to Use
- Downloading regulatory documents, handbooks, or official PDFs from government/enterprise websites
- Sources that may have JavaScript protection, CORS restrictions, or dynamic content
- When initial PDF downloads produce suspiciously small files or error content
- Any scenario requiring reliable text extraction from potentially problematic PDF sources
Core Workflow
Step 1: Download with Validation
# Download PDF with size check curl -L -o output.pdf "https://example.com/document.pdf" # Early failure detection: check file size file_size=$(stat -c%s output.pdf 2>/dev/null || stat -f%z output.pdf) if [ "$file_size" -lt 1000 ]; then echo "WARNING: File size ($file_size bytes) suggests failed download or error page" # Check for HTML/JavaScript error content if head -c 500 output.pdf | grep -qi "<html\|<script\|error\|access denied"; then echo "FAILURE: File contains error message, not PDF content" rm output.pdf # Proceed to alternative download method fi fi
Step 2: Sequential Extraction Fallbacks
Try extraction methods in order, moving to next on failure:
Fallback 1: pdftotext (command-line)
if command -v pdftotext &> /dev/null; then pdftotext output.pdf output.txt if [ -s output.txt ] && [ $(wc -c < output.txt) -gt 100 ]; then echo "SUCCESS: pdftotext extraction" exit 0 fi fi
Fallback 2: PyMuPDF (fitz)
import fitz # PyMuPDF def extract_with_pymupdf(pdf_path): try: doc = fitz.open(pdf_path) text = "" for page in doc: text += page.get_text() doc.close() if len(text.strip()) > 100: return text return None except Exception as e: print(f"PyMuPDF failed: {e}") return None
Fallback 3: pdfplumber
import pdfplumber def extract_with_pdfplumber(pdf_path): try: text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: text += page_text + "\n" if len(text.strip()) > 100: return text return None except Exception as e: print(f"pdfplumber failed: {e}") return None
Step 3: Content Sanity Validation
After any extraction method succeeds:
def validate_extracted_text(text, min_length=100): """Validate extracted content is meaningful""" if not text or len(text.strip()) < min_length: return False # Check for common error patterns error_patterns = [ "access denied", "permission denied", "error", "javascript", "<html", "<script", "404", "403" ] text_lower = text.lower()[:500] # Check first 500 chars for pattern in error_patterns: if pattern in text_lower: return False return True
Complete Workflow Script
#!/usr/bin/env python3 """ Robust PDF extraction with multiple fallbacks """ import subprocess import os import sys def download_pdf(url, output_path): """Download PDF with validation""" subprocess.run(["curl", "-L", "-o", output_path, url], check=True) # Validate download if not os.path.exists(output_path): return False file_size = os.path.getsize(output_path) if file_size < 1000: with open(output_path, 'r', errors='ignore') as f: content = f.read(500).lower() if any(x in content for x in ['<html', '<script', 'error', 'denied']): os.remove(output_path) return False return True def extract_text(pdf_path): """Try multiple extraction methods""" # Method 1: pdftotext try: result = subprocess.run( ["pdftotext", pdf_path, "-"], capture_output=True, text=True, timeout=60 ) if result.stdout and len(result.stdout.strip()) > 100: return result.stdout except: pass # Method 2: PyMuPDF try: import fitz doc = fitz.open(pdf_path) text = "".join(page.get_text() for page in doc) doc.close() if len(text.strip()) > 100: return text except: pass # Method 3: pdfplumber try: import pdfplumber text = "" with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: page_text = page.extract_text() if page_text: text += page_text + "\n" if len(text.strip()) > 100: return text except: pass return None def main(): url = sys.argv[1] pdf_path = "document.pdf" if not download_pdf(url, pdf_path): print("ERROR: Download failed or invalid content") sys.exit(1) text = extract_text(pdf_path) if text: with open("extracted.txt", "w") as f: f.write(text) print(f"SUCCESS: Extracted {len(text)} characters") else: print("ERROR: All extraction methods failed") sys.exit(1) if __name__ == "__main__": main()
Failure Documentation
For each failed attempt, log:
| Attempt | Method | Failure Reason | File Size | Content Preview |
|---|---|---|---|---|
| 1 | Direct download | JavaScript error page | 92 bytes | |
| 2 | pdftotext | File not valid PDF | - | - |
| 3 | PyMuPDF | Encrypted/protected | - | - |
| 4 | pdfplumber | Success | 2.4 MB | "VA Handbook Chapter 1..." |
Best Practices
- Always validate downloads immediately - Don't assume a successful HTTP 200 means valid content
- Check file size thresholds - Files <1KB are almost always error pages
- Scan for error patterns - HTML tags, JavaScript, error messages indicate failed downloads
- Try multiple extractors - Different PDFs work better with different libraries
- Set minimum content thresholds - Extracted text <100 chars usually indicates failure
- Clean up failed artifacts - Remove invalid files before retrying
- Document each failure - Helps diagnose patterns in source protection mechanisms
Dependencies
Install required tools:
# Command-line tool apt-get install poppler-utils # provides pdftotext # Python libraries pip install PyMuPDF pdfplumber
Notes for Regulatory Documents
Government and regulatory websites often:
- Use JavaScript-based PDF viewers instead of direct links
- Implement session-based access requiring authentication
- Serve error pages with 200 status codes
- Have CORS restrictions on direct downloads
When encountering these, consider:
- Using browser automation (Selenium/Playwright) as an additional fallback
- Checking for alternative document repositories
- Looking for cached versions via search engines