OpenSpace pdf-verification-cli
Verify PDF page count and content using command-line tools when Python libraries unavailable
install
source · Clone the upstream repo
git clone https://github.com/HKUDS/OpenSpace
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/pdf-verification-cli" ~/.claude/skills/hkuds-openspace-pdf-verification-cli && rm -rf "$T"
manifest:
gdpval_bench/skills/pdf-verification-cli/SKILL.mdsource content
PDF Verification with Command-Line Tools
When verifying PDF files during task execution, Python libraries like PyPDF2 may not be available in the environment. This skill provides a reliable alternative using standard command-line tools from the
poppler-utils package.
When to Use This Skill
- Need to verify PDF page count
- Need to inspect PDF text content
- PyPDF2 or similar Python PDF libraries are unavailable
- Working in minimal/containerized environments
Core Tools
1. pdfinfo
- Extract PDF Metadata
pdfinfoUse
pdfinfo to get page count and other metadata:
# Get full PDF info pdfinfo document.pdf # Get only page count pdfinfo document.pdf | grep Pages # Extract page count as a number pdfinfo document.pdf | grep Pages | awk '{print $2}'
Key metadata fields:
: Number of pages in the PDFPages
: Document titleTitle
: Document authorAuthor
: Application that created the PDFCreator
: Application that processed the PDFProducer
: When the PDF was createdCreationDate
: Last modification dateModDate
2. pdftotext
- Extract Text Content
pdftotextUse
pdftotext to inspect the actual content of the PDF:
# Extract all text to stdout pdftotext document.pdf - # Extract text to a file pdftotext document.pdf output.txt # Extract text from specific page range pdftotext -f 1 -l 3 document.pdf output.txt # Preserve layout (rough formatting) pdftotext -layout document.pdf output.txt
Verification Workflow
Step 1: Check Tool Availability
# Check if tools are installed which pdfinfo which pdftotext # Or test with --help pdfinfo --help 2>&1 | head -1
Step 2: Install if Needed
# Debian/Ubuntu apt-get update && apt-get install -y poppler-utils # RHEL/CentOS/Fedora yum install -y poppler-utils # or dnf install -y poppler-utils # macOS (with Homebrew) brew install poppler
Step 3: Verify PDF Properties
# Verify page count matches expected EXPECTED_PAGES=4 ACTUAL_PAGES=$(pdfinfo document.pdf | grep Pages | awk '{print $2}') if [ "$ACTUAL_PAGES" -eq "$EXPECTED_PAGES" ]; then echo "✓ Page count verified: $ACTUAL_PAGES pages" else echo "✗ Page count mismatch: expected $EXPECTED_PAGES, got $ACTUAL_PAGES" fi
Step 4: Verify PDF Content
# Check for required sections/content pdftotext document.pdf - | grep -i "checklist" && echo "✓ Contains checklist section" pdftotext document.pdf - | grep -i "references" && echo "✓ Contains references section" # Count occurrences of key terms pdftotext document.pdf - | grep -ci "assessment" # Case-insensitive count
Python Integration Example
import subprocess def get_pdf_page_count(pdf_path): """Get page count using pdfinfo""" result = subprocess.run( ['pdfinfo', pdf_path], capture_output=True, text=True ) for line in result.stdout.split('\n'): if line.startswith('Pages:'): return int(line.split(':')[1].strip()) return None def extract_pdf_text(pdf_path): """Extract all text from PDF using pdftotext""" result = subprocess.run( ['pdftotext', pdf_path, '-'], capture_output=True, text=True ) return result.stdout def verify_pdf(pdf_path, expected_pages, required_terms): """Verify PDF has expected page count and contains required terms""" # Check page count pages = get_pdf_page_count(pdf_path) if pages != expected_pages: return False, f"Expected {expected_pages} pages, got {pages}" # Check content text = extract_pdf_text(pdf_path).lower() missing = [term for term in required_terms if term.lower() not in text] if missing: return False, f"Missing terms: {missing}" return True, "PDF verification passed"
Common Use Cases
| Task | Command |
|---|---|
| Count pages | |
| Check if PDF has text | |
| Search for keyword | |
| Extract first page | |
| Get PDF title | |
Troubleshooting
pdfinfo: command not found
- Install poppler-utils (see Step 2 above)
- Ensure PATH includes the installation directory
returns empty outputpdftotext
- PDF may be image-only (scanned) - requires OCR
- PDF may be encrypted/password-protected
- Try
for better text extractionpdftotext -layout
Page count seems wrong
- Some PDFs have blank pages counted
- Verify with
to see actual content per pagepdftotext
Best Practices
- Always verify both structure and content - Page count alone doesn't guarantee content quality
- Use case-insensitive searches - Content may vary in capitalization
- Handle errors gracefully - Tools may fail on corrupted or encrypted PDFs
- Combine with file existence checks - Verify PDF exists before running tools