Skillsbench pdf-reading
Extract text, tables, and structured information from PDF documents using pdfplumber, PyPDF2, or pdftotext command-line tools.
install
source · Clone the upstream repo
git clone https://github.com/benchflow-ai/skillsbench
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/benchflow-ai/skillsbench "$T" && mkdir -p ~/.claude/skills && cp -r "$T/libs/artifact-runner/tasks/nodemedic-demo/environment/skills/pdf-reading" ~/.claude/skills/benchflow-ai-skillsbench-pdf-reading && rm -rf "$T"
manifest:
libs/artifact-runner/tasks/nodemedic-demo/environment/skills/pdf-reading/SKILL.mdsource content
PDF Reading Skill
This skill helps agents extract information from PDF documents.
Tools Available
The environment has these PDF tools pre-installed:
- Python library for precise text/table extractionpdfplumber
- Python library for PDF manipulationPyPDF2
- Command-line tool from poppler-utilspdftotext
Quick Extraction
Command Line (Fast)
# Extract all text from a PDF pdftotext /root/artifacts/paper.pdf - # Extract specific pages pdftotext -f 1 -l 3 /root/artifacts/paper.pdf - # Extract to a file pdftotext /root/artifacts/paper.pdf /tmp/paper.txt
Python (More Control)
import pdfplumber from pathlib import Path def extract_pdf_text(pdf_path: str) -> str: """Extract all text from a PDF file.""" text_parts = [] with pdfplumber.open(pdf_path) as pdf: for page in pdf.pages: text = page.extract_text() if text: text_parts.append(text) return "\n\n".join(text_parts) # Usage text = extract_pdf_text("/root/artifacts/paper.pdf") print(text)
Extracting Specific Information
Find Commands in Text
import re def find_commands(text: str) -> list: """Extract shell commands from text.""" # Look for common command patterns patterns = [ r'docker run[^\n]+', r'\$[^\n]+', r'--package=[^\s]+\s+--version=[^\s]+', ] commands = [] for pattern in patterns: commands.extend(re.findall(pattern, text)) return commands text = extract_pdf_text("/root/artifacts/paper.pdf") commands = find_commands(text)
Extract Tables
import pdfplumber def extract_tables(pdf_path: str) -> list: """Extract all tables from a PDF.""" tables = [] with pdfplumber.open(pdf_path) as pdf: for i, page in enumerate(pdf.pages): page_tables = page.extract_tables() for table in page_tables: tables.append({ "page": i + 1, "data": table }) return tables
Find Package Information
import re def find_package_info(text: str) -> list: """Find npm package references (name@version).""" # Match patterns like node-rsync@1.0.3 pattern = r'([a-z0-9-]+)@(\d+\.\d+\.\d+)' matches = re.findall(pattern, text.lower()) return [{"name": m[0], "version": m[1]} for m in matches]
Tips
-
Check available PDFs first:
ls -la /root/artifacts/ -
Preview before full extraction:
pdftotext /root/artifacts/paper.pdf - | head -100 -
Handle multi-column layouts: pdfplumber handles them better than pdftotext
-
For structured data: Look for JSON blocks in the text:
import json import re json_blocks = re.findall(r'\{[^{}]*\}', text) for block in json_blocks: try: data = json.loads(block) print(data) except json.JSONDecodeError: pass