Skills pdf-miner
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/baichenwzj/pdf-miner" ~/.claude/skills/openclaw-skills-pdf-miner && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/baichenwzj/pdf-miner" ~/.openclaw/skills/openclaw-skills-pdf-miner && rm -rf "$T"
skills/baichenwzj/pdf-miner/SKILL.mdPDF Miner Skill
Extract text and tables from PDF files using
pdfplumber (global market formats).
Prerequisites
python -m pip install pdfplumber
For OCR capabilities (scanned/image PDFs), also install:
python -m pip install pymupdf openai
Initial Setup for OCR
Before using
--ocr, you must provide a vision API credential. There are three ways:
-
Environment variables (recommended for temporary use):
export OCR_API_KEY="your-openrouter-api-key" export OCR_MODEL="qwen/qwen3.6-plus:free" # optional export OCR_BASE_URL="https://openrouter.ai/api/v1" # optional -
Config file (persistent, skill-specific):
Create
with:skills/skills/pdf-miner/config.json{ "vision_api_key": "your-openrouter-api-key", "vision_model": "qwen/qwen3.6-plus:free", "vision_base_url": "https://openrouter.ai/api/v1" } -
Command-line arguments (override per invocation):
python scripts/extract_pdf.py scanned.pdf --ocr --ocr-api-key "sk-..." --ocr-model "stepfun/step-3.5-flash:free"
Usage
Run commands from this skill directory.
Basic Extraction
# Full extraction (text + tables) python scripts/extract_pdf.py input.pdf # Output to custom path python scripts/extract_pdf.py input.pdf output.md # Specific pages python scripts/extract_pdf.py input.pdf --pages 1-5,10,15-20 # Text or tables only python scripts/extract_pdf.py input.pdf --text-only python scripts/extract_pdf.py input.pdf --tables-only python scripts/extract_pdf.py input.pdf --tables-only --json
Advanced Modes
# Search: find pages containing keywords with context python scripts/extract_pdf.py report.pdf --search "Vietnam export penetration" # Metrics: extract lines with keywords + numeric values python scripts/extract_pdf.py report.pdf --metrics "market size growth export penetration" # TOC: extract table of contents / chapter structure (robust, multi-format) python scripts/extract_pdf.py report.pdf --toc # Optionally adjust sensitivity (default: 3 entries per page required) python scripts/extract_pdf.py report.pdf --toc --toc-min-entries 2 # Diff: compare two PDFs, show pages unique to each python scripts/extract_pdf.py old_version.pdf new_version.pdf --diff # Chunk: split output into LLM-friendly chunks python scripts/extract_pdf.py report.pdf --chunk # single file, 8000 chars each python scripts/extract_pdf.py report.pdf --chunk --max-chars 4000 python scripts/extract_pdf.py report.pdf --chunk --output-dir ./chunks # separate files # Clean headers/footers python scripts/extract_pdf.py report.pdf --clean-headers # Batch: process multiple PDFs python scripts/extract_pdf.py file1.pdf file2.pdf file3.pdf --output-dir ./extracted
OCR for Scanned/Image PDFs (Automatic by Default)
OCR is automatically triggered for pages with very little extractable text (default threshold: 100 characters). This helps handle scanned or image-based PDFs without requiring the
--ocr flag.
Usage Examples
# Automatic OCR (default behavior) python scripts/extract_pdf.py scanned.pdf # Force OCR on all pages (ignore text length) python scripts/extract_pdf.py scanned.pdf --ocr # Force OCR only on specific pages python scripts/extract_pdf.py scanned.pdf --ocr --ocr-pages 1-5,10 # Adjust OCR quality (DPI) python scripts/extract_pdf.py scanned.pdf --ocr --ocr-dpi 300 # Use a different vision model python scripts/extract_pdf.py scanned.pdf --ocr --ocr-model "stepfun/step-3.5-flash:free" # Disable automatic OCR detection (if you want pure extraction only) python scripts/extract_pdf.py file.pdf --no-auto-ocr # Change the low-text threshold (default 100 chars) python scripts/extract_pdf.py file.pdf --ocr-threshold 200
Configuration
OCR requires a vision API key. See Initial Setup for OCR.
| Option | Default | Description |
|---|---|---|
| off | Force OCR on pages (with auto-detect or ) |
| on | Automatically OCR low-text pages (hidden; use to disable) |
| - | Disable automatic OCR detection |
| - | Comma-separated pages/ranges to OCR (requires ) |
| 100 | Minimum text length to consider a page as "sufficient" (characters) |
| 200 | Image DPI for OCR rendering |
| from env/config | Override API key |
| from env/config | Override API base URL |
| from env/config | Override vision model |
Troubleshooting
OCR failed with "No API key"
→ Configure your API key in
config.json or via OCR_API_KEY env var.
OCR model rejects images
→ The configured model might not support vision. Choose a vision-capable model (e.g.,
qwen/qwen3.6-plus:free, stepfun/step-3.5-flash:free). The script will attempt to auto-fallback to a known good model if the configured one lacks vision support.
Too many pages being OCR'd
→ Increase the threshold:
--ocr-threshold 300 or --no-auto-ocr and selectively use --ocr-pages.
Rate limit errors
→ Reduce concurrent OCR calls, switch to a paid model tier, or try a different provider.
Configuration Reference
| Option | Default | Source |
|---|---|---|
| (none) | env or |
| | env or |
| | env or |
Precedence: CLI argument > environment variable >
config.json > hardcoded default.
Tool Comparison
| Tool | Global text | Tables | Search | Metrics | Diff | Chunk | |
|---|---|---|---|---|---|---|---|
| web_fetch | ❌ | - | ❌ | ❌ | ❌ | ❌ | ❌ |
| scrapling | ❌ | - | ❌ | ❌ | ❌ | ❌ | ❌ |
| pypdf | ⚠️ garbled | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
| pdfplumber | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
Modes Reference
| Mode | Flag | What it does |
|---|---|---|
| Full | (default) | Extract all text + tables, page by page |
| Search | | Find pages with keywords, show ±N lines context (default 5) |
| Metrics | | Extract lines with keywords AND numeric data |
| TOC | | Detect table of contents / chapter structure (robust multi-format) |
| Minimum TOC entries per page to trust detection (default: 3) | |
| Diff | | Compare two PDFs, show matched vs unique pages |
| Chunk | | Split into LLM-friendly pieces () |
| Clean | | Auto-detect and remove repeated header/footer lines |
| Batch | | Process multiple PDFs, output to |
Output Options
| Flag | Effect |
|---|---|
| Output to specified directory |
| Each chunk as separate file |
| Context lines around search matches (default 5) |
| Chunk size (default 8000) |
| Manually specify header/footer lines to remove |
Workflow
1. Download PDF (if URL)
import urllib.request urllib.request.urlretrieve(url, "report.pdf")
2. Extract
Run from this skill directory:
cd <skill-directory> python scripts/extract_pdf.py /path/to/report.pdf [options]
3. Read & Answer
Read the output
.md file and answer based on the extracted content.
4. Clean Up
Delete temporary PDF and
.md files when done.
Limitations
- Scanned/image-based PDFs: Cannot extract text without OCR. Install OCR dependencies and configure an API key.
- Embedded charts/graphs: Only text labels extracted, not chart data.
- Multi-column layouts: Use
flag for improved reading order via x_tolerance.--layout - TOC detection: Robust multi-format matching with validation. Very non-standard layouts may still require manual extraction.
- Diff: Uses text similarity (Jaccard on normalized lines), not page numbers. Threshold adjustable via
(default 0.8).--diff-threshold N
Troubleshooting
OCR fails with "No API key"
→ Set
OCR_API_KEY environment variable or fill config.json.
OCR model rejects images
→ The configured model may not support vision; either choose a vision-capable model (e.g.,
qwen/qwen3.6-plus:free, stepfun/step-3.5-flash:free) or let the script auto-fallback by removing the model setting.
Rate limit errors
→ Reduce concurrent calls, switch to a paid tier, or try a different model provider.