Skills mineru-pdf-extractor
Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/a-i-r/mineru-pdf-extractor" ~/.claude/skills/openclaw-skills-mineru-pdf-extractor && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/a-i-r/mineru-pdf-extractor" ~/.openclaw/skills/openclaw-skills-mineru-pdf-extractor && rm -rf "$T"
skills/a-i-r/mineru-pdf-extractor/SKILL.mdMinerU PDF Extractor
Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR.
Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.
📁 Skill Structure
mineru-pdf-extractor/ ├── SKILL.md # English documentation ├── SKILL_zh.md # Chinese documentation ├── docs/ # Documentation │ ├── Local_File_Parsing_Guide.md # Local PDF parsing detailed guide (English) │ ├── Online_URL_Parsing_Guide.md # Online PDF parsing detailed guide (English) │ ├── MinerU_本地文档解析完整流程.md # Local parsing complete guide (Chinese) │ └── MinerU_在线文档解析完整流程.md # Online parsing complete guide (Chinese) └── scripts/ # Executable scripts ├── local_file_step1_apply_upload_url.sh # Local parsing Step 1 ├── local_file_step2_upload_file.sh # Local parsing Step 2 ├── local_file_step3_poll_result.sh # Local parsing Step 3 ├── local_file_step4_download.sh # Local parsing Step 4 ├── online_file_step1_submit_task.sh # Online parsing Step 1 └── online_file_step2_poll_result.sh # Online parsing Step 2
🔧 Requirements
Required Environment Variables
Scripts automatically read MinerU Token from environment variables (choose one):
# Option 1: Set MINERU_TOKEN export MINERU_TOKEN="your_api_token_here" # Option 2: Set MINERU_API_KEY export MINERU_API_KEY="your_api_token_here"
Required Command-Line Tools
- For HTTP requests (usually pre-installed)curl
- For extracting results (usually pre-installed)unzip
Optional Tools
- For enhanced JSON parsing and security (recommended but not required)jq- If not installed, scripts will use fallback methods
- Install:
(Debian/Ubuntu) orapt-get install jq
(macOS)brew install jq
Optional Configuration
# Set API base URL (default is pre-configured) export MINERU_BASE_URL="https://mineru.net/api/v4"
💡 Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key
📄 Feature 1: Parse Local PDF Documents
For locally stored PDF files. Requires 4 steps.
Quick Start
cd scripts/ # Step 1: Apply for upload URL ./local_file_step1_apply_upload_url.sh /path/to/your.pdf # Output: BATCH_ID=xxx UPLOAD_URL=xxx # Step 2: Upload file ./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf # Step 3: Poll for results ./local_file_step3_poll_result.sh "$BATCH_ID" # Output: FULL_ZIP_URL=xxx # Step 4: Download results ./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/
Script Descriptions
local_file_step1_apply_upload_url.sh
Apply for upload URL and batch_id.
Usage:
./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model]
Parameters:
:language
(Chinese),ch
(English),en
(auto-detect), defaultautoch
:layout_model
(fast),doclayout_yolo
(accurate), defaultlayoutlmv3doclayout_yolo
Output:
BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/...
local_file_step2_upload_file.sh
Upload PDF file to the presigned URL.
Usage:
./local_file_step2_upload_file.sh <upload_url> <pdf_file_path>
local_file_step3_poll_result.sh
Poll extraction results until completion or failure.
Usage:
./local_file_step3_poll_result.sh <batch_id> [max_retries] [retry_interval_seconds]
Output:
FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip
local_file_step4_download.sh
Download result ZIP and extract.
Usage:
./local_file_step4_download.sh <zip_url> [output_zip_filename] [extract_directory_name]
Output Structure:
extracted/ ├── full.md # 📄 Markdown document (main result) ├── images/ # 🖼️ Extracted images ├── content_list.json # Structured content └── layout.json # Layout analysis data
Detailed Documentation
📚 Complete Guide: See
docs/Local_File_Parsing_Guide.md
🌐 Feature 2: Parse Online PDF Documents (URL Method)
For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.
Quick Start
cd scripts/ # Step 1: Submit parsing task (provide URL directly) ./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf" # Output: TASK_ID=xxx # Step 2: Poll results and auto-download/extract ./online_file_step2_poll_result.sh "$TASK_ID" extracted/
Script Descriptions
online_file_step1_submit_task.sh
Submit parsing task for online PDF.
Usage:
./online_file_step1_submit_task.sh <pdf_url> [language] [layout_model]
Parameters:
: Complete URL of the online PDF (required)pdf_url
:language
(Chinese),ch
(English),en
(auto-detect), defaultautoch
:layout_model
(fast),doclayout_yolo
(accurate), defaultlayoutlmv3doclayout_yolo
Output:
TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
online_file_step2_poll_result.sh
Poll extraction results, automatically download and extract when complete.
Usage:
./online_file_step2_poll_result.sh <task_id> [output_directory] [max_retries] [retry_interval_seconds]
Output Structure:
extracted/ ├── full.md # 📄 Markdown document (main result) ├── images/ # 🖼️ Extracted images ├── content_list.json # Structured content └── layout.json # Layout analysis data
Detailed Documentation
📚 Complete Guide: See
docs/Online_URL_Parsing_Guide.md
📊 Comparison of Two Parsing Methods
| Feature | Local PDF Parsing | Online PDF Parsing |
|---|---|---|
| Steps | 4 steps | 2 steps |
| Upload Required | ✅ Yes | ❌ No |
| Average Time | 30-60 seconds | 10-20 seconds |
| Use Case | Local files | Files already online (arXiv, websites, etc.) |
| File Size Limit | 200MB | Limited by source server |
⚙️ Advanced Usage
Batch Process Local Files
for pdf in /path/to/pdfs/*.pdf; do echo "Processing: $pdf" # Step 1 result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1) batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2) upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2) # Step 2 ./local_file_step2_upload_file.sh "$upload_url" "$pdf" # Step 3 zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2) # Step 4 filename=$(basename "$pdf" .pdf) ./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted" done
Batch Process Online Files
for url in \ "https://arxiv.org/pdf/2410.17247.pdf" \ "https://arxiv.org/pdf/2409.12345.pdf"; do echo "Processing: $url" # Step 1 result=$(./online_file_step1_submit_task.sh "$url" 2>&1) task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2) # Step 2 filename=$(basename "$url" .pdf) ./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted" done
⚠️ Notes
- Token Configuration: Scripts prioritize
, fall back toMINERU_TOKEN
if not foundMINERU_API_KEY - Token Security: Do not hard-code tokens in scripts; use environment variables
- URL Accessibility: For online parsing, ensure the provided URL is publicly accessible
- File Limits: Single file recommended not exceeding 200MB, maximum 600 pages
- Network Stability: Ensure stable network when uploading large files
- Security: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks
- Optional jq: Installing
provides enhanced JSON parsing and additional security checksjq
📚 Reference Documentation
| Document | Description |
|---|---|
| Detailed curl commands and parameters for local PDF parsing |
| Detailed curl commands and parameters for online PDF parsing |
External Resources:
- 🏠 MinerU Official: https://mineru.net/
- 📖 API Documentation: https://mineru.net/apiManage/docs
- 💻 GitHub Repository: https://github.com/opendatalab/MinerU
Skill Version: 1.0.0
Release Date: 2026-02-18
Community Skill - Not affiliated with MinerU official