Skillshub mineru-extract
Use the official MinerU (mineru.net) parsing API to convert a URL (HTML pages like WeChat articles, or direct PDF/Office/image links) into clean Markdown + structured outputs. Use when web_fetch/browser can’t access or extracts messy content, and you want higher-fidelity parsing (layout/table/formula/OCR).
git clone https://github.com/ComeOnOliver/skillshub
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/blessonism/openclaw-search-skills/mineru-extract" ~/.claude/skills/comeonoliver-skillshub-mineru-extract && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/blessonism/openclaw-search-skills/mineru-extract" ~/.openclaw/skills/comeonoliver-skillshub-mineru-extract && rm -rf "$T"
skills/blessonism/openclaw-search-skills/mineru-extract/SKILL.mdMinerU Extract (official API)
Use MinerU as an upstream “content normalizer”: submit a URL to MinerU, poll for completion, download the result zip, and extract the main Markdown.
Quick start (MCP-aligned)
We align to the MinerU MCP mental model, but we do not run an MCP server.
- Primary script (MCP-style):
scripts/mineru_parse_documents.py- Input:
(comma/newline-separated)--file-sources - Output: JSON contract on stdout:
{ ok, items, errors }
- Input:
- Low-level script (single URL):
scripts/mineru_extract.py
Auth:
- Set
(Bearer token from mineru.net)MINERU_TOKEN
Default model heuristic:
- URLs ending with
→.pdf/.doc/.ppt/.png/.jpgpipeline - Otherwise →
(best for HTML pages like WeChat articles)MinerU-HTML
1) Configure token (skill-local)
Put secrets in skill root
.env (do not paste into chat outputs):
# In the mineru-extract skill directory: .env MINERU_TOKEN=your_token_here MINERU_API_BASE=https://mineru.net
2) Parse URL(s) → Markdown (recommended)
MCP-style wrapper (returns JSON, optionally includes markdown text):
python3 mineru-extract/scripts/mineru_parse_documents.py \ --file-sources "<URL1>\n<URL2>" \ --language ch \ --enable-ocr \ --model-version MinerU-HTML
If you want the markdown content inline in the JSON (can be large):
python3 mineru-extract/scripts/mineru_parse_documents.py \ --file-sources "<URL>" \ --model-version MinerU-HTML \ --emit-markdown --max-chars 20000
Low-level (single URL, print markdown to stdout):
python3 mineru-extract/scripts/mineru_extract.py "<URL>" --model MinerU-HTML --print > /tmp/out.md
Output
The script always downloads + extracts the MinerU result zip to:
~/.openclaw/workspace/mineru/<task_id>/
It writes:
result.zip- extracted files (Markdown + JSON + assets)
It prints a JSON summary to stderr with paths:
,task_id
,full_zip_url
,out_dirmarkdown_path
Parameters (common)
:--model
(HTML requirespipeline | vlm | MinerU-HTML
)MinerU-HTML
: enable OCR (effective for--ocr/--no-ocr
/pipeline
)vlm
: table recognition--table/--no-table
: formula recognition--formula/--no-formula--language ch|en|...
(non-HTML)--page-ranges "2,4-6"
/--timeout 600--poll-interval 2
Failure modes & fallbacks
- MinerU may fail to fetch some URLs (anti-bot / geo / login).
- Fallback: provide an HTML file or a PDF/long screenshot; then implement “upload + parse” flow with MinerU batch upload endpoints.
- Always report the failing URL + MinerU
and keep an original-source link in outputs.err_msg
References
- MinerU API docs: https://mineru.net/apiManage/docs
- MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/