Openclaw-skills content-extract

Robust URL-to-Markdown extraction for OpenClaw workflows. Use when the user wants to "extract/summarize/convert a webpage to markdown" (especially WeChat mp.weixin.qq.com) and web_fetch/browser is blocked or messy. Uses a cheap probe via web_fetch first, then falls back to the official MinerU API (via the local mineru-extract skill) and returns a traceable result contract with source links.

install

source · Clone the upstream repo

git clone https://github.com/blessonism/openclaw-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/blessonism/openclaw-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/content-extract" ~/.claude/skills/blessonism-openclaw-skills-content-extract && rm -rf "$T"

OpenClaw · Install into ~/.openclaw/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/blessonism/openclaw-skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/content-extract" ~/.openclaw/skills/blessonism-openclaw-skills-content-extract && rm -rf "$T"

manifest: content-extract/SKILL.md

source content

content-extract — 上层内容解析入口（MCP 语义对齐，但不跑 MCP Server）

目标：把“给我一个 URL → 产出可读 Markdown + 可追溯入口”变成一个统一入口，供后续所有业务 skill（github-explorer、写作类 skills、日报等）复用。

核心原则（来自你发的 Excel Skill 拆解文章的启发）：

行为规约层：永远给出可追溯入口（原文 URL + 解析产物路径/链接），绝不编造来源。
Token 探针：先用低成本 probe 判断可不可以直接抓；不行再走重解析（MinerU）。
反弹机制：失败时返回“下一步动作建议”，而不是一堆异常栈。

工作流（Decision Tree）

输入：

url

Domain Whitelist（跳过 probe）：若 URL 属于高概率反爬/动态站点（微信/知乎等），直接走 MinerU

白名单文件：
```
references/domain-whitelist.md
```
对命中白名单的 URL：强制
```
model_version=MinerU-HTML
```

0.5) GitHub Fast Path（跳过 probe 和 MinerU）：若 URL 匹配

github.com/{owner}/{repo}

模式，直接走 GitHub API

GitHub repo 页面是 SPA（客户端渲染），web_fetch 只能拿到导航栏壳子
用 GitHub API 获取 README、repo 元数据、文件树、Issues 等
详见
```
references/heuristics.md
```
中的 GitHub fast path 章节
Auth:
```
Authorization: token {GITHUB_PAT}
```
（见 TOOLS.md）
返回结果仍遵循统一 Result Contract（
```
engine: "github-api"
```
）

Probe（低成本）：优先用
```
web_fetch(url)
```

目标：拿到正文 markdown（便宜、快）
判断“失败/不合格”条件（见
```
references/heuristics.md
```
）包括：
- 403/401/反爬
- 只有“环境异常/验证码/请在微信打开”等提示
- 内容极短/明显导航页/丢正文

Fallback（高保真）：走 MinerU 官方 API

调用下游 driver：

skills/mineru-extract/scripts/mineru_parse_documents.py

对 HTML 页面（微信等）：强制
```
model_version=MinerU-HTML
```

输出统一结果合同（Result Contract）

无论用 probe 还是 MinerU，都返回同一套结构：

{
  "ok": true,
  "source_url": "...",
  "engine": "web_fetch" ,
  "markdown": "...",
  "artifacts": {
    "out_dir": "...",
    "markdown_path": "...",
    "zip_path": "..."
  },
  "sources": [
    "原文URL",
    "（如使用MinerU）MinerU full_zip_url",
    "（如使用MinerU）本地markdown_path"
  ],
  "notes": ["任何重要限制/失败原因/下一步建议"]
}

注意：
engine
可能是
web_fetch
、
mineru
或
github-api
。

MinerU 调用（给 agent 的确定性脚本）

当需要 MinerU 时，用这个命令（返回 JSON，且可把 markdown 内联进 JSON，便于下游总结）：

python3 /home/node/.openclaw/workspace/skills/mineru-extract/scripts/mineru_parse_documents.py \
  --file-sources "<URL>" \
  --model-version MinerU-HTML \
  --emit-markdown --max-chars 20000

交付规范（强制）

输出必须包含
```
sources
```
（原文入口 + 解析产物入口）。
如果 MinerU 成功：必须把
```
markdown_path
```
（本地路径）写进
```
sources
```
，方便复查。
如果两条链路都失败：必须明确失败原因，并给出下一步（例如：让 Boss 提供可访问镜像链接 / 允许我用浏览器 relay 导出 HTML / 走上传 HTML 文件解析的兜底方案）。

本 skill 自身不做什么

不跑 MCP Server（避免常驻服务与运维负担）
不试图绕过登录/验证码（这属于访问层问题；我们只做解析层和工作流路由）

References

MinerU API docs: https://mineru.net/apiManage/docs
MinerU output files: https://opendatalab.github.io/MinerU/reference/output_files/