Xcrawl-skills xcrawl-crawl
Use this skill for XCrawl crawl tasks, including bulk site crawling, crawler rule design, async status polling, and delivery of crawl output for downstream scrape and search workflows.
install
source · Clone the upstream repo
git clone https://github.com/xcrawl-api/xcrawl-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/xcrawl-api/xcrawl-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/xcrawl-crawl" ~/.claude/skills/xcrawl-api-xcrawl-skills-xcrawl-crawl && rm -rf "$T"
manifest:
skills/xcrawl-crawl/SKILL.mdsource content
XCrawl Crawl
Overview
This skill orchestrates full-site or scoped crawling with XCrawl Crawl APIs. Default behavior is raw passthrough: return upstream API response bodies as-is.
Required Local Config
Before using this skill, the user must create a local config file and write
XCRAWL_API_KEY into it.
Path:
~/.xcrawl/config.json
{ "XCRAWL_API_KEY": "<your_api_key>" }
Read API key from local config file only. Do not require global environment variables.
Credits and Account Setup
Using XCrawl APIs consumes credits. If the user does not have an account or available credits, guide them to register at
https://dash.xcrawl.com/.
After registration, they can activate the free 1000 credits plan before running requests.
Tool Permission Policy
Request runtime permissions for
curl and node only.
Do not request Python, shell helper scripts, or other runtime permissions.
API Surface
- Start crawl:
POST /v1/crawl - Read result:
GET /v1/crawl/{crawl_id} - Base URL:
https://run.xcrawl.com - Required header:
Authorization: Bearer <XCRAWL_API_KEY>
Usage Examples
cURL (create + result)
API_KEY="$(node -e "const fs=require('fs');const p=process.env.HOME+'/.xcrawl/config.json';const k=JSON.parse(fs.readFileSync(p,'utf8')).XCRAWL_API_KEY||'';process.stdout.write(k)")" CREATE_RESP="$(curl -sS -X POST "https://run.xcrawl.com/v1/crawl" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${API_KEY}" \ -d '{"url":"https://example.com","crawler":{"limit":100,"max_depth":2},"output":{"formats":["markdown","links"]}}')" echo "$CREATE_RESP" CRAWL_ID="$(node -e 'const s=process.argv[1];const j=JSON.parse(s);process.stdout.write(j.crawl_id||"")' "$CREATE_RESP")" curl -sS -X GET "https://run.xcrawl.com/v1/crawl/${CRAWL_ID}" \ -H "Authorization: Bearer ${API_KEY}"
Node
node -e ' const fs=require("fs"); const apiKey=JSON.parse(fs.readFileSync(process.env.HOME+"/.xcrawl/config.json","utf8")).XCRAWL_API_KEY; const body={url:"https://example.com",crawler:{limit:300,max_depth:3,include:["/docs/.*"],exclude:["/blog/.*"]},request:{locale:"ja-JP"},output:{formats:["markdown","links","json"]}}; fetch("https://run.xcrawl.com/v1/crawl",{ method:"POST", headers:{"Content-Type":"application/json",Authorization:`Bearer ${apiKey}`}, body:JSON.stringify(body) }).then(async r=>{console.log(await r.text());}); '
Request Parameters
Request endpoint and headers
- Endpoint:
POST https://run.xcrawl.com/v1/crawl - Headers:
Content-Type: application/jsonAuthorization: Bearer <api_key>
Request body: top-level fields
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string | Yes | - | Site entry URL |
| object | No | - | Crawler config |
| object | No | - | Proxy config |
| object | No | - | Request config |
| object | No | - | JS rendering config |
| object | No | - | Output config |
| object | No | - | Async callback config |
crawler
crawler| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| integer | No | | Max pages |
| string[] | No | - | Include only matching URLs (regex supported) |
| string[] | No | - | Exclude matching URLs (regex supported) |
| integer | No | | Max depth from entry URL |
| boolean | No | | Crawl full site instead of only subpaths |
| boolean | No | | Include subdomains |
| boolean | No | | Include external links |
| boolean | No | | Use site sitemap |
proxy
proxy| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string | No | | ISO-3166-1 alpha-2 country code, e.g. / / |
| string | No | Auto-generated | Sticky session ID; same ID attempts to reuse exit |
request
request| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string | No | | Affects |
| string | No | | / ; affects UA and viewport |
| object map | No | - | Cookie key/value pairs |
| object map | No | - | Header key/value pairs |
| boolean | No | | Return main content only |
| boolean | No | | Attempt to block ad resources |
| boolean | No | | Skip TLS verification |
js_render
js_render| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| boolean | No | | Enable browser rendering |
| string | No | | / / |
| integer | No | - | Viewport width (desktop , mobile ) |
| integer | No | - | Viewport height (desktop , mobile ) |
output
output| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string[] | No | | Output formats |
| string | No | | / (only if includes ) |
| string | No | - | Extraction prompt |
| object | No | - | JSON Schema |
output.formats enum:
htmlraw_htmlmarkdownlinkssummaryscreenshotjson
webhook
webhook| Field | Type | Required | Default | Description |
|---|---|---|---|---|
| string | No | - | Callback URL |
| object map | No | - | Custom callback headers |
| string[] | No | | Events: / / |
Response Parameters
Create response (POST /v1/crawl
)
POST /v1/crawl| Field | Type | Description |
|---|---|---|
| string | Task ID |
| string | Always |
| string | Version |
| string | Always |
Result response (GET /v1/crawl/{crawl_id}
)
GET /v1/crawl/{crawl_id}| Field | Type | Description |
|---|---|---|
| string | Task ID |
| string | Always |
| string | Version |
| string | / / / |
| string | Entry URL |
| object[] | Per-page result array |
| string | Start time (ISO 8601) |
| string | End time (ISO 8601) |
| integer | Total credits used |
data[] fields follow output.formats:
,html
,raw_html
,markdown
,links
,summary
,screenshotjson
(page metadata)metadatatraffic_bytescredits_usedcredits_detail
Workflow
- Confirm business objective and crawl boundary.
- What content is required, what content must be excluded, and what is the completion signal.
- Draft a bounded crawl request.
- Prefer explicit limits and path constraints.
- Start crawl and capture task metadata.
- Record
, initial status, and request payload.crawl_id
- Poll
until terminal state.GET /v1/crawl/{crawl_id}
- Track
,pending
,crawling
, orcompleted
.failed
- Return raw create/result responses.
- Do not synthesize derived summaries unless explicitly requested.
Output Contract
Return:
- Endpoint flow (
+POST /v1/crawl
)GET /v1/crawl/{crawl_id}
used for the create requestrequest_payload- Raw response body from create call
- Raw response body from result call
- Error details when request fails
Do not generate summaries unless the user explicitly requests a summary.
Guardrails
- Never run an unbounded crawl without explicit constraints.
- Do not present speculative page counts as final coverage.
- Do not hardcode provider-specific tool schemas in core logic.
- Highlight policy, legal, or website-usage risks when relevant.