Claude-code-templates cf-crawl

Crawl entire websites using Cloudflare Browser Rendering /crawl API. Initiates async crawl jobs, polls for completion, and saves results as markdown files. Useful for ingesting documentation sites, knowledge bases, or any web content into your project context. Requires CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN environment variables.

install
source · Clone the upstream repo
git clone https://github.com/davila7/claude-code-templates
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/davila7/claude-code-templates "$T" && mkdir -p ~/.claude/skills && cp -r "$T/cli-tool/components/skills/utilities/cf-crawl" ~/.claude/skills/davila7-claude-code-templates-cf-crawl && rm -rf "$T"
manifest: cli-tool/components/skills/utilities/cf-crawl/SKILL.md
source content

Cloudflare Website Crawler

You are a web crawling assistant that uses Cloudflare's Browser Rendering /crawl REST API to crawl websites and save their content as markdown files for local use.

Prerequisites

The user must have:

  1. A Cloudflare account with Browser Rendering enabled
  2. CLOUDFLARE_ACCOUNT_ID
    and
    CLOUDFLARE_API_TOKEN
    available (see below)

Workflow

When the user asks to crawl a website, follow this exact workflow:

Step 1: Load Credentials

Look for

CLOUDFLARE_ACCOUNT_ID
and
CLOUDFLARE_API_TOKEN
in this order:

  1. Current environment variables - Check if already exported in the shell
  2. Project
    .env
    file
    - Read
    .env
    in the current working directory and extract the values
  3. Project
    .env.local
    file
    - Read
    .env.local
    in the current working directory
  4. Home directory
    .env
    - Read
    ~/.env
    as a last resort

To load from a

.env
file, parse it line by line looking for
CLOUDFLARE_ACCOUNT_ID=
and
CLOUDFLARE_API_TOKEN=
entries. Use this bash approach:

# Load from .env if vars are not already set
if [ -z "$CLOUDFLARE_ACCOUNT_ID" ] || [ -z "$CLOUDFLARE_API_TOKEN" ]; then
  for envfile in .env .env.local "$HOME/.env"; do
    if [ -f "$envfile" ]; then
      eval "$(grep -E '^CLOUDFLARE_(ACCOUNT_ID|API_TOKEN)=' "$envfile" | sed 's/^/export /')"
    fi
  done
fi

If credentials are still missing after checking all sources, tell the user to add them to their project

.env
file:

CLOUDFLARE_ACCOUNT_ID=your-account-id
CLOUDFLARE_API_TOKEN=your-api-token

The API token needs "Browser Rendering - Edit" permission. Create one at Cloudflare Dashboard > API Tokens.

Step 2: Validate Credentials

Verify both variables are set and non-empty before proceeding.

Step 3: Initiate Crawl

Send a POST request to start the crawl job. Choose parameters based on user needs:

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "options": {
      "excludePatterns": ["**/changelog/**", "**/api-reference/**"]
    }
  }'

For incremental crawls, add the

modifiedSince
parameter (Unix timestamp in seconds):

curl -s -X POST "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "<TARGET_URL>",
    "limit": <NUMBER_OF_PAGES>,
    "formats": ["markdown"],
    "modifiedSince": <UNIX_TIMESTAMP>
  }'

When

--since
is provided, convert to Unix timestamp:
date -d "2026-03-10" +%s
(Linux) or
date -j -f "%Y-%m-%d" "2026-03-10" +%s
(macOS).

The response returns a job ID:

{"success": true, "result": "job-uuid-here"}

Step 4: Poll for Completion

Poll the job status every 5 seconds until it completes:

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?limit=1" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}" | python3 -c "import sys,json; d=json.load(sys.stdin); print(f'Status: {d[\"result\"][\"status\"]} | Finished: {d[\"result\"][\"finished\"]}/{d[\"result\"][\"total\"]}')"

Possible job statuses:

  • running
    - Still in progress, keep polling
  • completed
    - All pages processed
  • cancelled_due_to_timeout
    - Exceeded 7-day limit
  • cancelled_due_to_limits
    - Hit account limits
  • errored
    - Something went wrong

Step 5: Retrieve Results

When using

modifiedSince
, check for skipped pages to see what was unchanged:

# See which pages were skipped (not modified since the given timestamp)
curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=skipped&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

Fetch all completed records using pagination (cursor-based):

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

If there are more records, use the

cursor
value from the response:

curl -s -X GET "https://api.cloudflare.com/client/v4/accounts/${CLOUDFLARE_ACCOUNT_ID}/browser-rendering/crawl/<JOB_ID>?status=completed&limit=50&cursor=<CURSOR>" \
  -H "Authorization: Bearer ${CLOUDFLARE_API_TOKEN}"

Step 6: Save Results

Save each page's markdown content to a local directory. Use a script like:

# Create output directory
mkdir -p .crawl-output

# Fetch and save all pages
python3 -c "
import json, os, re, sys, urllib.request

account_id = os.environ['CLOUDFLARE_ACCOUNT_ID']
api_token = os.environ['CLOUDFLARE_API_TOKEN']
job_id = '<JOB_ID>'
base = f'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}'
outdir = '.crawl-output'
os.makedirs(outdir, exist_ok=True)

cursor = None
total_saved = 0

while True:
    url = f'{base}?status=completed&limit=50'
    if cursor:
        url += f'&cursor={cursor}'

    req = urllib.request.Request(url, headers={
        'Authorization': f'Bearer {api_token}'
    })
    with urllib.request.urlopen(req) as resp:
        data = json.load(resp)

    records = data.get('result', {}).get('records', [])
    if not records:
        break

    for rec in records:
        page_url = rec.get('url', '')
        md = rec.get('markdown', '')
        if not md:
            continue
        # Convert URL to filename
        name = re.sub(r'https?://', '', page_url)
        name = re.sub(r'[^a-zA-Z0-9]', '_', name).strip('_')[:120]
        filepath = os.path.join(outdir, f'{name}.md')
        with open(filepath, 'w') as f:
            f.write(f'<!-- Source: {page_url} -->\n\n')
            f.write(md)
        total_saved += 1

    cursor = data.get('result', {}).get('cursor')
    if cursor is None:
        break

print(f'Saved {total_saved} pages to {outdir}/')
"

Parameter Reference

Core Parameters

ParameterTypeDefaultDescription
url
string(required)Starting URL to crawl
limit
number10Max pages to crawl (up to 100,000)
depth
number100,000Max link depth from starting URL
formats
array["html"]Output formats:
html
,
markdown
,
json
render
booleantrue
true
= headless browser,
false
= fast HTML fetch
source
string"all"Page discovery:
all
,
sitemaps
,
links
maxAge
number86400Cache validity in seconds (max 604800)
modifiedSince
number-Unix timestamp; only crawl pages modified after this time

Options Object

ParameterTypeDefaultDescription
includePatterns
array[]Wildcard patterns to include (
*
and
**
)
excludePatterns
array[]Wildcard patterns to exclude (higher priority)
includeSubdomains
booleanfalseFollow links to subdomains
includeExternalLinks
booleanfalseFollow external links

Advanced Parameters

ParameterTypeDescription
jsonOptions
objectAI-powered structured extraction (prompt, response_format)
authenticate
objectHTTP basic auth (username, password)
setExtraHTTPHeaders
objectCustom headers for requests
rejectResourceTypes
arraySkip: image, media, font, stylesheet
userAgent
stringCustom user agent string
cookies
arrayCustom cookies for requests

Usage Examples

Crawl documentation site (most common)

/cf-crawl https://docs.example.com --limit 50

Crawls up to 50 pages, saves as markdown.

Crawl with filters

/cf-crawl https://docs.example.com --limit 100 --include "/guides/**,/api/**" --exclude "/changelog/**"

Incremental crawl (diff detection)

/cf-crawl https://docs.example.com --limit 50 --since 2026-03-10

Only crawls pages modified since the given date. Skipped pages appear with

status=skipped
in results. This is ideal for daily doc-syncing: do one full crawl, then incremental updates to see only what changed.

Fast crawl without JavaScript rendering

/cf-crawl https://docs.example.com --no-render --limit 200

Uses static HTML fetch - faster and cheaper but won't capture JS-rendered content.

Crawl and merge into single file

/cf-crawl https://docs.example.com --limit 50 --merge

Merges all pages into a single markdown file for easy context loading.

Argument Parsing

When invoked as

/cf-crawl
, parse the arguments as follows:

  • First positional argument: the URL to crawl
  • --limit N
    or
    -l N
    : max pages (default: 20)
  • --depth N
    or
    -d N
    : max depth (default: 100000)
  • --include "pattern1,pattern2"
    : include URL patterns
  • --exclude "pattern1,pattern2"
    : exclude URL patterns
  • --no-render
    : disable JavaScript rendering (faster)
  • --merge
    : combine all output into a single file
  • --output DIR
    or
    -o DIR
    : output directory (default:
    .crawl-output
    )
  • --source sitemaps|links|all
    : page discovery method (default: all)
  • --since DATE
    : only crawl pages modified since DATE (ISO date like
    2026-03-10
    or Unix timestamp). Converts to Unix timestamp for the
    modifiedSince
    API parameter

If no URL is provided, ask the user for the target URL.

Important Notes

  • The /crawl endpoint respects robots.txt directives including crawl-delay
  • Blocked URLs appear with
    "status": "disallowed"
    in results
  • Free plan: 10 minutes of browser time per day
  • Job results are available for 14 days after completion
  • Max job runtime: 7 days
  • Response page size limit: 10 MB per page
  • Use
    render: false
    for static sites to save browser time
  • Pattern wildcards:
    *
    matches any character except
    /
    ,
    **
    matches including
    /