Learn-skills.dev llms-txt-crawler

Fetch and crawl llms.txt files from websites. Parses the llms.txt format to extract page URLs and downloads all listed content. Use when you need to gather documentation or content from a website that provides an llms.txt file.

install

source · Clone the upstream repo

git clone https://github.com/NeverSight/learn-skills.dev

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/NeverSight/learn-skills.dev "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/skills-md/agykit/agykit/llms-txt-crawler" ~/.claude/skills/neversight-learn-skills-dev-llms-txt-crawler && rm -rf "$T"

manifest: data/skills-md/agykit/agykit/llms-txt-crawler/SKILL.md

source content

llms.txt Crawler Skill

This skill enables you to fetch

llms.txt

files from websites and crawl all pages listed within them. The

llms.txt

format is a standard way for websites to provide LLM-friendly content listings.

Overview

The

llms.txt

file typically follows this format:

# Site Name

## Section Name

- [Page Title](https://example.com/page.md): Description of the page
- [Another Page](https://example.com/another.md): Another description

This skill parses these files and downloads all linked content.

Usage

Basic Usage

Run the crawl script with a target URL:

cd /path/to/skills/llms-txt-crawler/scripts
npm install  # First time only
node crawl.js --url https://example.com

Command Line Options

Option	Short	Description	Default
`--url`	`-u`	Base URL of the site with llms.txt	Required
`--output`	`-o`	Output directory for crawled files	`./output`
`--format`	`-f`	Output format: `md` , `json` , or `txt`	`md`
`--delay`	`-d`	Delay between requests in milliseconds	`500`
`--concurrent`	`-c`	Maximum concurrent requests	`3`

Examples

Crawl agentskills.io documentation:

node crawl.js --url https://agentskills.io --output ./agentskills-docs

Crawl with custom rate limiting:

node crawl.js --url https://example.com --delay 1000 --concurrent 2

Output as JSON:

node crawl.js --url https://example.com --format json

Output Structure

The script creates the following output structure:

output/
├── llms.txt              # Original llms.txt file
├── index.json            # Metadata about all crawled pages
└── pages/
    ├── page-1.md
    ├── page-2.md
    └── ...

Error Handling

Network errors: Retries up to 3 times with exponential backoff
Rate limiting: Respects delay settings between requests
Missing pages: Logs warnings but continues crawling other pages
Invalid URLs: Skips and logs invalid URLs

Integration Tips

When using this skill in an agent workflow:

First run the crawler to download content
The
```
index.json
```
file contains metadata about all pages
Use the downloaded markdown files for context or analysis