Aiwg llms-txt-support
Detect and use llms.txt files for LLM-optimized documentation. Use when checking if a site has LLM-ready docs before scraping.
git clone https://github.com/jmagly/aiwg
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/llms-txt-support" ~/.claude/skills/jmagly-aiwg-llms-txt-support && rm -rf "$T"
.agents/skills/llms-txt-support/SKILL.mdllms.txt Support Skill
Purpose
Single responsibility: Detect, fetch, and utilize llms.txt files that provide LLM-optimized documentation, enabling 10x faster documentation ingestion. (BP-4)
Background
The llms.txt standard (https://llmstxt.org/) provides a convention for websites to expose LLM-friendly documentation. Instead of scraping entire sites, check for llms.txt first.
File hierarchy (check in order):
- Complete documentation (largest)llms-full.txt
- Standard documentationllms.txt
- Condensed documentation (smallest)llms-small.txt
Grounding Checkpoint (Archetype 1 Mitigation)
Before executing, VERIFY:
- Base URL is accessible
- Check all three llms.txt variants in order
- Validate file content is actual documentation (not error page)
- Confirm file size is reasonable for the documentation scope
DO NOT assume llms.txt exists. Always probe first.
Uncertainty Escalation (Archetype 2 Mitigation)
ASK USER instead of guessing when:
- Multiple llms.txt variants found - which size to use?
- llms.txt content appears partial or outdated
- File returns but content seems like error page
- Site has llms.txt but content doesn't match expected documentation
NEVER assume llms.txt quality without verification.
Context Scope (Archetype 3 Mitigation)
| Context Type | Included | Excluded |
|---|---|---|
| RELEVANT | Target base URL, llms.txt content | Full site scraping |
| PERIPHERAL | llms.txt spec reference | Other sites' llms.txt |
| DISTRACTOR | Previous scraping attempts | Unrelated documentation |
Workflow Steps
Step 1: Detect llms.txt (Grounding)
# Check for llms.txt variants (in order of preference) curl -I https://example.com/llms-full.txt curl -I https://example.com/llms.txt curl -I https://example.com/llms-small.txt # Check common alternate locations curl -I https://example.com/.well-known/llms.txt curl -I https://docs.example.com/llms.txt
Step 2: Validate Content
# Fetch and inspect first 100 lines curl -s https://example.com/llms.txt | head -100 # Check file size curl -sI https://example.com/llms.txt | grep -i content-length # Verify it's not an error page curl -s https://example.com/llms.txt | grep -i "not found\|error\|404" && echo "WARNING: May be error page"
Step 3: Choose Variant
| Variant | Size | Use Case |
|---|---|---|
| Large (1MB+) | Complete documentation, full API reference |
| Medium | Standard use, balanced coverage |
| Small (<100KB) | Quick reference, limited context windows |
Decision tree:
- If context window is limited →
llms-small.txt - If need complete coverage →
llms-full.txt - Default →
llms.txt
Step 4: Fetch and Process
# Download llms.txt curl -o docs/llms.txt https://example.com/llms.txt # Convert to skill format (if using skill-seekers) skill-seekers scrape --llms-txt docs/llms.txt --name myskill # Or process manually # llms.txt is already LLM-optimized markdown cp docs/llms.txt output/myskill/references/complete.md
Step 5: Validate Output
# Check content structure head -50 output/myskill/references/complete.md # Verify sections grep "^#" output/myskill/references/complete.md | head -20 # Check for code examples grep -c '```' output/myskill/references/complete.md
Recovery Protocol (Archetype 4 Mitigation)
On error:
- PAUSE - Note which variant failed
- DIAGNOSE - Check error type:
→ Try next variant or alternate location404 Not Found
→ May need authentication or user-agent403 Forbidden
→ Retry with longer timeoutTimeout
→ Fall back to traditional scrapingInvalid content
- ADAPT - Try alternate approach
- RETRY - Next variant (max 3 attempts per variant)
- ESCALATE - Inform user llms.txt unavailable, suggest scraping
Checkpoint Support
State saved to:
.aiwg/working/checkpoints/llms-txt-support/
checkpoints/llms-txt-support/ ├── detection_results.json # Which variants found ├── selected_variant.txt # Which was chosen └── content_hash.txt # For cache validation
llms.txt Format Reference
Standard llms.txt structure:
# Project Name > Brief description of the project ## Overview [High-level explanation] ## Installation [Setup instructions] ## Quick Start [Getting started guide] ## API Reference [Detailed API documentation] ## Examples [Code examples] ## FAQ [Common questions]
Detection Results Output
{ "base_url": "https://example.com", "detected": { "llms-full.txt": { "found": true, "url": "https://example.com/llms-full.txt", "size": 1523456, "last_modified": "2025-01-15T10:30:00Z" }, "llms.txt": { "found": true, "url": "https://example.com/llms.txt", "size": 245678, "last_modified": "2025-01-15T10:30:00Z" }, "llms-small.txt": { "found": false } }, "recommended": "llms.txt", "reason": "Standard size, good for most use cases" }
Known Sites with llms.txt
Sites known to support llms.txt (verify before use):
- Anthropic documentation
- Many modern API documentation sites
- Framework documentation following the standard
Always verify - this list may be outdated.
Troubleshooting
| Issue | Diagnosis | Solution |
|---|---|---|
| No llms.txt found | Site doesn't support | Fall back to doc-scraper |
| Content seems wrong | Error page or redirect | Check actual content, verify URL |
| File too large | llms-full.txt overwhelming | Use llms.txt or llms-small.txt |
| Outdated content | llms.txt not maintained | Consider scraping + llms.txt merge |
Integration with doc-scraper
If llms.txt is incomplete or outdated, combine approaches:
# 1. Fetch llms.txt as base curl -o base.md https://example.com/llms.txt # 2. Scrape for additional/updated content skill-seekers scrape --config config.json --skip-covered-by base.md # 3. Merge results # llms.txt provides structure, scraping fills gaps
References
- llms.txt Standard: https://llmstxt.org/
- Skill Seekers llms.txt Detection: https://github.com/jmagly/Skill_Seekers/blob/main/docs/LLMS_TXT_SUPPORT.md
- REF-001: Production-Grade Agentic Workflows (BP-4, BP-9)
- REF-002: LLM Failure Modes (Archetype 1-4 mitigations)