Claude-seo-skills seo-robots-ai
git clone https://github.com/lionkiii/claude-seo-skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/lionkiii/claude-seo-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/seo-robots-ai" ~/.claude/skills/lionkiii-claude-seo-skills-seo-robots-ai && rm -rf "$T"
skills/seo-robots-ai/SKILL.mdAI Crawler Robots.txt Audit
Analyzes a site's robots.txt specifically for AI crawler access policies. Complements
/seo-technical (which does a broad robots.txt check) with
deep AI-specific analysis.
@skills/seo/references/ai-crawlers-guide.md
AI Crawler Registry
| Bot Name | Owner | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data + ChatGPT web search |
| OAI-SearchBot | OpenAI | ChatGPT search only (not training) |
| ChatGPT-User | OpenAI | ChatGPT browsing (real-time) |
| ClaudeBot | Anthropic | Training data collection |
| anthropic-ai | Anthropic | Anthropic web crawler |
| PerplexityBot | Perplexity | AI search engine |
| Google-Extended | Gemini / AI training (not Search) | |
| Bytespider | ByteDance | TikTok / AI training |
| CCBot | Common Crawl | Open dataset used by many AI models |
| Applebot-Extended | Apple | Apple Intelligence training |
| cohere-ai | Cohere | AI model training |
| FacebookBot | Meta | Meta AI training |
| Meta-ExternalAgent | Meta | Meta AI browsing agent |
| Amazonbot | Amazon | Alexa / AI training |
| Diffbot | Diffbot | AI knowledge graph |
| ImagesiftBot | ImagesiftBot | AI image training |
| Omgili | Webz.io | AI data feeds |
Inputs
: The website URL to audit (will fetchurl
from site root)/robots.txt- Normalize to domain root:
→example.com/pagehttps://example.com/robots.txt
- Normalize to domain root:
Execution
-
Fetch robots.txt: WebFetch
<domain>/robots.txt- If 404 → report "No robots.txt found — all crawlers allowed by default"
- If 200 → proceed to parse
-
Parse User-agent blocks: Extract all
directives and their associatedUser-agent
/Allow
rules.Disallow -
Check each AI crawler: For each bot in the registry, determine access:
- Allowed — No specific block, or explicit
Allow: / - Blocked —
for this User-agentDisallow: / - Partial — Some paths blocked, others allowed (list specifics)
- Inherited — Falls under
rules (note this)User-agent: *
- Allowed — No specific block, or explicit
-
Check wildcard rules: If
hasUser-agent: *
, note that ALL bots (including AI) are blocked unless explicitly allowed.Disallow: / -
Check for ai.txt: WebFetch
— an emerging standard for AI-specific crawler policies. Report if found and summarize contents.<domain>/ai.txt -
Check for llms.txt: WebFetch
— report if found (cross-reference with<domain>/llms.txt
for full audit)./seo llms-txt -
Analyze crawl-delay: Note any
directives that affect AI bots specifically or via wildcard.Crawl-delay -
Check sitemap declaration: Note if
directive is present (helps AI crawlers discover content).Sitemap:
Output Format
## AI Crawler Audit: [domain] ### Crawler Access Matrix | Crawler | Owner | Status | Rule Source | Details | |---|---|---|---|---| | GPTBot | OpenAI | Allowed/Blocked/Partial | Line [#] | [specific rules] | | ClaudeBot | Anthropic | Allowed/Blocked/Partial | Line [#] | [specific rules] | | PerplexityBot | Perplexity | Allowed/Blocked/Partial | Line [#] | [specific rules] | | Google-Extended | Google | Allowed/Blocked/Partial | Line [#] | [specific rules] | | ... | ... | ... | ... | ... | ### AI Openness Score: X/10 Scoring: - 10/10 = All AI crawlers allowed, ai.txt present, llms.txt present - 7-9 = Most crawlers allowed, some minor gaps - 4-6 = Mixed policy — some allowed, some blocked - 1-3 = Most AI crawlers blocked - 0/10 = All AI crawlers blocked (or blanket Disallow: /) ### Key Findings - **AI crawlers explicitly blocked**: [count] of [total] - **AI crawlers explicitly allowed**: [count] - **Falling under wildcard rules**: [count] - **ai.txt present**: Yes/No - **llms.txt present**: Yes/No - **Sitemap declared**: Yes/No ### Recommendations Based on the site's apparent goals: **If goal is maximum AI visibility:** - [Specific recommendations to allow AI crawlers] - [Suggest llms.txt creation if missing] **If goal is AI protection:** - [Note any crawlers not yet blocked] - [Suggest ai.txt adoption] **If goal is selective access:** - [Recommend allowing search-focused bots: OAI-SearchBot, PerplexityBot] - [Block training-only bots: CCBot, Bytespider] - [Distinguish training vs search crawlers] ### Industry Context Note how the site's policy compares to common patterns: - Most major publishers block training bots but allow search bots - Most SaaS companies allow all AI crawlers for visibility - E-commerce sites typically allow all crawlers - Media/news sites increasingly block training-only bots ### robots.txt Snippets If the user wants to implement changes, provide ready-to-paste robots.txt blocks for their chosen strategy: **Allow all AI crawlers:**
AI Crawlers — Allowed
User-agent: GPTBot Allow: /
User-agent: ClaudeBot Allow: /
User-agent: PerplexityBot Allow: /
User-agent: Google-Extended Allow: /
**Block training, allow search:**
AI Search — Allowed
User-agent: OAI-SearchBot Allow: /
User-agent: PerplexityBot Allow: /
AI Training — Blocked
User-agent: GPTBot Disallow: /
User-agent: ClaudeBot Disallow: /
User-agent: CCBot Disallow: /
User-agent: Google-Extended Disallow: /
User-agent: Bytespider Disallow: /