Claude-seo-skills seo-robots-ai

install

source · Clone the upstream repo

git clone https://github.com/lionkiii/claude-seo-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/lionkiii/claude-seo-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/seo-robots-ai" ~/.claude/skills/lionkiii-claude-seo-skills-seo-robots-ai && rm -rf "$T"

manifest: skills/seo-robots-ai/SKILL.md

source content

AI Crawler Robots.txt Audit

Analyzes a site's robots.txt specifically for AI crawler access policies. Complements

/seo-technical

(which does a broad robots.txt check) with deep AI-specific analysis.

@skills/seo/references/ai-crawlers-guide.md

AI Crawler Registry

Bot Name	Owner	Purpose
GPTBot	OpenAI	Training data + ChatGPT web search
OAI-SearchBot	OpenAI	ChatGPT search only (not training)
ChatGPT-User	OpenAI	ChatGPT browsing (real-time)
ClaudeBot	Anthropic	Training data collection
anthropic-ai	Anthropic	Anthropic web crawler
PerplexityBot	Perplexity	AI search engine
Google-Extended	Google	Gemini / AI training (not Search)
Bytespider	ByteDance	TikTok / AI training
CCBot	Common Crawl	Open dataset used by many AI models
Applebot-Extended	Apple	Apple Intelligence training
cohere-ai	Cohere	AI model training
FacebookBot	Meta	Meta AI training
Meta-ExternalAgent	Meta	Meta AI browsing agent
Amazonbot	Amazon	Alexa / AI training
Diffbot	Diffbot	AI knowledge graph
ImagesiftBot	ImagesiftBot	AI image training
Omgili	Webz.io	AI data feeds

Inputs

```
url
```
: The website URL to audit (will fetch
```
/robots.txt
```
from site root)
- Normalize to domain root:
```
example.com/page
```
  →
```
https://example.com/robots.txt
```

Execution

Fetch robots.txt: WebFetch
```
<domain>/robots.txt
```
- If 404 → report "No robots.txt found — all crawlers allowed by default"
- If 200 → proceed to parse
Parse User-agent blocks: Extract all
```
User-agent
```
directives and their associated
```
Allow
```
/
```
Disallow
```
rules.
Check each AI crawler: For each bot in the registry, determine access:
- Allowed — No specific block, or explicit
```
Allow: /
```
- Blocked —
```
Disallow: /
```
  for this User-agent
- Partial — Some paths blocked, others allowed (list specifics)
- Inherited — Falls under
```
User-agent: *
```
  rules (note this)
Check wildcard rules: If
```
User-agent: *
```
has
```
Disallow: /
```
, note that ALL bots (including AI) are blocked unless explicitly allowed.
Check for ai.txt: WebFetch
```
<domain>/ai.txt
```
— an emerging standard for AI-specific crawler policies. Report if found and summarize contents.
Check for llms.txt: WebFetch
```
<domain>/llms.txt
```
— report if found (cross-reference with
```
/seo llms-txt
```
for full audit).
Analyze crawl-delay: Note any
```
Crawl-delay
```
directives that affect AI bots specifically or via wildcard.
Check sitemap declaration: Note if
```
Sitemap:
```
directive is present (helps AI crawlers discover content).

Output Format

## AI Crawler Audit: [domain]

### Crawler Access Matrix

| Crawler | Owner | Status | Rule Source | Details |
|---|---|---|---|---|
| GPTBot | OpenAI | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| ClaudeBot | Anthropic | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| PerplexityBot | Perplexity | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| Google-Extended | Google | Allowed/Blocked/Partial | Line [#] | [specific rules] |
| ... | ... | ... | ... | ... |

### AI Openness Score: X/10

Scoring:
- 10/10 = All AI crawlers allowed, ai.txt present, llms.txt present
- 7-9 = Most crawlers allowed, some minor gaps
- 4-6 = Mixed policy — some allowed, some blocked
- 1-3 = Most AI crawlers blocked
- 0/10 = All AI crawlers blocked (or blanket Disallow: /)

### Key Findings

- **AI crawlers explicitly blocked**: [count] of [total]
- **AI crawlers explicitly allowed**: [count]
- **Falling under wildcard rules**: [count]
- **ai.txt present**: Yes/No
- **llms.txt present**: Yes/No
- **Sitemap declared**: Yes/No

### Recommendations

Based on the site's apparent goals:

**If goal is maximum AI visibility:**
- [Specific recommendations to allow AI crawlers]
- [Suggest llms.txt creation if missing]

**If goal is AI protection:**
- [Note any crawlers not yet blocked]
- [Suggest ai.txt adoption]

**If goal is selective access:**
- [Recommend allowing search-focused bots: OAI-SearchBot, PerplexityBot]
- [Block training-only bots: CCBot, Bytespider]
- [Distinguish training vs search crawlers]

### Industry Context

Note how the site's policy compares to common patterns:
- Most major publishers block training bots but allow search bots
- Most SaaS companies allow all AI crawlers for visibility
- E-commerce sites typically allow all crawlers
- Media/news sites increasingly block training-only bots

### robots.txt Snippets

If the user wants to implement changes, provide ready-to-paste robots.txt
blocks for their chosen strategy:

**Allow all AI crawlers:**

AI Crawlers — Allowed

User-agent: GPTBot Allow: /

User-agent: ClaudeBot Allow: /

User-agent: PerplexityBot Allow: /

User-agent: Google-Extended Allow: /


**Block training, allow search:**

AI Search — Allowed

User-agent: OAI-SearchBot Allow: /

User-agent: PerplexityBot Allow: /

AI Training — Blocked

User-agent: GPTBot Disallow: /

User-agent: ClaudeBot Disallow: /

User-agent: CCBot Disallow: /

User-agent: Google-Extended Disallow: /

User-agent: Bytespider Disallow: /