Goose-skills site-content-catalog
install
source · Clone the upstream repo
git clone https://github.com/gooseworks-ai/goose-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/gooseworks-ai/goose-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/capabilities/site-content-catalog" ~/.claude/skills/gooseworks-ai-goose-skills-site-content-catalog && rm -rf "$T"
manifest:
skills/capabilities/site-content-catalog/SKILL.mdsource content
Site Content Catalog
Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.
Quick Start
# Basic content inventory python3 scripts/catalog_content.py --domain "example.com" # With deep analysis of top 20 pages python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20 # Output to specific file python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json
Inputs
| Parameter | Required | Default | Description |
|---|---|---|---|
| domain | Yes | — | Domain to catalog (e.g., "example.com") |
| deep-analyze | No | 0 | Number of top pages to deep-read for content analysis |
| output | No | stdout | Path to save JSON output |
| include-non-blog | No | true | Also catalog landing pages, docs, etc. (not just blog) |
Cost
- Sitemap/RSS crawling: Free (direct HTTP requests)
- Apify sitemap extractor (fallback): ~$0.50 per site
- Deep analysis: Free (WebFetch on individual pages)
Process
Phase 1: Discover All Pages
The script attempts multiple methods to find all pages on a site, in order:
A) Sitemap.xml
- Fetch
https://[domain]/sitemap.xml - If it's a sitemap index, recursively fetch all child sitemaps
- Common alternate locations:
,/sitemap_index.xml
,/sitemap-index.xml/wp-sitemap.xml - Check
forrobots.txt
directivesSitemap:
B) RSS/Atom Feeds
- Check
,/feed
,/rss
,/atom.xml
, etc./blog/feed - Extract posts with titles, dates, and URLs
- RSS typically only surfaces recent content (last 10-50 posts)
C) Blog Index Crawl
- Fetch
,/blog
,/resources
,/insights
,/news/articles - Extract links from the page
- Follow pagination if present (
,/blog/page/2
, etc.)?page=2
D) Site: Search (fallback)
- WebSearch:
to estimate total indexed pagessite:[domain] - WebSearch:
to find blog contentsite:[domain]/blog - WebSearch:
to discover page title patternssite:[domain] intitle:
E) Apify Sitemap Extractor (fallback for JS-heavy sites)
- Actor:
onescales/sitemap-url-extractor - Use when sitemap.xml is missing and the site is JS-rendered
Phase 2: Classify Each Page
For each discovered URL, classify by:
Content Type
Classify based on URL patterns and page titles:
| Type | URL Patterns | Examples |
|---|---|---|
| , , | How-to guides, opinion pieces |
| , , | Customer stories |
| , , | X vs Y pages |
| , , | Product marketing pages |
| , , , | Technical documentation |
| , , | Product updates |
| | Pricing page |
| , , | Company pages |
| , , | Legal/compliance |
| , , , | Gated/downloadable content |
| , , | SEO glossary pages |
| , , | Integration pages |
| — | Anything else |
Topic Cluster
Group by extracting topic signals from URL slugs and titles:
- Extract keywords from URL path segments
- Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
- Use simple keyword co-occurrence for clustering
Phase 3: Analyze Publishing Patterns
From the dated content (primarily blog posts):
- Total content pieces by type
- Publishing frequency: Posts per month over last 12 months
- Trend: Increasing, decreasing, or stable output
- Recency: Date of most recent publish
- Author diversity: Unique authors (if extractable from RSS)
Phase 4: Deep Analysis (Optional)
If
--deep-analyze N is specified, fetch the top N pages (prioritizing blog posts) and extract:
- Word count (approximate)
- Target keyword (inferred from title + H1 + URL)
- Funnel stage: TOFU (awareness), MOFU (consideration), BOFU (decision)
- Content depth: Shallow (<500 words), Medium (500-1500), Deep (1500+)
- Has images/video: Boolean
- Has CTA: Boolean (detected by common CTA patterns)
- Internal links count
Phase 5: Output
JSON Output (default)
{ "domain": "example.com", "crawl_date": "2026-02-25", "total_pages": 347, "discovery_methods": ["sitemap.xml", "rss"], "pages": [ { "url": "https://example.com/blog/reduce-aws-costs", "title": "How to Reduce Your AWS Bill by 40%", "date": "2025-11-15", "type": "blog-post", "topic_cluster": "Cloud Cost Optimization", "deep_analysis": { "word_count": 2100, "target_keyword": "reduce aws costs", "funnel_stage": "TOFU", "content_depth": "deep", "has_images": true, "has_cta": true } } ], "summary": { "by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...}, "by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...}, "publishing_cadence": { "posts_per_month_avg": 4.2, "trend": "increasing", "most_recent": "2026-02-20" } } }
Markdown Summary (also generated)
# Content Inventory: example.com **Crawled:** 2026-02-25 | **Total pages:** 347 ## Content by Type | Type | Count | % | |------|-------|---| | Blog Posts | 89 | 25.6% | | Landing Pages | 23 | 6.6% | | ... ## Content by Topic Cluster | Topic | Posts | Most Recent | |-------|-------|-------------| | Cloud Cost Optimization | 34 | 2026-02-20 | | ... ## Publishing Cadence - Average: 4.2 posts/month - Trend: Increasing (3.1 → 5.4 over last 6 months) - Most recent: 2026-02-20 ## Full Catalog | # | Date | Type | Topic | Title | URL | |---|------|------|-------|-------|-----| | 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |
Tips
- Sitemap.xml is the best source. Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
- RSS only shows recent content. If you need the full catalog, sitemap is essential. RSS is supplementary.
- Deep analysis is optional but valuable. Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
- JS-rendered sites may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
- Combine with seo-domain-analyzer to overlay traffic data on the content inventory — see which content actually performs.
Dependencies
- Python 3.8+
library (requests
)pip install requests
env var (only for Apify fallback mode)APIFY_API_TOKEN