Goose-skills site-content-catalog

install
source · Clone the upstream repo
git clone https://github.com/gooseworks-ai/goose-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/gooseworks-ai/goose-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/capabilities/site-content-catalog" ~/.claude/skills/gooseworks-ai-goose-skills-site-content-catalog && rm -rf "$T"
manifest: skills/capabilities/site-content-catalog/SKILL.md
source content

Site Content Catalog

Crawl a website's sitemap and blog to build a complete content inventory — every page cataloged with URL, title, date, content type, and topic cluster. Groups content by category, identifies publishing patterns, and optionally deep-analyzes top pages.

Quick Start

# Basic content inventory
python3 scripts/catalog_content.py --domain "example.com"

# With deep analysis of top 20 pages
python3 scripts/catalog_content.py --domain "example.com" --deep-analyze 20

# Output to specific file
python3 scripts/catalog_content.py --domain "example.com" --output content-inventory.json

Inputs

ParameterRequiredDefaultDescription
domainYesDomain to catalog (e.g., "example.com")
deep-analyzeNo0Number of top pages to deep-read for content analysis
outputNostdoutPath to save JSON output
include-non-blogNotrueAlso catalog landing pages, docs, etc. (not just blog)

Cost

  • Sitemap/RSS crawling: Free (direct HTTP requests)
  • Apify sitemap extractor (fallback): ~$0.50 per site
  • Deep analysis: Free (WebFetch on individual pages)

Process

Phase 1: Discover All Pages

The script attempts multiple methods to find all pages on a site, in order:

A) Sitemap.xml

  1. Fetch
    https://[domain]/sitemap.xml
  2. If it's a sitemap index, recursively fetch all child sitemaps
  3. Common alternate locations:
    /sitemap_index.xml
    ,
    /sitemap-index.xml
    ,
    /wp-sitemap.xml
  4. Check
    robots.txt
    for
    Sitemap:
    directives

B) RSS/Atom Feeds

  1. Check
    /feed
    ,
    /rss
    ,
    /atom.xml
    ,
    /blog/feed
    , etc.
  2. Extract posts with titles, dates, and URLs
  3. RSS typically only surfaces recent content (last 10-50 posts)

C) Blog Index Crawl

  1. Fetch
    /blog
    ,
    /resources
    ,
    /insights
    ,
    /news
    ,
    /articles
  2. Extract links from the page
  3. Follow pagination if present (
    /blog/page/2
    ,
    ?page=2
    , etc.)

D) Site: Search (fallback)

  1. WebSearch:
    site:[domain]
    to estimate total indexed pages
  2. WebSearch:
    site:[domain]/blog
    to find blog content
  3. WebSearch:
    site:[domain] intitle:
    to discover page title patterns

E) Apify Sitemap Extractor (fallback for JS-heavy sites)

  • Actor:
    onescales/sitemap-url-extractor
  • Use when sitemap.xml is missing and the site is JS-rendered

Phase 2: Classify Each Page

For each discovered URL, classify by:

Content Type

Classify based on URL patterns and page titles:

TypeURL PatternsExamples
blog-post
/blog/
,
/posts/
,
/articles/
How-to guides, opinion pieces
case-study
/case-study/
,
/customers/
,
/success-stories/
Customer stories
comparison
/vs/
,
/compare/
,
/alternative/
X vs Y pages
landing-page
/solutions/
,
/use-cases/
,
/for-/
Product marketing pages
docs
/docs/
,
/help/
,
/documentation/
,
/api/
Technical documentation
changelog
/changelog/
,
/releases/
,
/whats-new/
Product updates
pricing
/pricing/
Pricing page
about
/about/
,
/team/
,
/careers/
Company pages
legal
/privacy/
,
/terms/
,
/security/
Legal/compliance
resource
/resources/
,
/guides/
,
/ebooks/
,
/webinars/
Gated/downloadable content
glossary
/glossary/
,
/dictionary/
,
/terms/
SEO glossary pages
integration
/integrations/
,
/apps/
,
/marketplace/
Integration pages
other
Anything else

Topic Cluster

Group by extracting topic signals from URL slugs and titles:

  • Extract keywords from URL path segments
  • Group similar keywords into clusters (e.g., "aws-cost", "cloud-spending", "finops" → "Cloud Cost Management")
  • Use simple keyword co-occurrence for clustering

Phase 3: Analyze Publishing Patterns

From the dated content (primarily blog posts):

  • Total content pieces by type
  • Publishing frequency: Posts per month over last 12 months
  • Trend: Increasing, decreasing, or stable output
  • Recency: Date of most recent publish
  • Author diversity: Unique authors (if extractable from RSS)

Phase 4: Deep Analysis (Optional)

If

--deep-analyze N
is specified, fetch the top N pages (prioritizing blog posts) and extract:

  • Word count (approximate)
  • Target keyword (inferred from title + H1 + URL)
  • Funnel stage: TOFU (awareness), MOFU (consideration), BOFU (decision)
  • Content depth: Shallow (<500 words), Medium (500-1500), Deep (1500+)
  • Has images/video: Boolean
  • Has CTA: Boolean (detected by common CTA patterns)
  • Internal links count

Phase 5: Output

JSON Output (default)

{
  "domain": "example.com",
  "crawl_date": "2026-02-25",
  "total_pages": 347,
  "discovery_methods": ["sitemap.xml", "rss"],
  "pages": [
    {
      "url": "https://example.com/blog/reduce-aws-costs",
      "title": "How to Reduce Your AWS Bill by 40%",
      "date": "2025-11-15",
      "type": "blog-post",
      "topic_cluster": "Cloud Cost Optimization",
      "deep_analysis": {
        "word_count": 2100,
        "target_keyword": "reduce aws costs",
        "funnel_stage": "TOFU",
        "content_depth": "deep",
        "has_images": true,
        "has_cta": true
      }
    }
  ],
  "summary": {
    "by_type": {"blog-post": 89, "landing-page": 23, "case-study": 12, ...},
    "by_topic": {"Cloud Cost Optimization": 34, "FinOps": 18, ...},
    "publishing_cadence": {
      "posts_per_month_avg": 4.2,
      "trend": "increasing",
      "most_recent": "2026-02-20"
    }
  }
}

Markdown Summary (also generated)

# Content Inventory: example.com
**Crawled:** 2026-02-25 | **Total pages:** 347

## Content by Type
| Type | Count | % |
|------|-------|---|
| Blog Posts | 89 | 25.6% |
| Landing Pages | 23 | 6.6% |
| ...

## Content by Topic Cluster
| Topic | Posts | Most Recent |
|-------|-------|-------------|
| Cloud Cost Optimization | 34 | 2026-02-20 |
| ...

## Publishing Cadence
- Average: 4.2 posts/month
- Trend: Increasing (3.1 → 5.4 over last 6 months)
- Most recent: 2026-02-20

## Full Catalog
| # | Date | Type | Topic | Title | URL |
|---|------|------|-------|-------|-----|
| 1 | 2026-02-20 | blog-post | Cloud Cost | How to Reduce... | https://... |

Tips

  • Sitemap.xml is the best source. Most well-maintained sites have one. If missing, it's itself an SEO signal (negative).
  • RSS only shows recent content. If you need the full catalog, sitemap is essential. RSS is supplementary.
  • Deep analysis is optional but valuable. Use it when feeding into brand-voice-extractor or when you need funnel stage mapping.
  • JS-rendered sites may need the Apify fallback. Signs: sitemap.xml returns HTML, or blog page returns mostly JavaScript.
  • Combine with seo-domain-analyzer to overlay traffic data on the content inventory — see which content actually performs.

Dependencies

  • Python 3.8+
  • requests
    library (
    pip install requests
    )
  • APIFY_API_TOKEN
    env var (only for Apify fallback mode)