Claude-skill-registry data-extraction-patterns

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-extraction-patterns" ~/.claude/skills/majiayu000-claude-skill-registry-data-extraction-patterns && rm -rf "$T"

manifest: skills/data/data-extraction-patterns/SKILL.md

Data Extraction Patterns

When to Use

Setting up analytics data pipelines
Combining data from multiple sources
Handling API rate limits and errors
Caching frequently accessed data
Building data collection workflows

API Reference

Google Analytics 4 (GA4)

MCP Server:

mcp-server-google-analytics

Key Operations:

get_report({
  propertyId: "properties/123456789",
  dateRange: { startDate: "30daysAgo", endDate: "today" },
  dimensions: ["pagePath", "date"],
  metrics: ["screenPageViews", "averageSessionDuration", "bounceRate"]
})

Useful Metrics:

Metric	Description	Use Case
screenPageViews	Total page views	Traffic volume
sessions	User sessions	Visitor count
averageSessionDuration	Avg time in session	Engagement
bounceRate	Single-page visits	Content quality
engagementRate	Engaged sessions %	True engagement
scrolledUsers	Users who scrolled	Content consumption

Useful Dimensions:

Dimension	Description
pagePath	URL path
date	Date (for trending)
sessionSource	Traffic source
deviceCategory	Desktop/mobile/tablet

Google Search Console (GSC)

MCP Server:

mcp-server-gsc

Key Operations:

search_analytics({
  siteUrl: "https://example.com",
  startDate: "2025-11-27",
  endDate: "2025-12-27",
  dimensions: ["query", "page"],
  rowLimit: 1000
})

get_url_inspection({
  siteUrl: "https://example.com",
  inspectionUrl: "https://example.com/page"
})

Available Metrics:

Metric	Description	Use Case
clicks	Total clicks from search	Traffic from Google
impressions	Times shown in results	Visibility
ctr	Click-through rate	Snippet effectiveness
position	Average ranking	SEO success

Dimensions:

Dimension	Description
query	Search query
page	Landing page URL
country	User country
device	Desktop/mobile/tablet
date	Date (for trending)

SE Ranking (Official MCP Server)

MCP Server:

seo-data-api-mcp

(official SE Ranking MCP)

Repository: https://github.com/seranking/seo-data-api-mcp-server

Installation (via claudeup TUI - recommended):

npx claudeup
# Navigate to: MCP Server Setup → SEO & Analytics → se-ranking

Manual Installation:

git clone https://github.com/seranking/seo-data-api-mcp-server.git
cd seo-data-api-mcp-server
docker compose build

Environment Variable:

SERANKING_API_TOKEN

Available MCP Tools:

Tool	Description	Use Case
`domainOverview`	Domain performance metrics	Overall domain health
`domainKeywords`	Keyword rankings for domain	Track ranking positions
`domainCompetitors`	Identify competitors	Competitive analysis
`domainKeywordsComparison`	Compare keywords across domains	Gap analysis
`backlinksAll`	Retrieve backlink data	Link profile audit
`relatedKeywords`	Related keyword discovery	Content expansion
`similarKeywords`	Similar keyword suggestions	Keyword clustering

Example MCP Calls:

MCP: seo-data-api-mcp.domainOverview({ domain: "example.com" })
MCP: seo-data-api-mcp.domainKeywords({ domain: "example.com", limit: 100 })
MCP: seo-data-api-mcp.backlinksAll({ domain: "example.com" })

Parallel Execution Pattern

Optimal Data Fetch (All Sources)

## Parallel Data Fetch Pattern

When fetching from multiple sources, issue all requests in a SINGLE message
for parallel execution:

┌─────────────────────────────────────────────────────────────────┐
│  MESSAGE 1: Parallel Data Requests                              │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  [MCP Call 1]: google-analytics.get_report(...)                 │
│  [MCP Call 2]: google-search-console.search_analytics(...)      │
│  [WebFetch 3]: SE Ranking API endpoint                          │
│                                                                  │
│  → All execute simultaneously                                    │
│  → Results return when all complete                              │
│  → ~3x faster than sequential                                    │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Sequential (When Needed)

Some operations require sequential execution:

## Sequential Pattern (Dependencies)

When one request depends on another's result:

┌─────────────────────────────────────────────────────────────────┐
│  MESSAGE 1: Get list of pages                                   │
│  → Returns: ["/page1", "/page2", "/page3"]                      │
├─────────────────────────────────────────────────────────────────┤
│  MESSAGE 2: Get details for each page                           │
│  → Uses page list from Message 1                                │
│  → Can parallelize within this message                          │
└─────────────────────────────────────────────────────────────────┘

Rate Limiting

API Rate Limits

API	Limit	Strategy
GA4	10 QPS per property	Batch dimensions
GSC	1,200 requests/min	Paginate large exports
SE Ranking	100 requests/min	Queue long operations

Retry Pattern

#!/bin/bash
# Retry with exponential backoff

MAX_RETRIES=3
RETRY_DELAY=5

fetch_with_retry() {
    local url="$1"
    local attempt=1

    while [ $attempt -le $MAX_RETRIES ]; do
        response=$(curl -s -w "%{http_code}" -o /tmp/response.json "$url")
        http_code="${response: -3}"

        if [ "$http_code" = "200" ]; then
            cat /tmp/response.json
            return 0
        elif [ "$http_code" = "429" ]; then
            echo "Rate limited, waiting ${RETRY_DELAY}s..." >&2
            sleep $RETRY_DELAY
            RETRY_DELAY=$((RETRY_DELAY * 2))
        else
            echo "Error: HTTP $http_code" >&2
            return 1
        fi

        attempt=$((attempt + 1))
    done

    echo "Max retries exceeded" >&2
    return 1
}

Caching Pattern

Session-Based Cache

# Cache structure
SESSION_PATH="/tmp/seo-performance-20251227-143000-example"
CACHE_DIR="${SESSION_PATH}/cache"
CACHE_TTL=3600  # 1 hour in seconds

mkdir -p "$CACHE_DIR"

# Cache key generation
cache_key() {
    echo "$1" | md5sum | cut -d' ' -f1
}

# Check cache
get_cached() {
    local key=$(cache_key "$1")
    local cache_file="${CACHE_DIR}/${key}.json"

    if [ -f "$cache_file" ]; then
        local age=$(($(date +%s) - $(stat -f%m "$cache_file" 2>/dev/null || stat -c%Y "$cache_file")))
        if [ $age -lt $CACHE_TTL ]; then
            cat "$cache_file"
            return 0
        fi
    fi
    return 1
}

# Save to cache
save_cache() {
    local key=$(cache_key "$1")
    local cache_file="${CACHE_DIR}/${key}.json"
    cat > "$cache_file"
}

# Usage
CACHE_KEY="ga4_${URL}_${DATE_RANGE}"
if ! RESULT=$(get_cached "$CACHE_KEY"); then
    RESULT=$(fetch_from_api)
    echo "$RESULT" | save_cache "$CACHE_KEY"
fi

Date Range Standardization

Common Date Ranges

# Standard date range calculations
TODAY=$(date +%Y-%m-%d)

case "$RANGE" in
    "7d")
        START_DATE=$(date -v-7d +%Y-%m-%d 2>/dev/null || date -d "7 days ago" +%Y-%m-%d)
        ;;
    "30d")
        START_DATE=$(date -v-30d +%Y-%m-%d 2>/dev/null || date -d "30 days ago" +%Y-%m-%d)
        ;;
    "90d")
        START_DATE=$(date -v-90d +%Y-%m-%d 2>/dev/null || date -d "90 days ago" +%Y-%m-%d)
        ;;
    "mtd")
        START_DATE=$(date +%Y-%m-01)
        ;;
    "ytd")
        START_DATE=$(date +%Y-01-01)
        ;;
esac

END_DATE="$TODAY"

API-Specific Formats

API	Format	Example
GA4	Relative or ISO	"30daysAgo", "2025-12-01"
GSC	ISO 8601	"2025-12-01"
SE Ranking	ISO 8601 or Unix	"2025-12-01", 1735689600

Graceful Degradation

Data Source Fallback

## Fallback Strategy

When a data source is unavailable:

┌─────────────────────────────────────────────────────────────────┐
│  PRIMARY SOURCE      │  FALLBACK           │  LAST RESORT       │
├──────────────────────┼─────────────────────┼────────────────────┤
│  GA4 traffic data    │  GSC clicks         │  Estimate from GSC │
│  GSC search perf     │  SE Ranking queries │  WebSearch SERP    │
│  SE Ranking ranks    │  GSC avg position   │  Manual SERP check │
│  CWV (CrUX)          │  PageSpeed API      │  Lighthouse CLI    │
└──────────────────────┴─────────────────────┴────────────────────┘

Partial Data Output

## Analysis Report (Partial Data)

### Data Availability

| Source | Status | Impact |
|--------|--------|--------|
| GA4 | NOT CONFIGURED | Missing engagement metrics |
| GSC | AVAILABLE | Full search data |
| SE Ranking | ERROR (rate limit) | Using cached rankings |

### Analysis Notes

This analysis is based on limited data sources:
- Search performance metrics are complete (GSC)
- Engagement metrics unavailable (no GA4)
- Ranking data may be 24h stale (cached)

**Recommendation**: Configure GA4 for complete analysis.
Run `/setup-analytics` to add Google Analytics.

Unified Data Model

Combined Output Structure

{
  "metadata": {
    "url": "https://example.com/page",
    "fetchedAt": "2025-12-27T14:30:00Z",
    "dateRange": {
      "start": "2025-11-27",
      "end": "2025-12-27"
    }
  },
  "sources": {
    "ga4": {
      "available": true,
      "metrics": {
        "pageViews": 2450,
        "avgTimeOnPage": 222,
        "bounceRate": 38.2,
        "engagementRate": 64.5
      }
    },
    "gsc": {
      "available": true,
      "metrics": {
        "impressions": 15200,
        "clicks": 428,
        "ctr": 2.82,
        "avgPosition": 4.2
      },
      "topQueries": [
        {"query": "seo guide", "clicks": 156, "position": 4}
      ]
    },
    "seRanking": {
      "available": true,
      "rankings": [
        {"keyword": "seo guide", "position": 4, "volume": 12100}
      ],
      "visibility": 42
    }
  },
  "computed": {
    "healthScore": 72,
    "status": "GOOD"
  }
}

Error Handling

Common Errors

Error	Cause	Resolution
401 Unauthorized	Invalid/expired credentials	Re-run /setup-analytics
403 Forbidden	Missing permissions	Check API access in console
429 Too Many Requests	Rate limit	Wait and retry with backoff
404 Not Found	Invalid property/site	Verify IDs in configuration
500 Server Error	API issue	Retry later, check status page

Error Output Pattern

## Data Fetch Error

**Source**: Google Analytics 4
**Error**: 403 Forbidden
**Message**: "User does not have sufficient permissions for this property"

### Troubleshooting Steps

1. Verify Service Account email in GA4 Admin
2. Ensure "Viewer" role is granted
3. Check Analytics Data API is enabled
4. Wait 5 minutes for permission propagation

### Workaround

Proceeding with available data sources (GSC, SE Ranking).
GA4 engagement metrics will not be included in this analysis.