Claude-skill-registry data-extraction-patterns
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/data-extraction-patterns" ~/.claude/skills/majiayu000-claude-skill-registry-data-extraction-patterns && rm -rf "$T"
manifest:
skills/data/data-extraction-patterns/SKILL.mdsource content
Data Extraction Patterns
When to Use
- Setting up analytics data pipelines
- Combining data from multiple sources
- Handling API rate limits and errors
- Caching frequently accessed data
- Building data collection workflows
API Reference
Google Analytics 4 (GA4)
MCP Server:
mcp-server-google-analytics
Key Operations:
get_report({ propertyId: "properties/123456789", dateRange: { startDate: "30daysAgo", endDate: "today" }, dimensions: ["pagePath", "date"], metrics: ["screenPageViews", "averageSessionDuration", "bounceRate"] })
Useful Metrics:
| Metric | Description | Use Case |
|---|---|---|
| screenPageViews | Total page views | Traffic volume |
| sessions | User sessions | Visitor count |
| averageSessionDuration | Avg time in session | Engagement |
| bounceRate | Single-page visits | Content quality |
| engagementRate | Engaged sessions % | True engagement |
| scrolledUsers | Users who scrolled | Content consumption |
Useful Dimensions:
| Dimension | Description |
|---|---|
| pagePath | URL path |
| date | Date (for trending) |
| sessionSource | Traffic source |
| deviceCategory | Desktop/mobile/tablet |
Google Search Console (GSC)
MCP Server:
mcp-server-gsc
Key Operations:
search_analytics({ siteUrl: "https://example.com", startDate: "2025-11-27", endDate: "2025-12-27", dimensions: ["query", "page"], rowLimit: 1000 }) get_url_inspection({ siteUrl: "https://example.com", inspectionUrl: "https://example.com/page" })
Available Metrics:
| Metric | Description | Use Case |
|---|---|---|
| clicks | Total clicks from search | Traffic from Google |
| impressions | Times shown in results | Visibility |
| ctr | Click-through rate | Snippet effectiveness |
| position | Average ranking | SEO success |
Dimensions:
| Dimension | Description |
|---|---|
| query | Search query |
| page | Landing page URL |
| country | User country |
| device | Desktop/mobile/tablet |
| date | Date (for trending) |
SE Ranking (Official MCP Server)
MCP Server:
seo-data-api-mcp (official SE Ranking MCP)
Repository: https://github.com/seranking/seo-data-api-mcp-server
Installation (via claudeup TUI - recommended):
npx claudeup # Navigate to: MCP Server Setup → SEO & Analytics → se-ranking
Manual Installation:
git clone https://github.com/seranking/seo-data-api-mcp-server.git cd seo-data-api-mcp-server docker compose build
Environment Variable:
SERANKING_API_TOKEN
Available MCP Tools:
| Tool | Description | Use Case |
|---|---|---|
| Domain performance metrics | Overall domain health |
| Keyword rankings for domain | Track ranking positions |
| Identify competitors | Competitive analysis |
| Compare keywords across domains | Gap analysis |
| Retrieve backlink data | Link profile audit |
| Related keyword discovery | Content expansion |
| Similar keyword suggestions | Keyword clustering |
Example MCP Calls:
MCP: seo-data-api-mcp.domainOverview({ domain: "example.com" }) MCP: seo-data-api-mcp.domainKeywords({ domain: "example.com", limit: 100 }) MCP: seo-data-api-mcp.backlinksAll({ domain: "example.com" })
Parallel Execution Pattern
Optimal Data Fetch (All Sources)
## Parallel Data Fetch Pattern When fetching from multiple sources, issue all requests in a SINGLE message for parallel execution: ┌─────────────────────────────────────────────────────────────────┐ │ MESSAGE 1: Parallel Data Requests │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ [MCP Call 1]: google-analytics.get_report(...) │ │ [MCP Call 2]: google-search-console.search_analytics(...) │ │ [WebFetch 3]: SE Ranking API endpoint │ │ │ │ → All execute simultaneously │ │ → Results return when all complete │ │ → ~3x faster than sequential │ │ │ └─────────────────────────────────────────────────────────────────┘
Sequential (When Needed)
Some operations require sequential execution:
## Sequential Pattern (Dependencies) When one request depends on another's result: ┌─────────────────────────────────────────────────────────────────┐ │ MESSAGE 1: Get list of pages │ │ → Returns: ["/page1", "/page2", "/page3"] │ ├─────────────────────────────────────────────────────────────────┤ │ MESSAGE 2: Get details for each page │ │ → Uses page list from Message 1 │ │ → Can parallelize within this message │ └─────────────────────────────────────────────────────────────────┘
Rate Limiting
API Rate Limits
| API | Limit | Strategy |
|---|---|---|
| GA4 | 10 QPS per property | Batch dimensions |
| GSC | 1,200 requests/min | Paginate large exports |
| SE Ranking | 100 requests/min | Queue long operations |
Retry Pattern
#!/bin/bash # Retry with exponential backoff MAX_RETRIES=3 RETRY_DELAY=5 fetch_with_retry() { local url="$1" local attempt=1 while [ $attempt -le $MAX_RETRIES ]; do response=$(curl -s -w "%{http_code}" -o /tmp/response.json "$url") http_code="${response: -3}" if [ "$http_code" = "200" ]; then cat /tmp/response.json return 0 elif [ "$http_code" = "429" ]; then echo "Rate limited, waiting ${RETRY_DELAY}s..." >&2 sleep $RETRY_DELAY RETRY_DELAY=$((RETRY_DELAY * 2)) else echo "Error: HTTP $http_code" >&2 return 1 fi attempt=$((attempt + 1)) done echo "Max retries exceeded" >&2 return 1 }
Caching Pattern
Session-Based Cache
# Cache structure SESSION_PATH="/tmp/seo-performance-20251227-143000-example" CACHE_DIR="${SESSION_PATH}/cache" CACHE_TTL=3600 # 1 hour in seconds mkdir -p "$CACHE_DIR" # Cache key generation cache_key() { echo "$1" | md5sum | cut -d' ' -f1 } # Check cache get_cached() { local key=$(cache_key "$1") local cache_file="${CACHE_DIR}/${key}.json" if [ -f "$cache_file" ]; then local age=$(($(date +%s) - $(stat -f%m "$cache_file" 2>/dev/null || stat -c%Y "$cache_file"))) if [ $age -lt $CACHE_TTL ]; then cat "$cache_file" return 0 fi fi return 1 } # Save to cache save_cache() { local key=$(cache_key "$1") local cache_file="${CACHE_DIR}/${key}.json" cat > "$cache_file" } # Usage CACHE_KEY="ga4_${URL}_${DATE_RANGE}" if ! RESULT=$(get_cached "$CACHE_KEY"); then RESULT=$(fetch_from_api) echo "$RESULT" | save_cache "$CACHE_KEY" fi
Date Range Standardization
Common Date Ranges
# Standard date range calculations TODAY=$(date +%Y-%m-%d) case "$RANGE" in "7d") START_DATE=$(date -v-7d +%Y-%m-%d 2>/dev/null || date -d "7 days ago" +%Y-%m-%d) ;; "30d") START_DATE=$(date -v-30d +%Y-%m-%d 2>/dev/null || date -d "30 days ago" +%Y-%m-%d) ;; "90d") START_DATE=$(date -v-90d +%Y-%m-%d 2>/dev/null || date -d "90 days ago" +%Y-%m-%d) ;; "mtd") START_DATE=$(date +%Y-%m-01) ;; "ytd") START_DATE=$(date +%Y-01-01) ;; esac END_DATE="$TODAY"
API-Specific Formats
| API | Format | Example |
|---|---|---|
| GA4 | Relative or ISO | "30daysAgo", "2025-12-01" |
| GSC | ISO 8601 | "2025-12-01" |
| SE Ranking | ISO 8601 or Unix | "2025-12-01", 1735689600 |
Graceful Degradation
Data Source Fallback
## Fallback Strategy When a data source is unavailable: ┌─────────────────────────────────────────────────────────────────┐ │ PRIMARY SOURCE │ FALLBACK │ LAST RESORT │ ├──────────────────────┼─────────────────────┼────────────────────┤ │ GA4 traffic data │ GSC clicks │ Estimate from GSC │ │ GSC search perf │ SE Ranking queries │ WebSearch SERP │ │ SE Ranking ranks │ GSC avg position │ Manual SERP check │ │ CWV (CrUX) │ PageSpeed API │ Lighthouse CLI │ └──────────────────────┴─────────────────────┴────────────────────┘
Partial Data Output
## Analysis Report (Partial Data) ### Data Availability | Source | Status | Impact | |--------|--------|--------| | GA4 | NOT CONFIGURED | Missing engagement metrics | | GSC | AVAILABLE | Full search data | | SE Ranking | ERROR (rate limit) | Using cached rankings | ### Analysis Notes This analysis is based on limited data sources: - Search performance metrics are complete (GSC) - Engagement metrics unavailable (no GA4) - Ranking data may be 24h stale (cached) **Recommendation**: Configure GA4 for complete analysis. Run `/setup-analytics` to add Google Analytics.
Unified Data Model
Combined Output Structure
{ "metadata": { "url": "https://example.com/page", "fetchedAt": "2025-12-27T14:30:00Z", "dateRange": { "start": "2025-11-27", "end": "2025-12-27" } }, "sources": { "ga4": { "available": true, "metrics": { "pageViews": 2450, "avgTimeOnPage": 222, "bounceRate": 38.2, "engagementRate": 64.5 } }, "gsc": { "available": true, "metrics": { "impressions": 15200, "clicks": 428, "ctr": 2.82, "avgPosition": 4.2 }, "topQueries": [ {"query": "seo guide", "clicks": 156, "position": 4} ] }, "seRanking": { "available": true, "rankings": [ {"keyword": "seo guide", "position": 4, "volume": 12100} ], "visibility": 42 } }, "computed": { "healthScore": 72, "status": "GOOD" } }
Error Handling
Common Errors
| Error | Cause | Resolution |
|---|---|---|
| 401 Unauthorized | Invalid/expired credentials | Re-run /setup-analytics |
| 403 Forbidden | Missing permissions | Check API access in console |
| 429 Too Many Requests | Rate limit | Wait and retry with backoff |
| 404 Not Found | Invalid property/site | Verify IDs in configuration |
| 500 Server Error | API issue | Retry later, check status page |
Error Output Pattern
## Data Fetch Error **Source**: Google Analytics 4 **Error**: 403 Forbidden **Message**: "User does not have sufficient permissions for this property" ### Troubleshooting Steps 1. Verify Service Account email in GA4 Admin 2. Ensure "Viewer" role is granted 3. Check Analytics Data API is enabled 4. Wait 5 minutes for permission propagation ### Workaround Proceeding with available data sources (GSC, SE Ranking). GA4 engagement metrics will not be included in this analysis.