Claude-skill-registry event-scraper

Create new event scraping scripts for websites. Use when adding a new event source to the Asheville Event Feed. ALWAYS start by detecting the CMS/platform and trying known API endpoints first. Browser scraping is NOT supported (Vercel limitation). Handles API-based, HTML/JSON-LD, and hybrid patterns with comprehensive testing workflows.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/event-scraper" ~/.claude/skills/majiayu000-claude-skill-registry-event-scraper && rm -rf "$T"
manifest: skills/data/event-scraper/SKILL.md
source content

Event Scraper Skill

Create new event scrapers that integrate with the Asheville Event Feed codebase. This skill provides patterns and guidance for the full lifecycle: exploration, development, testing, and production integration.


⚠️ CRITICAL: API-First Approach

Scrapers run automatically on Vercel which does NOT support browser automation.

You MUST find the site's API before considering any other approach. Modern websites almost always fetch event data from a backend API - your job is to find and use that same API.

Priority Order (STRICTLY follow this order):

  1. 🥇 Known CMS API - Check the Quick API Lookup table below FIRST
  2. 🥈 Internal JSON API - Site's own API endpoints (found via page analysis)
  3. 🥉 Public API - Official documented API (Ticketmaster, Eventbrite, etc.)
  4. 🏅 HTML with JSON-LD - Structured data embedded in HTML pages
  5. ❌ Browser scraping - NOT SUPPORTED on Vercel!

🚀 Quick API Endpoint Lookup (TRY THESE FIRST!)

Before doing any exploration, check if the site uses a known CMS/platform and try these endpoints directly:

CMS/PluginDetection SignsAPI EndpointKey Parameters
WordPress + Tribe Events
/wp-content/
, "The Events Calendar"
/wp-json/tribe/events/v1/events
start_date
,
per_page
,
page
WordPress + All Events"All-in-One Event Calendar"
/wp-json/osec/v1/events
start
,
end
WordPress REST
/wp-content/
,
/wp-admin/
/wp-json/wp/v2/posts?type=event
per_page
,
page
Squarespace
squarespace.com
,
static1.squarespace.com
{any-page}?format=json
Append to URL
Next.js
/_next/
,
__NEXT_DATA__
/_next/data/{buildId}/{page}.json
Check page source
Eventbrite
eventbrite.com
Internal API (see eventbrite.ts)Complex - see example
Ticketmaster VenuesVenue ticket salesDiscovery API
venueId
,
apikey

Example: Detecting and Using Tribe Events API

If you detect WordPress + Tribe Events, immediately try:

GET https://example.com/wp-json/tribe/events/v1/events?start_date=2025-01-01&per_page=50&page=1

This often returns rich JSON with all event data, proper timezone handling, and pagination.


Required Output Format

Every scraper MUST return

ScrapedEvent[]
:

interface ScrapedEvent {
  sourceId: string;      // Unique ID from source platform (prefix with source, e.g., "mx-123")
  source: EventSource;   // Add to types.ts if new source
  title: string;
  description?: string;
  startDate: Date;       // UTC Date object - see Timezone Decision Tree
  location?: string;     // Format: "Venue, Address, City, State"
  zip?: string;          // Zip code (from API or fallback utilities)
  organizer?: string;
  price?: string;        // "Free", "$20", "$15 - $30", "Unknown"
  url: string;           // Unique event URL (used for deduplication)
  imageUrl?: string;
  interestedCount?: number;
  goingCount?: number;
  timeUnknown?: boolean; // True if source only provided date, no time
}

PHASE 1: EXPLORATION

Step 1.1: Detect CMS/Platform

Use WebFetch to analyze the target site:

WebFetch URL: https://example.com/events/
Prompt: "Analyze this page:
1. What CMS/platform is it? (WordPress, Squarespace, Next.js, custom)
2. Look for: wp-content, wp-json, squarespace, _next, __NEXT_DATA__
3. Is there JSON-LD structured data in script tags?
4. What event plugin is used? (Tribe Events, All Events Calendar, etc.)
5. Any hints about API endpoints in the HTML?"

Step 1.2: Try Known API Endpoints

Based on CMS detection, immediately try the known API endpoints from the Quick Lookup table:

WebFetch URL: https://example.com/wp-json/tribe/events/v1/events?per_page=5
Prompt: "Analyze this API response:
1. Is it returning JSON event data?
2. What fields are available? (title, start_date, venue, cost, etc.)
3. Is there timezone information?
4. What pagination mechanism is used?
5. List all available fields for each event"

Step 1.3: Test API Parameters

Once you find a working API, test common parameters:

ParameterCommon NamesPurpose
Future filter
start_date
,
after
,
from
,
startDate
Only get future events
Page size
per_page
,
limit
,
count
,
pageSize
Control results per page
Pagination
page
,
offset
,
cursor
,
skip
Navigate pages
Sort
orderby
,
sort
,
sortValue
Order results
WebFetch URL: https://example.com/wp-json/tribe/events/v1/events?start_date=2025-01-01&per_page=50
Prompt: "Does this API support:
1. start_date parameter for filtering future events?
2. per_page parameter for controlling page size?
3. What's the maximum per_page allowed?
4. How does pagination work (page number, next_url, etc.)?"

Step 1.4: Document Field Mapping

Create a mental map of API fields to ScrapedEvent fields:

API FieldScrapedEvent FieldTransform Needed
id
sourceId
Prefix:
"mx-${id}"
title
title
decodeHtmlEntities()
utc_start_date
startDate
new Date(utc + 'Z')
cost
price
Use directly or "Unknown"
venue.venue
location
Build string, decode entities
venue.zip
zip
Use directly or fallback
url
url
Use directly

⏰ Timezone Decision Tree (CRITICAL!)

Getting timezone right is crucial. Follow this decision tree:

Does the API provide a UTC field (utc_start_date, utc_time, etc.)?
├─ YES → Use directly: new Date(utcField.replace(' ', 'T') + 'Z')
│        This is the SIMPLEST and most reliable approach.
│
└─ NO → Does the API provide ISO 8601 with offset? (e.g., "2025-12-16T19:00:00-05:00")
        ├─ YES → Use directly: new Date(isoString)
        │
        └─ NO → Does the API provide local time + timezone name? (e.g., "America/New_York")
                ├─ YES → Use parseAsEastern(dateStr, timeStr)
                │
                └─ NO → DANGER! Ambiguous local time.
                        - Assume Eastern for NC events
                        - Use parseAsEastern(dateStr, timeStr)
                        - Verify with test insertion!

Timezone Verification

ALWAYS verify timezone handling by comparing:

  1. API's local time field (e.g.,
    start_date: "2025-12-16 19:00:00"
    )
  2. API's UTC field (e.g.,
    utc_start_date: "2025-12-17 00:00:00"
    )
  3. Your parsed Date displayed in Eastern (should match #1)

Example verification:

API local:  19:00:00 (7 PM Eastern)
API UTC:    00:00:00 next day (midnight UTC = 7 PM EST, correct!)
Our parsed: 7:00:00 PM Eastern ✓

📍 Location String Best Practices

Location strings often have issues. Follow these rules:

1. Always Decode HTML Entities

const venueName = decodeHtmlEntities(venue.venue); // "Rock & Roll" not "Rock & Roll"
const address = decodeHtmlEntities(venue.address);

2. Avoid Duplicate City Names

APIs often include city in both venue name and city field:

// BAD: "Turgua Brewing, Fairview, Fairview, NC"
// GOOD: "Turgua Brewing, 123 Main St, Fairview, NC"

if (venue.city && !venue.address?.includes(venue.city)) {
  parts.push(venue.city);
}

3. Standard Format

// Format: "Venue, Address, City, State"
const parts = [venueName];
if (venue.address) parts.push(decodeHtmlEntities(venue.address));
if (venue.city && !venue.address?.includes(venue.city)) {
  parts.push(venue.city);
}
if (venue.state) parts.push(venue.state);
location = parts.join(', ');

4. Zip Code Fallbacks

let zip = venue?.zip || undefined;
if (!zip && venue?.geo_lat && venue?.geo_lng) {
  zip = getZipFromCoords(venue.geo_lat, venue.geo_lng);
}
if (!zip && venue?.city) {
  zip = getZipFromCity(venue.city);
}

PHASE 2: DEVELOPMENT

Step 2.1: Add Source Type

Add to

lib/scrapers/types.ts
:

export type EventSource = 'AVL_TODAY' | ... | 'YOUR_SOURCE';

Step 2.2: Create Scraper

Create

lib/scrapers/yoursource.ts
:

import { ScrapedEvent } from './types';
import { fetchWithRetry } from '@/lib/utils/retry';
import { isNonNCEvent } from '@/lib/utils/geo';
import { decodeHtmlEntities } from '@/lib/utils/parsers';
import { getZipFromCoords, getZipFromCity } from '@/lib/utils/geo';
import { getTodayStringEastern } from '@/lib/utils/timezone';

const API_BASE = 'https://example.com/wp-json/tribe/events/v1/events';
const PER_PAGE = 50;
const MAX_PAGES = 40;
const DELAY_MS = 200;

const API_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  'Accept': 'application/json',
};

export async function scrapeYourSource(): Promise<ScrapedEvent[]> {
  console.log('[YourSource] Starting scrape...');

  const allEvents: ScrapedEvent[] = [];
  const today = getTodayStringEastern();
  let page = 1;
  let hasMore = true;

  while (hasMore && page <= MAX_PAGES) {
    try {
      const url = new URL(API_BASE);
      url.searchParams.set('start_date', today);
      url.searchParams.set('per_page', PER_PAGE.toString());
      url.searchParams.set('page', page.toString());

      console.log(`[YourSource] Fetching page ${page}...`);

      const response = await fetchWithRetry(
        url.toString(),
        { headers: API_HEADERS, cache: 'no-store' },
        { maxRetries: 3, baseDelay: 1000 }
      );

      const data = await response.json();
      const events = data.events || [];

      console.log(`[YourSource] Page ${page}: ${events.length} events`);

      for (const event of events) {
        const formatted = formatEvent(event);
        if (formatted) allEvents.push(formatted);
      }

      hasMore = !!data.next_rest_url && page < data.total_pages;
      page++;

      if (hasMore) await new Promise(r => setTimeout(r, DELAY_MS));
    } catch (error) {
      console.error(`[YourSource] Error on page ${page}:`, error);
      break;
    }
  }

  // Filter non-NC events
  const ncEvents = allEvents.filter(ev => !isNonNCEvent(ev.title, ev.location));
  console.log(`[YourSource] Found ${ncEvents.length} NC events`);

  return ncEvents;
}

function formatEvent(event: ApiEvent): ScrapedEvent | null {
  // Parse UTC date (see Timezone Decision Tree)
  const startDate = new Date(event.utc_start_date.replace(' ', 'T') + 'Z');

  if (isNaN(startDate.getTime()) || startDate < new Date()) {
    return null;
  }

  // Build location (see Location Best Practices)
  const venue = event.venue;
  let location: string | undefined;
  if (venue?.venue) {
    const parts = [decodeHtmlEntities(venue.venue)];
    if (venue.address) parts.push(decodeHtmlEntities(venue.address));
    if (venue.city && !venue.address?.includes(venue.city)) parts.push(venue.city);
    if (venue.state) parts.push(venue.state);
    location = parts.join(', ');
  }

  // Zip with fallbacks
  let zip = venue?.zip || undefined;
  if (!zip && venue?.geo_lat && venue?.geo_lng) {
    zip = getZipFromCoords(venue.geo_lat, venue.geo_lng);
  }

  return {
    sourceId: `ys-${event.id}`,
    source: 'YOUR_SOURCE',
    title: decodeHtmlEntities(event.title),
    description: event.description ? decodeHtmlEntities(event.description) : undefined,
    startDate,
    location,
    zip,
    organizer: event.organizer?.[0]?.organizer,
    price: event.cost || 'Unknown',
    url: event.url,
    imageUrl: event.image?.url,
    timeUnknown: event.all_day || false,
  };
}

Step 2.3: Create Test Script

Create

scripts/scrapers/test-yoursource.ts
:

import 'dotenv/config';
import * as fs from 'fs';
import * as path from 'path';

const DEBUG_DIR = path.join(process.cwd(), 'debug-scraper-yoursource');
if (!fs.existsSync(DEBUG_DIR)) {
  fs.mkdirSync(DEBUG_DIR, { recursive: true });
}

async function main() {
  console.log('='.repeat(60));
  console.log('SCRAPER TEST - YourSource');
  console.log('='.repeat(60));

  // Import scraper
  const { scrapeYourSource } = await import('../lib/scrapers/yoursource');

  // Run scraper
  const startTime = Date.now();
  const events = await scrapeYourSource();
  const duration = Date.now() - startTime;

  // Save results
  fs.writeFileSync(
    path.join(DEBUG_DIR, 'events.json'),
    JSON.stringify(events, null, 2)
  );

  // Display summary
  console.log(`\nCompleted in ${(duration / 1000).toFixed(1)}s`);
  console.log(`Found ${events.length} events`);

  // Field completeness
  const withImages = events.filter(e => e.imageUrl).length;
  const withPrices = events.filter(e => e.price && e.price !== 'Unknown').length;
  const withZips = events.filter(e => e.zip).length;

  console.log(`\nField Completeness:`);
  console.log(`  Images: ${withImages}/${events.length} (${Math.round(withImages/events.length*100)}%)`);
  console.log(`  Prices: ${withPrices}/${events.length} (${Math.round(withPrices/events.length*100)}%)`);
  console.log(`  Zips: ${withZips}/${events.length} (${Math.round(withZips/events.length*100)}%)`);

  // Sample events with timezone verification
  console.log(`\nSample Events (verify timezone!):`);
  for (const e of events.slice(0, 5)) {
    console.log(`\n${e.title}`);
    console.log(`  UTC:     ${e.startDate.toISOString()}`);
    console.log(`  Eastern: ${e.startDate.toLocaleString('en-US', { timeZone: 'America/New_York' })}`);
    console.log(`  Location: ${e.location || 'N/A'}`);
    console.log(`  Price: ${e.price}`);
  }

  console.log(`\nDebug files saved to: ${DEBUG_DIR}`);
}

main().catch(console.error);

Step 2.4: Add to package.json

"test:yoursource": "npx tsx scripts/scrapers/test-yoursource.ts"

PHASE 3: VALIDATION

Run the test script and verify output:

npm run test:yoursource

Validation Checklist

  • Timezone correct: Eastern times match expected (7 PM event shows as 7 PM ET)
  • No HTML entities: Titles/locations decoded (
    &
    not
    &amp;
    )
  • No duplicate cities: Location format is clean
  • Prices reasonable: Mix of Free, $X, Unknown
  • Zip codes populated: Most events have zips
  • URLs unique: No duplicates
  • Future events only: No past dates

PHASE 4: DATABASE TESTING

⚠️ MANDATORY: You MUST Complete This Phase

DO NOT declare production-ready until you have inserted test events into the real database and verified they display correctly.

Scraper output validation alone is NOT sufficient. Database insertion can reveal:

  • Timezone conversion issues
  • Field truncation
  • Constraint violations
  • Display problems

Step 4.1: Insert Test Events

// scripts/scrapers/test-yoursource-db.ts
import 'dotenv/config';
import { db } from '../lib/db';
import { events } from '../lib/db/schema';
import { eq } from 'drizzle-orm';
import { scrapeYourSource } from '../lib/scrapers/yoursource';

async function main() {
  // Check existing
  const existing = await db.select().from(events).where(eq(events.source, 'YOUR_SOURCE'));
  console.log(`Existing YOUR_SOURCE events: ${existing.length}`);

  // Scrape a few events
  const scraped = await scrapeYourSource();
  const testEvents = scraped.slice(0, 5);

  // Insert
  for (const event of testEvents) {
    await db.insert(events).values({
      ...event,
      tags: [],
      lastSeenAt: new Date(),
    }).onConflictDoUpdate({
      target: events.url,
      set: { lastSeenAt: new Date() },
    });
    console.log(`Inserted: ${event.title}`);
  }

  // Verify - THIS IS THE CRITICAL CHECK
  console.log('\n=== VERIFICATION ===\n');
  const inserted = await db.select().from(events).where(eq(events.source, 'YOUR_SOURCE'));

  for (const e of inserted) {
    console.log(`${e.title}`);
    console.log(`  DB Date:  ${e.startDate}`);
    console.log(`  Eastern:  ${e.startDate.toLocaleString('en-US', { timeZone: 'America/New_York' })}`);
    console.log(`  Location: ${e.location}`);
    console.log(`  Zip:      ${e.zip}`);
    console.log(`  Price:    ${e.price}`);
    console.log('');
  }

  console.log('To cleanup: DELETE FROM events WHERE source = \'YOUR_SOURCE\';');
}

main().catch(console.error);

Step 4.2: Verify Checklist

  • Events inserted without errors
  • Dates display correctly in Eastern time
  • All fields populated as expected
  • No HTML entities in text
  • Zip codes present

Step 4.3: Cleanup Test Data

npx tsx -e "
import 'dotenv/config';
import { db } from './lib/db';
import { events } from './lib/db/schema';
import { eq } from 'drizzle-orm';
db.delete(events).where(eq(events.source, 'YOUR_SOURCE')).then(() => console.log('Cleaned up'));
"

PHASE 5: PRODUCTION INTEGRATION

Step 5.1: Update Cron Route

Edit

app/api/cron/scrape/route.ts
:

// Add import
import { scrapeYourSource } from '@/lib/scrapers/yoursource';

// Add to Promise.allSettled array
const [..., yourSourceResult] = await Promise.allSettled([
  ...,
  scrapeYourSource(),
]);

// Extract results
const yourSourceEvents = yourSourceResult.status === 'fulfilled' ? yourSourceResult.value : [];

// Log failures
if (yourSourceResult.status === 'rejected')
  console.error('[Scrape] YourSource failed:', yourSourceResult.reason);

// Add to stats
stats.scraping.total = ... + yourSourceEvents.length;

// Add to allEvents
const allEvents = [..., ...yourSourceEvents];

// Update log message
console.log(`... YourSource: ${yourSourceEvents.length} ...`);

Step 5.2: Verify TypeScript Compiles

npx tsc --noEmit

PHASE 6: CLEANUP

# Remove debug folder
rm -rf debug-scraper-yoursource

# Remove test DB script if created
rm scripts/scrapers/test-yoursource-db.ts

Integration Checklist

  • Exploration

    • Detected CMS/platform
    • Tried known API endpoints
    • Tested API parameters (start_date, per_page, page)
    • Documented field mapping
    • Identified timezone handling approach
  • Development

    • Added source to
      types.ts
    • Created scraper file
    • Created test script
    • Added npm script
  • Validation

    • Timezone verified (Eastern times correct)
    • HTML entities decoded
    • Location strings clean (no duplicates)
    • Field completeness acceptable
  • Database Testing (MANDATORY)

    • Inserted test events
    • Verified dates in database
    • Confirmed all fields correct
    • Cleaned up test data
  • Production

    • Added to cron route
    • TypeScript compiles
    • Ready for deployment

Common Utilities Reference

Timezone

import { getTodayStringEastern, parseAsEastern } from '@/lib/utils/timezone';

// Get today's date in Eastern (for API start_date param)
const today = getTodayStringEastern(); // "2025-12-16"

// Parse ambiguous local time as Eastern
const date = parseAsEastern('2025-12-25', '19:00:00');

Price Formatting

import { formatPrice } from '@/lib/utils/parsers';

formatPrice(0);        // "Free"
formatPrice(25.50);    // "$26"
formatPrice(null);     // "Unknown"

HTML Entities

import { decodeHtmlEntities } from '@/lib/utils/parsers';

decodeHtmlEntities('Rock &amp; Roll &#8211; Live');
// "Rock & Roll – Live"

Location Filtering

import { isNonNCEvent } from '@/lib/utils/geo';

// Returns true if event should be EXCLUDED (not in NC)
if (isNonNCEvent(event.title, event.location)) continue;

Zip Code Fallbacks

import { getZipFromCoords, getZipFromCity } from '@/lib/utils/geo';

let zip = venue.zip || getZipFromCoords(lat, lng) || getZipFromCity(city);

Troubleshooting

API Returns 403/429

  • Add realistic headers (User-Agent, Accept, Referer)
  • Increase delays between requests (200-500ms)
  • Some APIs require
    Referer
    header matching the site

Dates Off by Hours

  • Check Timezone Decision Tree above
  • Verify API returns UTC vs local time
  • Compare API local time with your parsed Eastern time

Duplicate Events

  • Ensure
    url
    is unique per event
  • For recurring events, append date to URL:
    ${url}#${date}

Missing Events

  • Check pagination (off-by-one errors)
  • Verify
    start_date
    parameter format
  • API may have max page limit

HTML in Titles/Locations

  • Apply
    decodeHtmlEntities()
    to ALL text fields
  • Check for
    <br>
    ,
    <p>
    tags that need stripping

Duplicate City in Location

  • Check if city already in address before appending
  • Common with APIs that include full address + separate city field