Claude-skill-registry event-scraper

Create new event scraping scripts for websites. Use when adding a new event source to the Asheville Event Feed. ALWAYS start by detecting the CMS/platform and trying known API endpoints first. Browser scraping is NOT supported (Vercel limitation). Handles API-based, HTML/JSON-LD, and hybrid patterns with comprehensive testing workflows.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/event-scraper" ~/.claude/skills/majiayu000-claude-skill-registry-event-scraper && rm -rf "$T"

manifest: skills/data/event-scraper/SKILL.md

source content

Event Scraper Skill

Create new event scrapers that integrate with the Asheville Event Feed codebase. This skill provides patterns and guidance for the full lifecycle: exploration, development, testing, and production integration.

⚠️ CRITICAL: API-First Approach

Scrapers run automatically on Vercel which does NOT support browser automation.

You MUST find the site's API before considering any other approach. Modern websites almost always fetch event data from a backend API - your job is to find and use that same API.

Priority Order (STRICTLY follow this order):

🥇 Known CMS API - Check the Quick API Lookup table below FIRST
🥈 Internal JSON API - Site's own API endpoints (found via page analysis)
🥉 Public API - Official documented API (Ticketmaster, Eventbrite, etc.)
🏅 HTML with JSON-LD - Structured data embedded in HTML pages
❌ Browser scraping - NOT SUPPORTED on Vercel!

🚀 Quick API Endpoint Lookup (TRY THESE FIRST!)

Before doing any exploration, check if the site uses a known CMS/platform and try these endpoints directly:

CMS/Plugin	Detection Signs	API Endpoint	Key Parameters
WordPress + Tribe Events	`/wp-content/` , "The Events Calendar"	`/wp-json/tribe/events/v1/events`	`start_date` , `per_page` , `page`
WordPress + All Events	"All-in-One Event Calendar"	`/wp-json/osec/v1/events`	`start` , `end`
WordPress REST	`/wp-content/` , `/wp-admin/`	`/wp-json/wp/v2/posts?type=event`	`per_page` , `page`
Squarespace	`squarespace.com` , `static1.squarespace.com`	`{any-page}?format=json`	Append to URL
Next.js	`/_next/` , `__NEXT_DATA__`	`/_next/data/{buildId}/{page}.json`	Check page source
Eventbrite	`eventbrite.com`	Internal API (see eventbrite.ts)	Complex - see example
Ticketmaster Venues	Venue ticket sales	Discovery API	`venueId` , `apikey`

Example: Detecting and Using Tribe Events API

If you detect WordPress + Tribe Events, immediately try:

GET https://example.com/wp-json/tribe/events/v1/events?start_date=2025-01-01&per_page=50&page=1

This often returns rich JSON with all event data, proper timezone handling, and pagination.

Required Output Format

Every scraper MUST return

ScrapedEvent[]

interface ScrapedEvent {
  sourceId: string;      // Unique ID from source platform (prefix with source, e.g., "mx-123")
  source: EventSource;   // Add to types.ts if new source
  title: string;
  description?: string;
  startDate: Date;       // UTC Date object - see Timezone Decision Tree
  location?: string;     // Format: "Venue, Address, City, State"
  zip?: string;          // Zip code (from API or fallback utilities)
  organizer?: string;
  price?: string;        // "Free", "$20", "$15 - $30", "Unknown"
  url: string;           // Unique event URL (used for deduplication)
  imageUrl?: string;
  interestedCount?: number;
  goingCount?: number;
  timeUnknown?: boolean; // True if source only provided date, no time
}

PHASE 1: EXPLORATION

Step 1.1: Detect CMS/Platform

Use WebFetch to analyze the target site:

WebFetch URL: https://example.com/events/
Prompt: "Analyze this page:
1. What CMS/platform is it? (WordPress, Squarespace, Next.js, custom)
2. Look for: wp-content, wp-json, squarespace, _next, __NEXT_DATA__
3. Is there JSON-LD structured data in script tags?
4. What event plugin is used? (Tribe Events, All Events Calendar, etc.)
5. Any hints about API endpoints in the HTML?"

Step 1.2: Try Known API Endpoints

Based on CMS detection, immediately try the known API endpoints from the Quick Lookup table:

WebFetch URL: https://example.com/wp-json/tribe/events/v1/events?per_page=5
Prompt: "Analyze this API response:
1. Is it returning JSON event data?
2. What fields are available? (title, start_date, venue, cost, etc.)
3. Is there timezone information?
4. What pagination mechanism is used?
5. List all available fields for each event"

Step 1.3: Test API Parameters

Once you find a working API, test common parameters:

Parameter	Common Names	Purpose
Future filter	`start_date` , `after` , `from` , `startDate`	Only get future events
Page size	`per_page` , `limit` , `count` , `pageSize`	Control results per page
Pagination	`page` , `offset` , `cursor` , `skip`	Navigate pages
Sort	`orderby` , `sort` , `sortValue`	Order results

WebFetch URL: https://example.com/wp-json/tribe/events/v1/events?start_date=2025-01-01&per_page=50
Prompt: "Does this API support:
1. start_date parameter for filtering future events?
2. per_page parameter for controlling page size?
3. What's the maximum per_page allowed?
4. How does pagination work (page number, next_url, etc.)?"

Step 1.4: Document Field Mapping

Create a mental map of API fields to ScrapedEvent fields:

API Field	ScrapedEvent Field	Transform Needed
`id`	`sourceId`	Prefix: `"mx-${id}"`
`title`	`title`	`decodeHtmlEntities()`
`utc_start_date`	`startDate`	`new Date(utc + 'Z')`
`cost`	`price`	Use directly or "Unknown"
`venue.venue`	`location`	Build string, decode entities
`venue.zip`	`zip`	Use directly or fallback
`url`	`url`	Use directly

⏰ Timezone Decision Tree (CRITICAL!)

Getting timezone right is crucial. Follow this decision tree:

Does the API provide a UTC field (utc_start_date, utc_time, etc.)?
├─ YES → Use directly: new Date(utcField.replace(' ', 'T') + 'Z')
│        This is the SIMPLEST and most reliable approach.
│
└─ NO → Does the API provide ISO 8601 with offset? (e.g., "2025-12-16T19:00:00-05:00")
        ├─ YES → Use directly: new Date(isoString)
        │
        └─ NO → Does the API provide local time + timezone name? (e.g., "America/New_York")
                ├─ YES → Use parseAsEastern(dateStr, timeStr)
                │
                └─ NO → DANGER! Ambiguous local time.
                        - Assume Eastern for NC events
                        - Use parseAsEastern(dateStr, timeStr)
                        - Verify with test insertion!

Timezone Verification

ALWAYS verify timezone handling by comparing:

API's local time field (e.g.,
```
start_date: "2025-12-16 19:00:00"
```
)
API's UTC field (e.g.,
```
utc_start_date: "2025-12-17 00:00:00"
```
)
Your parsed Date displayed in Eastern (should match #1)

Example verification:

API local:  19:00:00 (7 PM Eastern)
API UTC:    00:00:00 next day (midnight UTC = 7 PM EST, correct!)
Our parsed: 7:00:00 PM Eastern ✓

📍 Location String Best Practices

Location strings often have issues. Follow these rules:

1. Always Decode HTML Entities

const venueName = decodeHtmlEntities(venue.venue); // "Rock & Roll" not "Rock &amp; Roll"
const address = decodeHtmlEntities(venue.address);

2. Avoid Duplicate City Names

APIs often include city in both venue name and city field:

// BAD: "Turgua Brewing, Fairview, Fairview, NC"
// GOOD: "Turgua Brewing, 123 Main St, Fairview, NC"

if (venue.city && !venue.address?.includes(venue.city)) {
  parts.push(venue.city);
}

3. Standard Format

// Format: "Venue, Address, City, State"
const parts = [venueName];
if (venue.address) parts.push(decodeHtmlEntities(venue.address));
if (venue.city && !venue.address?.includes(venue.city)) {
  parts.push(venue.city);
}
if (venue.state) parts.push(venue.state);
location = parts.join(', ');

4. Zip Code Fallbacks

let zip = venue?.zip || undefined;
if (!zip && venue?.geo_lat && venue?.geo_lng) {
  zip = getZipFromCoords(venue.geo_lat, venue.geo_lng);
}
if (!zip && venue?.city) {
  zip = getZipFromCity(venue.city);
}

PHASE 2: DEVELOPMENT

Step 2.1: Add Source Type

Add to

lib/scrapers/types.ts

export type EventSource = 'AVL_TODAY' | ... | 'YOUR_SOURCE';

Step 2.2: Create Scraper

Create

lib/scrapers/yoursource.ts

import { ScrapedEvent } from './types';
import { fetchWithRetry } from '@/lib/utils/retry';
import { isNonNCEvent } from '@/lib/utils/geo';
import { decodeHtmlEntities } from '@/lib/utils/parsers';
import { getZipFromCoords, getZipFromCity } from '@/lib/utils/geo';
import { getTodayStringEastern } from '@/lib/utils/timezone';

const API_BASE = 'https://example.com/wp-json/tribe/events/v1/events';
const PER_PAGE = 50;
const MAX_PAGES = 40;
const DELAY_MS = 200;

const API_HEADERS = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
  'Accept': 'application/json',
};

export async function scrapeYourSource(): Promise<ScrapedEvent[]> {
  console.log('[YourSource] Starting scrape...');

  const allEvents: ScrapedEvent[] = [];
  const today = getTodayStringEastern();
  let page = 1;
  let hasMore = true;

  while (hasMore && page <= MAX_PAGES) {
    try {
      const url = new URL(API_BASE);
      url.searchParams.set('start_date', today);
      url.searchParams.set('per_page', PER_PAGE.toString());
      url.searchParams.set('page', page.toString());

      console.log(`[YourSource] Fetching page ${page}...`);

      const response = await fetchWithRetry(
        url.toString(),
        { headers: API_HEADERS, cache: 'no-store' },
        { maxRetries: 3, baseDelay: 1000 }
      );

      const data = await response.json();
      const events = data.events || [];

      console.log(`[YourSource] Page ${page}: ${events.length} events`);

      for (const event of events) {
        const formatted = formatEvent(event);
        if (formatted) allEvents.push(formatted);
      }

      hasMore = !!data.next_rest_url && page < data.total_pages;
      page++;

      if (hasMore) await new Promise(r => setTimeout(r, DELAY_MS));
    } catch (error) {
      console.error(`[YourSource] Error on page ${page}:`, error);
      break;
    }
  }

  // Filter non-NC events
  const ncEvents = allEvents.filter(ev => !isNonNCEvent(ev.title, ev.location));
  console.log(`[YourSource] Found ${ncEvents.length} NC events`);

  return ncEvents;
}

function formatEvent(event: ApiEvent): ScrapedEvent | null {
  // Parse UTC date (see Timezone Decision Tree)
  const startDate = new Date(event.utc_start_date.replace(' ', 'T') + 'Z');

  if (isNaN(startDate.getTime()) || startDate < new Date()) {
    return null;
  }

  // Build location (see Location Best Practices)
  const venue = event.venue;
  let location: string | undefined;
  if (venue?.venue) {
    const parts = [decodeHtmlEntities(venue.venue)];
    if (venue.address) parts.push(decodeHtmlEntities(venue.address));
    if (venue.city && !venue.address?.includes(venue.city)) parts.push(venue.city);
    if (venue.state) parts.push(venue.state);
    location = parts.join(', ');
  }

  // Zip with fallbacks
  let zip = venue?.zip || undefined;
  if (!zip && venue?.geo_lat && venue?.geo_lng) {
    zip = getZipFromCoords(venue.geo_lat, venue.geo_lng);
  }

  return {
    sourceId: `ys-${event.id}`,
    source: 'YOUR_SOURCE',
    title: decodeHtmlEntities(event.title),
    description: event.description ? decodeHtmlEntities(event.description) : undefined,
    startDate,
    location,
    zip,
    organizer: event.organizer?.[0]?.organizer,
    price: event.cost || 'Unknown',
    url: event.url,
    imageUrl: event.image?.url,
    timeUnknown: event.all_day || false,
  };
}

Step 2.3: Create Test Script

Create

scripts/scrapers/test-yoursource.ts

import 'dotenv/config';
import * as fs from 'fs';
import * as path from 'path';

const DEBUG_DIR = path.join(process.cwd(), 'debug-scraper-yoursource');
if (!fs.existsSync(DEBUG_DIR)) {
  fs.mkdirSync(DEBUG_DIR, { recursive: true });
}

async function main() {
  console.log('='.repeat(60));
  console.log('SCRAPER TEST - YourSource');
  console.log('='.repeat(60));

  // Import scraper
  const { scrapeYourSource } = await import('../lib/scrapers/yoursource');

  // Run scraper
  const startTime = Date.now();
  const events = await scrapeYourSource();
  const duration = Date.now() - startTime;

  // Save results
  fs.writeFileSync(
    path.join(DEBUG_DIR, 'events.json'),
    JSON.stringify(events, null, 2)
  );

  // Display summary
  console.log(`\nCompleted in ${(duration / 1000).toFixed(1)}s`);
  console.log(`Found ${events.length} events`);

  // Field completeness
  const withImages = events.filter(e => e.imageUrl).length;
  const withPrices = events.filter(e => e.price && e.price !== 'Unknown').length;
  const withZips = events.filter(e => e.zip).length;

  console.log(`\nField Completeness:`);
  console.log(`  Images: ${withImages}/${events.length} (${Math.round(withImages/events.length*100)}%)`);
  console.log(`  Prices: ${withPrices}/${events.length} (${Math.round(withPrices/events.length*100)}%)`);
  console.log(`  Zips: ${withZips}/${events.length} (${Math.round(withZips/events.length*100)}%)`);

  // Sample events with timezone verification
  console.log(`\nSample Events (verify timezone!):`);
  for (const e of events.slice(0, 5)) {
    console.log(`\n${e.title}`);
    console.log(`  UTC:     ${e.startDate.toISOString()}`);
    console.log(`  Eastern: ${e.startDate.toLocaleString('en-US', { timeZone: 'America/New_York' })}`);
    console.log(`  Location: ${e.location || 'N/A'}`);
    console.log(`  Price: ${e.price}`);
  }

  console.log(`\nDebug files saved to: ${DEBUG_DIR}`);
}

main().catch(console.error);

Step 2.4: Add to package.json

"test:yoursource": "npx tsx scripts/scrapers/test-yoursource.ts"

PHASE 3: VALIDATION

Run the test script and verify output:

npm run test:yoursource

Validation Checklist

Timezone correct: Eastern times match expected (7 PM event shows as 7 PM ET)
No HTML entities: Titles/locations decoded (
```
&
```
not
```
&amp;
```
)
No duplicate cities: Location format is clean
Prices reasonable: Mix of Free, $X, Unknown
Zip codes populated: Most events have zips
URLs unique: No duplicates
Future events only: No past dates

PHASE 4: DATABASE TESTING

⚠️ MANDATORY: You MUST Complete This Phase

DO NOT declare production-ready until you have inserted test events into the real database and verified they display correctly.

Scraper output validation alone is NOT sufficient. Database insertion can reveal:

Timezone conversion issues
Field truncation
Constraint violations
Display problems

Step 4.1: Insert Test Events

// scripts/scrapers/test-yoursource-db.ts
import 'dotenv/config';
import { db } from '../lib/db';
import { events } from '../lib/db/schema';
import { eq } from 'drizzle-orm';
import { scrapeYourSource } from '../lib/scrapers/yoursource';

async function main() {
  // Check existing
  const existing = await db.select().from(events).where(eq(events.source, 'YOUR_SOURCE'));
  console.log(`Existing YOUR_SOURCE events: ${existing.length}`);

  // Scrape a few events
  const scraped = await scrapeYourSource();
  const testEvents = scraped.slice(0, 5);

  // Insert
  for (const event of testEvents) {
    await db.insert(events).values({
      ...event,
      tags: [],
      lastSeenAt: new Date(),
    }).onConflictDoUpdate({
      target: events.url,
      set: { lastSeenAt: new Date() },
    });
    console.log(`Inserted: ${event.title}`);
  }

  // Verify - THIS IS THE CRITICAL CHECK
  console.log('\n=== VERIFICATION ===\n');
  const inserted = await db.select().from(events).where(eq(events.source, 'YOUR_SOURCE'));

  for (const e of inserted) {
    console.log(`${e.title}`);
    console.log(`  DB Date:  ${e.startDate}`);
    console.log(`  Eastern:  ${e.startDate.toLocaleString('en-US', { timeZone: 'America/New_York' })}`);
    console.log(`  Location: ${e.location}`);
    console.log(`  Zip:      ${e.zip}`);
    console.log(`  Price:    ${e.price}`);
    console.log('');
  }

  console.log('To cleanup: DELETE FROM events WHERE source = \'YOUR_SOURCE\';');
}

main().catch(console.error);

Step 4.2: Verify Checklist

Events inserted without errors
Dates display correctly in Eastern time
All fields populated as expected
No HTML entities in text
Zip codes present

Step 4.3: Cleanup Test Data

npx tsx -e "
import 'dotenv/config';
import { db } from './lib/db';
import { events } from './lib/db/schema';
import { eq } from 'drizzle-orm';
db.delete(events).where(eq(events.source, 'YOUR_SOURCE')).then(() => console.log('Cleaned up'));
"

PHASE 5: PRODUCTION INTEGRATION

Step 5.1: Update Cron Route

Edit

app/api/cron/scrape/route.ts

// Add import
import { scrapeYourSource } from '@/lib/scrapers/yoursource';

// Add to Promise.allSettled array
const [..., yourSourceResult] = await Promise.allSettled([
  ...,
  scrapeYourSource(),
]);

// Extract results
const yourSourceEvents = yourSourceResult.status === 'fulfilled' ? yourSourceResult.value : [];

// Log failures
if (yourSourceResult.status === 'rejected')
  console.error('[Scrape] YourSource failed:', yourSourceResult.reason);

// Add to stats
stats.scraping.total = ... + yourSourceEvents.length;

// Add to allEvents
const allEvents = [..., ...yourSourceEvents];

// Update log message
console.log(`... YourSource: ${yourSourceEvents.length} ...`);

Step 5.2: Verify TypeScript Compiles

npx tsc --noEmit

PHASE 6: CLEANUP

# Remove debug folder
rm -rf debug-scraper-yoursource

# Remove test DB script if created
rm scripts/scrapers/test-yoursource-db.ts

Integration Checklist

Exploration
- Detected CMS/platform
- Tried known API endpoints
- Tested API parameters (start_date, per_page, page)
- Documented field mapping
- Identified timezone handling approach
Development
- Added source to
```
types.ts
```
- Created scraper file
- Created test script
- Added npm script
Validation
- Timezone verified (Eastern times correct)
- HTML entities decoded
- Location strings clean (no duplicates)
- Field completeness acceptable
Database Testing (MANDATORY)
- Inserted test events
- Verified dates in database
- Confirmed all fields correct
- Cleaned up test data
Production
- Added to cron route
- TypeScript compiles
- Ready for deployment

Common Utilities Reference

Timezone

import { getTodayStringEastern, parseAsEastern } from '@/lib/utils/timezone';

// Get today's date in Eastern (for API start_date param)
const today = getTodayStringEastern(); // "2025-12-16"

// Parse ambiguous local time as Eastern
const date = parseAsEastern('2025-12-25', '19:00:00');

Price Formatting

import { formatPrice } from '@/lib/utils/parsers';

formatPrice(0);        // "Free"
formatPrice(25.50);    // "$26"
formatPrice(null);     // "Unknown"

HTML Entities

import { decodeHtmlEntities } from '@/lib/utils/parsers';

decodeHtmlEntities('Rock &amp; Roll &#8211; Live');
// "Rock & Roll – Live"

Location Filtering

import { isNonNCEvent } from '@/lib/utils/geo';

// Returns true if event should be EXCLUDED (not in NC)
if (isNonNCEvent(event.title, event.location)) continue;

Zip Code Fallbacks

import { getZipFromCoords, getZipFromCity } from '@/lib/utils/geo';

let zip = venue.zip || getZipFromCoords(lat, lng) || getZipFromCity(city);

Troubleshooting

API Returns 403/429

Add realistic headers (User-Agent, Accept, Referer)
Increase delays between requests (200-500ms)
Some APIs require
```
Referer
```
header matching the site

Dates Off by Hours

Check Timezone Decision Tree above
Verify API returns UTC vs local time
Compare API local time with your parsed Eastern time

Duplicate Events

Ensure
```
url
```
is unique per event
For recurring events, append date to URL:
```
${url}#${date}
```

Missing Events

Check pagination (off-by-one errors)
Verify
```
start_date
```
parameter format
API may have max page limit

HTML in Titles/Locations

Apply
```
decodeHtmlEntities()
```
to ALL text fields
Check for
```
<br>
```
,
```
<p>
```
tags that need stripping

Duplicate City in Location

Check if city already in address before appending
Common with APIs that include full address + separate city field