Goose-skills conference-speaker-scraper
install
source · Clone the upstream repo
git clone https://github.com/gooseworks-ai/goose-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/gooseworks-ai/goose-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/capabilities/conference-speaker-scraper" ~/.claude/skills/gooseworks-ai-goose-skills-conference-speaker-scraper && rm -rf "$T"
manifest:
skills/capabilities/conference-speaker-scraper/SKILL.mdsource content
Conference Speaker Scraper
Extract speaker names, titles, companies, and bios from conference website /speakers pages. Supports direct HTML scraping with multiple extraction strategies, plus Apify fallback for JS-heavy sites.
Quick Start
No API key needed for direct scraping mode.
# Scrape speakers from a conference page python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \ --url "https://example.com/speakers" # Use Apify for JS-heavy sites python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \ --url "https://example.com/speakers" --mode apify # Custom conference name (otherwise inferred from URL) python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \ --url "https://example.com/speakers" --conference "Sage Future 2026" # Output formats python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json # default python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary
How It Works
Direct Mode (default)
Fetches the page HTML and tries multiple extraction strategies in order, using whichever returns the most results:
- Strategy A -- CSS class hints: Looks for speaker cards with class names containing "speaker", "presenter", "faculty", "panelist", "team-member"
- Strategy B -- Heading + paragraph patterns: Looks for repeated
/<h2>
+<h3>
structures<p> - Strategy C -- JSON-LD structured data: Checks for
with speaker data<script type="application/ld+json"> - Strategy D -- Platform embeds: Detects Sched.com/Sessionize patterns used by many conferences
Apify Mode
Uses
apify/cheerio-scraper actor with a custom page function that targets common speaker card selectors. Standard POST/poll/GET dataset pattern.
CLI Reference
| Flag | Default | Description |
|---|---|---|
| required | Conference speakers page URL |
| inferred | Conference name (otherwise inferred from URL domain) |
| direct | (HTML scraping) or (Apify cheerio scraper) |
| json | Output format: , , or |
| env var | Apify token (only needed for apify mode) |
| 300 | Max seconds for Apify run |
Output Schema
{ "name": "Jane Smith", "title": "VP of Finance", "company": "Acme Corp", "bio": "Jane leads the finance transformation at...", "linkedin_url": "https://linkedin.com/in/janesmith", "image_url": "https://...", "conference": "Sage Future 2026", "source_url": "https://sagefuture2026.com/speakers" }
Cost
- Direct mode: Free (no API, no tokens)
- Apify mode: Uses
-- minimal Apify creditsapify/cheerio-scraper
Testing Notes
HTML scraping is inherently fragile across conference sites. The multi-strategy approach maximizes coverage, but JS-heavy sites will require Apify mode. When direct scraping returns 0 results, try
--mode apify.