Goose-skills conference-speaker-scraper

install

source · Clone the upstream repo

git clone https://github.com/gooseworks-ai/goose-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/gooseworks-ai/goose-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/capabilities/conference-speaker-scraper" ~/.claude/skills/gooseworks-ai-goose-skills-conference-speaker-scraper && rm -rf "$T"

manifest: skills/capabilities/conference-speaker-scraper/SKILL.md

Conference Speaker Scraper

Extract speaker names, titles, companies, and bios from conference website /speakers pages. Supports direct HTML scraping with multiple extraction strategies, plus Apify fallback for JS-heavy sites.

Quick Start

No API key needed for direct scraping mode.

# Scrape speakers from a conference page
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
  --url "https://example.com/speakers"

# Use Apify for JS-heavy sites
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
  --url "https://example.com/speakers" --mode apify

# Custom conference name (otherwise inferred from URL)
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py \
  --url "https://example.com/speakers" --conference "Sage Future 2026"

# Output formats
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output json     # default
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output csv
python3 skills/conference-speaker-scraper/scripts/scrape_speakers.py --url URL --output summary

How It Works

Direct Mode (default)

Fetches the page HTML and tries multiple extraction strategies in order, using whichever returns the most results:

Strategy A -- CSS class hints: Looks for speaker cards with class names containing "speaker", "presenter", "faculty", "panelist", "team-member"
Strategy B -- Heading + paragraph patterns: Looks for repeated
```
<h2>
```
/
```
<h3>
```
+
```
<p>
```
structures
Strategy C -- JSON-LD structured data: Checks for
```
<script type="application/ld+json">
```
with speaker data
Strategy D -- Platform embeds: Detects Sched.com/Sessionize patterns used by many conferences

Apify Mode

Uses

apify/cheerio-scraper

actor with a custom page function that targets common speaker card selectors. Standard POST/poll/GET dataset pattern.

CLI Reference

Flag	Default	Description
`--url`	required	Conference speakers page URL
`--conference`	inferred	Conference name (otherwise inferred from URL domain)
`--mode`	direct	`direct` (HTML scraping) or `apify` (Apify cheerio scraper)
`--output`	json	Output format: `json` , `csv` , or `summary`
`--token`	env var	Apify token (only needed for apify mode)
`--timeout`	300	Max seconds for Apify run

Output Schema

{
  "name": "Jane Smith",
  "title": "VP of Finance",
  "company": "Acme Corp",
  "bio": "Jane leads the finance transformation at...",
  "linkedin_url": "https://linkedin.com/in/janesmith",
  "image_url": "https://...",
  "conference": "Sage Future 2026",
  "source_url": "https://sagefuture2026.com/speakers"
}

Cost

Direct mode: Free (no API, no tokens)
Apify mode: Uses
```
apify/cheerio-scraper
```
-- minimal Apify credits

Testing Notes

HTML scraping is inherently fragile across conference sites. The multi-strategy approach maximizes coverage, but JS-heavy sites will require Apify mode. When direct scraping returns 0 results, try

--mode apify