install
source · Clone the upstream repo
git clone https://github.com/ComeOnOliver/skillshub
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/ComeOnOliver/skillshub "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/Harmeet10000/skills/scrape-leads" ~/.claude/skills/comeonoliver-skillshub-scrape-leads && rm -rf "$T"
manifest:
skills/Harmeet10000/skills/scrape-leads/SKILL.mdsource content
Lead Scraping & Verification
Goal
Scrape leads using Apify (
code_crafter/leads-finder), verify their relevance (industry match > 80%), and save them to a Google Sheet. For large scrapes (1000+ leads), use parallel scraping for 3-5x faster performance.
Inputs
- Industry: The target industry (e.g., "Plumbers", "Software Agencies").
- Location: The target location (e.g., "New York", "United States").
- Total Count: The total number of leads desired.
Tools/Scripts
- Script:
(single scrape, for <1000 leads)execution/scrape_apify.py - Script:
(parallel scraping, for 1000+ leads)execution/scrape_apify_parallel.py - Script:
(batch sheet updates, optimized for large datasets)execution/update_sheet.py - Dependencies: Apify API Token, Google Service Account Credentials
Process
Small Scrapes (<1000 leads)
-
Test Scrape
- Run
withexecution/scrape_apify.py
andmax_items=25
.--no-email-filter - Output:
(temporary file)..tmp/test_leads.json
- Run
-
Verification
- Agent (You) reads
..tmp/test_leads.json - Check if at least 20/25 (80%) leads match the Industry.
- Decision:
- Pass: Proceed to step 3.
- Fail: Stop. Ask user to refine Industry or Location keywords.
- Agent (You) reads
-
Full Scrape
- Run
with full Total Count andexecution/scrape_apify.py
.--no-email-filter - Output:
(temporary file)..tmp/leads_[timestamp].json
- Run
-
[OPTIONAL] LLM Classification for Harder Niches
- When to use: For complex distinctions (e.g., "product SaaS vs agencies")
- Command:
python3 execution/classify_leads_llm.py .tmp/leads_[timestamp].json \ --classification_type product_saas \ --output .tmp/classified_leads.json - Performance: ~2 minutes for 3,000 leads
- See classify_leads_llm.md for details
-
Upload to Google Sheet (DELIVERABLE)
- Run
with the final JSON file (classified or original).execution/update_sheet.py - Output: Google Sheet URL (this is the actual deliverable the user receives).
- Run
-
Enrich Missing Emails
- Run
with the Google Sheet URL.execution/enrich_emails.py - Script auto-detects dataset size and uses appropriate API strategy.
- Output: Updated Google Sheet URL (final deliverable with enriched emails).
- Run
Large Scrapes (1000+ leads) - FASTER with Parallel Processing
-
Test Scrape (same as above)
- Run
withexecution/scrape_apify.py
andmax_items=25
.--no-email-filter - Verify industry match > 80%.
- Run
-
Parallel Full Scrape
- Run
with:execution/scrape_apify_parallel.py
(e.g., 4000)--total_count
(e.g., "United States", "EU", "UK", "Canada", "Australia")--location
(auto-detects based on location)--strategy regions
(scrape without email requirement, enrich after)--no-email-filter
- Geographic Partitioning (Cost-Neutral):
- Auto-detects region based on location:
- United States: 4-way (Northeast, Southeast, Midwest, West)
- EU/Europe: 4-way (Western, Southern, Northern, Eastern)
- UK: 4-way (SE England, N England, Scotland/Wales, SW England)
- Canada: 4-way (Ontario, Quebec, West, Atlantic)
- Australia: 4-way (NSW, VIC/TAS, QLD, WA/SA)
- Alternative strategies:
: 8-way US metro areas--strategy metros
: 8-way Asia-Pacific split--strategy apac
: 8-way worldwide continental split--strategy global
- Custom: Comma-separated cities/states (e.g.,
)--location "London,Paris,Berlin,Madrid"
- Auto-detects region based on location:
- Cost: SAME as sequential (4 partitions × 1000 = 4000 total leads)
- Automatic Deduplication: Handles leads appearing in multiple regions
- Output:
(deduplicated, temporary file)..tmp/leads_[timestamp].json - Time Savings: 3-4x faster than sequential, no extra cost.
- Run
-
[OPTIONAL] LLM Classification for Harder Niches
- When to use: For complex distinctions that keywords can't capture:
- ✅ "Product SaaS vs IT consulting agencies" (use LLM)
- ✅ "High-ticket vs low-ticket businesses" (use LLM)
- ✅ "Subscription vs one-time payment models" (use LLM)
- ❌ "Dentists" or "Realtors" (simple keyword matching works)
- Command:
python3 execution/classify_leads_llm.py .tmp/leads_[timestamp].json \ --classification_type product_saas \ --output .tmp/classified_leads.json - Performance: ~2 minutes for 3,000 leads, ~$0.30 per 1,000 leads
- Default behavior: Includes "unclear" classifications (medium confidence)
- Output:
(use this instead of original file for next step).tmp/classified_leads.json - See classify_leads_llm.md for full details
- When to use: For complex distinctions that keywords can't capture:
-
Upload to Google Sheet (DELIVERABLE)
- Run
with the final JSON file (classified or original).execution/update_sheet.py - Script automatically uses chunked batch updates for datasets >1000 rows.
- Output: Google Sheet URL (this is the actual deliverable the user receives).
- Run
-
Enrich Missing Emails (ALWAYS USE BULK API)
- IMPORTANT: Always run
in the foreground and wait for completion before notifying the user.execution/enrich_emails.py - Run:
python3 execution/enrich_emails.py <SHEET_URL> - Bulk API Strategy (200+ rows, PREFERRED):
- Creates a single AnyMailFinder bulk job for all missing emails
- Processes ~1000 rows in 5 minutes (much faster than individual calls)
- Automatically polls until complete
- Agent must wait until enrichment finishes and sheet is updated
- Concurrent API Fallback (<200 rows or if bulk fails):
- Makes up to 20 concurrent individual API calls
- Automatically used if bulk API fails
- Output: Updated Google Sheet URL (final deliverable with enriched emails).
- Workflow: DO NOT notify user until enrichment completes and sheet is updated.
- IMPORTANT: Always run
Outputs (Deliverables)
The ONLY deliverable is the Google Sheet URL. This sheet contains all verified leads with company info, contact details, etc.
Important: Local JSON files (
.tmp/test_leads.json, .tmp/leads_*.json, .tmp/classified_leads.json) are temporary intermediates used for processing. They are NOT deliverables and should never be presented to the user as final outputs.
Edge Cases
- No leads found: Apify returns empty list. -> Ask user to broaden search.
- API Error: Apify or Google API fails. -> Check credentials in
..env - Low quality classifications: If >80% classified as "unclear", consider improving scrape keywords or using custom classification prompt.
Error Handling
- Authentication Error: Ensure
andAPIFY_API_TOKEN
are set.GOOGLE_APPLICATION_CREDENTIALS