Awesome-omni-skill shot-scraper
Automated web scraping using shot-scraper (Playwright-based CLI) via GitHub Actions to extract structured data from websites and export to JSON/SQLite for Datasette. Use when users need to periodically scrape web data, set up automated data collection workflows, extract structured information from pages using JavaScript execution, or create data pipelines that run on schedules.
git clone https://github.com/diegosouzapw/awesome-omni-skill
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/development/shot-scraper" ~/.claude/skills/diegosouzapw-awesome-omni-skill-shot-scraper && rm -rf "$T"
skills/development/shot-scraper/SKILL.mdshot-scraper
Quick Start
Extract data by executing JavaScript on a webpage and save as JSON:
# Install pip install shot-scraper sqlite-utils shot-scraper install # Scrape data shot-scraper javascript https://example.com/news "({ articles: Array.from(document.querySelectorAll('article')).map(a => ({ title: a.querySelector('h2')?.innerText, link: a.querySelector('a')?.href, date: a.querySelector('.date')?.innerText })), timestamp: new Date().toISOString() })" -o data.json # Import to SQLite (Datasette format) sqlite-utils insert data.db articles data.json --pk=link --alter
Core Workflow
- Write JavaScript extractor - Returns JSON object or array
- Run via GitHub Actions - Scheduled or on-demand
- Import to SQLite - Use
for Datasette compatibilitysqlite-utils - Commit results - Track data changes over time
JavaScript Execution
Basic extraction
shot-scraper javascript URL "document.title" shot-scraper javascript URL -i script.js -o output.json
Return JSON objects
// Wrap object literals in parentheses ({ title: document.title, items: Array.from(document.querySelectorAll('.item')).map(i => i.innerText) })
Handle dynamic content
new Promise(done => { setTimeout(() => { done({ data: document.querySelector('.dynamic')?.innerText }); }, 2000); });
Common options
- Wait before executing--wait MILLISECONDS
- Max execution time--timeout MILLISECONDS
- Authentication context--auth FILE
- Custom user agent--user-agent STRING
- Show console.log output--log-console
GitHub Actions Integration
Basic scheduled scraper workflow:
name: Scrape Website on: schedule: - cron: '0 6 * * *' # Daily at 6 AM workflow_dispatch: jobs: scrape: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v5 with: python-version: '3.12' - name: Install tools run: | pip install shot-scraper sqlite-utils shot-scraper install - name: Scrape data run: | shot-scraper javascript https://example.com \ -i scrape.js \ -o data/output-$(date +%Y%m%d).json - name: Import to database run: | sqlite-utils insert data/scraper.db records \ data/output-*.json \ --pk=id --alter - name: Commit results run: | git config user.name "Bot" git config user.email "bot@example.com" git add data/ git commit -m "Update $(date +%Y-%m-%d)" || exit 0 git push
Data Export Patterns
Single record per run
({ timestamp: new Date().toISOString(), total: document.querySelectorAll('.item').length })
sqlite-utils insert data.db snapshots output.json --alter
Multiple records per run
Array.from(document.querySelectorAll('.product')).map(p => ({ id: p.dataset.id, name: p.querySelector('.name')?.innerText, price: parseFloat(p.querySelector('.price')?.innerText.replace(/[^0-9.]/g, '')), timestamp: new Date().toISOString() }))
sqlite-utils insert data.db products output.json --pk=id --replace
With nested data
# Flatten nested objects sqlite-utils insert data.db items output.json --flatten --alter
Common Patterns
Extract tables:
Array.from(document.querySelectorAll('table tr')).map(row => { const cells = row.querySelectorAll('td'); return { col1: cells[0]?.innerText, col2: cells[1]?.innerText }; })
Parse prices/numbers:
parseFloat(text.replace(/[^0-9.]/g, ''))
Wait for element:
new Promise(done => { const check = setInterval(() => { const el = document.querySelector('.target'); if (el) { clearInterval(check); done({ data: el.innerText }); } }, 100); setTimeout(() => { clearInterval(check); done({ error: 'timeout' }); }, 10000); });
sqlite-utils Commands
# Insert with primary key sqlite-utils insert db.db table data.json --pk=id # Replace existing records sqlite-utils insert db.db table data.json --pk=id --replace # Auto-alter schema to fit new data sqlite-utils insert db.db table data.json --alter # Create indexes for Datasette sqlite-utils create-index db.db table column_name # Query data sqlite-utils query db.db "SELECT * FROM table" --csv
Advanced Topics
Authentication: Store auth.json as GitHub secret, write to file in workflow:
env: AUTH_JSON: ${{ secrets.AUTH_JSON }} run: | echo "$AUTH_JSON" > auth.json shot-scraper javascript URL -a auth.json -i script.js -o data.json
Error handling: Add retry logic to workflow (see workflows.md)
Multiple pages: Process URL list (see examples.md)
Screenshots: Take alongside data for verification:
shot-scraper URL -o screenshot.png shot-scraper javascript URL -i script.js -o data.json
References
- examples.md - Complete workflow examples (price tracking, monitoring, error handling)
- scraping-patterns.md - JavaScript patterns for common scraping scenarios
- workflows.md - Advanced GitHub Actions patterns
External Resources
- shot-scraper: https://github.com/simonw/shot-scraper
- sqlite-utils: https://sqlite-utils.datasette.io/
- Datasette: https://datasette.io/