Goose-skills orthogonal-extract-webpage-data
Extract structured data from web pages using AI
install
source · Clone the upstream repo
git clone https://github.com/gooseworks-ai/goose-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/gooseworks-ai/goose-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/capabilities/orthogonal-extract-webpage-data" ~/.claude/skills/gooseworks-ai-goose-skills-orthogonal-extract-webpage-data && rm -rf "$T"
manifest:
skills/capabilities/orthogonal-extract-webpage-data/SKILL.mdsource content
Extract Webpage Data
Setup
Read your credentials from ~/.gooseworks/credentials.json:
export GOOSEWORKS_API_KEY=$(python3 -c "import json;print(json.load(open('$HOME/.gooseworks/credentials.json'))['api_key'])") export GOOSEWORKS_API_BASE=$(python3 -c "import json;print(json.load(open('$HOME/.gooseworks/credentials.json')).get('api_base','https://api.gooseworks.ai'))")
If ~/.gooseworks/credentials.json does not exist, tell the user to run:
npx gooseworks login
All endpoints use Bearer auth:
-H "Authorization: Bearer $GOOSEWORKS_API_KEY"
Extract structured data from any web page using AI. Turn messy HTML into clean, organized data.
When to Use
- User wants to extract specific data from a website
- User asks to scrape information from a page
- User needs structured data from unstructured content
- User wants to pull product info, contact details, etc.
- Converting web content to usable data
How It Works
Uses Olostep, Scrapegraph, or Riveter APIs for AI-powered data extraction.
Usage
Simple Scrape with Olostep
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"olostep","path":"/v1/scrapes","body":{"url_to_scrape":"https://example.com/products"}}'
AI-Powered Extraction with Scrapegraph
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/team","user_prompt":"Extract all team members with their names, titles, and LinkedIn URLs"}}'
Schema-Based Extraction with Riveter
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"riveter","path":"/v1/scrape","body":{"url":"https://example.com","schema":{"name":"string","price":"number","description":"string"}}}'
Get AI Answer from Web
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"olostep","path":"/v1/answers","body":{"task":"Find the pricing for Notion Teams plan from their website"}}'
Crawl Multiple Pages
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"olostep","path":"/v1/crawls","body":{"start_url":"https://example.com","max_pages":10}}'
Parameters
Olostep Scrape
- url_to_scrape (required) - URL to scrape
- formats - Output formats (markdown, html, text)
Scrapegraph
- website_url (required) - URL to scrape
- user_prompt (required) - Natural language description of what to extract
Riveter
- url (required) - URL to scrape
- schema - JSON schema defining the data structure to extract
Olostep Answer
- task (required) - Natural language task/question
Response
Olostep Response
Returns a scrape object:
- id (string) - Scrape ID (e.g.,
)scrape_z926lxxon3 - result.markdown_content (string|null) - Page content as markdown
- result.html_content (string|null) - Raw HTML (if requested via
)formats - result.text_content (string|null) - Plain text (if requested)
- result.markdown_hosted_url (string|null) - S3 URL for large content
- result.links_on_page (array) - Links found on the page
- result.screenshot_hosted_url (string|null) - Screenshot URL (if requested)
- result.page_metadata (object) -
of the pagestatus_code - credits_consumed (integer) - Credits used for this scrape
Async crawls: POST
/v1/crawls returns an id. Poll with GET /v1/crawls/{id} until complete.
Scrapegraph Response
Returns structured extraction result:
- request_id (string) - Unique request identifier
- status (string) -
orcompletedpending - result (object) - AI-extracted data matching your prompt (dynamic keys)
- error (string) - Empty on success, error message on failure
Note: For large pages, the POST may return
status: "pending". Poll with GET /v1/smartscraper/{request_id} until status is completed.
Riveter Response
Returns scrape result:
- request_status (string) -
orsuccesserror - message (string) - Human-readable status
- text (string) - Extracted page text content
- url (string) - URL that was scraped
- status_code (integer) - HTTP status of the page
- run_key (string) - Unique run identifier
- base_url_for_links (string) - Base URL for resolving relative links
- riveter_app_link (string) - Link to view run in Riveter dashboard
- credit_used (integer) - Credits consumed
Examples
User: "Get all the product names and prices from this page"
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/products","user_prompt":"Extract all products with name, price, and description"}}'
User: "Scrape the team page and get everyone's info"
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/about/team","user_prompt":"Extract team members: name, role, bio, photo URL, LinkedIn"}}'
User: "What are Stripe's API pricing details?"
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"olostep","path":"/v1/answers","body":{"task":"Find Stripe API pricing breakdown from stripe.com/pricing"}}'
User: "Get all blog post titles and dates from this blog"
curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \ -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \ -H "Content-Type: application/json" \ -d '{"api":"riveter","path":"/v1/scrape","body":{"url":"https://blog.example.com","schema":{"posts":[{"title":"string","date":"string","url":"string"}]}}}'
Error Handling
- 504 - Olostep timeout on slow pages — retry or try a simpler URL
- 400 - Missing required parameters (
for Olostep,url_to_scrape
+website_url
for Scrapegraph,user_prompt
for Riveter)url - Scrapegraph returns
field in response body — check it even on 200 statuserror - Riveter returns
with details inrequest_status: "error"message - Some sites block automated scraping — try a different API if one fails
Tips
- Scrapegraph is best for natural language extraction
- Riveter is best when you know the exact schema you want
- Olostep is great for general scraping and AI answers
- For dynamic sites (JavaScript-heavy), these tools handle rendering
- Be specific in your prompts for better extraction results
- Some sites may block automated access