Goose-skills orthogonal-extract-webpage-data

Extract structured data from web pages using AI

install

source · Clone the upstream repo

git clone https://github.com/gooseworks-ai/goose-skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/gooseworks-ai/goose-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/capabilities/orthogonal-extract-webpage-data" ~/.claude/skills/gooseworks-ai-goose-skills-orthogonal-extract-webpage-data && rm -rf "$T"

manifest: skills/capabilities/orthogonal-extract-webpage-data/SKILL.md

Extract Webpage Data

Setup

Read your credentials from ~/.gooseworks/credentials.json:

export GOOSEWORKS_API_KEY=$(python3 -c "import json;print(json.load(open('$HOME/.gooseworks/credentials.json'))['api_key'])")
export GOOSEWORKS_API_BASE=$(python3 -c "import json;print(json.load(open('$HOME/.gooseworks/credentials.json')).get('api_base','https://api.gooseworks.ai'))")

If ~/.gooseworks/credentials.json does not exist, tell the user to run:

npx gooseworks login

All endpoints use Bearer auth:

-H "Authorization: Bearer $GOOSEWORKS_API_KEY"

Extract structured data from any web page using AI. Turn messy HTML into clean, organized data.

When to Use

User wants to extract specific data from a website
User asks to scrape information from a page
User needs structured data from unstructured content
User wants to pull product info, contact details, etc.
Converting web content to usable data

How It Works

Uses Olostep, Scrapegraph, or Riveter APIs for AI-powered data extraction.

Usage

Simple Scrape with Olostep

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"olostep","path":"/v1/scrapes","body":{"url_to_scrape":"https://example.com/products"}}'

AI-Powered Extraction with Scrapegraph

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/team","user_prompt":"Extract all team members with their names, titles, and LinkedIn URLs"}}'

Schema-Based Extraction with Riveter

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"riveter","path":"/v1/scrape","body":{"url":"https://example.com","schema":{"name":"string","price":"number","description":"string"}}}'

Get AI Answer from Web

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"olostep","path":"/v1/answers","body":{"task":"Find the pricing for Notion Teams plan from their website"}}'

Crawl Multiple Pages

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"olostep","path":"/v1/crawls","body":{"start_url":"https://example.com","max_pages":10}}'

Parameters

Olostep Scrape

url_to_scrape (required) - URL to scrape
formats - Output formats (markdown, html, text)

Scrapegraph

website_url (required) - URL to scrape
user_prompt (required) - Natural language description of what to extract

Riveter

url (required) - URL to scrape
schema - JSON schema defining the data structure to extract

Olostep Answer

task (required) - Natural language task/question

Response

Olostep Response

Returns a scrape object:

id (string) - Scrape ID (e.g.,
```
scrape_z926lxxon3
```
)
result.markdown_content (string|null) - Page content as markdown
result.html_content (string|null) - Raw HTML (if requested via
```
formats
```
)
result.text_content (string|null) - Plain text (if requested)
result.markdown_hosted_url (string|null) - S3 URL for large content
result.links_on_page (array) - Links found on the page
result.screenshot_hosted_url (string|null) - Screenshot URL (if requested)
result.page_metadata (object) -
```
status_code
```
of the page
credits_consumed (integer) - Credits used for this scrape

Async crawls: POST

/v1/crawls

returns an

id

. Poll with GET

/v1/crawls/{id}

until complete.

Scrapegraph Response

Returns structured extraction result:

request_id (string) - Unique request identifier
status (string) -
```
completed
```
or
```
pending
```
result (object) - AI-extracted data matching your prompt (dynamic keys)
error (string) - Empty on success, error message on failure

Note: For large pages, the POST may return

status: "pending"

. Poll with GET

/v1/smartscraper/{request_id}

until

status

completed

Riveter Response

Returns scrape result:

request_status (string) -
```
success
```
or
```
error
```
message (string) - Human-readable status
text (string) - Extracted page text content
url (string) - URL that was scraped
status_code (integer) - HTTP status of the page
run_key (string) - Unique run identifier
base_url_for_links (string) - Base URL for resolving relative links
riveter_app_link (string) - Link to view run in Riveter dashboard
credit_used (integer) - Credits consumed

Examples

User: "Get all the product names and prices from this page"

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/products","user_prompt":"Extract all products with name, price, and description"}}'

User: "Scrape the team page and get everyone's info"

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"scrapegraph","path":"/v1/smartscraper","body":{"website_url":"https://example.com/about/team","user_prompt":"Extract team members: name, role, bio, photo URL, LinkedIn"}}'

User: "What are Stripe's API pricing details?"

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"olostep","path":"/v1/answers","body":{"task":"Find Stripe API pricing breakdown from stripe.com/pricing"}}'

User: "Get all blog post titles and dates from this blog"

curl -s -X POST $GOOSEWORKS_API_BASE/v1/proxy/orthogonal/run \
  -H "Authorization: Bearer $GOOSEWORKS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"api":"riveter","path":"/v1/scrape","body":{"url":"https://blog.example.com","schema":{"posts":[{"title":"string","date":"string","url":"string"}]}}}'

Error Handling

504 - Olostep timeout on slow pages — retry or try a simpler URL
400 - Missing required parameters (
```
url_to_scrape
```
for Olostep,
```
website_url
```
+
```
user_prompt
```
for Scrapegraph,
```
url
```
for Riveter)
Scrapegraph returns
```
error
```
field in response body — check it even on 200 status
Riveter returns
```
request_status: "error"
```
with details in
```
message
```
Some sites block automated scraping — try a different API if one fails

Tips

Scrapegraph is best for natural language extraction
Riveter is best when you know the exact schema you want
Olostep is great for general scraping and AI answers
For dynamic sites (JavaScript-heavy), these tools handle rendering
Be specific in your prompts for better extraction results
Some sites may block automated access