Awesome-omni-skills web-scraper

Web Scraper workflow skill. Use this skill when the user needs Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON and the operator should preserve the upstream workflow, copied support files, and provenance before merging or handing off.

install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/web-scraper" ~/.claude/skills/diegosouzapw-awesome-omni-skills-web-scraper && rm -rf "$T"
manifest: skills/web-scraper/SKILL.md
source content

Web Scraper

Overview

This public intake copy packages

plugins/antigravity-awesome-skills-claude/skills/web-scraper
from
https://github.com/sickn33/antigravity-awesome-skills
into the native Omni Skills editorial shape without hiding its origin.

Use it when the operator needs the upstream workflow, support files, and repository context to stay intact while the public validator and private enhancer continue their normal downstream flow.

This intake keeps the copied upstream files intact and uses

metadata.json
plus
ORIGIN.md
as the provenance anchor for review.

Web Scraper

Imported source sections that did not map cleanly to the public headings are still preserved below or in the support files. Notable imported sections: How It Works, Capabilities, Web Scraper, Phase 1: Clarify, Required Parameters, Optional Parameters.

When to Use This Skill

Use this section as the trigger filter. It should make the activation boundary explicit before the operator loads files, runs commands, or opens a pull request.

  • When the user mentions "scraper" or related topics
  • When the user mentions "scraping" or related topics
  • When the user mentions "extrair dados web" or related topics
  • When the user mentions "web scraping" or related topics
  • When the user mentions "raspar dados" or related topics
  • When the user mentions "coletar dados site" or related topics

Operating Table

SituationStart hereWhy it matters
First-time use
metadata.json
Confirms repository, branch, commit, and imported path before touching the copied workflow
Provenance review
ORIGIN.md
Gives reviewers a plain-language audit trail for the imported source
Workflow execution
references/data-transforms.md
Starts with the smallest copied file that materially changes execution
Supporting context
references/extraction-patterns.md
Adds the next most relevant copied source file without loading the entire package
Handoff decision
## Related Skills
Helps the operator switch to a stronger native skill when the task drifts

Workflow

This workflow is intentionally editorial and operational at the same time. It keeps the imported source useful to the operator while still satisfying the public intake standards that feed the downstream enhancer flow.

  1. Page type: article, product listing, search results, data table,
  2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
  3. Approximate number of distinct data items visible
  4. JavaScript rendering indicators: empty containers, loading spinners,
  5. Pagination: next/prev links, page numbers, load-more buttons,
  6. Data density: how much structured, extractable data exists
  7. List the main data fields/columns available for extraction

Imported Workflow Notes

Imported: Step 2.1: Initial Fetch

Use WebFetch to retrieve and analyze the page structure:

WebFetch(
  url = TARGET_URL,
  prompt = "Analyze this page structure and report:
    1. Page type: article, product listing, search results, data table,
       directory, dashboard, API docs, FAQ, pricing page, job board, events, or other
    2. Main content structure: tables, ordered/unordered lists, card grid, free-form text,
       accordion/collapsible sections, tabs
    3. Approximate number of distinct data items visible
    4. JavaScript rendering indicators: empty containers, loading spinners,
       SPA framework markers (React root, Vue app, Angular), minimal HTML with heavy JS
    5. Pagination: next/prev links, page numbers, load-more buttons,
       infinite scroll indicators, total results count
    6. Data density: how much structured, extractable data exists
    7. List the main data fields/columns available for extraction
    8. Embedded structured data: JSON-LD, microdata, OpenGraph tags
    9. Available download links: CSV, Excel, PDF, API endpoints"
)

Imported: Step 2.2: Evaluate Fetch Quality

SignalInterpretationAction
Rich content with data clearly visibleStatic pageStrategy A (WebFetch)
Empty containers, "loading...", minimal textJS-renderedStrategy B (Browser)
Login wall, CAPTCHA, 403/401 responseBlockedReport to user
Content present but poorly structuredNeeds precisionStrategy B (Browser)
JSON or XML response bodyAPI endpointStrategy C (Bash/curl)
Download links for CSV/Excel availableDirect data fileStrategy C (download)

Imported: Step 2.3: Content Classification

Classify into an extraction mode:

ModeIndicatorsExamples
table
HTML
<table>
, grid layout with headers
Price comparison, statistics, specs
list
Repeated similar elements, card gridsSearch results, product listings
article
Long-form text with headings/paragraphsBlog post, news article, docs
product
Product name, price, specs, images, ratingE-commerce product page
contact
Names, emails, phones, addresses, rolesTeam page, staff directory
faq
Question-answer pairs, accordionsFAQ page, help center
pricing
Plan names, prices, features, tiersSaaS pricing page
events
Dates, locations, titles, descriptionsEvent listings, conferences
jobs
Titles, companies, locations, salariesJob boards, career pages
custom
User specified CSS selectors or fieldsAnything not matching above

Record: page type, extraction mode, JS rendering needed (yes/no), available fields, structured data present (JSON-LD etc.).

If user asked for "everything", present the available fields and let them choose.


Imported: Overview

Web scraping inteligente multi-estrategia. Extrai dados estruturados de paginas web (tabelas, listas, precos). Paginacao, monitoramento e export CSV/JSON.

Imported: How It Works

Execute phases in strict order. Each phase feeds the next.

1. CLARIFY  ->  2. RECON  ->  3. STRATEGY  ->  4. EXTRACT  ->  5. TRANSFORM  ->  6. VALIDATE  ->  7. FORMAT

Never skip Phase 1 or Phase 2. They prevent wasted effort and failed extractions.

Fast path: If user provides URL + clear data target + the request is simple (single page, one data type), compress Phases 1-3 into a single action: fetch, classify, and extract in one WebFetch call. Still validate and format.


Examples

Example 1: Ask for the upstream workflow directly

Use @web-scraper to handle <task>. Start from the copied upstream workflow, load only the files that change the outcome, and keep provenance visible in the answer.

Explanation: This is the safest starting point when the operator needs the imported workflow, but not the entire repository.

Example 2: Ask for a provenance-grounded review

Review @web-scraper against metadata.json and ORIGIN.md, then explain which copied upstream files you would load first and why.

Explanation: Use this before review or troubleshooting when you need a precise, auditable explanation of origin and file selection.

Example 3: Narrow the copied support files before execution

Use @web-scraper for <task>. Load only the copied references, examples, or scripts that change the outcome, and name the files explicitly before proceeding.

Explanation: This keeps the skill aligned with progressive disclosure instead of loading the whole copied package by default.

Example 4: Build a reviewer packet

Review @web-scraper using the copied upstream files plus provenance, then summarize any gaps before merge.

Explanation: This is useful when the PR is waiting for human review and you want a repeatable audit packet.

Best Practices

Treat the generated public skill as a reviewable packaging layer around the upstream repository. The goal is to keep provenance explicit and load only the copied source material that materially improves execution.

  • If user provides a URL and clear data target, proceed directly to Phase 2.
  • If request is ambiguous (e.g. "scrape this site"), ask ONLY:
  • Default to Markdown table output. Mention alternatives only if relevant.
  • Accept requests in any language. Always respond in the user's language.
  • If user says "everything" or "all data", perform recon first, then present
  • Left-align text columns (:---), right-align numbers (---:)
  • Consistent column widths for readability

Imported Operating Notes

Imported: Clarification Rules

  • If user provides a URL and clear data target, proceed directly to Phase 2. Do NOT ask unnecessary questions.
  • If request is ambiguous (e.g. "scrape this site"), ask ONLY: "What specific data do you want me to extract from this page?"
  • Default to Markdown table output. Mention alternatives only if relevant.
  • Accept requests in any language. Always respond in the user's language.
  • If user says "everything" or "all data", perform recon first, then present what's available and let user choose.

Imported: Markdown Table Rules

  • Left-align text columns (
    :---
    ), right-align numbers (
    ---:
    )
  • Consistent column widths for readability
  • Include summary row for numeric data when useful (totals, averages)
  • Maximum 10 columns per table; split wider data into multiple tables or suggest JSON format
  • Truncate long cell values to 60 chars with
    ...
    indicator
  • Use
    N/A
    for missing values, never leave cells empty
  • For multi-page results, show combined table (not per-page)

Imported: Json Rules

  • Use camelCase for keys (e.g.
    productName
    ,
    unitPrice
    )
  • Wrap in metadata envelope:
    {
      "metadata": {
        "source": "URL",
        "title": "Page Title",
        "extractedAt": "ISO-8601",
        "itemCount": 47,
        "fieldCount": 6,
        "confidence": "HIGH",
        "strategy": "A",
        "transforms": ["deduplication", "priceNormalization"],
        "notes": []
      },
      "data": [ ... ]
    }
    
  • Pretty-print with 2-space indentation
  • Numbers as numbers (not strings), booleans as booleans
  • null for missing values (not empty strings)

Imported: Csv Rules

  • First row is always headers
  • Quote any field containing commas, quotes, or newlines
  • UTF-8 encoding with BOM for Excel compatibility
  • Use
    ,
    as delimiter (standard)
  • Include metadata as comments:
    # Source: URL

Imported: Best Practices

  • Provide clear, specific context about your project and requirements
  • Review all suggestions before applying them to production code
  • Combine with other complementary skills for comprehensive analysis

Troubleshooting

Problem: The operator skipped the imported context and answered too generically

Symptoms: The result ignores the upstream workflow in

plugins/antigravity-awesome-skills-claude/skills/web-scraper
, fails to mention provenance, or does not use any copied source files at all. Solution: Re-open
metadata.json
,
ORIGIN.md
, and the most relevant copied upstream files. Load only the files that materially change the answer, then restate the provenance before continuing.

Problem: The imported workflow feels incomplete during review

Symptoms: Reviewers can see the generated

SKILL.md
, but they cannot quickly tell which references, examples, or scripts matter for the current task. Solution: Point at the exact copied references, examples, scripts, or assets that justify the path you took. If the gap is still real, record it in the PR instead of hiding it.

Problem: The task drifted into a different specialization

Symptoms: The imported skill starts in the right place, but the work turns into debugging, architecture, design, security, or release orchestration that a native skill handles better. Solution: Use the related skills section to hand off deliberately. Keep the imported provenance visible so the next skill inherits the right context instead of starting blind.

Imported Troubleshooting Notes

Imported: Failure Protocol

When extraction fails or is blocked:

  1. Explain the specific reason (JS rendering, bot detection, login, etc.)
  2. Suggest alternatives (different URL, API if available, manual approach)
  3. Never retry aggressively or escalate access attempts

Related Skills

  • @00-andruia-consultant-v2
    - Use when the work is better handled by that native specialization after this imported skill establishes context.
  • @10-andruia-skill-smith-v2
    - Use when the work is better handled by that native specialization after this imported skill establishes context.
  • @20-andruia-niche-intelligence-v2
    - Use when the work is better handled by that native specialization after this imported skill establishes context.
  • @3d-web-experience-v2
    - Use when the work is better handled by that native specialization after this imported skill establishes context.

Additional Resources

Use this support matrix and the linked files below as the operator packet for this imported skill. They should reflect real copied source material, not generic scaffolding.

Resource familyWhat it gives the reviewerExample path
references
copied reference notes, guides, or background material from upstream
references/data-transforms.md
examples
worked examples or reusable prompts copied from upstream
examples/n/a
scripts
upstream helper scripts that change execution or validation
scripts/n/a
agents
routing or delegation notes that are genuinely part of the imported package
agents/n/a
assets
supporting assets or schemas copied from the source package
assets/n/a

Imported Reference Notes

Imported: Quick Reference: Mode Cheat Sheet

User Says...ModeStrategyOutput Default
"extract the table"tableA or BMarkdown table
"get all products/prices"productE then AMarkdown table
"scrape the listings"listA or BMarkdown table
"extract contact info / team page"contactAMarkdown table
"get the article data"articleAMarkdown text
"extract the FAQ"faqA or BJSON
"get pricing plans"pricingA or BMarkdown table
"scrape job listings"jobsA or BMarkdown table
"get event schedule"eventsA or BMarkdown table
"find and extract [topic]"discoveryWebSearchMarkdown table
"compare prices across sites"multi-URLA or BComparison table
"what changed since last time"diffanyDiff format

Imported: References

Imported: Capabilities

  • Multi-strategy: WebFetch (static), Browser automation (JS-rendered), Bash/curl (APIs), WebSearch (discovery)
  • Extraction modes: table, list, article, product, contact, FAQ, pricing, events, jobs, custom
  • Output formats: Markdown tables (default), JSON, CSV
  • Pagination: auto-detect and follow (page numbers, infinite scroll, load-more)
  • Multi-URL: extract same structure across sources with comparison and diff
  • Validation: confidence ratings (HIGH/MEDIUM/LOW) on every extraction
  • Auto-escalation: WebFetch fails silently -> automatic Browser fallback
  • Data transforms: cleaning, normalization, deduplication, enrichment
  • Differential mode: detect changes between scraping runs

Imported: Web Scraper

Multi-strategy web data extraction with intelligent approach selection, automatic fallback escalation, data transformation, and structured output.

Imported: Phase 1: Clarify

Establish extraction parameters before touching any URL.

Imported: Required Parameters

ParameterResolveDefault
Target URL(s)Which page(s) to scrape?(required)
Data TargetWhat specific data to extract?(required)
Output FormatMarkdown table, JSON, CSV, or text?Markdown table
ScopeSingle page, paginated, or multi-URL?Single page

Imported: Optional Parameters

ParameterResolveDefault
PaginationFollow pagination? Max pages?No, 1 page
Max ItemsMaximum number of items to collect?Unlimited
FiltersData to exclude or include?None
Sort OrderHow to sort results?Source order
Save PathSave to file? Which path?Display only
LanguageRespond in which language?User's lang
Diff ModeCompare with previous run?No

Imported: Discovery Mode

When user has a topic but no specific URL:

  1. Use WebSearch to find the most relevant pages
  2. Present top 3-5 URLs with descriptions
  3. Let user choose which to scrape, or scrape all
  4. Proceed to Phase 2 with selected URL(s)

Example: "find and extract pricing data for CRM tools" -> WebSearch("CRM tools pricing comparison 2026") -> Present top results -> User selects -> Extract


Imported: Phase 2: Reconnaissance

Analyze the target page before extraction.

Imported: Phase 3: Strategy Selection

Choose the extraction approach based on recon results.

Imported: Decision Tree

Structured data (JSON-LD, microdata) has what we need?
 |
 +-- YES --> STRATEGY E: Extract structured data directly
 |
 +-- NO: Content fully visible in WebFetch?
      |
      +-- YES: Need precise element targeting?
      |    |
      |    +-- NO  --> STRATEGY A: WebFetch + AI extraction
      |    +-- YES --> STRATEGY B: Browser automation
      |
      +-- NO: JavaScript rendering detected?
           |
           +-- YES --> STRATEGY B: Browser automation
           +-- NO:  API/JSON/XML endpoint or download link?
                |
                +-- YES --> STRATEGY C: Bash (curl + jq)
                +-- NO  --> Report access issue to user

Imported: Strategy A: Webfetch With Ai Extraction

Best for: Static pages, articles, simple tables, well-structured HTML.

Use WebFetch with a targeted extraction prompt tailored to the mode:

WebFetch(
  url = URL,
  prompt = "Extract [DATA_TARGET] from this page.
    Return ONLY the extracted data as [FORMAT] with these columns/fields: [FIELDS].
    Rules:
    - If a value is missing or unclear, use 'N/A'
    - Do not include navigation, ads, footers, or unrelated content
    - Preserve original values exactly (numbers, currencies, dates)
    - Include ALL matching items, not just the first few
    - For each item, also extract the URL/link if available"
)

Auto-escalation: If WebFetch returns suspiciously few items (less than 50% of expected from recon), or mostly empty fields, automatically escalate to Strategy B without asking user. Log the escalation in notes.

Imported: Strategy B: Browser Automation

Best for: JS-rendered pages, SPAs, interactive content, lazy-loaded data.

Sequence:

  1. Get tab context:
    tabs_context_mcp(createIfEmpty=true)
    -> get tabId
  2. Navigate to URL:
    navigate(url=TARGET_URL, tabId=TAB)
  3. Wait for content to load:
    computer(action="wait", duration=3, tabId=TAB)
  4. Check for cookie/consent banners:
    find(query="cookie consent or accept button", tabId=TAB)
    • If found, dismiss it (prefer privacy-preserving option)
  5. Read page structure:
    read_page(tabId=TAB)
    or
    get_page_text(tabId=TAB)
  6. Locate target elements:
    find(query="[DESCRIPTION]", tabId=TAB)
  7. Extract with JavaScript for precise data via
    javascript_tool
// Table extraction
const rows = document.querySelectorAll('TABLE_SELECTOR tr');
const data = Array.from(rows).map(row => {
  const cells = row.querySelectorAll('td, th');
  return Array.from(cells).map(c => c.textContent.trim());
});
JSON.stringify(data);
// List/card extraction
const items = document.querySelectorAll('ITEM_SELECTOR');
const data = Array.from(items).map(item => ({
  field1: item.querySelector('FIELD1_SELECTOR')?.textContent?.trim() || null,
  field2: item.querySelector('FIELD2_SELECTOR')?.textContent?.trim() || null,
  link: item.querySelector('a')?.href || null,
}));
JSON.stringify(data);
  1. For lazy-loaded content, scroll and re-extract:
    computer(action="scroll", scroll_direction="down", tabId=TAB)
    then
    computer(action="wait", duration=2, tabId=TAB)

Imported: Strategy C: Bash (Curl + Jq)

Best for: REST APIs, JSON endpoints, XML feeds, CSV/Excel downloads.


#### Imported: Json Api

curl -s "API_URL" | jq '[.items[] | {field1: .key1, field2: .key2}]'

#### Imported: Csv Download

curl -s "CSV_URL" -o /tmp/scraped_data.csv

#### Imported: Xml Parsing

curl -s "XML_URL" | python3 -c "
import xml.etree.ElementTree as ET, json, sys
tree = ET.parse(sys.stdin)

#### Imported: ... Parse And Output Json

"

Imported: Strategy D: Hybrid

When a single strategy is insufficient, combine:

  1. WebSearch to discover relevant URLs
  2. WebFetch for initial content assessment
  3. Browser automation for JS-heavy sections
  4. Bash for post-processing (jq, python for data cleaning)

Imported: Strategy E: Structured Data Extraction

When JSON-LD, microdata, or OpenGraph is present:

  1. Use Browser
    javascript_tool
    to extract structured data:
const scripts = document.querySelectorAll('script[type="application/ld+json"]');
const data = Array.from(scripts).map(s => {
  try { return JSON.parse(s.textContent); } catch { return null; }
}).filter(Boolean);
JSON.stringify(data);
  1. This often provides cleaner, more reliable data than DOM scraping
  2. Fall back to DOM extraction only for fields not in structured data

Imported: Pagination Handling

When pagination is detected and user wants multiple pages:

Page-number pagination (any strategy):

  1. Extract data from current page
  2. Identify URL pattern (e.g.
    ?page=N
    ,
    /page/N
    ,
    &offset=N
    )
  3. Iterate through pages up to user's max (default: 5 pages)
  4. Show progress: "Extracting page 2/5..."
  5. Concatenate all results, deduplicate if needed

Infinite scroll (Browser only):

  1. Extract currently visible data
  2. Record item count
  3. Scroll down:
    computer(action="scroll", scroll_direction="down", tabId=TAB)
  4. Wait:
    computer(action="wait", duration=2, tabId=TAB)
  5. Extract newly loaded data
  6. Compare count - if no new items after 2 scrolls, stop
  7. Repeat until no new content or max iterations (default: 5)

"Load More" button (Browser only):

  1. Extract currently visible data
  2. Find button:
    find(query="load more button", tabId=TAB)
  3. Click it:
    computer(action="left_click", ref=REF, tabId=TAB)
  4. Wait and extract new content
  5. Repeat until button disappears or max iterations reached

Imported: Phase 4: Extract

Execute the selected strategy using mode-specific patterns. See references/extraction-patterns.md for CSS selectors and JavaScript snippets.

Imported: Table Mode

WebFetch prompt:

"Extract ALL rows from the table(s) on this page.
Return as a markdown table with exact column headers.
Include every row - do not truncate or summarize.
Preserve numeric precision, currencies, and units."

Imported: List Mode

WebFetch prompt:

"Extract each [ITEM_TYPE] from this page.
For each item, extract: [FIELD_LIST].
Return as a JSON array of objects with these keys: [KEY_LIST].
Include ALL items, not just the first few. Include link/URL for each item if available."

Imported: Article Mode

WebFetch prompt:

"Extract article metadata:
- title, author, date, tags/categories, word count estimate
- Key factual data points, statistics, and named entities
Return as structured markdown. Summarize the content; do not reproduce full text."

Imported: Product Mode

WebFetch prompt:

"Extract product data with these exact fields:
- name, brand, price, currency, originalPrice (if discounted),
  availability, description (first 200 chars), rating, reviewCount,
  specifications (as key-value pairs), productUrl, imageUrl
Return as JSON. Use null for missing fields."

Also check for JSON-LD

Product
schema (Strategy E) first.

Imported: Contact Mode

WebFetch prompt:

"Extract contact information for each person/entity:
- name, title, role, email, phone, address, organization, website, linkedinUrl
Return as a markdown table. Only extract real contacts visible on the page."

Imported: Faq Mode

WebFetch prompt:

"Extract all question-answer pairs from this page.
For each FAQ item extract:
- question: the exact question text
- answer: the answer text (first 300 chars if long)
- category: the section/category if grouped
Return as a JSON array of objects."

Imported: Pricing Mode

WebFetch prompt:

"Extract all pricing plans/tiers from this page.
For each plan extract:
- planName, monthlyPrice, annualPrice, currency
- features (array of included features)
- limitations (array of limits or excluded features)
- ctaText (call-to-action button text)
- highlighted (true if marked as recommended/popular)
Return as JSON. Use null for missing fields."

Imported: Events Mode

WebFetch prompt:

"Extract all events/sessions from this page.
For each event extract:
- title, date, time, endTime, location, description (first 200 chars)
- speakers (array of names), category, registrationUrl
Return as JSON. Use null for missing fields."

Imported: Jobs Mode

WebFetch prompt:

"Extract all job listings from this page.
For each job extract:
- title, company, location, salary, salaryRange, type (full-time/part-time/contract)
- postedDate, description (first 200 chars), applyUrl, tags
Return as JSON. Use null for missing fields."

Imported: Custom Mode

When user provides specific selectors or field descriptions:

  • Use Browser automation with
    javascript_tool
    and user's CSS selectors
  • Or use WebFetch with a prompt built from user's field descriptions
  • Always confirm extracted schema with user before proceeding to multi-URL

Imported: Multi-Url Extraction

When extracting from multiple URLs:

  1. Extract from the first URL to establish the data schema
  2. Show user the first results and confirm the schema is correct
  3. Extract from remaining URLs using the same schema
  4. Add a
    source
    column/field to every record with the origin URL
  5. Combine all results into a single output
  6. Show progress: "Extracting 3/7 URLs..."

Imported: Phase 5: Transform

Clean, normalize, and enrich extracted data before validation. See references/data-transforms.md for patterns.

Imported: Automatic Transforms (Always Apply)

TransformAction
Whitespace cleanupTrim, collapse multiple spaces, remove
\n
in cells
HTML entity decode
&amp;
->
&
,
&lt;
->
<
,
&#39;
->
'
Unicode normalizationNFKC normalization for consistent characters
Empty string to null
""
->
null
(for JSON),
""
->
N/A
(for tables)

Imported: Deduplication Strategy

When combining data from multiple pages or URLs:

  1. Exact match: rows with identical values in all fields -> keep first
  2. Near match: rows with same key fields (name+source) but different details -> keep most complete (fewer nulls), flag in notes
  3. Report: "Removed N duplicate rows" in delivery notes

Imported: Phase 6: Validate

Verify extraction quality before delivering results.

Imported: Validation Checks

CheckAction
Item countCompare extracted count to expected count from recon
Empty fieldsCount N/A or null values per field
Data type consistencyNumbers should be numeric, dates parseable
DuplicatesFlag exact duplicate rows (post-dedup)
EncodingCheck for HTML entities, garbled characters
CompletenessAll user-requested fields present in output
TruncationVerify data wasn't cut off (check last items)
OutliersFlag values that seem anomalous (e.g. $0.00 price)

Imported: Confidence Rating

Assign to every extraction:

RatingCriteria
HIGHAll fields populated, count matches expected, no anomalies
MEDIUMMinor gaps (<10% empty fields) or count slightly differs
LOWSignificant gaps (>10% empty), structural issues, partial data

Always report confidence with specifics:

Confidence: HIGH - 47 items extracted, all 6 fields populated, matches expected count from page analysis.

Imported: Auto-Recovery (Try Before Reporting Issues)

IssueAuto-Recovery Action
Missing dataRe-attempt with Browser if WebFetch was used
Encoding problemsApply HTML entity decode + unicode normalization
Incomplete resultsCheck for pagination or lazy-loading, fetch more
Count mismatchScroll/paginate to find remaining items
All fields emptyPage likely JS-rendered, switch to Browser strategy
Partial fieldsTry JSON-LD extraction as supplement

Log all recovery attempts in delivery notes. Inform user of any irrecoverable gaps with specific details.


Imported: Phase 7: Format And Deliver

Structure results according to user preference. See references/output-templates.md for complete formatting templates.

Imported: Delivery Envelope

ALWAYS wrap results with this metadata header:


#### Imported: Extraction Results

**Source:** [Page Title](http://example.com)
**Date:** YYYY-MM-DD HH:MM UTC
**Items:** N records (M fields each)
**Confidence:** HIGH | MEDIUM | LOW
**Strategy:** A (WebFetch) | B (Browser) | C (API) | E (Structured Data)
**Format:** Markdown Table | JSON | CSV

---

[DATA HERE]

---

**Notes:**
- [Any gaps, issues, or observations]
- [Transforms applied: deduplication, normalization, etc.]
- [Pages scraped if paginated: "Pages 1-5 of 12"]
- [Auto-escalation if it occurred: "Escalated from WebFetch to Browser"]

Imported: File Output

When user requests file save:

  • Markdown:
    .md
    extension
  • JSON:
    .json
    extension
  • CSV:
    .csv
    extension
  • Confirm path before writing
  • Report full file path and item count after saving

Imported: Multi-Url Comparison Format

When comparing data across multiple sources:

  • Add
    Source
    as the first column/field
  • Use short identifiers for sources (domain name or user label)
  • Group by source or interleave based on user preference
  • Highlight differences if user asks for comparison
  • Include summary: "Best price: $X at store-b.com"

Imported: Differential Output

When user requests change detection (diff mode):

  • Compare current extraction with previous run
  • Mark new items with
    [NEW]
  • Mark removed items with
    [REMOVED]
  • Mark changed values with
    [WAS: old_value]
  • Include summary: "Changes since last run: +5 new, -2 removed, 3 modified"

Imported: Rate Limiting

  • Maximum 1 request per 2 seconds for sequential page fetches
  • For multi-URL jobs, process sequentially with pauses
  • If a site returns 429 (Too Many Requests), stop and report to user

Imported: Access Respect

  • If a page blocks access (403, CAPTCHA, login wall), report to user
  • Do NOT attempt to bypass bot detection, CAPTCHAs, or access controls
  • Do NOT scrape behind authentication unless user explicitly provides access
  • Respect robots.txt directives when known

Imported: Copyright

  • Do NOT reproduce large blocks of copyrighted article text
  • For articles: extract factual data, statistics, and structured info; summarize narrative content
  • Always include source attribution (http://example.com) in output

Imported: Data Scope

  • Extract ONLY what the user explicitly requested
  • Warn user before collecting potentially sensitive data at scale (emails, phone numbers, personal information)
  • Do not store or transmit extracted data beyond what the user sees

Imported: Common Pitfalls

  • Using this skill for tasks outside its domain expertise
  • Applying recommendations without understanding your specific context
  • Not providing enough project context for accurate analysis

Imported: Limitations

  • Use this skill only when the task clearly matches the scope described above.
  • Do not treat the output as a substitute for environment-specific validation, testing, or expert review.
  • Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.