BrowserOS extract-data
Extract structured data from web pages — tables, lists, product info, pricing — into clean CSV, JSON, or markdown tables. Parallelizes across hidden tabs for multi-source extraction and saves results to disk incrementally. Use when the user asks to scrape, extract, or pull data from a page.
git clone https://github.com/browseros-ai/BrowserOS
T=$(mktemp -d) && git clone --depth=1 https://github.com/browseros-ai/BrowserOS "$T" && mkdir -p ~/.claude/skills && cp -r "$T/packages/browseros-agent/apps/server/src/skills/defaults/extract-data" ~/.claude/skills/browseros-ai-browseros-extract-data && rm -rf "$T"
packages/browseros-agent/apps/server/src/skills/defaults/extract-data/SKILL.mdExtract Data
End-to-end data extraction workflow that pulls structured content from one or many web pages, saves results to disk incrementally (never accumulating everything in memory), and delivers clean output in the user's preferred format.
When to Apply
Activate when the user asks to extract, scrape, pull, or collect structured data from web pages — tables, product listings, pricing, contact info, search results, leaderboards, or any repeating data pattern.
Workflow
Phase 1 — Clarify & Plan
- Clarify the request. Before extracting, confirm with the user:
- Source(s): Single page, list of URLs, or search-then-extract?
- Output format: CSV, JSON, or Markdown table? Default to CSV if not specified.
- Output location: Where to save files. Default:
in your working directory.extract-<topic-slug>/ - What data to extract: Column names, specific fields, or "everything in the table."
- Create the output directory. Use
to create the target folder:evaluate_scriptextract-<topic-slug>/ ├── raw/ ← per-page extracted content ├── merged.<format> ← final combined output (csv / json) └── extraction.log ← progress log with source URLs
Phase 2 — Single-Page Extraction
For a single page (or each individual page in a batch):
| Step | Tool | Detail |
|---|---|---|
| Navigate | | Go to the target URL (skip if already on the page) |
| Read content | | Extract the page as markdown — this captures tables, lists, and text in a structured format |
| Identify structure | — | Determine the data pattern: HTML table, repeated cards, key-value pairs, etc. |
| Extract data | | For complex structures (e.g., product grids, nested cards), run JavaScript to query elements and return a JSON array. For clean markdown tables from , parse directly. |
| Save immediately | | Write the extracted data to with a header comment containing the source URL and timestamp |
| Log progress | | Append the source URL, row count, and status to |
Handling Pagination
If the page has pagination (next buttons, page numbers, infinite scroll):
- Extract the current page's data and save to
raw/<n>-page-<p>.<format> - Use
orclick
to go to the next pagenavigate_page - Repeat until all pages are processed or a user-specified limit is reached
- Each page's data is saved to its own file immediately — never accumulate across pages in memory
Phase 3 — Multi-Source Parallel Extraction
When extracting from multiple URLs or sources, parallelize using a hidden window:
| Step | Tool | Detail |
|---|---|---|
| Create workspace | | Open a dedicated hidden window for extraction work — keeps the user's browsing undisturbed |
| Open batch of tabs | | Open up to 10 tabs concurrently within the hidden window, one per source URL |
| Extract per tab | → → | For each tab: navigate, extract content, parse structured data |
| Save per tab | | Write each tab's results to immediately after extraction |
| Close tab | | Free the tab after its data is saved |
| Next batch | — | Once a batch of 10 completes, open the next batch. Continue until all sources are processed. |
| Close workspace | | Close the hidden window after all extraction is done |
Concurrency rule: Never exceed 10 open tabs at a time. Process in batches of 10, saving and closing before opening the next batch.
Phase 4 — Merge & Format
After all raw files are saved:
- Read each raw file from
usingraw/
.evaluate_script - Merge into a single output file (
,merged.csv
, ormerged.json
) with:merged.md- Consistent column headers / keys across all sources
- A
column so every row is traceable to its originsource_url - Deduplication if the same record appears in multiple sources
- Write the merged file to the output directory.
- For large datasets, provide a summary: total rows, sources processed, any errors.
Output Formats
| Format | File | Notes |
|---|---|---|
| CSV | | Header row, comma-separated, properly escaped. Include as the last column. |
| JSON | | Array of objects with consistent keys. Each object includes a field. |
| Markdown | | Aligned table with headers. Source URL in the last column. |
Phase 5 — HTML Report
Generate a self-contained
report.html in the output directory that serves as an index for the entire extraction.
| Requirement | Detail |
|---|---|
| Theme | Light background (), clean sans-serif typography, generous whitespace |
| Header | Title, date, total rows extracted, number of sources processed |
| What was done | Brief description of the extraction: source URLs, data fields extracted, format used |
| File index | Table listing every file in the output directory (, , ) with file paths as clickable links so the user can open them directly |
| Data preview | First 20 rows of the merged dataset rendered as an HTML table |
| Source list | All source URLs as clickable hyperlinks with the row count extracted from each |
| Self-contained | All styles inline or in a block — no external dependencies |
| Footer | "Generated by BrowserOS Extract Data" with the current date |
Use
evaluate_script to write the HTML file to the output directory.
Phase 6 — Open & Notify
| Step | Tool | Detail |
|---|---|---|
| Open report | | Open so the user sees the extraction summary |
| Notify user | — | Tell the user: extraction is complete, total rows, source count, and paths to and |
Tool Reference
| Category | Tools Used |
|---|---|
| Window management | , |
| Tab management | , , |
| Navigation | |
| Content extraction | |
| Data parsing & file I/O | |
| Interaction | (for pagination) |
Tips
- Always ask the format first. CSV, JSON, and Markdown have different strengths — let the user decide.
- Save after every page. Never hold more than one page's worth of data in memory at a time.
- 10 tabs max. More tabs degrades performance and risks timeouts. Batch in groups of 10.
- Record the source URL on every row and in every raw file so data is fully traceable.
- Clean up extracted data: trim whitespace, normalize currency symbols, remove hidden characters.
- For paginated sites, check for a total count or "showing X of Y" to estimate progress.
- If a page requires login or blocks extraction, report it to the user rather than retrying silently.