Fetch-everything shopify-product-scraper

install

source · Clone the upstream repo

git clone https://github.com/liangdabiao/fetch-everything

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/liangdabiao/fetch-everything "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/shopify-product-scraper" ~/.claude/skills/liangdabiao-fetch-everything-shopify-product-scraper && rm -rf "$T"

manifest: .claude/skills/shopify-product-scraper/SKILL.md

source content

Shopify Product Scraper

Scrape product data from brand websites and generate Shopify CSV with remote CDN image URLs (no image downloading needed).

CRITICAL: Windows Python Rule

ALWAYS write Python code to

.py

files and run with
python script.py
. NEVER use
python -c "..."
with multi-line code on Windows — it causes

IndentationError: unexpected indent

|| goto :error

. Only single-line

python -c "import json; print(1)"

is safe.

Workflow

User provides URL → Assess site → Extract products → Fetch galleries → Generate CSV

Step 0: Assess Target Site

curl -sL -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \
  "TARGET_URL" -o _page.html --connect-timeout 15 --max-time 30
wc -c _page.html

Check which pattern the site uses (write to a

_assess.py

file):

import re
with open('_page.html', encoding='utf-8') as f:
    html = f.read()

print('data-product:', html.count('data-product'))
print('product-card:', html.count('product-card'))
print('aria-label="Product":', len(re.findall(r'aria-label="[^"]*"[^>]*class="[^"]*product', html)))

jsonlds = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL)
import json
types = []
for j in jsonlds:
    d = json.loads(j)
    if isinstance(d, list):
        types.extend([x.get('@type','?') for x in d])
    else:
        types.append(d.get('@type','?'))
print('JSON-LD types:', types)  # BreadcrumbList ≠ Product!
print('Has Product/ItemList:', any(t in ('Product','ItemList') for t in types))

print('__NEXT_DATA__:', html.count('__NEXT_DATA__'))

Pattern decision:

Pattern	Indicators	Method
A: data-* attrs	`data-product` > 0	curl + regex on data attributes
A2: Semantic cards	`product-card` class + `aria-label`	curl + regex on article/div
B: JSON-LD Product	`@type: Product` or `ItemList`	curl + regex on ld+json
C: Next.js	`__NEXT_DATA__` > 0	curl + parse JSON state
D: JS-rendered	None of the above	Playwright MCP

Important: JSON-LD may exist but only contain

BreadcrumbList

, not product data. Always check the actual

@type

Step 1: Extract Product List

Pattern A: HTML data-* attributes

import re, json

products_raw = re.findall(
    r'<div[^>]*class="[^"]*product[^"]*"[^>]+(data-product-name="[^"]*"[^>]+)>',
    html
)
for attrs_str in products_raw:
    data = dict(re.findall(r'data-([\w-]+)="([^"]*)"', attrs_str))
    # data = {'product_name': '...', 'msrp_price': '...', 'default_id': '...'}

Pattern A2: Semantic HTML cards (Logitech, modern SSR sites)

Many sites use

<article>

<div>

with class containing "product" and

aria-label

for the product name:

# Example: <article aria-label="MX Keys S" class="product-card ...">
articles = re.findall(
    r'<article[^>]*aria-label="([^"]*)"[^>]*class="[^"]*product-card[^"]*"[^>]*>(.*?)(?=<article[^>]*class="[^"]*product-card|<footer|$)',
    html, re.DOTALL
)
# If aria-label comes after class:
# articles = re.findall(r'<article[^>]*class="[^"]*product-card[^"]*"[^>]*aria-label="([^"]*)"[^>]*>(.*?)...', ...)

for title, card_html in articles:
    url = 'https://example.com' + re.search(r'href="(/[^"]*)"', card_html).group(1)
    image = re.search(r'<img[^>]+src="(https://[^"]+)"', card_html).group(1)
    prices = re.findall(r'\$(\d+\.?\d*)', card_html)  # [sale, msrp] or [price]
    slug = url.split('/')[-1].replace('.html', '')

Key tip: Don't rely on the first

aria-label

found inside the card — buttons like "Add to wishlist" also have aria-labels. Extract the product name from the

<article>

tag's own

aria-label

, not from child elements.

Pattern B: JSON-LD

jsonlds = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL)
for j in jsonlds:
    d = json.loads(j)
    if isinstance(d, list):
        d = d[0]
    if d.get('@type') in ('Product', 'ItemList'):
        # d['name'], d['offers']['price'], d['image'], d['sku']

Pattern C: Next.js NEXT_DATA

m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
state = json.loads(m.group(1))
products = state['props']['pageProps'].get('products', [])

Pattern D: JS-rendered (use Playwright MCP)

await (async (page) => {
    await page.goto('TARGET_URL', {timeout: 30000});
    await page.waitForTimeout(3000);
    const products = await page.evaluate(() => {
        const cards = document.querySelectorAll('[class*="product"]');
        return [...cards].map(c => ({
            title: c.querySelector('[class*="title"]')?.textContent?.trim(),
            price: c.querySelector('[class*="price"]')?.textContent?.trim(),
            url: c.querySelector('a')?.href,
            image: c.querySelector('img')?.src,
        })).filter(p => p.title);
    });
    return products;
})(page);

Normalize all extracted products to this intermediate format:

[
  {
    "title": "Product Name",
    "description": "Short description",
    "url": "https://example.com/product/slug",
    "category": "category-slug",
    "msrp": "299.99",
    "sale_price": "249.99",
    "compare_price": "299.99",
    "sku": "SKU-001",
    "image": "https://cdn.example.com/main.png",
    "product_slug": "product-slug"
  }
]

```
msrp
```
: The regular/original price. Used as fallback.
```
sale_price
```
: The current selling price (for deals/discount pages).
```
compare_price
```
: The "compare at" / "was" price shown to customer.
If no sale, set
```
sale_price = msrp
```
and leave
```
compare_price
```
empty.

Save to

{brand}-products.json

Step 2: Fetch Product Gallery Images

For each product, fetch its detail page to get full image gallery.

IMPORTANT: Use

curl

via subprocess, not Python
urllib
. On Windows, Python's

urllib

has frequent SSL handshake timeouts.

curl

is more reliable.

Write a

_fetch_galleries.py

file with this pattern:

import re, json, subprocess, time

with open('brand-products.json', encoding='utf-8') as f:
    products = json.load(f)

def fetch_url(url):
    """Use curl subprocess — more reliable than urllib on Windows."""
    try:
        result = subprocess.run(
            ['curl', '-sL', '-A', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
             '--connect-timeout', '15', '--max-time', '30', url],
            capture_output=True, text=True, encoding='utf-8', errors='replace'
        )
        if result.returncode == 0 and len(result.stdout) > 500:
            return result.stdout
    except Exception:
        pass
    return ''

def filter_product_images(urls, product_slug):
    """Keep only images belonging to this specific product."""
    slug_norm = product_slug.replace('-', '').replace('.', '').lower()
    skip = ['swatch', 'icon', 'logo', '/3d/', '-3d', '/ar/', 'diagram',
            'color-ring', 'callout', 'award', 'badge', 'comparison',
            'color-guide', 'video-thumbnail', 'thumbnail', 'icon-']
    seen, result = set(), []
    for url in urls:
        lower = url.lower()
        if not re.search(r'\.(?:png|jpg|jpeg)', lower):
            continue
        if any(s in lower for s in skip):
            continue
        fname = re.search(r'/([^/?]+\.(?:png|jpg|jpeg))', url)
        if fname:
            base = re.sub(r'-\d+x\d+', '', fname.group(1)).split('?')[0].replace('-2x', '')
            if base in seen:
                continue
            seen.add(base)
        # Upgrade resolution
        url = re.sub(r'w_\d+', 'w_1200', url)
        url = re.sub(r'h_\d+', 'h_900', url)
        result.append(url)
    return result[:8]

results = []
for i, p in enumerate(products):
    print(f'[{i+1}/{len(products)}] {p["title"]}')
    html = fetch_url(p['url'])

    if not html:
        # Fallback to listing image only
        results.append({**p, 'gallery': [p['image']] if p.get('image') else []})
        continue

    # Extract all CDN image URLs
    all_imgs = set()
    for m in re.finditer(r'(https://[^"\'>\s]+?\.(?:png|jpg|jpeg))', html):
        all_imgs.add(m.group(1))
    # Also check srcset
    for m in re.finditer(r'(?:srcset|data-src)="([^"]+)"', html):
        for part in m.group(1).split(','):
            u = part.strip().split()[0]
            if re.search(r'\.(?:png|jpg|jpeg)', u):
                all_imgs.add(u)

    gallery = filter_product_images(all_imgs, p['product_slug'])
    if not gallery and p.get('image'):
        gallery = [p['image']]

    results.append({**p, 'gallery': gallery})
    print(f'  Found {len(gallery)} images')

    # Checkpoint every 10 products
    if (i + 1) % 10 == 0:
        with open('brand-products-gallery.json', 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2, ensure_ascii=False)

    time.sleep(0.5)

with open('brand-products-gallery.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

Key points for gallery extraction:

Filter cross-product images: "YOU MAY ALSO LIKE" sections include other products' images. Filter by product slug.
Upgrade image resolution: Replace
```
w_300
```
→
```
w_1200
```
etc. in CDN URLs.
Deduplicate by filename: Same image at different sizes → keep one.
Max 8 images per product.
Checkpoint saves: Save progress every 10 products in case of interruption.
Slug matching: Use
```
slug_norm
```
with dots and hyphens removed. Some slugs contain version numbers (e.g.,
```
mx-keys-s.920-011558
```
) that should be stripped for matching.

Save enriched data as

{brand}-products-gallery.json

Step 3: Generate Shopify CSV

Use the bundled script:

python scripts/gen_shopify_csv.py products-gallery.json --vendor "Brand Name" --output brand_shopify.csv

The script supports these fields:

```
gallery
```
or
```
gallery_images
```
— array of image URLs
```
sale_price
```
— selling price (used as Variant Price)
```
compare_price
```
or
```
msrp
```
— original/compare-at price (used as Variant Compare At Price)
```
image
```
— fallback main image if gallery is empty

Or generate inline following the Shopify CSV format:

Shopify CSV structure (UTF-8 BOM, all fields double-quoted):

Row Type	Handle	Title	Body (HTML)	Image Src	Image Position
Product	`product-slug`	Product Name	`<p>Description</p>`	URL	1
Image 2	`product-slug`			URL	2
Image 3	`product-slug`			URL	3

Key rules:

Handle: lowercase + hyphens only (

re.sub(r'[^a-z0-9]+', '-', name.lower())

)

First row: product info + first image
Image rows: same Handle, only fill Image Src/Position/Alt Text
Images: use remote CDN URLs (Shopify downloads them on import)
Max 8 images per product
Encoding:
```
utf-8-sig
```
(BOM for Excel compatibility)
Quoting:
```
csv.QUOTE_ALL
```

Required Shopify CSV headers:

Handle, Title, Body (HTML), Vendor, Type, Tags, Published,
Option1 Name, Option1 Value, Variant SKU, Variant Grams,
Variant Inventory Tracker, Variant Inventory Qty,
Variant Inventory Policy, Variant Fulfillment Service,
Variant Price, Variant Compare At Price,
Variant Requires Shipping, Variant Taxable, Variant Barcode,
Image Src, Image Position, Image Alt Text, Gift Card, Variant Image

Step 4: Verify Output

Write a

_verify.py

file (not inline python):

import csv
with open('output.csv', encoding='utf-8-sig') as f:
    rows = list(csv.DictReader(f))
product_rows = [r for r in rows if r['Title']]
image_rows = [r for r in rows if not r['Title']]
print(f'Total rows: {len(rows)}')
print(f'Product rows: {len(product_rows)}')
print(f'Image-only rows: {len(image_rows)}')
for r in product_rows:
    print(f"  {r['Title'][:45]:45s}  ${r['Variant Price']:>8s}  {r['Variant Compare At Price'] and '$'+r['Variant Compare At Price'] or '':>8s}")

Report to user: product count, image count, categories, file location.

Anti-Bot Mitigation

Strategy	Implementation
User-Agent	`curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"`
Rate limit	`time.sleep(0.5)` between requests; `sleep 0.5` in bash
Batch size	Process max 20-30 URLs per batch
Timeout	curl `--connect-timeout 15 --max-time 30`
SSL issues	Use `curl` subprocess, not Python `urllib`
Access Denied	Fall back to Playwright MCP with longer delays (3-5s)

Windows Development Notes

ALWAYS write Python to
.py
files —
```
python -c
```
multi-line fails with IndentationError on Windows
Use
curl
subprocess for HTTP fetches, not
```
urllib
```
— avoids SSL handshake timeouts
No
grep -P
on Windows — use
```
grep -o
```
or
```
grep -c
```
(no Perl regex)
Use
```
/
```
not
```
\
```
in Python paths
Use project directory for temp files (not
```
/tmp/
```
)
CSV encoding:
```
utf-8-sig
```
curl is available natively on Windows 10/11

Troubleshooting

Issue	Cause	Solution
Empty product list	JS-rendered page	Use Playwright MCP (Step 1 Pattern D)
Cross-product images	"Recommended" section	Filter by product slug (Step 2)
404/BlobNotFound images	CDN variant doesn't exist	Only use URLs found in DOM, don't guess
CSV encoding issues	Missing BOM	Use `utf-8-sig` encoding
Price shows as 0	Price in JS variable	Extract from rendered DOM with Playwright
Duplicate images	CDN serves same image at different sizes	Deduplicate by base filename
SSL timeout on fetch	Python urllib unreliable on Windows	Use `curl` via subprocess (Step 2)
JSON-LD exists but no products	Only BreadcrumbList, not Product	Check `@type` value, try Pattern A2
Wrong aria-label parsed	Button aria-labels inside cards	Use article-level aria-label, not child elements
`python -c` IndentationError	Windows bash handling of multi-line	Write to `.py` file, run with `python file.py`
`grep -oP` not working	Windows grep lacks Perl regex	Use `grep -o` or Python regex

References

See references/scraping-patterns.md for detailed CDN URL patterns, image filtering code, and site-specific techniques