Fetch-everything shopify-product-scraper

install
source · Clone the upstream repo
git clone https://github.com/liangdabiao/fetch-everything
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/liangdabiao/fetch-everything "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/shopify-product-scraper" ~/.claude/skills/liangdabiao-fetch-everything-shopify-product-scraper && rm -rf "$T"
manifest: .claude/skills/shopify-product-scraper/SKILL.md
source content

Shopify Product Scraper

Scrape product data from brand websites and generate Shopify CSV with remote CDN image URLs (no image downloading needed).

CRITICAL: Windows Python Rule

ALWAYS write Python code to

.py
files and run with
python script.py
.
NEVER use
python -c "..."
with multi-line code on Windows
— it causes
IndentationError: unexpected indent
or
|| goto :error
. Only single-line
python -c "import json; print(1)"
is safe.

Workflow

User provides URL → Assess site → Extract products → Fetch galleries → Generate CSV

Step 0: Assess Target Site

curl -sL -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \
  "TARGET_URL" -o _page.html --connect-timeout 15 --max-time 30
wc -c _page.html

Check which pattern the site uses (write to a

_assess.py
file):

import re
with open('_page.html', encoding='utf-8') as f:
    html = f.read()

print('data-product:', html.count('data-product'))
print('product-card:', html.count('product-card'))
print('aria-label="Product":', len(re.findall(r'aria-label="[^"]*"[^>]*class="[^"]*product', html)))

jsonlds = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL)
import json
types = []
for j in jsonlds:
    d = json.loads(j)
    if isinstance(d, list):
        types.extend([x.get('@type','?') for x in d])
    else:
        types.append(d.get('@type','?'))
print('JSON-LD types:', types)  # BreadcrumbList ≠ Product!
print('Has Product/ItemList:', any(t in ('Product','ItemList') for t in types))

print('__NEXT_DATA__:', html.count('__NEXT_DATA__'))

Pattern decision:

PatternIndicatorsMethod
A: data-* attrs
data-product
> 0
curl + regex on data attributes
A2: Semantic cards
product-card
class +
aria-label
curl + regex on article/div
B: JSON-LD Product
@type: Product
or
ItemList
curl + regex on ld+json
C: Next.js
__NEXT_DATA__
> 0
curl + parse JSON state
D: JS-renderedNone of the abovePlaywright MCP

Important: JSON-LD may exist but only contain

BreadcrumbList
, not product data. Always check the actual
@type
.

Step 1: Extract Product List

Pattern A: HTML data-* attributes

import re, json

products_raw = re.findall(
    r'<div[^>]*class="[^"]*product[^"]*"[^>]+(data-product-name="[^"]*"[^>]+)>',
    html
)
for attrs_str in products_raw:
    data = dict(re.findall(r'data-([\w-]+)="([^"]*)"', attrs_str))
    # data = {'product_name': '...', 'msrp_price': '...', 'default_id': '...'}

Pattern A2: Semantic HTML cards (Logitech, modern SSR sites)

Many sites use

<article>
or
<div>
with class containing "product" and
aria-label
for the product name:

# Example: <article aria-label="MX Keys S" class="product-card ...">
articles = re.findall(
    r'<article[^>]*aria-label="([^"]*)"[^>]*class="[^"]*product-card[^"]*"[^>]*>(.*?)(?=<article[^>]*class="[^"]*product-card|<footer|$)',
    html, re.DOTALL
)
# If aria-label comes after class:
# articles = re.findall(r'<article[^>]*class="[^"]*product-card[^"]*"[^>]*aria-label="([^"]*)"[^>]*>(.*?)...', ...)

for title, card_html in articles:
    url = 'https://example.com' + re.search(r'href="(/[^"]*)"', card_html).group(1)
    image = re.search(r'<img[^>]+src="(https://[^"]+)"', card_html).group(1)
    prices = re.findall(r'\$(\d+\.?\d*)', card_html)  # [sale, msrp] or [price]
    slug = url.split('/')[-1].replace('.html', '')

Key tip: Don't rely on the first

aria-label
found inside the card — buttons like "Add to wishlist" also have aria-labels. Extract the product name from the
<article>
tag's own
aria-label
, not from child elements.

Pattern B: JSON-LD

jsonlds = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL)
for j in jsonlds:
    d = json.loads(j)
    if isinstance(d, list):
        d = d[0]
    if d.get('@type') in ('Product', 'ItemList'):
        # d['name'], d['offers']['price'], d['image'], d['sku']

Pattern C: Next.js NEXT_DATA

m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL)
state = json.loads(m.group(1))
products = state['props']['pageProps'].get('products', [])

Pattern D: JS-rendered (use Playwright MCP)

await (async (page) => {
    await page.goto('TARGET_URL', {timeout: 30000});
    await page.waitForTimeout(3000);
    const products = await page.evaluate(() => {
        const cards = document.querySelectorAll('[class*="product"]');
        return [...cards].map(c => ({
            title: c.querySelector('[class*="title"]')?.textContent?.trim(),
            price: c.querySelector('[class*="price"]')?.textContent?.trim(),
            url: c.querySelector('a')?.href,
            image: c.querySelector('img')?.src,
        })).filter(p => p.title);
    });
    return products;
})(page);

Normalize all extracted products to this intermediate format:

[
  {
    "title": "Product Name",
    "description": "Short description",
    "url": "https://example.com/product/slug",
    "category": "category-slug",
    "msrp": "299.99",
    "sale_price": "249.99",
    "compare_price": "299.99",
    "sku": "SKU-001",
    "image": "https://cdn.example.com/main.png",
    "product_slug": "product-slug"
  }
]
  • msrp
    : The regular/original price. Used as fallback.
  • sale_price
    : The current selling price (for deals/discount pages).
  • compare_price
    : The "compare at" / "was" price shown to customer.
  • If no sale, set
    sale_price = msrp
    and leave
    compare_price
    empty.

Save to

{brand}-products.json
.

Step 2: Fetch Product Gallery Images

For each product, fetch its detail page to get full image gallery.

IMPORTANT: Use

curl
via subprocess, not Python
urllib
.
On Windows, Python's
urllib
has frequent SSL handshake timeouts.
curl
is more reliable.

Write a

_fetch_galleries.py
file with this pattern:

import re, json, subprocess, time

with open('brand-products.json', encoding='utf-8') as f:
    products = json.load(f)

def fetch_url(url):
    """Use curl subprocess — more reliable than urllib on Windows."""
    try:
        result = subprocess.run(
            ['curl', '-sL', '-A', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
             '--connect-timeout', '15', '--max-time', '30', url],
            capture_output=True, text=True, encoding='utf-8', errors='replace'
        )
        if result.returncode == 0 and len(result.stdout) > 500:
            return result.stdout
    except Exception:
        pass
    return ''

def filter_product_images(urls, product_slug):
    """Keep only images belonging to this specific product."""
    slug_norm = product_slug.replace('-', '').replace('.', '').lower()
    skip = ['swatch', 'icon', 'logo', '/3d/', '-3d', '/ar/', 'diagram',
            'color-ring', 'callout', 'award', 'badge', 'comparison',
            'color-guide', 'video-thumbnail', 'thumbnail', 'icon-']
    seen, result = set(), []
    for url in urls:
        lower = url.lower()
        if not re.search(r'\.(?:png|jpg|jpeg)', lower):
            continue
        if any(s in lower for s in skip):
            continue
        fname = re.search(r'/([^/?]+\.(?:png|jpg|jpeg))', url)
        if fname:
            base = re.sub(r'-\d+x\d+', '', fname.group(1)).split('?')[0].replace('-2x', '')
            if base in seen:
                continue
            seen.add(base)
        # Upgrade resolution
        url = re.sub(r'w_\d+', 'w_1200', url)
        url = re.sub(r'h_\d+', 'h_900', url)
        result.append(url)
    return result[:8]

results = []
for i, p in enumerate(products):
    print(f'[{i+1}/{len(products)}] {p["title"]}')
    html = fetch_url(p['url'])

    if not html:
        # Fallback to listing image only
        results.append({**p, 'gallery': [p['image']] if p.get('image') else []})
        continue

    # Extract all CDN image URLs
    all_imgs = set()
    for m in re.finditer(r'(https://[^"\'>\s]+?\.(?:png|jpg|jpeg))', html):
        all_imgs.add(m.group(1))
    # Also check srcset
    for m in re.finditer(r'(?:srcset|data-src)="([^"]+)"', html):
        for part in m.group(1).split(','):
            u = part.strip().split()[0]
            if re.search(r'\.(?:png|jpg|jpeg)', u):
                all_imgs.add(u)

    gallery = filter_product_images(all_imgs, p['product_slug'])
    if not gallery and p.get('image'):
        gallery = [p['image']]

    results.append({**p, 'gallery': gallery})
    print(f'  Found {len(gallery)} images')

    # Checkpoint every 10 products
    if (i + 1) % 10 == 0:
        with open('brand-products-gallery.json', 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2, ensure_ascii=False)

    time.sleep(0.5)

with open('brand-products-gallery.json', 'w', encoding='utf-8') as f:
    json.dump(results, f, indent=2, ensure_ascii=False)

Key points for gallery extraction:

  • Filter cross-product images: "YOU MAY ALSO LIKE" sections include other products' images. Filter by product slug.
  • Upgrade image resolution: Replace
    w_300
    w_1200
    etc. in CDN URLs.
  • Deduplicate by filename: Same image at different sizes → keep one.
  • Max 8 images per product.
  • Checkpoint saves: Save progress every 10 products in case of interruption.
  • Slug matching: Use
    slug_norm
    with dots and hyphens removed. Some slugs contain version numbers (e.g.,
    mx-keys-s.920-011558
    ) that should be stripped for matching.

Save enriched data as

{brand}-products-gallery.json
.

Step 3: Generate Shopify CSV

Use the bundled script:

python scripts/gen_shopify_csv.py products-gallery.json --vendor "Brand Name" --output brand_shopify.csv

The script supports these fields:

  • gallery
    or
    gallery_images
    — array of image URLs
  • sale_price
    — selling price (used as Variant Price)
  • compare_price
    or
    msrp
    — original/compare-at price (used as Variant Compare At Price)
  • image
    — fallback main image if gallery is empty

Or generate inline following the Shopify CSV format:

Shopify CSV structure (UTF-8 BOM, all fields double-quoted):

Row TypeHandleTitleBody (HTML)Image SrcImage Position
Product
product-slug
Product Name
<p>Description</p>
URL1
Image 2
product-slug
URL2
Image 3
product-slug
URL3

Key rules:

  • Handle: lowercase + hyphens only (
    re.sub(r'[^a-z0-9]+', '-', name.lower())
    )
  • First row: product info + first image
  • Image rows: same Handle, only fill Image Src/Position/Alt Text
  • Images: use remote CDN URLs (Shopify downloads them on import)
  • Max 8 images per product
  • Encoding:
    utf-8-sig
    (BOM for Excel compatibility)
  • Quoting:
    csv.QUOTE_ALL

Required Shopify CSV headers:

Handle, Title, Body (HTML), Vendor, Type, Tags, Published,
Option1 Name, Option1 Value, Variant SKU, Variant Grams,
Variant Inventory Tracker, Variant Inventory Qty,
Variant Inventory Policy, Variant Fulfillment Service,
Variant Price, Variant Compare At Price,
Variant Requires Shipping, Variant Taxable, Variant Barcode,
Image Src, Image Position, Image Alt Text, Gift Card, Variant Image

Step 4: Verify Output

Write a

_verify.py
file (not inline python):

import csv
with open('output.csv', encoding='utf-8-sig') as f:
    rows = list(csv.DictReader(f))
product_rows = [r for r in rows if r['Title']]
image_rows = [r for r in rows if not r['Title']]
print(f'Total rows: {len(rows)}')
print(f'Product rows: {len(product_rows)}')
print(f'Image-only rows: {len(image_rows)}')
for r in product_rows:
    print(f"  {r['Title'][:45]:45s}  ${r['Variant Price']:>8s}  {r['Variant Compare At Price'] and '$'+r['Variant Compare At Price'] or '':>8s}")

Report to user: product count, image count, categories, file location.

Anti-Bot Mitigation

StrategyImplementation
User-Agent
curl -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
Rate limit
time.sleep(0.5)
between requests;
sleep 0.5
in bash
Batch sizeProcess max 20-30 URLs per batch
Timeoutcurl
--connect-timeout 15 --max-time 30
SSL issuesUse
curl
subprocess, not Python
urllib
Access DeniedFall back to Playwright MCP with longer delays (3-5s)

Windows Development Notes

  • ALWAYS write Python to
    .py
    files
    python -c
    multi-line fails with IndentationError on Windows
  • Use
    curl
    subprocess
    for HTTP fetches, not
    urllib
    — avoids SSL handshake timeouts
  • No
    grep -P
    on Windows — use
    grep -o
    or
    grep -c
    (no Perl regex)
  • Use
    /
    not
    \
    in Python paths
  • Use project directory for temp files (not
    /tmp/
    )
  • CSV encoding:
    utf-8-sig
  • curl is available natively on Windows 10/11

Troubleshooting

IssueCauseSolution
Empty product listJS-rendered pageUse Playwright MCP (Step 1 Pattern D)
Cross-product images"Recommended" sectionFilter by product slug (Step 2)
404/BlobNotFound imagesCDN variant doesn't existOnly use URLs found in DOM, don't guess
CSV encoding issuesMissing BOMUse
utf-8-sig
encoding
Price shows as 0Price in JS variableExtract from rendered DOM with Playwright
Duplicate imagesCDN serves same image at different sizesDeduplicate by base filename
SSL timeout on fetchPython urllib unreliable on WindowsUse
curl
via subprocess (Step 2)
JSON-LD exists but no productsOnly BreadcrumbList, not ProductCheck
@type
value, try Pattern A2
Wrong aria-label parsedButton aria-labels inside cardsUse article-level aria-label, not child elements
python -c
IndentationError
Windows bash handling of multi-lineWrite to
.py
file, run with
python file.py
grep -oP
not working
Windows grep lacks Perl regexUse
grep -o
or Python regex

References