Fetch-everything shopify-product-scraper
git clone https://github.com/liangdabiao/fetch-everything
T=$(mktemp -d) && git clone --depth=1 https://github.com/liangdabiao/fetch-everything "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/shopify-product-scraper" ~/.claude/skills/liangdabiao-fetch-everything-shopify-product-scraper && rm -rf "$T"
.claude/skills/shopify-product-scraper/SKILL.mdShopify Product Scraper
Scrape product data from brand websites and generate Shopify CSV with remote CDN image URLs (no image downloading needed).
CRITICAL: Windows Python Rule
ALWAYS write Python code to
files and run with .py
.
NEVER use python script.py
with multi-line code on Windows — it causes python -c "..."
IndentationError: unexpected indent or || goto :error.
Only single-line python -c "import json; print(1)" is safe.
Workflow
User provides URL → Assess site → Extract products → Fetch galleries → Generate CSV
Step 0: Assess Target Site
curl -sL -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \ "TARGET_URL" -o _page.html --connect-timeout 15 --max-time 30 wc -c _page.html
Check which pattern the site uses (write to a
_assess.py file):
import re with open('_page.html', encoding='utf-8') as f: html = f.read() print('data-product:', html.count('data-product')) print('product-card:', html.count('product-card')) print('aria-label="Product":', len(re.findall(r'aria-label="[^"]*"[^>]*class="[^"]*product', html))) jsonlds = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL) import json types = [] for j in jsonlds: d = json.loads(j) if isinstance(d, list): types.extend([x.get('@type','?') for x in d]) else: types.append(d.get('@type','?')) print('JSON-LD types:', types) # BreadcrumbList ≠ Product! print('Has Product/ItemList:', any(t in ('Product','ItemList') for t in types)) print('__NEXT_DATA__:', html.count('__NEXT_DATA__'))
Pattern decision:
| Pattern | Indicators | Method |
|---|---|---|
| A: data-* attrs | > 0 | curl + regex on data attributes |
| A2: Semantic cards | class + | curl + regex on article/div |
| B: JSON-LD Product | or | curl + regex on ld+json |
| C: Next.js | > 0 | curl + parse JSON state |
| D: JS-rendered | None of the above | Playwright MCP |
Important: JSON-LD may exist but only contain
BreadcrumbList, not product data. Always check the actual @type.
Step 1: Extract Product List
Pattern A: HTML data-* attributes
import re, json products_raw = re.findall( r'<div[^>]*class="[^"]*product[^"]*"[^>]+(data-product-name="[^"]*"[^>]+)>', html ) for attrs_str in products_raw: data = dict(re.findall(r'data-([\w-]+)="([^"]*)"', attrs_str)) # data = {'product_name': '...', 'msrp_price': '...', 'default_id': '...'}
Pattern A2: Semantic HTML cards (Logitech, modern SSR sites)
Many sites use
<article> or <div> with class containing "product" and aria-label for the product name:
# Example: <article aria-label="MX Keys S" class="product-card ..."> articles = re.findall( r'<article[^>]*aria-label="([^"]*)"[^>]*class="[^"]*product-card[^"]*"[^>]*>(.*?)(?=<article[^>]*class="[^"]*product-card|<footer|$)', html, re.DOTALL ) # If aria-label comes after class: # articles = re.findall(r'<article[^>]*class="[^"]*product-card[^"]*"[^>]*aria-label="([^"]*)"[^>]*>(.*?)...', ...) for title, card_html in articles: url = 'https://example.com' + re.search(r'href="(/[^"]*)"', card_html).group(1) image = re.search(r'<img[^>]+src="(https://[^"]+)"', card_html).group(1) prices = re.findall(r'\$(\d+\.?\d*)', card_html) # [sale, msrp] or [price] slug = url.split('/')[-1].replace('.html', '')
Key tip: Don't rely on the first
aria-label found inside the card — buttons like "Add to wishlist" also have aria-labels. Extract the product name from the <article> tag's own aria-label, not from child elements.
Pattern B: JSON-LD
jsonlds = re.findall(r'<script type="application/ld\+json">(.*?)</script>', html, re.DOTALL) for j in jsonlds: d = json.loads(j) if isinstance(d, list): d = d[0] if d.get('@type') in ('Product', 'ItemList'): # d['name'], d['offers']['price'], d['image'], d['sku']
Pattern C: Next.js NEXT_DATA
m = re.search(r'<script id="__NEXT_DATA__"[^>]*>(.*?)</script>', html, re.DOTALL) state = json.loads(m.group(1)) products = state['props']['pageProps'].get('products', [])
Pattern D: JS-rendered (use Playwright MCP)
await (async (page) => { await page.goto('TARGET_URL', {timeout: 30000}); await page.waitForTimeout(3000); const products = await page.evaluate(() => { const cards = document.querySelectorAll('[class*="product"]'); return [...cards].map(c => ({ title: c.querySelector('[class*="title"]')?.textContent?.trim(), price: c.querySelector('[class*="price"]')?.textContent?.trim(), url: c.querySelector('a')?.href, image: c.querySelector('img')?.src, })).filter(p => p.title); }); return products; })(page);
Normalize all extracted products to this intermediate format:
[ { "title": "Product Name", "description": "Short description", "url": "https://example.com/product/slug", "category": "category-slug", "msrp": "299.99", "sale_price": "249.99", "compare_price": "299.99", "sku": "SKU-001", "image": "https://cdn.example.com/main.png", "product_slug": "product-slug" } ]
: The regular/original price. Used as fallback.msrp
: The current selling price (for deals/discount pages).sale_price
: The "compare at" / "was" price shown to customer.compare_price- If no sale, set
and leavesale_price = msrp
empty.compare_price
Save to
{brand}-products.json.
Step 2: Fetch Product Gallery Images
For each product, fetch its detail page to get full image gallery.
IMPORTANT: Use
via subprocess, not Python curl
. On Windows, Python's urllib
urllib has frequent SSL handshake timeouts. curl is more reliable.
Write a
_fetch_galleries.py file with this pattern:
import re, json, subprocess, time with open('brand-products.json', encoding='utf-8') as f: products = json.load(f) def fetch_url(url): """Use curl subprocess — more reliable than urllib on Windows.""" try: result = subprocess.run( ['curl', '-sL', '-A', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36', '--connect-timeout', '15', '--max-time', '30', url], capture_output=True, text=True, encoding='utf-8', errors='replace' ) if result.returncode == 0 and len(result.stdout) > 500: return result.stdout except Exception: pass return '' def filter_product_images(urls, product_slug): """Keep only images belonging to this specific product.""" slug_norm = product_slug.replace('-', '').replace('.', '').lower() skip = ['swatch', 'icon', 'logo', '/3d/', '-3d', '/ar/', 'diagram', 'color-ring', 'callout', 'award', 'badge', 'comparison', 'color-guide', 'video-thumbnail', 'thumbnail', 'icon-'] seen, result = set(), [] for url in urls: lower = url.lower() if not re.search(r'\.(?:png|jpg|jpeg)', lower): continue if any(s in lower for s in skip): continue fname = re.search(r'/([^/?]+\.(?:png|jpg|jpeg))', url) if fname: base = re.sub(r'-\d+x\d+', '', fname.group(1)).split('?')[0].replace('-2x', '') if base in seen: continue seen.add(base) # Upgrade resolution url = re.sub(r'w_\d+', 'w_1200', url) url = re.sub(r'h_\d+', 'h_900', url) result.append(url) return result[:8] results = [] for i, p in enumerate(products): print(f'[{i+1}/{len(products)}] {p["title"]}') html = fetch_url(p['url']) if not html: # Fallback to listing image only results.append({**p, 'gallery': [p['image']] if p.get('image') else []}) continue # Extract all CDN image URLs all_imgs = set() for m in re.finditer(r'(https://[^"\'>\s]+?\.(?:png|jpg|jpeg))', html): all_imgs.add(m.group(1)) # Also check srcset for m in re.finditer(r'(?:srcset|data-src)="([^"]+)"', html): for part in m.group(1).split(','): u = part.strip().split()[0] if re.search(r'\.(?:png|jpg|jpeg)', u): all_imgs.add(u) gallery = filter_product_images(all_imgs, p['product_slug']) if not gallery and p.get('image'): gallery = [p['image']] results.append({**p, 'gallery': gallery}) print(f' Found {len(gallery)} images') # Checkpoint every 10 products if (i + 1) % 10 == 0: with open('brand-products-gallery.json', 'w', encoding='utf-8') as f: json.dump(results, f, indent=2, ensure_ascii=False) time.sleep(0.5) with open('brand-products-gallery.json', 'w', encoding='utf-8') as f: json.dump(results, f, indent=2, ensure_ascii=False)
Key points for gallery extraction:
- Filter cross-product images: "YOU MAY ALSO LIKE" sections include other products' images. Filter by product slug.
- Upgrade image resolution: Replace
→w_300
etc. in CDN URLs.w_1200 - Deduplicate by filename: Same image at different sizes → keep one.
- Max 8 images per product.
- Checkpoint saves: Save progress every 10 products in case of interruption.
- Slug matching: Use
with dots and hyphens removed. Some slugs contain version numbers (e.g.,slug_norm
) that should be stripped for matching.mx-keys-s.920-011558
Save enriched data as
{brand}-products-gallery.json.
Step 3: Generate Shopify CSV
Use the bundled script:
python scripts/gen_shopify_csv.py products-gallery.json --vendor "Brand Name" --output brand_shopify.csv
The script supports these fields:
orgallery
— array of image URLsgallery_images
— selling price (used as Variant Price)sale_price
orcompare_price
— original/compare-at price (used as Variant Compare At Price)msrp
— fallback main image if gallery is emptyimage
Or generate inline following the Shopify CSV format:
Shopify CSV structure (UTF-8 BOM, all fields double-quoted):
| Row Type | Handle | Title | Body (HTML) | Image Src | Image Position |
|---|---|---|---|---|---|
| Product | | Product Name | | URL | 1 |
| Image 2 | | URL | 2 | ||
| Image 3 | | URL | 3 |
Key rules:
- Handle: lowercase + hyphens only (
)re.sub(r'[^a-z0-9]+', '-', name.lower()) - First row: product info + first image
- Image rows: same Handle, only fill Image Src/Position/Alt Text
- Images: use remote CDN URLs (Shopify downloads them on import)
- Max 8 images per product
- Encoding:
(BOM for Excel compatibility)utf-8-sig - Quoting:
csv.QUOTE_ALL
Required Shopify CSV headers:
Handle, Title, Body (HTML), Vendor, Type, Tags, Published, Option1 Name, Option1 Value, Variant SKU, Variant Grams, Variant Inventory Tracker, Variant Inventory Qty, Variant Inventory Policy, Variant Fulfillment Service, Variant Price, Variant Compare At Price, Variant Requires Shipping, Variant Taxable, Variant Barcode, Image Src, Image Position, Image Alt Text, Gift Card, Variant Image
Step 4: Verify Output
Write a
_verify.py file (not inline python):
import csv with open('output.csv', encoding='utf-8-sig') as f: rows = list(csv.DictReader(f)) product_rows = [r for r in rows if r['Title']] image_rows = [r for r in rows if not r['Title']] print(f'Total rows: {len(rows)}') print(f'Product rows: {len(product_rows)}') print(f'Image-only rows: {len(image_rows)}') for r in product_rows: print(f" {r['Title'][:45]:45s} ${r['Variant Price']:>8s} {r['Variant Compare At Price'] and '$'+r['Variant Compare At Price'] or '':>8s}")
Report to user: product count, image count, categories, file location.
Anti-Bot Mitigation
| Strategy | Implementation |
|---|---|
| User-Agent | |
| Rate limit | between requests; in bash |
| Batch size | Process max 20-30 URLs per batch |
| Timeout | curl |
| SSL issues | Use subprocess, not Python |
| Access Denied | Fall back to Playwright MCP with longer delays (3-5s) |
Windows Development Notes
- ALWAYS write Python to
files —.py
multi-line fails with IndentationError on Windowspython -c - Use
subprocess for HTTP fetches, notcurl
— avoids SSL handshake timeoutsurllib - No
on Windows — usegrep -P
orgrep -o
(no Perl regex)grep -c - Use
not/
in Python paths\ - Use project directory for temp files (not
)/tmp/ - CSV encoding:
utf-8-sig - curl is available natively on Windows 10/11
Troubleshooting
| Issue | Cause | Solution |
|---|---|---|
| Empty product list | JS-rendered page | Use Playwright MCP (Step 1 Pattern D) |
| Cross-product images | "Recommended" section | Filter by product slug (Step 2) |
| 404/BlobNotFound images | CDN variant doesn't exist | Only use URLs found in DOM, don't guess |
| CSV encoding issues | Missing BOM | Use encoding |
| Price shows as 0 | Price in JS variable | Extract from rendered DOM with Playwright |
| Duplicate images | CDN serves same image at different sizes | Deduplicate by base filename |
| SSL timeout on fetch | Python urllib unreliable on Windows | Use via subprocess (Step 2) |
| JSON-LD exists but no products | Only BreadcrumbList, not Product | Check value, try Pattern A2 |
| Wrong aria-label parsed | Button aria-labels inside cards | Use article-level aria-label, not child elements |
IndentationError | Windows bash handling of multi-line | Write to file, run with |
not working | Windows grep lacks Perl regex | Use or Python regex |
References
- See references/scraping-patterns.md for detailed CDN URL patterns, image filtering code, and site-specific techniques