Agent-skills gallery-scraper

Bulk download images from login-protected gallery websites using an attached browser session. Use when asked to scrape, download, or save images from authenticated gallery pages, extract full-size images from thumbnails, or batch download from multi-page galleries.

install
source · Clone the upstream repo
git clone https://github.com/jdrhyne/agent-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jdrhyne/agent-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/clawdbot/gallery-scraper" ~/.claude/skills/jdrhyne-agent-skills-gallery-scraper && rm -rf "$T"
manifest: clawdbot/gallery-scraper/SKILL.md
source content

Gallery Scraper

Bulk download images from authenticated gallery websites via browser relay.

Safety Boundaries

  • Do not access gallery sites or user accounts that the user has not explicitly attached and authorized.
  • Do not download beyond the selected gallery, profile, or page range without confirmation.
  • Do not store cookies, tokens, or hidden form values in local output files.
  • Do not keep retrying blocked downloads indefinitely; surface rate limits or auth failures instead.

Prerequisites

  • User must have Chrome with OpenClaw Browser Relay extension
  • User must be logged into the target site
  • User must attach the browser tab (click relay toolbar button, badge ON)

Workflow

1. Attach Browser Tab

Ask user to:

  1. Log into the gallery site in Chrome
  2. Navigate to the target gallery/profile page
  3. Click the OpenClaw Browser Relay toolbar button (badge shows ON)

2. Discover Image URL Pattern

Most gallery sites store full-size URLs in data attributes. Common patterns:

// Extract via browser evaluate
() => {
  // Try common patterns
  const patterns = [
    'img[data-max]',           // data-max attribute
    'img[data-src]',           // lazy-load pattern
    'img[data-full]',          // full-size pattern
    'a[data-lightbox] img',    // lightbox galleries
    '.gallery-item img'        // generic gallery
  ];
  
  for (const sel of patterns) {
    const imgs = document.querySelectorAll(sel);
    if (imgs.length > 0) {
      return {
        selector: sel,
        count: imgs.length,
        sample: imgs[0].outerHTML.substring(0, 200)
      };
    }
  }
  return null;
}

3. Extract Full-Size URLs

Once pattern identified, extract all URLs:

// For data-max pattern (common)
() => Array.from(document.querySelectorAll('img[data-max]'))
  .map(img => img.dataset.max)

// For thumbnail→full conversion (replace path segment)
() => Array.from(document.querySelectorAll('.gallery img'))
  .map(img => img.src.replace('/thumb/', '/full/'))

4. Handle Pagination

Check for multiple pages:

() => {
  const pagination = document.querySelectorAll('.pagination a, [class*="page"] a');
  return Array.from(pagination).map(a => ({text: a.textContent, href: a.href}));
}

Navigate to each page and collect URLs.

4b. Batch scrape multiple galleries (iframe trick)

When you need multiple galleries quickly and can’t automate CDP, you can load each gallery in a hidden iframe and extract

data-max
URLs:

async () => {
  const urls = [
    'https://site.example/galleries/view/123',
    'https://site.example/galleries/view/456'
  ];
  const results = [];
  for (const url of urls) {
    const iframe = document.createElement('iframe');
    iframe.style.position = 'fixed';
    iframe.style.left = '-9999px';
    iframe.style.width = '800px';
    iframe.style.height = '600px';
    iframe.src = url;
    document.body.appendChild(iframe);
    await new Promise((resolve, reject) => {
      const t = setTimeout(() => reject(new Error('timeout load')), 20000);
      iframe.onload = () => { clearTimeout(t); resolve(); };
    });
    const doc = iframe.contentDocument;
    const start = Date.now();
    let imgs = [];
    while (Date.now() - start < 20000) {
      imgs = Array.from(doc.querySelectorAll('img[data-max]')).map(i => i.dataset.max);
      if (imgs.length) break;
      await new Promise(r => setTimeout(r, 500));
    }
    results.push({ id: url.split('/').pop(), urls: imgs });
    iframe.remove();
  }
  return results;
}

5. Check CDN Access

Test if CDN requires authentication or just Referer:

# Test direct access
curl -I "CDN_URL" 2>/dev/null | head -3

# Test with Referer
curl -I -H "Referer: https://SITE_DOMAIN/" "CDN_URL" 2>/dev/null | head -3

6. Bulk Download

Collect the URLs into a text file, then parallel download:

# Create output directory
mkdir -p ~/Downloads/gallery_name

# Download with Referer header (parallel)
cd ~/Downloads/gallery_name
while IFS= read -r url; do
  filename=$(basename "$url")
  curl -s -H "Referer: https://SITE_DOMAIN/" -o "$filename" "$url" &
  [ $(jobs -r | wc -l) -ge 8 ] && wait -n
done < urls.txt
wait

Python ThreadPool fallback (avoids shell quoting + wait -n issues):

import os
import requests
from concurrent.futures import ThreadPoolExecutor

outdir = os.path.expanduser('~/Downloads/gallery_name')
os.makedirs(outdir, exist_ok=True)
headers = {'Referer': 'https://SITE_DOMAIN/', 'User-Agent': 'Mozilla/5.0'}

with open('urls.txt') as f:
    urls = [line.strip() for line in f if line.strip()]

def download(url):
    filename = os.path.join(outdir, os.path.basename(url))
    if os.path.exists(filename) and os.path.getsize(filename) > 0:
        return
    r = requests.get(url, headers=headers, timeout=60)
    r.raise_for_status()
    with open(filename, 'wb') as f:
        f.write(r.content)

with ThreadPoolExecutor(max_workers=8) as ex:
    for url in urls:
        ex.submit(download, url)

Handling Lock Buttons

Some galleries have "lock" buttons to reveal hidden content. Look for:

// Find lock/unlock buttons
() => {
  const locks = document.querySelectorAll(
    '[class*="lock"], [class*="unlock"], ' +
    'button[title*="lock"], .premium-unlock'
  );
  return Array.from(locks).map(el => ({
    tag: el.tagName,
    class: el.className,
    text: el.innerText?.substring(0, 30)
  }));
}

Click each lock button before extracting URLs.

Output Organization

Optionally organize by gallery:

# Derive a gallery-specific folder name from the selected URL
mkdir -p "gallery_<id>"

Troubleshooting

  • 403 Forbidden: Add Referer header or extract cookies from browser
  • Rate limited: Reduce parallel downloads, add delays
  • Missing images: Check for JavaScript-loaded content, may need scroll injection
  • Login required for CDN: Extract session cookies via
    document.cookie