OpenSpace http-response-handling

Handle websites requiring JavaScript by using curl with browser headers and validating file types.

install

source · Clone the upstream repo

git clone https://github.com/HKUDS/OpenSpace

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/HKUDS/OpenSpace "$T" && mkdir -p ~/.claude/skills && cp -r "$T/gdpval_bench/skills/http-response-handling" ~/.claude/skills/hkuds-openspace-http-response-handling && rm -rf "$T"

manifest: gdpval_bench/skills/http-response-handling/SKILL.md

source content

HTTP Response Handling for JavaScript-Dependent Sites

When to Use This Skill

Use this technique when you need to fetch content from websites that:

Render content dynamically with JavaScript
Return placeholder HTML when accessed by non-browser clients
Deliver different content based on User-Agent headers

Core Technique

Step 1: Fetch Content with Browser-Like Headers

Use

curl

with a realistic User-Agent header to mimic a real browser:

curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o output.html "https://example.com"

Key flags:

```
-L
```
— Follow redirects
```
-A
```
— Set User-Agent header to mimic a real browser
```
-o
```
— Save output to file for inspection

Step 2: Detect Placeholder HTML Responses

After fetching, check if you received a placeholder response instead of actual content:

# Check file size (placeholder responses are often very small)
wc -c output.html

# Check for common placeholder indicators
grep -i "javascript" output.html | head -5
grep -i "loading" output.html | head -5
grep -i "noscript" output.html | head -5

Signs of a placeholder response:

File size is suspiciously small (<5KB for content pages)
Contains大量 JavaScript but minimal actual content
Has "loading", "spinner", or "noscript" tags
Missing expected text/data from the page

Step 3: Validate File Type Before Parsing

Before attempting format-specific parsing, validate the file type:

# Check the file type
file output.html

# Check the actual content type (if you have the headers)
curl -I -A "Mozilla/5.0 ..." "https://example.com" | grep -i content-type

# Inspect first few lines
head -50 output.html

Common checks:

HTML files: Should start with
```
<!DOCTYPE
```
or
```
<html
```
JSON files: Should start with
```
{
```
or
```
[
```
PDF files: Should start with
```
%PDF
```
Empty/error pages: May contain error messages or generic HTML

Step 4: Handle Different Scenarios

If you got valid HTML content:

# Proceed with HTML parsing or extraction
grep -oP '(?<=<title>).*?(?=</title>)' output.html

If you got a placeholder/JS-dependent response:

Option A: Use a headless browser (Playwright, Selenium)
Option B: Look for an API endpoint that returns JSON directly
Option C: Check if the site has a mobile/API version with simpler responses

If you got an unexpected file type:

# Check what was actually returned
file output.html

# Adjust your approach based on actual content
case $(file -b --mime-type output.html) in
  "text/html")
    # Parse as HTML
    ;;
  "application/json")
    # Parse as JSON
    ;;
  "application/pdf")
    # Handle as PDF
    ;;
  *)
    echo "Unexpected file type: $(file -b --mime-type output.html)"
    ;;
esac

Quick Reference Script

#!/bin/bash
# fetch-with-validation.sh

URL="$1"
OUTPUT="${2:-output.html}"

# Fetch with browser headers
curl -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36" -o "$OUTPUT" "$URL"

# Get file info
SIZE=$(wc -c < "$OUTPUT")
TYPE=$(file -b --mime-type "$OUTPUT")

echo "Downloaded: $OUTPUT"
echo "Size: $SIZE bytes"
echo "Type: $TYPE"

# Warn about potential issues
if [ "$SIZE" -lt 1000 ]; then
    echo "WARNING: File is very small - may be a placeholder response"
fi

if [ "$TYPE" = "text/html" ]; then
    if grep -qi "javascript\|loading\|spinner" "$OUTPUT"; then
        echo "WARNING: Content may be JavaScript-dependent"
    fi
fi

Common Pitfalls

Not checking file type: Assuming HTML when you got JSON or an error page
Using default curl User-Agent: Many sites block or serve minimal content to bots
Ignoring file size: Very small files are often error pages or placeholders
Not following redirects: Some sites redirect based on User-Agent