Skills web-reader-pro

Advanced web content extraction skill for OpenClaw using multi-tier fallback strategy (Jina → Scrapling → WebFetch) with intelligent routing, caching, quality scoring, and domain learning. Use when: reading article content, extracting web page text, scraping dynamic JS-heavy pages, or fetching WeChat official account articles.

install
source · Clone the upstream repo
git clone https://github.com/openclaw/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/0xcjl/web-reader-pro" ~/.claude/skills/openclaw-skills-web-reader-pro && rm -rf "$T"
OpenClaw · Install into ~/.openclaw/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/0xcjl/web-reader-pro" ~/.openclaw/skills/openclaw-skills-web-reader-pro && rm -rf "$T"
manifest: skills/0xcjl/web-reader-pro/SKILL.md
source content

Web Reader Pro - OpenClaw Skill

Overview

Web Reader Pro is an advanced web content extraction skill for OpenClaw that uses a multi-tier fallback strategy with intelligent routing, caching, and quality assessment.

Features

1. Three-Tier Fallback Strategy

  • Tier 1: Jina Reader API - Fast, reliable, best for most websites
  • Tier 2: Scrapling + Playwright - Dynamic content rendering for JS-heavy sites
  • Tier 3: WebFetch Fallback - Basic extraction for simple pages

2. Jina Quota Monitoring

  • Tracks API call count with persistent counter
  • Warning alerts when approaching quota limits
  • Automatic fallback to lower-tier methods when quota exhausted

3. Smart Cache Layer

  • Short-term caching (configurable TTL, default 1 hour)
  • Cache key based on URL hash
  • Reduces redundant API calls

4. Extraction Quality Scoring

  • Scores based on: word count, title detection, content density
  • Minimum quality threshold (default: 200 words + valid title)
  • Auto-escalation to next tier if quality below threshold

5. Domain-Level Routing Learning

  • Learns optimal extraction tier per domain
  • Persists learned routes in local JSON database
  • Adapts based on historical success rates

6. Retry with Exponential Backoff

  • Configurable max retries per tier (default: 3)
  • Exponential backoff: 1s, 2s, 4s, 8s...
  • Respects rate limits and transient failures

Installation

# Install dependencies
pip install -r requirements.txt

# Install Scrapling (requires Node.js)
./scripts/install_scrapling.sh

# Or install Scrapling manually
npm install -g @scrapinghub/scrapling

Usage

Basic Usage

from scripts.web_reader_pro import WebReaderPro

reader = WebReaderPro()
result = reader.fetch("https://example.com")
print(result['title'])
print(result['content'])

Advanced Configuration

reader = WebReaderPro(
    jina_api_key="your-jina-key",      # Optional: set via env JINA_API_KEY
    cache_ttl=3600,                      # Cache TTL in seconds (default: 3600)
    quality_threshold=200,               # Min word count for quality (default: 200)
    max_retries=3,                       # Max retries per tier (default: 3)
    enable_learning=True,                # Enable domain learning (default: True)
    scrapling_path="/usr/local/bin/scrapling"  # Path to scrapling binary
)

Result Format

{
    "title": "Page Title",
    "content": "Extracted content in markdown...",
    "url": "https://example.com",
    "tier_used": "jina|scrapling|webfetch",
    "quality_score": 85,
    "cached": False,
    "domain_learned_tier": "jina",
    "extracted_at": "2024-01-01T00:00:00Z"
}

Environment Variables

VariableDescriptionDefault
JINA_API_KEY
Jina Reader API keyRequired for Tier 1
WEB_READER_CACHE_DIR
Cache directory path
~/.openclaw/cache/web-reader-pro/
WEB_READER_LEARNING_DB
Learning database path
~/.openclaw/data/web-reader-pro/routes.json
WEB_READER_JINA_QUOTA
Jina quota limit
100000

API Reference

WebReaderPro.fetch(url, force_refresh=False)

Fetch and extract content from a URL.

Parameters:

  • url
    (str): Target URL
  • force_refresh
    (bool): Bypass cache if True

Returns: Dict with title, content, metadata

WebReaderPro.fetch_with_tier(url, preferred_tier)

Fetch using a specific tier (bypassing automatic selection).

Parameters:

  • url
    (str): Target URL
  • preferred_tier
    (str): "jina", "scrapling", or "webfetch"

WebReaderPro.get_jina_status()

Get current Jina API quota usage.

Returns: Dict with count, limit, percentage, warnings

WebReaderPro.clear_cache(url=None)

Clear cache for specific URL or all URLs.

Parameters:

  • url
    (str, optional): Specific URL to clear, or None for all

WebReaderPro.get_domain_routes()

Get learned domain-to-tier mappings.

Returns: Dict of domain -> preferred tier

Tier Comparison

TierSpeedJS RenderingBest ForCost
JinaFastNoStatic pages, articlesAPI calls
ScraplingMediumYesSPAs, dynamic contentCPU
WebFetchFastestNoSimple pages, fallbacksFree

License

MIT