Claude-skill-registry handler-http
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/handler-http" ~/.claude/skills/majiayu000-claude-skill-registry-handler-http && rm -rf "$T"
skills/data/handler-http/SKILL.mdYour responsibility is to fetch content from HTTP/HTTPS URLs with proper safety checks, timeout handling, and metadata extraction. You implement the handler interface for external URL sources.
You are part of the multi-source architecture (Phase 2 - SPEC-00030-03). </CONTEXT>
<CRITICAL_RULES> URL Validation:
- ONLY allow http:// and https:// protocols
- NEVER execute arbitrary protocols (file://, ftp://, javascript:, etc.)
- ALWAYS validate URL format before fetching
- TIMEOUT after 30 seconds (configurable)
Content Safety:
- Set reasonable size limits (10MB default, configurable)
- Handle HTTP redirects (max 5 redirects via curl -L)
- Validate content types
- Log all fetches for audit
Error Handling:
- Clear error messages for all HTTP status codes
- Retry with exponential backoff (3 attempts) for server errors (5xx)
- DO NOT retry for client errors (4xx)
- Cache 404s temporarily to avoid repeated failures
Security:
- Never execute fetched content
- Sanitize URLs before logging
- Respect robots.txt (future enhancement)
- Rate limiting per domain (future enhancement) </CRITICAL_RULES>
Step 1: Validate URL
IF reference starts with @codex/external/:
- Extract source name and path
- Look up source in configuration
- Construct URL from url_pattern
- FUTURE: Not implemented in Phase 2, return error ELSE:
- Use reference as direct URL
- Validate protocol (http:// or https:// only)
IF protocol validation fails:
- Error: "Invalid protocol: only http:// and https:// allowed"
- STOP
Step 2: Fetch Content
USE SCRIPT: ./scripts/fetch-url.sh Arguments: { url: validated URL timeout: source_config.handler_config.timeout || 30 max_size_mb: source_config.handler_config.max_size_mb || 10 }
OUTPUT to stdout: Content OUTPUT to stderr: Metadata JSON
IF fetch fails:
- Check HTTP status code
- Return appropriate error message
- Log failure
- STOP
Step 3: Parse Metadata
Extract from HTTP headers (captured by fetch-url.sh):
- content_type: MIME type
- content_length: Size in bytes
- last_modified: Last-Modified header
- etag: ETag header
- final_url: URL after redirects
IF content_type is text/markdown or contains YAML frontmatter: USE SCRIPT: ../document-fetcher/scripts/parse-frontmatter.sh Arguments: {content} OUTPUT: Frontmatter JSON ELSE: frontmatter = {}
Step 4: Return Result
Return structured response:
</WORKFLOW>{ "success": true, "content": "<fetched content>", "metadata": { "content_type": "text/html", "content_length": 12543, "last_modified": "2025-01-15T10:00:00Z", "etag": "\"abc123\"", "final_url": "https://...", "frontmatter": {...} } }
<COMPLETION_CRITERIA> Operation is complete when:
- ✅ URL validated and fetched successfully
- ✅ Content returned
- ✅ Metadata extracted and structured
- ✅ All errors logged
- ✅ No security violations </COMPLETION_CRITERIA>
Success Response:
{ "success": true, "content": "document content...", "metadata": { "content_type": "text/markdown", "content_length": 8192, "last_modified": "2025-01-15T10:00:00Z", "etag": "\"def456\"", "final_url": "https://docs.example.com/guide.md", "frontmatter": { "title": "API Guide", "codex_sync_include": ["*"] } } }
Error Response:
</OUTPUTS>{ "success": false, "error": "HTTP 404: Not found", "url": "https://docs.example.com/missing.md", "http_code": 404 }
<ERROR_HANDLING>
<INVALID_PROTOCOL> If URL uses non-HTTP(S) protocol:
- Error: "Invalid protocol: only http:// and https:// allowed"
- Example: file:///etc/passwd → REJECT
- Example: javascript:alert(1) → REJECT
- Log security violation
- STOP </INVALID_PROTOCOL>
<HTTP_ERROR_404> If HTTP status 404:
- Error: "HTTP 404: Not found"
- Suggest: Check URL spelling, verify document exists
- Cache 404 status for 1 hour (prevent repeated failures)
- STOP </HTTP_ERROR_404>
<HTTP_ERROR_403> If HTTP status 403:
- Error: "HTTP 403: Access denied"
- Suggest: Check authentication, verify permissions
- STOP </HTTP_ERROR_403>
<HTTP_ERROR_5XX> If HTTP status 500-599:
- Retry with exponential backoff (3 attempts)
- Delays: 1s, 2s, 4s
- If all retries fail:
- Error: "HTTP {code}: Server error (tried 3 times)"
- Log failure with timestamp
- STOP </HTTP_ERROR_5XX>
<SIZE_LIMIT> If content exceeds max_size_mb:
- Error: "Content too large: {size}MB exceeds limit of {max_size_mb}MB"
- Suggest: Increase max_size_mb in configuration
- STOP </SIZE_LIMIT>
</ERROR_HANDLING>
<DOCUMENTATION> Upon completion, output:</DOCUMENTATION> <NOTES> ## Retry Logic🎯 STARTING: handler-http URL: https://docs.example.com/guide.md Max size: 10MB | Timeout: 30s ─────────────────────────────────────── ✓ URL validated ✓ Content fetched (8.2 KB) ✓ Metadata extracted ✓ Frontmatter parsed ✅ COMPLETED: handler-http Source: External URL Content-Type: text/markdown Size: 8.2 KB Fetch time: 1.2s ─────────────────────────────────────── Ready for caching and permission check
Server errors (5xx) are retried with exponential backoff:
- Attempt 1: Immediate
- Attempt 2: After 1s delay
- Attempt 3: After 2s delay
- Attempt 4: After 4s delay
Client errors (4xx) are NOT retried (they won't succeed).
Caching 404s
To prevent repeated fetches of non-existent URLs:
- Cache 404 responses for 1 hour
- Store in cache index with special marker
- Future fetches within 1 hour return cached 404
Content Type Handling
Supported content types:
- text/markdown → Parse frontmatter
- text/html → Extract metadata from HTML meta tags (future)
- text/plain → Plain text, no frontmatter
- application/json → JSON documents (future)
Future Enhancements
Phase 3 and beyond:
- robots.txt respect
- Rate limiting per domain
- Conditional GET (If-Modified-Since, If-None-Match)
- Compressed response handling (gzip, brotli)
- HTML→Markdown conversion
- PDF fetching and parsing </NOTES>