Claude-skill-registry e2e-testing-agent
Build autonomous end-to-end full-stack testing agents using Claude's Computer Use API, LangGraph orchestration, and hybrid Playwright automation. Use this skill when building testing infrastructure, test automation, CI/CD test integration, or self-healing test systems.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/e2e-testing-agent" ~/.claude/skills/majiayu000-claude-skill-registry-e2e-testing-agent && rm -rf "$T"
skills/data/e2e-testing-agent/SKILL.mdAutonomous E2E Testing Agent Skill
This skill provides comprehensive guidance for building fully autonomous end-to-end testing systems using Claude's capabilities.
Quick Start
# Install dependencies pip install anthropic langgraph playwright httpx pydantic # Install Playwright browsers playwright install chromium # Set up environment export ANTHROPIC_API_KEY=sk-ant-... # Run the agent e2e-agent --codebase /path/to/app --app-url http://localhost:3000
Example Invocations
# Basic usage from e2e_testing_agent import TestingOrchestrator orchestrator = TestingOrchestrator( codebase_path="/path/to/app", app_url="http://localhost:3000" ) results = await orchestrator.run() # Single test execution result = await orchestrator.run_single_test({ "id": "login-test", "name": "User Login Flow", "type": "ui", "steps": [ {"action": "goto", "target": "/login"}, {"action": "fill", "target": "#email", "value": "test@example.com"}, {"action": "click", "target": "button[type=submit]"} ] })
Requirements
Dependencies
anthropic = ">=0.40.0" langgraph = ">=0.2.0" playwright = ">=1.48.0" httpx = ">=0.27.0" pydantic = ">=2.9.0"
API Compatibility
| Component | Version |
|---|---|
| Computer Use API | |
| Computer Tool | |
| Bash Tool | |
| Text Editor | |
Supported Models
- Claude Sonnet 4.5 - Primary testing (best cost/capability)
- Claude Haiku 4.5 - Quick verifications
- Claude Opus 4.5 - Complex debugging only
When to Use This Skill
Use this skill when the user wants to:
- Build automated testing systems with AI capabilities
- Implement Computer Use API for browser automation
- Create self-healing test infrastructure
- Integrate AI-powered testing into CI/CD pipelines
- Build multi-agent testing orchestration with LangGraph
Core Concepts
1. Computer Use API
Claude's Computer Use API enables visual interaction with desktop/browser environments:
import anthropic client = anthropic.Anthropic() # Basic Computer Use call response = client.beta.messages.create( model="claude-sonnet-4-5", max_tokens=4096, tools=[ { "type": "computer_20250124", "name": "computer", "display_width_px": 1920, "display_height_px": 1080, "display_number": 1, }, {"type": "bash_20250124", "name": "bash"}, {"type": "text_editor_20250728", "name": "str_replace_based_edit_tool"} ], messages=[{"role": "user", "content": "Navigate to example.com and click Login"}], betas=["computer-use-2025-01-24"] )
Available Actions:
- Capture current screen statescreenshot
- Move cursor to coordinatesmouse_move
,left_click
,right_click
,double_clicktriple_click
- Enter texttype
- Press keyboard shortcutskey
- Scroll in directionscroll
- Hold a key downhold_key
- Pause for element loadingwait
2. Agent Loop Pattern
Always implement Computer Use as an agent loop:
async def run_computer_use_task(task: str, max_iterations: int = 30): messages = [{"role": "user", "content": task}] for i in range(max_iterations): response = client.beta.messages.create( model="claude-sonnet-4-5", max_tokens=4096, tools=COMPUTER_USE_TOOLS, messages=messages, betas=["computer-use-2025-01-24"] ) # Check if done if response.stop_reason == "end_turn": return extract_result(response) # Process tool calls tool_results = [] for block in response.content: if block.type == "tool_use": result = await execute_tool(block.name, block.input) tool_results.append({ "type": "tool_result", "tool_use_id": block.id, "content": result }) # Continue conversation messages.append({"role": "assistant", "content": response.content}) messages.append({"role": "user", "content": tool_results}) raise MaxIterationsExceeded()
3. Hybrid Testing Strategy
Combine Playwright (fast, reliable) with Computer Use (visual verification):
class HybridTester: def __init__(self): self.playwright = sync_playwright().start() self.browser = self.playwright.chromium.launch() self.page = self.browser.new_page() self.claude_client = anthropic.Anthropic() async def test_login_flow(self): # Use Playwright for fast actions await self.page.goto("https://app.example.com/login") await self.page.fill("#email", "test@example.com") await self.page.fill("#password", "password123") await self.page.click("button[type=submit]") # Use Claude for visual verification screenshot = await self.page.screenshot() verification = await self.verify_with_claude( screenshot, "Verify the user is logged in and sees the dashboard" ) return verification async def verify_with_claude(self, screenshot: bytes, assertion: str): response = self.claude_client.messages.create( model="claude-haiku-4-5", # Fast model for verification max_tokens=500, messages=[{ "role": "user", "content": [ { "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": base64.b64encode(screenshot).decode() } }, { "type": "text", "text": f"Verify: {assertion}\nRespond with JSON: {{\"passed\": boolean, \"reason\": string}}" } ] }] ) return json.loads(response.content[0].text)
4. LangGraph Orchestration
Structure the testing system as a state machine:
from langgraph.graph import StateGraph, END from typing import TypedDict, Annotated from langgraph.graph.message import add_messages class TestState(TypedDict): messages: Annotated[list, add_messages] codebase_context: str test_plan: list[dict] current_test: int results: list[dict] failures: list[dict] next_step: str def create_testing_graph(): graph = StateGraph(TestState) # Add nodes graph.add_node("analyze_code", analyze_codebase) graph.add_node("plan_tests", create_test_plan) graph.add_node("execute_ui_test", run_ui_test) graph.add_node("execute_api_test", run_api_test) graph.add_node("self_heal", heal_failed_test) graph.add_node("report", generate_report) # Define edges graph.add_edge("analyze_code", "plan_tests") graph.add_conditional_edges( "plan_tests", route_to_test_type, { "ui": "execute_ui_test", "api": "execute_api_test", "done": "report" } ) graph.add_conditional_edges( "execute_ui_test", check_test_result, { "pass": "plan_tests", # Next test "fail": "self_heal", "done": "report" } ) graph.add_edge("self_heal", "execute_ui_test") # Retry after healing graph.add_edge("report", END) graph.set_entry_point("analyze_code") return graph.compile()
5. Self-Healing Tests
Implement automatic test repair:
class SelfHealingAgent: def __init__(self): self.client = anthropic.Anthropic() async def heal_test(self, test_spec: dict, failure: dict, screenshot: bytes): prompt = f"""You are a test self-healing agent. A test has failed. TEST SPECIFICATION: {json.dumps(test_spec, indent=2)} FAILURE DETAILS: {json.dumps(failure, indent=2)} CURRENT SCREENSHOT: [Attached] Analyze the failure and determine the fix: 1. If selector changed - provide new selector 2. If timing issue - suggest wait strategy 3. If UI changed intentionally - update assertion 4. If real bug - mark as actual failure Respond with JSON: {{ "diagnosis": "selector_changed|timing_issue|ui_change|real_bug", "fix": {{ "type": "update_selector|add_wait|update_assertion|none", "old_value": "...", "new_value": "..." }}, "confidence": 0.0-1.0, "explanation": "..." }} """ response = self.client.messages.create( model="claude-sonnet-4-5", max_tokens=1000, messages=[{ "role": "user", "content": [ {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": base64.b64encode(screenshot).decode()}}, {"type": "text", "text": prompt} ] }] ) return json.loads(response.content[0].text)
6. Cost Optimization
Monitor and control API costs:
class CostTracker: # Pricing per million tokens (December 2025) PRICING = { "claude-opus-4-5": {"input": 5.00, "output": 25.00}, "claude-sonnet-4-5": {"input": 3.00, "output": 15.00}, "claude-haiku-4-5": {"input": 0.25, "output": 1.25}, } # Screenshot token estimates by resolution SCREENSHOT_TOKENS = { (1024, 768): 1500, (1920, 1080): 2500, (2560, 1440): 4000, } def __init__(self, budget_limit: float = 10.0): self.budget_limit = budget_limit self.total_cost = 0.0 self.usage_log = [] def track_usage(self, model: str, input_tokens: int, output_tokens: int): pricing = self.PRICING[model] cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000 self.total_cost += cost self.usage_log.append({ "model": model, "input_tokens": input_tokens, "output_tokens": output_tokens, "cost": cost, "cumulative": self.total_cost }) if self.total_cost > self.budget_limit: raise BudgetExceeded(f"Budget of ${self.budget_limit} exceeded") return cost
7. Test Specification Format
Use structured test specifications:
TEST_SPEC_SCHEMA = { "id": "string", "name": "string", "type": "ui|api|db", "priority": "critical|high|medium|low", "preconditions": ["list of setup steps"], "steps": [ { "action": "goto|click|fill|assert|wait|screenshot", "target": "selector or url", "value": "optional value", "timeout": 5000 } ], "assertions": [ { "type": "element_visible|text_contains|url_matches|visual_match", "target": "...", "expected": "..." } ], "cleanup": ["list of teardown steps"], "tags": ["regression", "smoke", "critical-path"] }
Best Practices
DO:
- ✅ Always set max_iterations to prevent runaway costs
- ✅ Use lower resolution screenshots when possible (1024x768)
- ✅ Implement exponential backoff for retries
- ✅ Cache codebase analysis between test runs
- ✅ Use Haiku for quick verifications, Sonnet for complex reasoning
- ✅ Log all Claude API calls with full context for debugging
- ✅ Run Computer Use in Docker sandboxes only
DON'T:
- ❌ Never run Computer Use against production systems
- ❌ Don't use Opus unless debugging complex failures
- ❌ Avoid taking screenshots after every action (batch them)
- ❌ Don't store API keys in code - use environment variables
- ❌ Never assume a single screenshot is enough - verify state
Prompting Guidelines
For Code Analysis:
Analyze this codebase to identify testable surfaces: 1. List all user-facing pages/routes 2. Identify API endpoints 3. Find database models/tables 4. Note authentication flows 5. Map critical user journeys Focus on areas with: high user traffic, payment/sensitive data, recent changes.
For Test Generation:
Generate comprehensive E2E tests for this feature: - Cover happy path and edge cases - Include error handling scenarios - Add visual regression checkpoints - Consider mobile/responsive views - Include accessibility checks Format as structured JSON following the TEST_SPEC_SCHEMA.
For Computer Use Tasks:
Execute this test autonomously: {test_spec} CRITICAL INSTRUCTIONS: 1. After each action, take a screenshot and verify the result 2. If an element is not found, wait up to 10 seconds before failing 3. If something unexpected happens, document it and continue if possible 4. Take a final screenshot showing the end state 5. Report exactly what you observed vs what was expected
For Self-Healing:
Analyze this test failure and determine the root cause: - Was this a selector change? (element moved/renamed) - Was this a timing issue? (element not loaded in time) - Was this an intentional UI change? (expected behavior changed) - Was this an actual bug? (unexpected behavior) Provide a specific fix with high confidence, or flag for human review.
Integration Examples
GitHub Actions Integration:
name: AI E2E Tests on: pull_request: branches: [main] jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Start test server run: docker-compose up -d - name: Run AI E2E Tests env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }} run: | python -m e2e_testing_agent \ --codebase . \ --app-url http://localhost:3000 \ --output results/ - name: Upload Results uses: actions/upload-artifact@v4 with: name: test-results path: results/
n8n Webhook Integration:
from fastapi import FastAPI, HTTPException from pydantic import BaseModel app = FastAPI() class TestRequest(BaseModel): repo_url: str branch: str preview_url: str pr_number: int @app.post("/webhook/run-tests") async def run_tests(request: TestRequest): orchestrator = TestingOrchestrator( codebase_url=request.repo_url, branch=request.branch, app_url=request.preview_url ) results = await orchestrator.run_changed_file_tests() # Post results back to GitHub PR await post_github_check( pr_number=request.pr_number, results=results ) return {"status": "completed", "summary": results.summary}
Troubleshooting
Computer Use Not Working
- Verify beta header:
betas=["computer-use-2025-01-24"] - Check tool type version:
computer_20250124 - Ensure sandbox environment is running
- Verify screenshot is being captured correctly
High Token Costs
- Reduce screenshot resolution
- Use Haiku for simple verifications
- Implement prompt caching for system prompts
- Batch related test assertions
Flaky Tests
- Add explicit waits before assertions
- Use visual verification instead of selectors when possible
- Implement retry logic with exponential backoff
- Consider test isolation (fresh browser context)
Self-Healing Not Working
- Ensure screenshot is taken at failure point
- Provide full error context to Claude
- Set confidence threshold for auto-fixes
- Review and validate fixes before committing
MCP (Model Context Protocol) Integration
Playwright MCP Server
The recommended way to use Playwright with Claude is through MCP. This provides a standardized interface for browser automation.
Official Package:
@playwright/mcp (Microsoft)
- Source: https://www.npmjs.com/package/@playwright/mcp
- GitHub: https://github.com/microsoft/playwright-mcp
Installation
# Run with npx (recommended) npx @playwright/mcp@latest # Or install globally npm install -g @playwright/mcp
Claude Code Configuration
Add to your
~/.claude/mcp_servers.json:
{ "mcpServers": { "playwright": { "command": "npx", "args": ["-y", "@playwright/mcp@latest"] } } }
Key Features
- Fast and lightweight - Uses Playwright's accessibility tree, not pixel-based input
- LLM-friendly - No vision models needed, operates purely on structured data
- Deterministic - Avoids ambiguity common with screenshot-based approaches
Available MCP Tools
| Tool | Description |
|---|---|
| Navigate to a URL |
| Get accessibility snapshot |
| Click an element |
| Type text into element |
| Select from dropdown |
| Hover over element |
| Drag element to target |
| Press keyboard key |
| Take a screenshot |
Using MCP in Code
from src.mcp import PlaywrightMCPClient async def run_test_with_mcp(): async with PlaywrightMCPClient() as client: # Navigate await client.navigate("https://example.com/login") # Fill form await client.fill("#email", "test@example.com") await client.fill("#password", "password123") # Click submit await client.click("button[type=submit]") # Take screenshot for verification result = await client.screenshot() return result
MCP vs Direct Playwright
| Aspect | MCP | Direct Playwright |
|---|---|---|
| Setup | Requires npm server | Python only |
| Standardization | Protocol-based | Library-specific |
| Claude Integration | Native support | Requires wrapper |
| Performance | Slightly slower (IPC) | Faster (in-process) |
| Use Case | Claude Code integration | Standalone scripts |
Recommendation: Use MCP for Claude Code integrations, direct Playwright for performance-critical testing.
Other MCP Servers
Filesystem MCP Server
Package:
@modelcontextprotocol/server-filesystem
- Source: https://www.npmjs.com/package/@modelcontextprotocol/server-filesystem
- Provides secure file operations with configurable access controls
{ "mcpServers": { "filesystem": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/allowed/dir"] } } }
GitHub MCP Server
Package:
@modelcontextprotocol/server-github
- Source: https://www.npmjs.com/package/@modelcontextprotocol/server-github
- Note: Development moved to https://github.com/github/github-mcp-server
{ "mcpServers": { "github": { "command": "npx", "args": ["-y", "@modelcontextprotocol/server-github"], "env": { "GITHUB_PERSONAL_ACCESS_TOKEN": "<YOUR_TOKEN>" } } } }
Using LangGraph with MCP (langchain-mcp-adapters)
The official way to use MCP with LangGraph is via
langchain-mcp-adapters:
from langchain_mcp_adapters.client import MultiServerMCPClient from langgraph.prebuilt import create_react_agent from langchain_anthropic import ChatAnthropic async with MultiServerMCPClient({ "playwright": { "transport": "stdio", "command": "npx", "args": ["-y", "@playwright/mcp@latest"], } }) as client: tools = await client.get_tools() llm = ChatAnthropic(model="claude-sonnet-4-5") agent = create_react_agent(llm, tools) result = await agent.ainvoke({ "messages": [("user", "Navigate to example.com")] })
Docker Sandbox Setup
CRITICAL: Always run Computer Use in an isolated Docker container.
Dockerfile for Computer Use Sandbox
FROM python:3.11-slim # Install system dependencies RUN apt-get update && apt-get install -y \ xvfb \ x11vnc \ fluxbox \ wget \ gnupg \ && rm -rf /var/lib/apt/lists/* # Install Chrome RUN wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | apt-key add - \ && echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google.list \ && apt-get update \ && apt-get install -y google-chrome-stable \ && rm -rf /var/lib/apt/lists/* # Install Python dependencies COPY requirements.txt . RUN pip install -r requirements.txt RUN playwright install chromium # Set up virtual display ENV DISPLAY=:99 ENV RESOLUTION=1920x1080x24 # Start script COPY entrypoint.sh /entrypoint.sh RUN chmod +x /entrypoint.sh ENTRYPOINT ["/entrypoint.sh"]
entrypoint.sh
#!/bin/bash # Start virtual display Xvfb :99 -screen 0 $RESOLUTION & sleep 1 # Start window manager fluxbox & sleep 1 # Start VNC server (optional, for debugging) x11vnc -display :99 -forever -nopw -quiet & # Run the testing agent exec python -m e2e_testing_agent "$@"
Docker Compose
version: '3.8' services: e2e-agent: build: . environment: - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY} - DISPLAY=:99 volumes: - ./test-results:/app/test-results - ./codebase:/app/codebase:ro ports: - "5900:5900" # VNC for debugging security_opt: - seccomp:unconfined shm_size: 2gb
Sandbox Manager Implementation
import docker import asyncio from pathlib import Path class SandboxManager: """Manages Docker containers for safe Computer Use execution.""" def __init__(self, image: str = "e2e-agent:latest"): self.client = docker.from_env() self.image = image self.container = None async def start(self, codebase_path: str) -> str: """Start sandbox container, return container ID.""" self.container = self.client.containers.run( self.image, detach=True, environment={ "ANTHROPIC_API_KEY": os.environ["ANTHROPIC_API_KEY"], "DISPLAY": ":99", }, volumes={ str(Path(codebase_path).absolute()): { "bind": "/app/codebase", "mode": "ro" } }, shm_size="2g", security_opt=["seccomp:unconfined"], ) # Wait for display to be ready await asyncio.sleep(2) return self.container.id async def execute(self, command: str) -> str: """Execute command in sandbox.""" if not self.container: raise RuntimeError("Sandbox not started") exit_code, output = self.container.exec_run(command) return output.decode() async def screenshot(self) -> bytes: """Capture screenshot from sandbox display.""" exit_code, output = self.container.exec_run( "import -window root -display :99 png:-" ) return output async def stop(self): """Stop and remove sandbox container.""" if self.container: self.container.stop() self.container.remove() self.container = None
Error Handling Patterns
Retry with Exponential Backoff
import asyncio from functools import wraps def retry_with_backoff(max_retries: int = 3, base_delay: float = 1.0): """Decorator for retrying failed operations.""" def decorator(func): @wraps(func) async def wrapper(*args, **kwargs): last_exception = None for attempt in range(max_retries): try: return await func(*args, **kwargs) except (TimeoutError, ConnectionError) as e: last_exception = e delay = base_delay * (2 ** attempt) await asyncio.sleep(delay) except Exception as e: # Don't retry on non-transient errors raise raise last_exception return wrapper return decorator # Usage @retry_with_backoff(max_retries=3) async def click_element(page, selector: str): await page.click(selector, timeout=5000)
Graceful Degradation
class TestExecutor: """Execute tests with fallback strategies.""" async def execute_with_fallback(self, test_spec: dict) -> TestResult: """Try Computer Use first, fall back to Playwright.""" # Try Computer Use (visual, but slower) try: return await self.execute_with_computer_use(test_spec) except ComputerUseError as e: logger.warning(f"Computer Use failed: {e}, falling back to Playwright") # Fall back to Playwright (faster, but less visual) try: return await self.execute_with_playwright(test_spec) except PlaywrightError as e: logger.error(f"Playwright also failed: {e}") return TestResult( status=TestStatus.FAILED, error_message=f"Both execution methods failed: {e}" )
Error Classification
from enum import Enum class ErrorType(Enum): TRANSIENT = "transient" # Retry SELECTOR = "selector" # Self-heal TIMEOUT = "timeout" # Increase wait ASSERTION = "assertion" # Check logic INFRASTRUCTURE = "infra" # Alert ops UNKNOWN = "unknown" # Manual review def classify_error(error: Exception, screenshot: bytes = None) -> ErrorType: """Classify error for appropriate handling.""" error_str = str(error).lower() if "timeout" in error_str: return ErrorType.TIMEOUT elif "selector" in error_str or "element" in error_str: return ErrorType.SELECTOR elif "connection" in error_str or "network" in error_str: return ErrorType.TRANSIENT elif "assert" in error_str: return ErrorType.ASSERTION elif "docker" in error_str or "container" in error_str: return ErrorType.INFRASTRUCTURE return ErrorType.UNKNOWN def handle_error(error_type: ErrorType, test_spec: dict, error: Exception): """Route error to appropriate handler.""" handlers = { ErrorType.TRANSIENT: retry_test, ErrorType.SELECTOR: queue_for_healing, ErrorType.TIMEOUT: increase_timeout_and_retry, ErrorType.ASSERTION: mark_as_failed, ErrorType.INFRASTRUCTURE: alert_and_abort, ErrorType.UNKNOWN: queue_for_review, } return handlers[error_type](test_spec, error)
Screenshot on Failure
async def execute_test_with_evidence(test_spec: dict) -> TestResult: """Always capture screenshot on failure for debugging.""" screenshots = [] try: for step in test_spec["steps"]: # Take before screenshot screenshots.append(await capture_screenshot()) # Execute step await execute_step(step) # Take after screenshot screenshots.append(await capture_screenshot()) return TestResult(status=TestStatus.PASSED, screenshots=screenshots) except Exception as e: # Capture failure state failure_screenshot = await capture_screenshot() screenshots.append(failure_screenshot) return TestResult( status=TestStatus.FAILED, error_message=str(e), screenshots=screenshots, screenshot_at_failure=failure_screenshot, )
Action Mapping Reference
Map test specification actions to Playwright and Computer Use:
| Test Action | Playwright | Computer Use |
|---|---|---|
| | then type URL |
| | |
| | then |
| | |
| | |
| | |
| | Click dropdown + click option |
| | |
| | |
class ActionMapper: """Map test actions to execution methods.""" async def execute_action( self, action: dict, page: Page, computer_use_client: ComputerUseClient = None ): """Execute action using Playwright, with Computer Use fallback.""" action_type = action["action"] target = action.get("target") value = action.get("value") try: # Try Playwright first (faster) if action_type == "goto": await page.goto(target) elif action_type == "click": await page.click(target, timeout=5000) elif action_type == "fill": await page.fill(target, value) elif action_type == "wait": await page.wait_for_selector(target, timeout=int(value or 5000)) elif action_type == "screenshot": return await page.screenshot() else: raise ValueError(f"Unknown action: {action_type}") except Exception as e: if computer_use_client: # Fall back to Computer Use for visual interaction return await self._execute_via_computer_use( computer_use_client, action ) raise