Skills run-test-plan
Execute YAML test plan, stop on first failure, output rich debug prompt
git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/anderskev/run-test-plan" ~/.claude/skills/clawdbot-skills-run-test-plan && rm -rf "$T"
skills/anderskev/run-test-plan/SKILL.mdRun Test Plan
Execute a YAML test plan, run setup commands, health checks, and each test sequentially. Stop on first failure with rich debug output.
Prerequisites
- agent-browser skill: Browser tests require the
skill to be availableagent-browser:agent-browser
Arguments
: Path to test plan (default:--plan <path>
)docs/testing/test-plan.yaml
: Skip setup commands and health checks (for re-running after failure)--skip-setup
Step 1: Parse Test Plan
Read and validate the test plan:
# Check file exists ls docs/testing/test-plan.yaml || { echo "Error: Test plan not found"; exit 1; } # Validate YAML python3 -c "import yaml; yaml.safe_load(open('docs/testing/test-plan.yaml'))" || { echo "Error: Invalid YAML"; exit 1; }
Extract from the YAML:
: List of setup commandssetup.commands
: List of URLs to pollsetup.health_checks
: Array of test casestests
Step 2: Run Setup (unless --skip-setup)
2a. Check Prerequisites
If
setup.prerequisites exists, verify each one:
# For each prerequisite in setup.prerequisites <prerequisite.check> || { echo "Prerequisite not met: <prerequisite.name>"; exit 1; }
2b. Set Environment Variables
If
setup.env exists, export each variable. Variables using ${VAR} syntax should be resolved from the current environment:
# For each key/value in setup.env export <key>="<value>"
2c. Build
If
setup.build exists, execute build commands sequentially:
# For each command in setup.build <command> || { echo "Build failed: <command>"; exit 1; }
2d. Start Services
If
setup.services exists, start long-running processes and wait for health checks:
# For each service in setup.services nohup <service.command> > .beagle/service-<index>.log 2>&1 & echo $! > .beagle/service-<index>.pid
For each service with a
health_check, poll until ready:
timeout=<service.health_check.timeout or 30> url=<service.health_check.url> elapsed=0 while [ $elapsed -lt $timeout ]; do if curl -s -o /dev/null -w "%{http_code}" "$url" | grep -qE "^(200|301|302)"; then echo "✓ Health check passed: $url" break fi sleep 2 elapsed=$((elapsed + 2)) done if [ $elapsed -ge $timeout ]; then echo "✗ Health check timeout: $url" exit 1 fi
2e. Legacy Setup Format
If the plan uses the older flat format (
setup.commands + setup.health_checks instead of prerequisites/build/services), fall back to executing setup.commands sequentially and polling setup.health_checks as before.
Step 4: Execute Tests Sequentially
For each test in the plan:
4a. Log Test Start
## Running: TC-XX - <test.name> Context: <test.context>
4b. Execute Steps
For each step in
test.steps, determine the step type and execute accordingly:
Shell commands (
steps):run:
The most common step type. Execute the command via Bash and capture stdout, stderr, and exit code:
# Execute the command, capture output and exit code <command> 2>&1 echo "EXIT_CODE: $?"
Capture all output for evaluation in step 4c. Shell steps cover:
- CLI binary invocations (e.g.,
)./target/debug/myapp status --all - Database queries (e.g.,
)psql "${DATABASE_URL}" -c "SELECT ..." - File inspection (e.g.,
)ls -la /path/to/expected/output - Process lifecycle checks (e.g.,
)timeout 5 ./myapp 2>&1 || true - Any other command a human would type in a terminal
curl actions (
steps):action: curl
curl -X <method> \ -H "Content-Type: application/json" \ <additional headers> \ -d '<body>' \ "<url>" \ -o response.json \ -w "%{http_code}" > status_code.txt # Capture response for evaluation cat response.json cat status_code.txt
agent-browser CLI actions:
Steps starting with
agent-browser are browser automation commands:
# Navigate agent-browser open <url> # Snapshot interactive elements (always do before interacting) agent-browser snapshot -i # Interact using refs from snapshot output (@e1, @e2, etc.) agent-browser fill @<ref> "<value>" agent-browser click @<ref> # Wait for conditions agent-browser wait --url "<pattern>" agent-browser wait --text "<text>" agent-browser wait --load networkidle # Capture evidence agent-browser screenshot docs/testing/evidence/<test.id>.png
Important: Always run
agent-browser snapshot -i before interacting with elements to get valid refs, and re-snapshot after navigation or significant DOM changes.
Save screenshots to
docs/testing/evidence/<test.id>.png
4c. Evaluate Result
Using agent reasoning, compare actual outcome against
test.expected:
- Read the expected behavior description
- Compare with actual response/screenshot
- Determine PASS or FAIL
4d. On PASS
✓ TC-XX PASSED: <test.name>
Continue to next test.
4e. On FAIL
Stop immediately. Go to Step 6.
Step 5: On All Tests Pass
## Test Results: ALL PASSED | ID | Name | Result | |----|------|--------| | TC-01 | <name> | ✓ PASS | | TC-02 | <name> | ✓ PASS | | ... | ... | ... | **Total:** N/N tests passed ### Evidence Screenshots saved to `docs/testing/evidence/` ### Cleanup Stopping background services...
Clean up:
# Kill background services for pidfile in .beagle/service-*.pid .beagle/dev-server.pid; do if [ -f "$pidfile" ]; then kill $(cat "$pidfile") 2>/dev/null rm "$pidfile" fi done
Step 6: On Failure - Generate Debug Prompt
When a test fails, generate rich debug output:
6a. Gather Context
# Get changed files relevant to the failure git diff --name-only $(git merge-base HEAD origin/main)..HEAD # Get recent changes in files mentioned in test.context git diff $(git merge-base HEAD origin/main)..HEAD -- <relevant_files>
6b. Output Debug Report
## Test Failure: TC-XX - <test.name> ### What Failed **Test:** <test.name> **Expected:** <test.expected> **Actual:** <Describe what actually happened - response code, error message, screenshot description> ### Relevant Changes in This PR <For each file mentioned in test.context or related to the failure:> - `<file>` (lines X-Y) - <brief description of changes> ### Evidence <If screenshot exists:> - Screenshot: `docs/testing/evidence/<test.id>.png` <If API response:> - Status code: <code> - Response body: ```json <response>
Error Details
<If error message in response or logs:> ``` <error message> ```Suggested Investigation
<Based on the error, suggest 2-3 specific things to check:>
- <First thing to check based on error type>
- <Second thing related to changed files>
- <Third thing about environment/setup>
Debug Session Prompt
Copy this to start a new Claude session:
I'm debugging a test failure in branch
<branch>.
Test: <test.name> Error: <brief error description>
<Summarize what the test was checking and what went wrong>Relevant files: <List changed files related to this test>
Help me investigate why <specific failure reason>.
### 6c. Preserve Evidence ```bash # Ensure evidence directory exists mkdir -p docs/testing/evidence # Save failure context cat > docs/testing/evidence/<test.id>-failure.md << 'EOF' # Failure Report: <test.id> <Full debug report content> EOF
6d. Cleanup and Exit
# Kill background services for pidfile in .beagle/service-*.pid .beagle/dev-server.pid; do if [ -f "$pidfile" ]; then kill $(cat "$pidfile") 2>/dev/null rm "$pidfile" fi done
Test Results Summary Table
Always output a summary table showing progress:
## Test Results | ID | Name | Result | |----|------|--------| | TC-01 | <name> | ✓ PASS | | TC-02 | <name> | ✗ FAIL | | TC-03 | <name> | - SKIP | **Passed:** 1/3 **Failed:** TC-02
Tests after a failure are marked as SKIP (not executed).
Verification
Before completing:
# Verify evidence directory exists ls -la docs/testing/evidence/ # List captured evidence ls docs/testing/evidence/*.png docs/testing/evidence/*.md 2>/dev/null
Verification Checklist:
- Setup commands executed successfully
- Health checks passed before test execution
- Each executed test has recorded result
- Evidence captured in
docs/testing/evidence/ - On failure: debug prompt includes expected vs actual
- On failure: relevant PR changes listed
- Background processes cleaned up
Rules
- Stop on first test failure (do not continue to other tests)
- Always capture evidence (screenshots, responses)
- Include file:line references in debug prompts when possible
- Use
flag to re-run after fixing issues--skip-setup - Never hardcode secrets - use environment variables
- Clean up background processes even on failure
- Preserve failure evidence for debugging
- Make debug prompts copy-paste ready for new sessions