git clone https://github.com/openclaw/skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/openclaw/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/1kalin/afrexai-qa-testing-engine" ~/.claude/skills/clawdbot-skills-afrexai-qa-testing-engine && rm -rf "$T"
skills/1kalin/afrexai-qa-testing-engine/SKILL.mdQA & Testing Engine — Complete Software Quality System
The definitive testing methodology for AI agents. From test strategy to execution, coverage to reporting — everything you need to ship quality software.
Phase 1: Test Strategy Design
Before writing a single test, design the strategy.
Strategy Brief Template
project: name: "" type: web-app | api | mobile | library | cli | data-pipeline languages: [typescript, python, go, java] frameworks: [react, express, django, spring] risk_profile: data_sensitivity: low | medium | high | critical # PII, financial, health user_impact: internal | b2b | b2c | life-safety deployment_frequency: daily | weekly | monthly regulatory: [none, SOC2, HIPAA, PCI-DSS, GDPR] test_scope: in_scope: [] # Features, services, components out_of_scope: [] # Explicitly excluded (with reason) environments: dev: { url: "", db: "local" } staging: { url: "", db: "seeded" } prod: { url: "", smoke_only: true }
Test Type Decision Matrix
| Risk Profile | Unit | Integration | E2E | Performance | Security | Accessibility |
|---|---|---|---|---|---|---|
| Internal tool | ✅ Core | ✅ API | ⚠️ Happy path | ❌ | ⚠️ Basic | ❌ |
| B2B SaaS | ✅ Full | ✅ Full | ✅ Critical flows | ✅ Load | ✅ OWASP Top 10 | ✅ WCAG AA |
| B2C high-traffic | ✅ Full | ✅ Full | ✅ Full | ✅ Stress + soak | ✅ Full | ✅ WCAG AA |
| Financial/Health | ✅ Full + mutation | ✅ Full + contract | ✅ Full + chaos | ✅ Full suite | ✅ Pen test | ✅ WCAG AAA |
Test Pyramid Architecture
/ E2E \ 5-10% — Critical user journeys only / Integration \ 20-30% — API contracts, service boundaries / Unit Tests \ 60-70% — Business logic, pure functions
Anti-pattern: Ice cream cone — More E2E than unit tests. Slow, flaky, expensive. Fix by pushing test coverage DOWN the pyramid.
Anti-pattern: Hourglass — Lots of unit + E2E, no integration. Misses contract bugs between services.
Phase 2: Unit Testing Mastery
The AAA Pattern (Arrange-Act-Assert)
Every unit test follows this structure:
describe('PricingCalculator', () => { // Group by behavior, not by method describe('when customer has volume discount', () => { it('applies tiered pricing above threshold', () => { // ARRANGE — Set up the scenario const calculator = new PricingCalculator(); const customer = createCustomer({ tier: 'enterprise', units: 150 }); // ACT — Execute the behavior under test const price = calculator.calculate(customer); // ASSERT — Verify the outcome (ONE logical assertion) expect(price).toEqual({ subtotal: 12000, discount: 1800, // 15% volume discount total: 10200, }); }); }); });
Test Naming Convention
Format:
[unit] [scenario] [expected behavior]
✅ Good:
PricingCalculator applies 15% discount when units exceed 100UserService throws NotFoundError when user ID is invalidparseDate returns null for malformed ISO strings
❌ Bad:
,test1
,should workcalculates price
What to Unit Test (Priority Order)
- Business logic — Pricing, rules, calculations, state machines
- Data transformations — Parsers, formatters, serializers, mappers
- Edge cases — Boundaries, null/undefined, empty collections, overflow
- Error handling — Every
block, every validation pathcatch - Pure functions — Easiest to test, highest ROI
What NOT to Unit Test
- Framework internals (React rendering, Express routing)
- Simple getters/setters with no logic
- Third-party library behavior
- Implementation details (private methods, internal state)
Mocking Rules
| Dependency Type | Strategy | Example |
|---|---|---|
| Database | Mock the repository/DAO | |
| HTTP API | Mock the client or use MSW | |
| File system | Mock fs or use temp dirs | |
| Time/Date | Fake timers | |
| Randomness | Seed or mock | |
| Environment | Override env vars | |
Rule: Mock at boundaries, not internals. If you're mocking a class you own, your design might need refactoring.
Coverage Targets
| Metric | Minimum | Good | Excellent |
|---|---|---|---|
| Line coverage | 70% | 85% | 95%+ |
| Branch coverage | 60% | 80% | 90%+ |
| Function coverage | 75% | 90% | 95%+ |
| Critical path coverage | 100% | 100% | 100% |
Warning: 100% coverage ≠ quality. Coverage measures what code ran, not what was verified. A test with no assertions has coverage but no value.
Phase 3: Integration Testing
API Testing Checklist
For every API endpoint, test:
endpoint: POST /api/orders tests: happy_path: - Valid request returns 201 with order ID - Response matches schema - Database record created correctly - Events/webhooks fired validation: - Missing required fields → 400 with field errors - Invalid data types → 400 with type errors - Business rule violations → 422 with explanation authentication: - No token → 401 - Expired token → 401 - Wrong role → 403 - Valid token → proceeds edge_cases: - Duplicate request (idempotency) → same response - Concurrent requests → no race condition - Maximum payload size → 413 or graceful handling - Special characters in input → no injection error_handling: - Database down → 503 with retry hint - External service timeout → 504 or fallback - Rate limit exceeded → 429 with retry-after
Contract Testing
When services communicate, test the contract:
contract: consumer: order-service provider: payment-service interactions: - description: "Process payment" request: method: POST path: /payments body: amount: 99.99 currency: USD order_id: "ord_123" response: status: 200 body: payment_id: "pay_xxx" # string, not null status: "completed" # enum: completed|pending|failed breaking_changes: # NEVER do these without versioning - Remove a field from response - Change a field's type - Add a required field to request - Change the URL path - Change error response format
Database Testing Rules
- Each test gets a clean state — Use transactions that rollback, or truncate between tests
- Use factories, not fixtures —
> hardcoded SQL dumpscreateUser({ role: 'admin' }) - Test migrations — Run migrate-up, migrate-down, migrate-up (roundtrip)
- Test constraints — Unique violations, FK cascades, NOT NULL
- Test queries — Especially complex JOINs, aggregations, window functions
Phase 4: End-to-End Testing
Critical User Journey Mapping
Identify and test the flows that generate revenue or block users:
critical_journeys: - name: "Sign up → First value" steps: - Visit landing page - Click sign up - Fill registration form - Verify email - Complete onboarding - Perform first key action max_duration: 3 minutes - name: "Purchase flow" steps: - Browse products - Add to cart - Enter shipping - Enter payment - Confirm order - Receive confirmation email max_duration: 2 minutes - name: "Login → Core task → Logout" steps: - Login (password + SSO + MFA variants) - Navigate to core feature - Complete primary workflow - Verify result - Logout max_duration: 1 minute
E2E Best Practices
- Test user behavior, not implementation — Click buttons by text/role, not by CSS class
- Use data-testid sparingly — Only when no accessible selector exists
- Wait for state, not time —
notwaitFor(element)sleep(3000) - Isolate test data — Each test creates its own users/data
- Run in CI with retries — 1 retry for flaky network, investigate if >5% flake rate
Selector Priority (Best → Worst)
— Accessible, resilientgetByRole('button', { name: 'Submit' })
— Form-specific, accessiblegetByLabelText('Email')
— Content-basedgetByText('Welcome back')
— Explicit test hookgetByTestId('submit-btn')
— ❌ Fragile, breaks on CSS changesquerySelector('.btn-primary')
Flaky Test Triage
| Symptom | Likely Cause | Fix |
|---|---|---|
| Passes locally, fails in CI | Timing/race condition | Add explicit waits, check CI resource limits |
| Fails intermittently | Shared state between tests | Isolate test data, reset state |
| Fails after deploy | Environment difference | Check env vars, API versions, feature flags |
| Fails at specific time | Time-dependent logic | Mock dates/times, avoid time-sensitive assertions |
| Fails in parallel | Resource contention | Use unique ports/DBs per worker |
Rule: Quarantine flaky tests within 24 hours. A flaky test suite that everyone ignores is worse than no tests.
Phase 5: Performance Testing
Load Test Design
performance_tests: smoke: vus: 5 duration: 1m purpose: "Verify test works" load: vus: 100 # Expected concurrent users duration: 10m ramp_up: 2m purpose: "Normal traffic behavior" thresholds: p95_response: <500ms error_rate: <1% stress: vus: 300 # 3x expected load duration: 15m ramp_up: 5m purpose: "Find breaking point" soak: vus: 80 duration: 2h purpose: "Memory leaks, connection exhaustion" spike: stages: - { vus: 50, duration: 2m } - { vus: 500, duration: 30s } # Sudden spike - { vus: 50, duration: 2m } purpose: "Recovery behavior"
Performance Budgets
| Metric | Web App | API | Background Job |
|---|---|---|---|
| Response time (p50) | <200ms | <100ms | N/A |
| Response time (p95) | <1s | <500ms | N/A |
| Response time (p99) | <3s | <1s | N/A |
| Throughput | >100 rps | >500 rps | >1000/min |
| Error rate | <0.1% | <0.1% | <0.5% |
| CPU usage | <70% | <70% | <90% |
| Memory growth | <5%/hr | <2%/hr | <10%/hr |
Database Performance Testing
db_performance: query_tests: - name: "Dashboard aggregate query" baseline: 50ms max_acceptable: 200ms with_1M_rows: measure with_10M_rows: measure index_verification: - Run EXPLAIN ANALYZE on all critical queries - Verify no sequential scans on tables >10K rows - Check index usage statistics weekly connection_pool: - Test at max connections - Verify graceful handling when pool exhausted - Monitor connection wait time
Phase 6: Security Testing
OWASP Top 10 Test Checklist
security_tests: A01_broken_access_control: - [ ] Horizontal privilege escalation (access other user's data) - [ ] Vertical privilege escalation (access admin functions) - [ ] IDOR (Insecure Direct Object References) - [ ] Missing function-level access control - [ ] CORS misconfiguration A02_cryptographic_failures: - [ ] Sensitive data in transit (TLS 1.2+) - [ ] Sensitive data at rest (encryption) - [ ] Password hashing (bcrypt/argon2, not MD5/SHA) - [ ] No secrets in code/logs/URLs A03_injection: - [ ] SQL injection (parameterized queries) - [ ] NoSQL injection - [ ] Command injection (OS commands) - [ ] XSS (stored, reflected, DOM-based) - [ ] Template injection (SSTI) A04_insecure_design: - [ ] Rate limiting on auth endpoints - [ ] Account lockout after N failures - [ ] CAPTCHA on public forms - [ ] Business logic abuse scenarios A05_security_misconfiguration: - [ ] Default credentials removed - [ ] Error messages don't leak stack traces - [ ] Security headers set (CSP, HSTS, X-Frame-Options) - [ ] Directory listing disabled - [ ] Unnecessary HTTP methods disabled A07_auth_failures: - [ ] Brute force protection - [ ] Session fixation - [ ] Session timeout - [ ] JWT validation (signature, expiry, issuer) - [ ] MFA bypass attempts
Input Validation Test Payloads
Test every user input with:
injection_payloads: sql: ["' OR 1=1--", "'; DROP TABLE users;--", "1 UNION SELECT * FROM users"] xss: ["<script>alert(1)</script>", "<img onerror=alert(1) src=x>", "javascript:alert(1)"] path_traversal: ["../../etc/passwd", "..\\..\\windows\\system32", "%2e%2e%2f"] command: ["; ls -la", "| cat /etc/passwd", "$(whoami)", "`id`"] boundary_values: strings: ["", " ", "a"*10000, null, undefined, "emoji: 🎯", "unicode: é à ü", "rtl: مرحبا"] numbers: [0, -1, 2147483647, -2147483648, NaN, Infinity, 0.1+0.2] arrays: [[], [null], Array(10000)] dates: ["1970-01-01", "2099-12-31", "invalid-date", "2024-02-29", "2023-02-29"]
Phase 7: Test Automation Architecture
Framework Selection Guide
| Need | JavaScript/TS | Python | Go | Java |
|---|---|---|---|---|
| Unit | Vitest / Jest | pytest | testing + testify | JUnit 5 |
| API | Supertest | httpx + pytest | net/http/httptest | RestAssured |
| E2E (browser) | Playwright | Playwright | chromedp | Selenium |
| Performance | k6 | Locust | vegeta | Gatling |
| Contract | Pact | Pact | Pact | Pact |
| Security | ZAP + custom | Bandit + custom | gosec | SpotBugs |
CI Pipeline Test Stages
pipeline: stage_1_fast: # <2 min, blocks PR - Lint + type check - Unit tests - Security: dependency scan (npm audit / safety) stage_2_thorough: # <10 min, blocks merge - Integration tests - Contract tests - Security: SAST scan - Coverage report + threshold check stage_3_confidence: # <30 min, blocks deploy - E2E critical journeys - Visual regression (if applicable) - Security: container scan stage_4_post_deploy: # After deploy to staging - Smoke tests against staging - Performance baseline check - Security: DAST scan (ZAP) stage_5_production: # After prod deploy - Smoke tests (critical paths only) - Synthetic monitoring enabled - Canary metrics watching
Test Data Management
test_data_strategy: unit_tests: approach: factories # Builder pattern, create exactly what you need example: "createUser({ role: 'admin', plan: 'enterprise' })" integration_tests: approach: seeded_database reset: per_test_suite # Transaction rollback or truncate sensitive_data: anonymized # Never use real PII e2e_tests: approach: api_setup # Create data via API before test cleanup: after_each # Delete created data isolation: unique_identifiers # Timestamp or UUID in test data performance_tests: approach: representative_dataset volume: 10x_production # Test with more data than prod generation: faker_libraries # Realistic but synthetic
Phase 8: Quality Metrics & Reporting
Test Health Dashboard
metrics: test_suite_health: total_tests: 0 passing: 0 failing: 0 skipped: 0 # >5% skipped = tech debt alarm flaky: 0 # >2% flaky = quarantine immediately coverage: line: "0%" branch: "0%" critical_paths: "0%" # Must be 100% execution: unit_duration: "0s" # Target: <30s integration_duration: "0s" # Target: <5m e2e_duration: "0s" # Target: <15m total_ci_time: "0s" # Target: <20m defect_metrics: bugs_found_in_test: 0 bugs_escaped_to_prod: 0 escape_rate: "0%" # Target: <5% mttr: "0h" # Mean time to resolve trends: # Track weekly new_tests_added: 0 tests_deleted: 0 # Healthy deletion = removing redundant tests coverage_delta: "+0%" flake_rate_delta: "+0%"
Test Report Template
# Test Report — [Feature/Sprint/Release] ## Summary - **Status:** ✅ PASS / ⚠️ PASS WITH RISKS / ❌ FAIL - **Tests Run:** X | **Passed:** X | **Failed:** X | **Skipped:** X - **Coverage:** Line X% | Branch X% | Critical 100% - **Duration:** Xm Xs ## Key Findings ### 🔴 Critical (Block Release) 1. [Finding] — [Impact] — [Fix recommendation] ### 🟡 High (Fix Before Next Release) 1. [Finding] — [Impact] — [Fix recommendation] ### 🟢 Medium/Low (Backlog) 1. [Finding] — [Impact] ## Risk Assessment - **Untested areas:** [list] - **Known flaky tests:** [list with ticket IDs] - **Performance concerns:** [if any] ## Recommendation [Ship / Ship with monitoring / Hold for fixes]
Quality Score (0-100)
| Dimension | Weight | Scoring |
|---|---|---|
| Test coverage | 20% | <60%=0, 60-70%=5, 70-80%=10, 80-90%=15, 90%+=20 |
| Critical path coverage | 20% | <100%=0, 100%=20 |
| Defect escape rate | 15% | >10%=0, 5-10%=5, 2-5%=10, <2%=15 |
| Test suite speed | 10% | >30m=0, 20-30m=3, 10-20m=7, <10m=10 |
| Flake rate | 10% | >5%=0, 2-5%=3, 1-2%=7, <1%=10 |
| Security test coverage | 10% | None=0, Basic=3, OWASP Top 10=7, Full=10 |
| Documentation | 5% | None=0, Basic=2, Complete=5 |
| Automation ratio | 10% | <50%=0, 50-70%=3, 70-90%=7, 90%+=10 |
Scoring: 0-40 = 🔴 Critical | 41-60 = 🟡 Needs Work | 61-80 = 🟢 Good | 81-100 = 💎 Excellent
Phase 9: Specialized Testing
Accessibility Testing (WCAG 2.1)
accessibility_checklist: level_a: # Minimum compliance - [ ] All images have alt text - [ ] All form inputs have labels - [ ] Color is not the only visual indicator - [ ] Page has proper heading hierarchy (h1→h2→h3) - [ ] All functionality available via keyboard - [ ] Focus is visible and logical - [ ] No content flashes >3 times/second level_aa: # Standard compliance (recommended) - [ ] Color contrast ratio ≥4.5:1 (normal text) - [ ] Color contrast ratio ≥3:1 (large text) - [ ] Text resizable to 200% without loss - [ ] Skip navigation links - [ ] Consistent navigation across pages - [ ] Error suggestions provided - [ ] ARIA landmarks for page regions tools: - axe-core (automated, catches ~30% of issues) - Lighthouse accessibility audit - Manual keyboard navigation test - Screen reader testing (VoiceOver/NVDA)
API Backward Compatibility Testing
compatibility_tests: when_updating_api: - [ ] All existing fields still present in response - [ ] No field type changes (string→number) - [ ] New required request fields have defaults - [ ] Deprecated fields still work (with warning header) - [ ] Error format unchanged - [ ] Pagination behavior unchanged - [ ] Rate limits not reduced versioning_strategy: - URL versioning: /v1/users, /v2/users - Header versioning: Accept: application/vnd.api+json;version=2 - Sunset header for deprecated versions - Minimum 6-month deprecation notice
Chaos Engineering Principles
chaos_tests: network: - Service dependency goes down → graceful degradation? - Network latency increases 10x → timeout handling? - DNS resolution fails → fallback behavior? infrastructure: - Database primary fails → replica promotion? - Cache (Redis) goes down → DB fallback works? - Disk fills up → alerting + graceful failure? application: - Memory pressure → OOM handling? - CPU saturation → request queuing? - Certificate expiry → monitoring alert? data: - Corrupt message in queue → dead letter + alert? - Schema migration fails mid-way → rollback works? - Clock skew between services → idempotency holds?
Phase 10: Daily QA Workflow
For New Features
- Review requirements — Identify test scenarios before code is written (shift-left)
- Write test cases — Cover happy path, edge cases, error cases, security
- Review PR tests — Are tests meaningful? Do they test behavior, not implementation?
- Run full suite — Unit + integration + E2E for affected areas
- Report findings — Use the test report template above
For Bug Fixes
- Write failing test first — Reproduce the bug as a test
- Verify fix makes test pass — The test IS the proof
- Check for regression — Run related test suites
- Add to regression suite — Bug tests prevent re-introduction
Weekly QA Review
weekly_review: monday: - Review flaky test quarantine — fix or delete - Check coverage trends — declining = tech debt - Review escaped defects — update test strategy friday: - Update test health dashboard - Clean up obsolete tests - Document new testing patterns discovered - Plan next week's testing focus
Natural Language Commands
→ Full strategy brief"Create test strategy for [project/feature]"
→ AAA pattern tests with edge cases"Write unit tests for [function/class]"
→ Full API test checklist"Test this API endpoint: [method] [path]"
→ Test code review with scoring"Review these tests for quality"
→ k6/Locust test design"Generate performance test plan"
→ OWASP-based test checklist"Security test [feature/endpoint]"
→ Formatted test report"Create test report for [release]"
→ Dashboard with metrics and recommendations"What's our test health?"
→ Analysis with prioritized recommendations"Find gaps in our test coverage"
→ Root cause analysis with fix suggestions"Help debug this flaky test"
→ Stage-by-stage pipeline config"Set up CI test pipeline"
→ WCAG checklist with findings"Accessibility audit [page/component]"