Claude-skill-registry flaky-detect
Identify flaky tests from CI history and test execution patterns. Use when debugging intermittent test failures, auditing test reliability, or improving CI stability.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/flaky-detect" ~/.claude/skills/majiayu000-claude-skill-registry-flaky-detect && rm -rf "$T"
skills/data/flaky-detect/SKILL.mdFlaky Detect Skill
Purpose
Identify flaky tests (tests that pass and fail non-deterministically) by analyzing CI history, execution patterns, and test characteristics. Google research shows 4.56% of tests are flaky, costing millions in developer productivity.
Research Foundation
| Finding | Source | Reference |
|---|---|---|
| 4.56% flaky rate | Google (2016) | Flaky Tests at Google |
| ML Classification | FlaKat (2024) | arXiv:2403.01003 - 85%+ accuracy |
| LLM Auto-repair | FlakyFix (2023) | arXiv:2307.00012 |
| Flaky Taxonomy | Luo et al. (2014) | "An Empirical Analysis of Flaky Tests" |
When This Skill Applies
- User reports "tests sometimes fail" or "intermittent failures"
- CI has been unstable or unreliable
- User wants to audit test suite reliability
- Pre-release quality assessment
- Debugging non-deterministic behavior
Trigger Phrases
| Natural Language | Action |
|---|---|
| "Find flaky tests" | Analyze CI history for flaky patterns |
| "Why does CI keep failing?" | Identify flaky tests causing failures |
| "Test suite is unreliable" | Full flaky test audit |
| "This test sometimes passes" | Analyze specific test for flakiness |
| "Audit test reliability" | Comprehensive flaky detection |
| "Quarantine flaky tests" | Identify and isolate flaky tests |
Flaky Test Taxonomy (Google Research)
| Category | Percentage | Root Causes |
|---|---|---|
| Async/Timing | 45% | Race conditions, insufficient waits, timeouts |
| Test Order | 20% | Shared state, execution order dependencies |
| Environment | 15% | File system, network, configuration differences |
| Resource Limits | 10% | Memory, threads, connection pools |
| Non-deterministic | 10% | Random values, timestamps, UUIDs |
Detection Methods
1. CI History Analysis
Parse GitHub Actions / CI logs to find inconsistent results:
def analyze_ci_history(repo, days=30): """Analyze CI runs for flaky patterns""" runs = get_ci_runs(repo, days) test_results = {} for run in runs: for test in run.tests: if test.name not in test_results: test_results[test.name] = {"pass": 0, "fail": 0} if test.passed: test_results[test.name]["pass"] += 1 else: test_results[test.name]["fail"] += 1 # Identify flaky tests (pass rate between 5% and 95%) flaky = [] for test, results in test_results.items(): total = results["pass"] + results["fail"] if total >= 5: # Enough data pass_rate = results["pass"] / total if 0.05 < pass_rate < 0.95: flaky.append({ "test": test, "pass_rate": pass_rate, "total_runs": total }) return sorted(flaky, key=lambda x: x["pass_rate"])
2. Code Pattern Analysis
Scan test code for flaky patterns:
FLAKY_PATTERNS = [ # Timing issues (r'setTimeout|sleep|delay', "timing", "Uses explicit delays"), (r'Date\.now\(\)|new Date\(\)', "timing", "Uses current time"), # Async issues (r'\.then\([^)]*\)(?!.*await)', "async", "Promise without await"), (r'async.*(?!await)', "async", "Async without await"), # Order dependencies (r'Math\.random\(\)', "random", "Uses random values"), (r'uuid|nanoid', "random", "Uses generated IDs"), # Environment (r'process\.env', "environment", "Environment-dependent"), (r'fs\.(read|write)', "environment", "File system access"), (r'fetch\(|axios\.|http\.', "network", "Network calls"), ] def scan_for_flaky_patterns(test_file): """Scan test file for flaky patterns""" content = read_file(test_file) matches = [] for pattern, category, description in FLAKY_PATTERNS: if re.search(pattern, content): matches.append({ "category": category, "description": description, "pattern": pattern }) return matches
3. Re-run Analysis
Run tests multiple times to detect flakiness:
# Run tests 10 times, track results for i in {1..10}; do npm test -- --reporter=json >> test-results.jsonl done # Analyze for inconsistency python analyze_reruns.py test-results.jsonl
Output Format
## Flaky Test Report **Analysis Period**: Last 30 days **Total Tests**: 450 **Flaky Tests Found**: 12 (2.7%) ### Critical Flaky Tests (< 50% pass rate) #### 1. `test/api/login.test.ts:45` **Pass Rate**: 42% (21/50 runs) **Category**: Timing **Pattern**: Uses `Date.now()` for token expiry ```typescript // Flaky code it('should expire token after 1 hour', () => { const token = createToken(); const expiry = Date.now() + 3600000; // Flaky! expect(token.expiresAt).toBe(expiry); });
Root Cause: Test creates token and checks expiry in same millisecond sometimes, different millisecond other times.
Recommended Fix: Use mocked time
it('should expire token after 1 hour', () => { vi.setSystemTime(new Date('2024-01-01T00:00:00Z')); const token = createToken(); expect(token.expiresAt).toBe(new Date('2024-01-01T01:00:00Z').getTime()); vi.useRealTimers(); });
High Flaky Tests (50-80% pass rate)
2. test/db/connection.test.ts:23
test/db/connection.test.ts:23Pass Rate: 68% (34/50 runs) Category: Resource Pattern: Connection pool exhaustion
[... more tests ...]
Summary by Category
| Category | Count | Impact |
|---|---|---|
| Timing | 5 | HIGH |
| Async | 3 | HIGH |
| Environment | 2 | MEDIUM |
| Order | 1 | MEDIUM |
| Network | 1 | LOW |
Recommendations
- Quick Win: Fix 5 timing tests with
(+0.5% stability)vi.setSystemTime() - Medium Effort: Add proper async handling (+0.3% stability)
- Infrastructure: Add test isolation for DB tests (+0.2% stability)
Quarantine Candidates
These tests should be skipped in CI until fixed:
// vitest.config.ts export default { test: { exclude: [ 'test/api/login.test.ts', // Timing flaky 'test/db/connection.test.ts', // Resource flaky ] } }
Note: Track quarantined tests in
.aiwg/testing/flaky-quarantine.md
## Quarantine Process ### 1. Identify ```bash # Run flaky detection python scripts/flaky_detect.py --ci-history 30 --threshold 95
2. Quarantine
// Mark test as flaky describe.skip('flaky: login expiry', () => { // FLAKY: https://github.com/org/repo/issues/123 // Root cause: timing-dependent // Fix in progress: PR #456 });
3. Track
Create tracking issue:
## Flaky Test: test/api/login.test.ts:45 - **Pass Rate**: 42% - **Category**: Timing - **Root Cause**: Uses real system time - **Quarantined**: 2024-12-12 - **Fix PR**: #456 - **Target Unquarantine**: 2024-12-15
4. Fix and Unquarantine
After fix:
# Verify fix with multiple runs for i in {1..20}; do npm test -- test/api/login.test.ts; done # Remove from quarantine if all pass
Integration Points
- Works with
skill for automated repairsflaky-fix - Reports to CI dashboard
- Feeds into
for release decisions/flow-gate-check - Tracks in
.aiwg/testing/flaky-registry.md
Script Reference
flaky_detect.py
Analyze CI history for flaky tests:
python scripts/flaky_detect.py --repo owner/repo --days 30
flaky_scanner.py
Scan code for flaky patterns:
python scripts/flaky_scanner.py --target test/