Aiwg regression-performance
Detect performance regressions by comparing benchmarks across versions with latency, throughput, and statistical significance analysis
git clone https://github.com/jmagly/aiwg
T=$(mktemp -d) && git clone --depth=1 https://github.com/jmagly/aiwg "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.agents/skills/regression-performance" ~/.claude/skills/jmagly-aiwg-regression-performance && rm -rf "$T"
.agents/skills/regression-performance/SKILL.mdregression-performance
Detect performance regressions by comparing benchmarks across versions, analyzing latency/throughput degradation, and providing statistical significance testing.
Triggers
Alternate expressions and non-obvious activations (primary phrases are matched automatically from the skill description):
- "latency regression" → performance benchmark comparison
- "p99" / "p95" → percentile-based performance metrics
- "benchmark diff" → performance baseline comparison
Purpose
This skill detects performance regressions across software versions by:
- Comparing latency metrics (p50, p95, p99) between baseline and current versions
- Detecting throughput regressions (requests/sec, transactions/sec)
- Identifying memory regressions (heap growth, memory leaks)
- Analyzing resource utilization (CPU, disk I/O, network)
- Running benchmark comparisons with statistical significance testing
- Generating performance regression reports with visualizations
Behavior
When triggered, this skill:
-
Identifies baseline version:
- Detect last known good version from git tags
- Load baseline benchmark results
- Extract performance metrics from monitoring
-
Runs performance benchmarks:
- Execute load tests using k6, Artillery, or wrk
- Capture latency distributions (p50, p95, p99)
- Measure throughput (req/s, TPS)
- Profile memory usage and heap growth
- Monitor CPU and I/O utilization
-
Performs statistical comparison:
- Calculate delta and percentage change
- Apply statistical significance tests (t-test, Mann-Whitney U)
- Determine if degradation exceeds threshold
- Account for variance and noise
-
Detects regression patterns:
- Latency spikes at specific percentiles
- Throughput capacity reduction
- Memory leak indicators (growing heap)
- CPU saturation points
- I/O bottlenecks
-
Generates regression report:
- Performance comparison tables
- Percentile distribution graphs
- Time-series trend analysis
- Root cause indicators
- Recommendations
-
Logs regression findings:
- Create regression register entry
- Tag commits with performance impact
- Alert on threshold violations
Performance Metrics Model
┌─────────────────────┐ │ BASELINE v2.3.0 │ ├─────────────────────┤ │ p50: 45ms │ │ p95: 120ms │ │ p99: 180ms │ │ RPS: 2500 │ │ Mem: 256MB │ └─────────────────────┘ │ ▼ Compare ┌─────────────────────┐ │ CURRENT v2.4.0 │ ├─────────────────────┤ │ p50: 52ms (+15%) │ ⚠️ REGRESSION │ p95: 145ms (+21%) │ ⚠️ REGRESSION │ p99: 220ms (+22%) │ ⚠️ REGRESSION │ RPS: 2100 (-16%) │ ⚠️ REGRESSION │ Mem: 312MB (+22%) │ ⚠️ REGRESSION └─────────────────────┘ │ ▼ ┌─────────────────────┐ │ REGRESSION REPORT │ │ │ │ Type: Latency │ │ Severity: HIGH │ │ Confidence: 99.5% │ │ Root Cause: TBD │ └─────────────────────┘
Metric Categories
Latency Metrics
| Metric | Description | Threshold | Tool |
|---|---|---|---|
| p50 (median) | 50th percentile latency | +10% | k6, Artillery, wrk |
| p95 | 95th percentile latency | +15% | k6, Artillery, wrk |
| p99 | 99th percentile latency | +20% | k6, Artillery, wrk |
| max | Maximum observed latency | +30% | k6, Artillery, wrk |
Throughput Metrics
| Metric | Description | Threshold | Tool |
|---|---|---|---|
| Requests/sec | HTTP requests per second | -10% | k6, wrk, ab |
| Transactions/sec | Business transactions per second | -10% | Custom |
| Bytes/sec | Network throughput | -15% | iperf3, iftop |
| Queries/sec | Database query throughput | -10% | pgbench, sysbench |
Memory Metrics
| Metric | Description | Threshold | Tool |
|---|---|---|---|
| Heap size | JavaScript heap usage | +20% | Node.js heap snapshot |
| RSS | Resident set size | +20% | ps, top |
| Memory growth rate | MB/hour increase | >10 MB/hour | Continuous profiling |
| GC pressure | Garbage collection frequency | +30% | Node.js --trace-gc |
Resource Metrics
| Metric | Description | Threshold | Tool |
|---|---|---|---|
| CPU utilization | Average CPU usage | +20% | mpstat, top |
| Disk I/O wait | I/O wait percentage | +25% | iostat |
| Network bandwidth | Network utilization | +15% | iftop, nethogs |
| File descriptors | Open file handles | +30% | lsof |
Benchmark Tools Integration
k6 Load Testing
// benchmark.k6.js import http from 'k6/http'; import { check, sleep } from 'k6'; export let options = { stages: [ { duration: '2m', target: 100 }, // Ramp up { duration: '5m', target: 100 }, // Steady state { duration: '2m', target: 0 }, // Ramp down ], thresholds: { 'http_req_duration': ['p(50)<100', 'p(95)<200', 'p(99)<300'], 'http_req_failed': ['rate<0.01'], }, }; export default function () { const res = http.get('https://api.example.com/endpoint'); check(res, { 'status is 200': (r) => r.status === 200, 'response time < 200ms': (r) => r.timings.duration < 200, }); sleep(1); }
Running comparison:
# Baseline k6 run --out json=baseline-results.json benchmark.k6.js # Current version k6 run --out json=current-results.json benchmark.k6.js # Compare ./compare-k6-results.sh baseline-results.json current-results.json
Artillery Load Testing
# artillery-config.yml config: target: 'https://api.example.com' phases: - duration: 120 arrivalRate: 10 rampTo: 50 - duration: 300 arrivalRate: 50 - duration: 120 arrivalRate: 50 rampTo: 0 plugins: metrics-by-endpoint: stripQueryString: true scenarios: - name: "API Performance Test" flow: - get: url: "/api/users" - get: url: "/api/products" - post: url: "/api/orders" json: product_id: 123 quantity: 2
Running comparison:
# Baseline artillery run --output baseline.json artillery-config.yml # Current artillery run --output current.json artillery-config.yml # Compare artillery report baseline.json --output baseline-report.html artillery report current.json --output current-report.html ./compare-artillery-results.sh baseline.json current.json
wrk HTTP Benchmarking
# Simple throughput test wrk_benchmark() { local version=$1 local output_file=$2 wrk -t12 -c400 -d30s \ --latency \ --timeout 10s \ https://api.example.com/endpoint \ > "$output_file" } # Baseline wrk_benchmark "v2.3.0" "wrk-baseline.txt" # Current wrk_benchmark "v2.4.0" "wrk-current.txt" # Compare ./parse-wrk-results.sh wrk-baseline.txt wrk-current.txt
Apache Bench (ab)
# Quick regression check ab_compare() { local baseline_version=$1 local current_version=$2 echo "=== Baseline ${baseline_version} ===" ab -n 10000 -c 100 https://api.example.com/ > ab-baseline.txt echo "=== Current ${current_version} ===" ab -n 10000 -c 100 https://api.example.com/ > ab-current.txt # Extract key metrics echo "Comparison:" echo "Baseline RPS: $(grep 'Requests per second' ab-baseline.txt | awk '{print $4}')" echo "Current RPS: $(grep 'Requests per second' ab-current.txt | awk '{print $4}')" }
Hyperfine (CLI tool benchmarking)
# For CLI tool performance hyperfine \ --warmup 3 \ --export-json comparison.json \ 'git checkout v2.3.0 && npm run build && npm test' \ 'git checkout v2.4.0 && npm run build && npm test'
Statistical Significance Testing
T-Test for Mean Comparison
function detectLatencyRegression( baseline: number[], current: number[] ): RegressionAnalysis { const baselineMean = mean(baseline); const currentMean = mean(current); const delta = currentMean - baselineMean; const deltaPercent = (delta / baselineMean) * 100; // Perform Welch's t-test const tTestResult = welchTTest(baseline, current); return { metric: 'latency', baseline: { mean: baselineMean, stddev: stddev(baseline), n: baseline.length, }, current: { mean: currentMean, stddev: stddev(current), n: current.length, }, delta, deltaPercent, pValue: tTestResult.pValue, isSignificant: tTestResult.pValue < 0.05, isRegression: deltaPercent > 10 && tTestResult.pValue < 0.05, confidence: 1 - tTestResult.pValue, }; }
Mann-Whitney U Test (Non-Parametric)
function detectThroughputRegression( baseline: number[], current: number[] ): RegressionAnalysis { const baselineMedian = median(baseline); const currentMedian = median(current); const delta = currentMedian - baselineMedian; const deltaPercent = (delta / baselineMedian) * 100; // Use Mann-Whitney U for non-normal distributions const uTestResult = mannWhitneyU(baseline, current); return { metric: 'throughput', baseline: { median: baselineMedian, iqr: iqr(baseline), n: baseline.length, }, current: { median: currentMedian, iqr: iqr(current), n: current.length, }, delta, deltaPercent, pValue: uTestResult.pValue, isSignificant: uTestResult.pValue < 0.05, isRegression: deltaPercent < -10 && uTestResult.pValue < 0.05, confidence: 1 - uTestResult.pValue, }; }
Multiple Comparison Correction
// Apply Bonferroni correction when testing multiple metrics function detectMultiMetricRegressions( metrics: MetricComparison[], familyAlpha: number = 0.05 ): RegressionResult[] { const adjustedAlpha = familyAlpha / metrics.length; return metrics.map(metric => ({ ...metric, adjustedAlpha, isSignificantCorrected: metric.pValue < adjustedAlpha, })); }
Threshold Configuration
Configure regression detection thresholds in
.aiwg/config/performance-thresholds.yaml:
performance_thresholds: latency: p50: warning: 10% # Warn if p50 increases by >10% critical: 20% # Critical if p50 increases by >20% p95: warning: 15% critical: 25% p99: warning: 20% critical: 30% throughput: requests_per_sec: warning: -10% # Warn if RPS decreases by >10% critical: -20% transactions_per_sec: warning: -10% critical: -20% memory: heap_size: warning: 20% critical: 40% growth_rate: warning: 10 # MB/hour critical: 25 resources: cpu_utilization: warning: 20% critical: 40% io_wait: warning: 25% critical: 50% statistical: significance_level: 0.05 # p-value threshold min_sample_size: 100 # Minimum observations multiple_comparison_correction: bonferroni
Memory Leak Detection
Heap Snapshot Comparison
interface HeapAnalysis { version: string; timestamp: string; totalSize: number; usedSize: number; objectCounts: Record<string, number>; retainedPaths: RetainedPath[]; } function detectMemoryLeak( baseline: HeapAnalysis, current: HeapAnalysis ): MemoryLeakReport { const growthRate = (current.usedSize - baseline.usedSize) / (current.timestamp - baseline.timestamp); const suspectObjects = Object.keys(current.objectCounts) .filter(type => { const baseCount = baseline.objectCounts[type] || 0; const currCount = current.objectCounts[type]; const growth = currCount - baseCount; return growth > 1000; // >1000 additional objects }) .map(type => ({ type, baseline: baseline.objectCounts[type] || 0, current: current.objectCounts[type], delta: current.objectCounts[type] - (baseline.objectCounts[type] || 0), })); return { isLeak: growthRate > 10 * 1024 * 1024, // >10MB/hour growthRate, suspectObjects, recommendation: suspectObjects.length > 0 ? `Investigate ${suspectObjects[0].type} accumulation` : 'Run extended profiling session', }; }
Continuous Memory Monitoring
#!/bin/bash # monitor-memory-regression.sh monitor_memory() { local duration_minutes=$1 local interval_seconds=$2 local output_file=$3 echo "timestamp,rss,heap_used,heap_total,external" > "$output_file" for i in $(seq 1 $((duration_minutes * 60 / interval_seconds))); do local stats=$(curl -s http://localhost:3000/metrics/memory) echo "$(date +%s),$stats" >> "$output_file" sleep "$interval_seconds" done } # Run for 30 minutes, sample every 10 seconds monitor_memory 30 10 memory-profile.csv # Analyze for leaks python analyze-memory-trend.py memory-profile.csv
Performance Regression Report Format
Executive Summary
# Performance Regression Report **Project**: my-api-service **Analysis Date**: 2024-01-25 **Baseline Version**: v2.3.0 **Current Version**: v2.4.0 **Test Duration**: 10 minutes **Load Profile**: 100 concurrent users, steady state ## Executive Summary **Regression Detected**: YES (HIGH severity) **Metrics Affected**: 5 of 8 metrics show significant degradation **Statistical Confidence**: 99.5% **Recommendation**: DO NOT DEPLOY - requires investigation | Category | Status | Details | |----------|--------|---------| | Latency | ⚠️ REGRESSION | p50: +15%, p95: +21%, p99: +22% | | Throughput | ⚠️ REGRESSION | -16% requests/sec | | Memory | ⚠️ REGRESSION | +22% heap, potential leak | | CPU | ✅ PASS | Within acceptable range | | I/O | ✅ PASS | No significant change |
Detailed Metrics Comparison
## Latency Metrics | Metric | Baseline | Current | Delta | % Change | p-value | Regression? | |--------|----------|---------|-------|----------|---------|-------------| | p50 | 45ms | 52ms | +7ms | +15.6% | 0.001 | ⚠️ YES | | p75 | 78ms | 92ms | +14ms | +17.9% | 0.002 | ⚠️ YES | | p90 | 105ms | 128ms | +23ms | +21.9% | 0.001 | ⚠️ YES | | p95 | 120ms | 145ms | +25ms | +20.8% | 0.001 | ⚠️ YES | | p99 | 180ms | 220ms | +40ms | +22.2% | 0.001 | ⚠️ YES | | max | 450ms | 580ms | +130ms | +28.9% | 0.023 | ⚠️ YES | **Statistical Test**: Welch's t-test **Significance Level**: α = 0.05 (Bonferroni corrected: 0.0083) **Confidence**: 99.9% that degradation is real ### Latency Distribution
Baseline v2.3.0 Current v2.4.0 │ │ 600 │ │ * │ │ 500 │ │ * │ │ * 400 │ │ * │ │ 300 │ │ ** │ │ * 200 │ * │ ** │ ** │ ** 100 │ **** │ ** │ ***** │** 0 └───────────────── └───────────────── 0 25 50 75 95 99 0 25 50 75 95 99 Percentile Percentile
Throughput Regression
## Throughput Metrics | Metric | Baseline | Current | Delta | % Change | p-value | Regression? | |--------|----------|---------|-------|----------|---------|-------------| | Requests/sec | 2500 | 2100 | -400 | -16.0% | 0.003 | ⚠️ YES | | Total requests | 150000 | 126000 | -24000 | -16.0% | - | - | | Failed requests | 150 (0.1%) | 1260 (1.0%) | +1110 | +740% | 0.001 | ⚠️ YES | **Statistical Test**: Mann-Whitney U test **Confidence**: 99.7% that throughput decreased ### Time Series Comparison
Requests per Second Over Time
3000 ┤ │ Baseline ───── 2500 │ ████████████████████████████████████ │ 2000 │ Current ─ ─ ─ ─ │ ╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌ 1500 │ │ 1000 │ │ 500 │ │ 0 └──────────────────────────────────────── 0 2m 4m 6m 8m 10m Time
Memory Analysis
## Memory Metrics | Metric | Baseline | Current | Delta | % Change | Leak Detected? | |--------|----------|---------|-------|----------|----------------| | Heap Used | 256 MB | 312 MB | +56 MB | +21.9% | ⚠️ Suspected | | RSS | 384 MB | 468 MB | +84 MB | +21.9% | - | | Growth Rate | 2 MB/hour | 18 MB/hour | +16 MB/hour | +800% | ⚠️ YES | | GC Frequency | 12/min | 18/min | +6/min | +50% | ⚠️ Pressure | **Memory Leak Analysis**: - **Leak Detected**: YES (high confidence) - **Growth Rate**: 18 MB/hour (threshold: 10 MB/hour) - **Suspect Objects**: Array (+12500 instances), Closure (+8200 instances) - **Recommendation**: Take heap snapshot and analyze retained paths ### Heap Growth Over Time
Heap Size (MB)
400 ┤ Current ╱╱ │ ╱╱╱╱╱ 350 │ ╱╱╱╱╱ │ ╱╱╱╱╱ 300 │ ╱╱╱╱╱ │ ╱╱╱╱╱ 250 │ ╱╱╱╱╱ Baseline ──── │ ╱╱╱╱╱ ────────────────────────── 200 │ │ 150 │ │ 100 │ └──────────────────────────────────────── 0 10 20 30 40 50 60 Time (minutes)
Root Cause Indicators
## Root Cause Analysis **Automated Analysis**: ### Likely Culprits 1. **Synchronous I/O in Request Path** (HIGH confidence) - Evidence: p99 latency +22%, I/O wait +15% - Pattern: Blocking operations correlate with latency spikes - Recommendation: Profile with `clinic doctor` or `0x` 2. **Memory Accumulation** (HIGH confidence) - Evidence: Linear heap growth at 18 MB/hour - Pattern: Array and Closure object growth - Recommendation: Take heap snapshots at t=0 and t=30min, analyze diff 3. **Increased Computation** (MEDIUM confidence) - Evidence: CPU +12%, throughput -16% - Pattern: Less throughput despite similar CPU usage - Recommendation: Profile with `perf` or `clinic flame` ### Recommended Investigation Steps 1. **Git Bisect**: Identify introducing commit ```bash git bisect start HEAD v2.3.0 git bisect run ./performance-test.sh
-
Heap Snapshot Diff:
node --heap-prof app.js # Baseline # Run load test node --heap-prof app.js # Current # Compare snapshots -
Flame Graph Analysis:
clinic flame -- node app.js # Analyze hot paths
### Recommendations ```markdown ## Recommendations ### Immediate Actions (DO NOT DEPLOY) - [ ] **Do not deploy v2.4.0** - Severity is HIGH - [ ] Run git bisect to identify introducing commit - [ ] Take heap snapshots and analyze memory accumulation - [ ] Profile CPU with flame graphs to identify hot paths ### Investigation Priority | Priority | Action | Owner | ETA | |----------|--------|-------|-----| | P0 | Identify regression commit via bisect | Regression Analyst | 2 hours | | P0 | Analyze heap snapshots for leak | Performance Engineer | 4 hours | | P1 | Profile CPU for hot paths | Performance Engineer | 4 hours | | P1 | Review recent I/O changes | Developer | 2 hours | ### Prevention (After Fix) - [ ] Add performance regression tests to CI - [ ] Set up continuous memory profiling - [ ] Enable automated alerts for p99 > 200ms - [ ] Add load testing to pull request checks
Usage Examples
Full Performance Comparison
User: "Performance regression check between v2.3.0 and v2.4.0" Skill executes: 1. Checkout v2.3.0, run k6 benchmark → baseline-results.json 2. Checkout v2.4.0, run k6 benchmark → current-results.json 3. Statistical comparison → 5 of 8 metrics regressed 4. Memory profiling → leak detected (+18 MB/hour) 5. Generate report → .aiwg/reports/performance-regression-20240125.md Output: "⚠️ PERFORMANCE REGRESSION DETECTED Severity: HIGH Confidence: 99.5% Metrics Affected: Latency (+15-22%), Throughput (-16%), Memory Leak Report: .aiwg/reports/performance-regression-20240125.md DO NOT DEPLOY - Investigation required"
Quick Latency Check
User: "Check latency regression" Skill executes: 1. Compare current p99 against baseline 2. Statistical t-test for significance 3. Quick report Output: "Latency Regression Detected: p50: 52ms (baseline: 45ms) → +15.6% ⚠️ p95: 145ms (baseline: 120ms) → +20.8% ⚠️ p99: 220ms (baseline: 180ms) → +22.2% ⚠️ Statistical confidence: 99.9% Exceeds threshold: YES (>10%) Recommendation: Investigate before deploying"
Memory Leak Detection
User: "Detect memory leak" Skill executes: 1. Monitor memory for 30 minutes 2. Calculate growth rate 3. Identify accumulating objects Output: "⚠️ MEMORY LEAK DETECTED Growth Rate: 18 MB/hour (threshold: 10 MB/hour) Suspect Objects: - Array: +12500 instances - Closure: +8200 instances Heap snapshots saved: - .aiwg/profiling/heap-t0.heapsnapshot - .aiwg/profiling/heap-t30.heapsnapshot Recommendation: Analyze with Chrome DevTools Memory Profiler"
Benchmark Comparison
User: "Compare performance with main branch" Skill executes: 1. Stash current changes 2. Benchmark main branch 3. Restore changes 4. Benchmark current branch 5. Statistical comparison Output: "Performance Comparison: feature-branch vs main Latency (p99): main: 180ms feature-branch: 185ms Delta: +2.8% (within threshold ✓) Throughput: main: 2500 req/s feature-branch: 2480 req/s Delta: -0.8% (within threshold ✓) Verdict: NO REGRESSION DETECTED Safe to merge"
Integration
This skill uses:
: Load benchmark results from previous runsartifact-metadata
: Detect git tags and version historyproject-awareness
agent: For detailed root cause analysis when regression detectedregression-analyst
This skill is used by:
- CI/CD pipelines for automated regression gates
- Developers for pre-merge performance validation
- Release managers for go/no-go decisions
Automated Detection in CI
GitHub Actions Example
# .github/workflows/performance-gate.yml name: Performance Regression Gate on: pull_request: branches: [main] jobs: performance-gate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 with: fetch-depth: 0 # Full history for baseline - name: Setup Node.js uses: actions/setup-node@v3 with: node-version: '20' - name: Install dependencies run: npm ci - name: Benchmark baseline (main branch) run: | git checkout main npm run build k6 run --out json=baseline.json benchmark.k6.js - name: Benchmark current (PR branch) run: | git checkout ${{ github.head_ref }} npm run build k6 run --out json=current.json benchmark.k6.js - name: Performance regression analysis id: regression run: | npm run analyze:performance -- \ --baseline baseline.json \ --current current.json \ --threshold 10 \ --output regression-report.md - name: Comment PR with results uses: actions/github-script@v6 with: script: | const fs = require('fs'); const report = fs.readFileSync('regression-report.md', 'utf8'); github.rest.issues.createComment({ issue_number: context.issue.number, owner: context.repo.owner, repo: context.repo.repo, body: report }); - name: Fail if regression detected if: steps.regression.outputs.regression == 'true' run: | echo "Performance regression detected - see report" exit 1
Output Locations
- Performance reports:
.aiwg/reports/performance-regression-{date}.md - Benchmark data:
.aiwg/benchmarks/{version}-{timestamp}.json - Heap snapshots:
.aiwg/profiling/heap-{version}-{timestamp}.heapsnapshot - Flame graphs:
.aiwg/profiling/flame-{version}-{timestamp}.html - Regression register:
.aiwg/testing/regression-register/REG-PERF-{id}.yaml
References
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/agents/regression-analyst.md - Regression analysis agent
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/agents/performance-engineer.md - Performance optimization agent
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/schemas/testing/regression.yaml - Regression schema
- @.aiwg/research/findings/REF-013-metagpt.md - Debug memory pattern for performance history
- @$AIWG_ROOT/agentic/code/frameworks/sdlc-complete/rules/executable-feedback.md - Execution validation requirements