PM-Copilot-by-Product-Faculty regression-testing
Use this skill when the user asks to "prevent regressions in AI quality", "regression testing for AI", "how do I know if a prompt change broke something", "before/after evaluation for model changes", "catch quality regressions", or wants to set up a process that catches when a model update, prompt change, or system change has degraded AI output quality compared to before.
git clone https://github.com/Productfculty-aipm/PM-Copilot-by-Product-Faculty
T=$(mktemp -d) && git clone --depth=1 https://github.com/Productfculty-aipm/PM-Copilot-by-Product-Faculty "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/regression-testing" ~/.claude/skills/productfculty-aipm-pm-copilot-by-product-faculty-regression-testing && rm -rf "$T"
skills/regression-testing/SKILL.mdAI Regression Testing
You are setting up a regression testing framework for an AI feature — a systematic process that catches quality degradations caused by model changes, prompt changes, or data/context changes before they reach users.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), software testing principles applied to AI.
Step 1 — Load Context
Read
memory/user-profile.md for the AI feature being protected. Read the eval suite design if available — regression tests are a subset of the broader eval suite, focused on the specific failure modes the team has already identified and fixed.
Step 2 — What Triggers a Regression?
AI regressions can be caused by:
- Model updates: The underlying LLM changes (e.g., Claude version update)
- Prompt changes: The system prompt or few-shot examples are modified
- Context changes: The data passed to the model (retrieved documents, user context) changes
- Tool/API changes: External tools the model calls change their behavior
- Distribution shift: The types of inputs coming from users have changed over time
The regression test suite should catch all of these, not just obvious changes.
Step 3 — Building the Regression Test Set
The regression test set is a curated collection of (input, expected behavior) pairs. "Expected behavior" means the output should PASS a specific eval.
Sources for the test set:
- Past failures that were fixed: If you fixed a failure in v1.0, add a test that would have caught it. This prevents it from recurring silently.
- Edge cases from error analysis: The failure categories identified in error analysis become test cases.
- Representative happy path examples: The most common, important use cases should have at least one test each.
- Boundary cases: Inputs that are at the edge of the system's capability or scope.
Size guidance:
- Minimum viable: 20–30 carefully chosen test cases (covers major failure categories)
- Good: 50–100 test cases (covers failure categories + representative happy paths)
- Comprehensive: 200+ test cases (needed for high-stakes AI features)
Step 4 — Regression Test Execution
When to run:
- On every PR that changes the system prompt, model version, or retrieval pipeline
- On every scheduled model update (when the underlying model is upgraded)
- Weekly (to catch silent regressions from distribution shift)
Pass/fail definition: The test suite passes if: (1) every individual test case passes its specific eval, AND (2) the aggregate pass rate doesn't drop by more than [threshold]% from the baseline.
Set the threshold based on the feature's criticality:
- Core feature: < 2% regression acceptable
- Secondary feature: < 5% regression acceptable
- Experimental feature: < 10% regression acceptable
Step 5 — Regression Report Structure
When a regression is detected, produce a report:
What changed: Which eval(s) failed? What does the failure pattern look like?
Affected input types: Are regressions concentrated on certain types of inputs (short inputs, specific user segments, specific task types)?
Severity: How many test cases failed? What's the regression % vs. baseline?
Root cause hypothesis: What change (model, prompt, context) most likely caused this?
Rollback recommendation: Should the change be reverted immediately, or is this a degradation that can be fixed forward?
Fix plan: If fixing forward, what changes to the prompt or system would address the regression?
Step 6 — CI/CD Integration
Connect regression tests to the deployment pipeline:
# Pseudocode: regression gate in deployment pipeline def run_regression_gate(eval_suite, test_cases, baseline_pass_rate, threshold=0.02): results = [run_eval(test_case, eval_suite) for test_case in test_cases] current_pass_rate = sum(1 for r in results if r.passed) / len(results) regression = baseline_pass_rate - current_pass_rate if regression > threshold: raise DeploymentBlockedError( f"Regression detected: {regression:.1%} quality drop vs. baseline. " f"Blocking deployment. Review failing cases: {[r for r in results if not r.passed]}" ) return {"pass_rate": current_pass_rate, "regression": regression, "status": "PASS"}
Step 7 — Regression Triage Process
When a regression is flagged:
- Identify which test cases failed (which failure categories?)
- Compare failing outputs to the passing outputs from before the change
- Determine root cause: is this a prompt issue, model issue, or retrieval issue?
- Decide: revert (fastest) or fix forward (if the cause is known and the fix is simple)
- After fixing, run the full regression suite before deploying
- Update the test set: add the new failing cases as permanent regression tests
Step 8 — Output
Produce:
- Regression test set design (how many cases, which failure categories to cover, sources)
- Pass/fail thresholds and regression % tolerances
- Deployment gate integration plan
- Regression triage process (who does what when a regression is caught)
- The first 10 regression test cases to implement (highest-priority failures from error analysis)