PM-Copilot-by-Product-Faculty eval-suite-design
Use this skill when the user asks to "design an eval suite", "build evals for my AI feature", "create an evaluation framework", "how do I evaluate my AI", "what evals should I run", "build an eval system", or wants to create a systematic evaluation framework for an AI-powered product feature. Typically run after error-analysis has identified the failure categories to prioritize.
git clone https://github.com/Productfculty-aipm/PM-Copilot-by-Product-Faculty
T=$(mktemp -d) && git clone --depth=1 https://github.com/Productfculty-aipm/PM-Copilot-by-Product-Faculty "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/eval-suite-design" ~/.claude/skills/productfculty-aipm-pm-copilot-by-product-faculty-eval-suite-design && rm -rf "$T"
skills/eval-suite-design/SKILL.mdEval Suite Design
You are designing an evaluation suite for an AI product feature — a systematic set of tests that catches real failure modes before they reach users. The goal is a suite that the team actually runs and acts on, not one that gets ignored.
Framework: Hamel Husain + Shreya Shankar (Building eval systems, 2025), Aman Khan (Beyond vibe checks, 2025).
Key principle: "Evals quietly decide whether your AI product thrives or dies. The ability to write great evals is rapidly becoming the defining skill for AI PMs in 2025 and beyond." — Aman Khan, Lenny's Newsletter (2025)
Step 1 — Load Context
Read the error analysis output (from the error-analysis skill or user input) to understand which failure categories to target. Read
memory/user-profile.md for the AI feature context.
Step 2 — Three Types of Evals
For each failure category, select the appropriate eval type:
Type 1 — Code-based evals (deterministic): Best for: Failures with objectively correct / incorrect answers. Format compliance. Structural checks. Examples:
- Output contains required sections (assert "## Problem" in output)
- Output is within length bounds (assert len(output) < 2000)
- Output is valid JSON (try parse; fail if exception)
- Required fields are non-empty (assert output.get('metric') is not None) Pros: Fast, cheap, perfectly reliable Cons: Only works for objective correctness — can't evaluate quality
Type 2 — Human evals: Best for: Subjective quality, domain-specific correctness, complex reasoning, new failure categories being discovered. Format: Annotators see (input, output) pairs and rate: Thumbs up / Thumbs down, or score on a rubric (1–5). Pros: Highest accuracy; catches nuanced failures Cons: Slow, expensive, can't scale; requires clear annotation guidelines Use for: Calibration, sampling for quality assurance, training LLM-as-judge
Type 3 — LLM-as-judge: Best for: Subjective quality at scale; failures that require reasoning to detect; when human evals are too slow. Structure: A separate LLM (usually a stronger model) reviews (input, output) pairs and provides a judgment. Pros: Scalable; can evaluate complex quality; can explain its reasoning Cons: Not perfectly reliable; needs calibration against human evals; can be biased
Step 3 — Eval Design Per Failure Category
For each top failure category (from error analysis):
Name: [Failure category name] Eval type: [Code-based / Human / LLM-as-judge] What to test: [Specific aspect of the output being evaluated] Test cases needed: [How many? Where do they come from?] Pass/fail criteria: [What counts as pass? What counts as fail?] Automation plan: [When does this eval run — on every PR? Daily? Weekly?]
Step 4 — LLM-as-Judge Prompt Design
If using LLM-as-judge, write the judge prompt following best practices:
You are evaluating an AI assistant's response for [failure type]. **Input to the AI assistant:** {input} **AI assistant's response:** {response} **Evaluation criteria:** [Criterion 1]: [Clear definition of what good looks like] [Criterion 2]: [Clear definition of what good looks like] **Scoring:** - PASS: The response [specific pass condition] - FAIL: The response [specific fail condition] **Your output:** First, briefly explain your reasoning (1–2 sentences). Then output: PASS or FAIL
Key principles for judge prompts:
- Binary outputs (PASS/FAIL) are more reliable than numeric scores
- Include examples of PASS and FAIL in the prompt when possible
- The judge should explain its reasoning before giving the verdict (chain-of-thought)
- Calibrate the judge against 50+ human annotations before trusting it
Step 5 — The Principal Domain Expert Model
From Hamel Husain: designate one "benevolent dictator" for quality — one person whose judgment defines what PASS/FAIL means for subjective evals. This prevents annotation conflicts and anchors the LLM-as-judge calibration.
This person:
- Reviews 50–100 human annotation cases to establish the quality bar
- Resolves disagreements between annotators
- Periodically reviews LLM-as-judge outputs to catch drift
Step 6 — Eval Suite Structure
Design the full suite as three layers:
Layer 1 — Pre-commit (fast): Code-based evals only. Run on every code change. Must complete in < 60 seconds. Catches format and structural failures.
Layer 2 — Pre-deploy (medium): Code-based + LLM-as-judge on a representative sample. Run before any deployment. Should complete in < 10 minutes.
Layer 3 — Production monitoring (ongoing): LLM-as-judge on a sample of live outputs + human eval on flagged outputs. Run continuously or weekly.
Step 7 — Output
Produce:
- Eval suite design (one eval design per failure category)
- LLM-as-judge prompt(s) for subjective failure categories
- Three-layer eval structure with run frequency and completion time targets
- Minimum viable suite: which 3 evals to implement first to get 80% of the value?
- Measurement plan: how will you know if the evals are improving product quality over time?