GAAI-framework skill-optimize
Run a structured evaluate-analyze-improve cycle on any GAAI skill to measure quality, detect regressions, and propose targeted improvements. Activate when a skill needs baseline evaluation, after SKILL.md modifications, or when friction-retrospective flags a skill.
git clone https://github.com/Fr-e-d/GAAI-framework
T=$(mktemp -d) && git clone --depth=1 https://github.com/Fr-e-d/GAAI-framework "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.gaai/core/skills/cross/skill-optimize" ~/.claude/skills/fr-e-d-gaai-framework-skill-optimize && rm -rf "$T"
.gaai/core/skills/cross/skill-optimize/SKILL.mdSkill Optimize
Purpose / When to Activate
Activate when:
- A skill needs a baseline quality measurement (no ledger.yaml exists yet)
- A SKILL.md has been modified and a before/after regression check is needed
flags a skill as a recurring friction sourcefriction-retrospective- An eval cycle is needed after a manual skill update
This skill formalizes the Skill Optimize protocol referenced by
eval-run (SKILL-CRS-025). It runs the full evaluate-analyze-improve loop with mandatory human gates at every modification step.
Scope: Any GAAI skill with measurable quality criteria — not limited to content-production skills. The inline Skill Optimize protocol in
discovery.agent.md remains the agent's orchestration logic; this skill provides the structured execution procedure.
Process
Step 1 — Eval authoring
If no
evals.yaml exists for the target skill:
- Read the target
in full.SKILL.md - Identify measurable quality criteria from the
section.Quality Checks - Author
following theevals.yaml
spec (seeevals-format.md
).eval-run/references/evals-format.md - Include a minimum of 5 assertions with a mix of
andcode
types.llm-judge - Store the file at
.{skill-dir}/eval-corpus/evals.yaml
HUMAN CHECKPOINT: Present the drafted
evals.yaml for validation. Do not proceed until approved. If rejected, revise based on feedback and re-present.
Step 2 — Corpus generation
If no corpus outputs exist in
{skill-dir}/eval-corpus/:
- Identify the skill's expected inputs from its
frontmatter and Process section.inputs: - Produce 2-3 representative outputs by simulating the skill's expected inputs.
- Store each output in
with naming{skill-dir}/eval-corpus/
.corpus-{N}.md
If corpus outputs already exist (from prior runs or real production), use those. Prefer real outputs over synthetic when available.
Step 3 — Baseline evaluation
- Invoke
(SKILL-CRS-025) with each corpus output against theeval-run
.evals.yaml - Compile per-output scores into
.{skill-dir}/eval-corpus/score-baseline.yaml - Record the aggregate:
andpassed / total
.pass_rate
Step 4 — Error analysis
For each failed assertion across all corpus outputs:
- Identify the root cause in the target SKILL.md: which step, which instruction.
- Classify the failure:
— the skill doesn't instruct what is neededinstruction-gap
— the skill instructs ambiguouslyinstruction-ambiguity
— the assertion is flawed, not the skilleval-design-error
— the model cannot reliably produce what is askedmodel-limitation
- Produce
with per-assertion findings.{skill-dir}/eval-corpus/error-analysis.md
Step 5 — Improvement proposal
Based on the error analysis:
- Propose specific, minimal SKILL.md edits addressing
andinstruction-gap
failures.instruction-ambiguity - For
failures: propose evals.yaml corrections instead.eval-design-error - For
failures: document as known limitations, do not propose changes.model-limitation - Present the proposal to the human.
HUMAN CHECKPOINT: The human approves, modifies, or rejects the proposal. NEVER auto-apply SKILL.md changes.
If approved:
- Apply the edits to SKILL.md.
- Re-run Steps 3-4 as a new iteration (score file:
).score-{iteration}.yaml - Compare against previous iteration scores.
Step 6 — Ledger update
After each iteration (including baseline), append an entry to
{skill-dir}/quality/ledger.yaml:
iterations: - id: {N} date: {ISO 8601} trigger: {trigger input value} score: passed: N total: N pass_rate: 0.XX delta_vs_previous: +/-0.XX # null for baseline failed_assertions: [ANN, ...] action_taken: "{description of SKILL.md change, or 'baseline — no action'}" status: current_pass_rate: 0.XX trend: improving | stable | degrading slo_target: 0.85 error_budget_remaining: 0.XX
The ledger is append-only — iteration history is never deleted or overwritten.
For ledger format details, see
references/ledger-format.md.
Step 7 — Trend detection
After updating the ledger:
- If
over 3+ consecutive iterations: escalate to human with full history and recommendation.trend: degrading - If
(pass rate below SLO for 3+ iterations): flag the skill aserror_budget_remaining < 0
in the ledger status. This blocks new deliveries using this skill until the human resolves it.needs-optimization - If
ortrend: improving
: report status inline and complete.stable
Quality Checks
- Every iteration produces a score report — no silent skips
- Ledger is append-only — iteration history never deleted
- SKILL.md modifications require human approval (SkillsBench finding: self-generated skill edits = -1.3pp without human review)
- Per-assertion tracking in every score report, not just aggregate scores (prevents AP-8: aggregation hiding regressions)
- Mixed assertion types mandatory in evals.yaml: both
andcode
(prevents AP-1: self-model bias)llm-judge - Error analysis classifies every failure — unclassified failures are not allowed
- Improvement proposals are minimal and targeted — no wholesale rewrites
Outputs
| Output | Path | Persistence |
|---|---|---|
| Eval assertions | | Created once, updated on eval-design-error |
| Corpus outputs | | Stable across iterations |
| Score reports | | One per iteration |
| Error analysis | | Overwritten each iteration |
| Quality ledger | | Append-only, never overwritten |
| Improvement proposal | Inline in session | Not persisted |
Non-Goals
This skill must NOT:
- Auto-modify SKILL.md without human approval (human gate is non-negotiable)
- Invoke the target skill to produce outputs (skills never chain — it evaluates existing outputs only)
- Compare quality across different skills (only within-skill across iterations)
- Set or modify SLO targets (human decision — skill only reads and reports against them)
- Generate corpus from production data without explicit human authorization
- Skip the error analysis step (every failure must be classified before proposing changes)
- Propose changes for
failures (these are documented, not "fixed")model-limitation
For documented anti-patterns and mitigations, see
references/anti-patterns.md.
No silent assumptions. Every evaluation result, every failure classification, every improvement proposal becomes explicit and governed.