Agentic-context-engine kayba-stage-5-action-plan
Triage each insight into discard/code-fix/prompt-fix and produce a prioritized action plan with specific recommendations. Trigger when the user says "run stage 5", "make action plan", "triage skills", or when invoked by the kayba-pipeline orchestrator. Requires eval outputs from stages 1-4.
git clone https://github.com/kayba-ai/agentic-context-engine
T=$(mktemp -d) && git clone --depth=1 https://github.com/kayba-ai/agentic-context-engine "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/kayba-pipeline/stage-5-action-plan" ~/.claude/skills/kayba-ai-agentic-context-engine-kayba-stage-5-action-plan && rm -rf "$T"
.claude/skills/kayba-pipeline/stage-5-action-plan/SKILL.mdStage 5: Action Plan
Triage each insight and produce a concrete, prioritized action plan.
Inputs
— insights from Kaybaeval/stage1_insights_summary.md
— domain contexteval/stage2_domain_context.md
— the evaluation rubriceval/baseline_metrics.md
— baseline valueseval/baseline_metrics.json
— measurement codeeval/compute_baselines.py
Read all files before starting.
Process
1. Triage each insight
For each insight/skill, answer three questions in order: Is it valid? Is it already handled? Is it a code fix or prompt fix?
1a. Validity check
- Does it describe a real, recurring problem visible in traces — or noise from a one-off edge case?
- Is it actionable — can the agent actually change this behavior given its tools and context?
- If not valid → verdict: discard with a one-sentence reason.
1b. "Already handled" verification
Do not rely on memory or assumption. Run these checks and cite what you find:
- Grep the codebase for 2-3 key terms from the insight (tool names, error strings, behavioral keywords). Example: for an insight about cancellation eligibility, grep for
,cancel
,eligibility
.criteria - Read the existing system prompt text — check
in the agent file and the domain policy file. Quote any existing language that addresses this behavior.AGENT_INSTRUCTION - Verdict:
- If existing text partially covers it → keep as a strengthening fix, note what's missing.
- If no existing coverage → keep.
- If existing prompt text already covers the behavior thoroughly AND the baseline metric is >= 95% → discard (cite the existing text and metric). A high baseline alone is NOT sufficient to discard — if the metric is below 95%, there are still failures to fix. An 87% baseline means 1 in 8 attempts still fails; that is worth fixing.
1c. Code-vs-prompt decision tree
Walk through this tree for every non-discarded insight:
Q1: Can the agent fix this by following different instructions? (Does it have the right tools, correct data in tool responses, and sufficient context to behave correctly?) │ ├─ YES → PROMPT FIX │ The agent has everything it needs but acts wrong. │ A system prompt addition would fix it. │ └─ NO → Q2: What is the agent missing? │ ├─ Tool doesn't exist, schema is wrong, API returns │ incomplete data, infrastructure drops information, │ timeout/error not surfaced to agent │ → CODE FIX │ Name the file, function, and specific change. │ └─ The agent has partial information but the prompt can't fully compensate (e.g., needs a new tool but a heuristic prompt workaround exists) → PROMPT FIX (primary) + CODE FIX (optional) Note both. Mark the code fix as "optional" with a one-sentence justification for why it's lower priority.
Ambiguity default: When genuinely uncertain, default to prompt fix and add a note:
"Classification uncertain — defaulting to prompt fix. Revisit if prompt change doesn't move metrics." This is safer because prompt fixes are cheaper to test and revert, and Stage 7 handles prompt fixes and code fixes through different paths.
Use the reflector's reasoning from Stage 1 insights — it often explicitly identifies root causes that clarify the code-vs-prompt distinction.
2. Consolidate related insights
Before writing recommendations, merge insights that are redundant. Two insights should merge when ALL three conditions hold:
- Same target behavior — they describe the agent doing (or failing to do) the same thing.
- Overlapping fix text — the prompt instructions you'd write for each would share >50% of their content.
- Addressing one substantially addresses the other — fixing insight A would fix >80% of the cases described by insight B.
When NOT to merge — two insights about the same tool or domain area but different failure modes should remain separate. Example: "agent doesn't check cancellation eligibility" and "agent doesn't execute cancellation after user confirms" both involve
cancel_reservation but are completely different behavioral failures with different prompt fixes. Keep them separate.
For each merge, document:
- Which insight IDs are combined
- Which insight's framing is primary (use the one with stronger trace evidence)
- What, if anything, is lost from the secondary insight (add it as a sub-point)
3. Write specific recommendations
For each insight (after merging):
- Discards: one sentence on why it's not valid or actionable.
- Code fixes: what code/schema/infrastructure to change. Name the file, the function, the specific change. If Stage 7 needs to find the right code location, give it enough to grep for.
- Prompt fixes: the exact instruction text to add to the system prompt, where it should go (e.g., appended to
, added to domain policy, or as a standalone skill block), and why this wording over alternatives.AGENT_INSTRUCTION
4. Assess risk per fix
For each non-discarded fix, assess whether the change could break currently-working behaviors:
| Risk | Definition | Example |
|---|---|---|
| None | Change is additive; no existing behavior could be affected | Adding a new metric to compute_baselines.py |
| Low | Change targets a behavior that is currently failing; working cases are unrelated | Adding a cancellation checklist when current cancellation compliance is 0% |
| Medium | Change modifies a behavior where some cases already work correctly | Strengthening confirmation protocol when 28.6% already succeed — could the new wording break the working 28.6%? |
| High | Change rewrites or constrains a behavior that mostly works | Restricting tool-call patterns when 41.4% already comply — overly rigid wording could cause the agent to under-call tools |
For Medium and High risk fixes, add a one-sentence mitigation: what to watch for, or how to word the prompt to preserve working cases.
5. Handle qualitative-only insights — STILL PRODUCE FIXES
Some insights from Stage 3 may be flagged as "unmeasurable." These still get fixes. An insight that the agent fabricates data or violates policy is a real problem whether or not we can measure it programmatically. Treat them the same as any other insight:
- Run the same triage (validity → already-handled → code-vs-prompt) as every other insight.
- Include them in the priority-ranked implementation list alongside all other fixes. They are NOT second-class.
- Use the trace evidence from the insight (not the metric) to assess impact and priority. If the insight has strong trace evidence showing clear failures, rank it accordingly.
- For prioritization: since there is no metric denominator, use confidence = 0.5 and estimate impact from the severity described in the insight evidence.
- In the fix entry, note that this fix has no programmatic metric for automated before/after comparison, so improvement should be verified via manual trace review or LLM-as-judge after generating new traces.
Only relegate an insight to a non-actionable "Monitor Items" section if the triage concludes it should be discarded (not valid or not actionable). Being unmeasurable is NOT a reason to skip fixing it.
6. Link to metrics
For each non-discarded fix, identify which metric(s) from the rubric would move if this fix is implemented. Use the metric IDs from
eval/baseline_metrics.md (e.g., M1, M2).
7. Prioritize
Rank non-discarded fixes using this formula:
Priority Score = Impact × Confidence × Tier Bonus ÷ Risk Factor
Where:
- Impact = estimated metric delta. Use the gap between baseline and 100% as the ceiling. A fix expected to close 50% of that gap on M1 (baseline 41.4%) has impact = 0.5 × (1.0 - 0.414) = 0.293.
- Confidence = sample size reliability. Use the denominator from
:baseline_metrics.json- denominator >= 20: confidence = 1.0
- denominator 10-19: confidence = 0.8
- denominator 5-9: confidence = 0.6
- denominator < 5: confidence = 0.3
- Tier Bonus = leading metrics get a 1.5x multiplier (they validate adoption), lagging and quality get 1.0x. Rationale: leading metrics move first and tell you if your fix is even being adopted — you want those signals early.
- Risk Factor = None: 1.0, Low: 1.0, Medium: 1.5, High: 2.0
You do not need to compute exact scores to three decimal places. The formula is a tiebreaker and sanity check. The point is:
- High-impact, high-confidence, leading-metric fixes with low risk go first.
- Low-confidence fixes (small denominators) get deprioritized even if the metric is at 0%.
- High-risk fixes get deprioritized unless impact is overwhelming.
After scoring, apply one manual adjustment pass: if a fix is a prerequisite for another fix (e.g., "confirmation protocol" must exist before "post-confirmation execution" can be measured), promote the prerequisite even if its standalone score is lower.
Output format
Write to
eval/action_plan.md:
# Action Plan ## Summary - Total insights: N - Discarded: X (with reasons) - Code fixes: Y - Prompt fixes: Z - Fixes without programmatic metric (verify manually): Q ## Implementation Priority | Rank | Fix | Type | Metrics | Risk | Score rationale | |------|-----|------|---------|------|-----------------| | 1 | [name] | prompt | M1, M2 | Low | [one-line: why this ranks here] | | 2 | ... | ... | ... | ... | ... | --- ## Skill: [insight ID(s)] — [title] **Summary:** [one-line description of what the skill addresses] **Verdict:** `prompt fix` | `code fix` | `discard` **Classification path:** [which branch of the decision tree — e.g., "Agent has tools and data but acts wrong → prompt fix"] **Rationale:** [why this verdict — reference specific trace evidence from insights] **Risk:** None | Low | Medium | High — [one-sentence justification] **Risk mitigation:** [for Medium/High only — what to watch for or how to preserve working cases] **Recommendation:** [specific change to make] **Files to modify:** [list of files, for code fixes] **Metric link:** [which metrics would move, with baseline values] **Already-handled check:** [what you grepped, what existing prompt text you found, verdict] --- [repeat for each insight] ## Consolidated Prompt Skills [After all per-insight entries, list the final merged prompt skill texts in priority order, ready for Stage 7 to implement] ## Monitor Items (Non-Actionable Only) [Only insights that were triaged as genuinely non-actionable — e.g., the agent cannot change this behavior, or the insight is noise. Unmeasurable insights that are still real problems should appear in the priority list above, NOT here.]
Group related insights under cluster headings when they address the same underlying behavior. For merged insights, list all constituent insight IDs in the heading.
Outputs
eval/action_plan.md