Claude-Skills self-improving-agent

install
source · Clone the upstream repo
git clone https://github.com/borghei/Claude-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/borghei/Claude-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/engineering/self-improving-agent" ~/.claude/skills/borghei-claude-skills-self-improving-agent && rm -rf "$T"
manifest: engineering/self-improving-agent/SKILL.md
source content

Self-Improving Agent - Autonomous Learning Patterns

Tier: POWERFUL Category: Engineering Tags: self-improvement, AI agents, feedback loops, auto-memory, meta-learning, performance tracking

Overview

Self-Improving Agent provides architectural patterns for AI agents that get better with use. Most agents are stateless -- they make the same mistakes repeatedly because they lack mechanisms to learn from their own execution. This skill addresses that gap with concrete patterns for feedback capture, memory curation, skill extraction, and regression detection.

The key insight: auto-memory captures everything, but curation is what turns noise into knowledge.

Sub-Skills

This skill uses compound sub-skill architecture. Each sub-skill in

skills/
handles a specific step of the improvement loop:

Sub-SkillFilePurpose
Remember
skills/remember.md
Capture errors and learnings from current session
Extract
skills/extract.md
Extract reusable patterns from completed work
Promote
skills/promote.md
Graduate proven patterns to permanent rules
Review
skills/review.md
Audit memory health, prune stale entries
Status
skills/status.md
Dashboard showing memory state and learning progress

Sub-Skill Flow

Remember ──> Extract ──> Promote ──> Review
    ^                                  │
    └──────────── Status ◄─────────────┘

The improvement cycle: Remember captures events during work, Extract identifies patterns across sessions, Promote graduates proven patterns to rules, Review maintains memory health, and Status provides visibility into the entire system.

Scripts

ScriptPurpose
scripts/pattern_extractor.py
Extract reusable patterns from session logs
scripts/memory_health_checker.py
Audit memory for stale, duplicate, and promotable entries
scripts/rule_promoter.py
Validate and apply promotions from memory to rules
scripts/feedback_analyzer.py
Analyze feedback logs for success rates and opportunities
scripts/regression_detector.py
Compare baseline vs current performance metrics
scripts/rule_manager.py
Manage a learned rules knowledge base with CRUD

Core Architecture

The Improvement Loop

┌──────────────────────────────────────────────────────────┐
│                   SELF-IMPROVEMENT CYCLE                  │
│                                                          │
│  ┌─────────┐    ┌──────────┐    ┌─────────────┐        │
│  │ Execute  │───▶│ Evaluate │───▶│ Extract     │        │
│  │ Task     │    │ Outcome  │    │ Learnings   │        │
│  └─────────┘    └──────────┘    └─────────────┘        │
│       ▲                               │                  │
│       │                               ▼                  │
│  ┌─────────┐    ┌──────────┐    ┌─────────────┐        │
│  │ Apply   │◀───│ Promote  │◀───│ Validate    │        │
│  │ Rules   │    │ to Rules │    │ Learnings   │        │
│  └─────────┘    └──────────┘    └─────────────┘        │
│                                                          │
└──────────────────────────────────────────────────────────┘

Improvement Maturity Levels

LevelNameMechanismExample
0StatelessNo memory between sessionsDefault agent behavior
1RecordingCaptures observations, no actionAuto-memory logging
2CuratingOrganizes and deduplicates observationsMemory review + cleanup
3PromotingGraduates patterns to enforced rulesMEMORY.md entries become CLAUDE.md rules
4ExtractingCreates reusable skills from proven patternsRecurring solutions become skill packages
5Meta-LearningAdapts learning strategy itselfAdjusts what to capture based on what proved useful

Most agents operate at Level 0-1. This skill provides the machinery for Levels 2-5.

Core Capabilities

1. Memory Curation System

The Memory Stack

┌─────────────────────────────────────────────────┐
│  CLAUDE.md / .claude/rules/                      │
│  Highest authority. Enforced every session.       │
│  Capacity: Unlimited. Load: Full file.           │
├─────────────────────────────────────────────────┤
│  MEMORY.md (auto-memory)                         │
│  Project learnings. Auto-captured by Claude.     │
│  Capacity: First 200 lines loaded. Overflow to   │
│  topic files.                                    │
├─────────────────────────────────────────────────┤
│  Session Context                                  │
│  Current conversation. Ephemeral.                │
│  Capacity: Context window.                       │
└─────────────────────────────────────────────────┘

Memory Review Protocol

Run periodically (weekly or after every 10 sessions):

Step 1: Read MEMORY.md and all topic files
Step 2: Classify each entry

  Categories:
  - PROMOTE: Pattern proven 3+ times, should be a rule
  - CONSOLIDATE: Multiple entries saying the same thing
  - STALE: References deleted files, old patterns, resolved issues
  - KEEP: Still relevant, not yet proven enough to promote
  - EXTRACT: Recurring solution that should be a reusable skill

Step 3: Execute actions
  - PROMOTE entries → move to CLAUDE.md or .claude/rules/
  - CONSOLIDATE entries → merge into single clear entry
  - STALE entries → delete
  - EXTRACT entries → create skill package (see Skill Extraction)

Step 4: Verify MEMORY.md is under 200 lines
  - If over 200: move topic-specific entries to topic files
  - Topic files: ~/.claude/projects/<path>/memory/<topic>.md

Promotion Criteria

An entry is ready for promotion when:

CriterionThresholdWhy
RecurrenceSeen in 3+ sessionsNot a one-off
ConsistencySame solution every timeNot context-dependent
ImpactPrevented errors or saved significant timeWorth enforcing
StabilityUnderlying code/system unchangedWon't immediately become stale
ClarityCan be stated in 1-2 sentencesRules must be unambiguous

Promotion Targets

Pattern TypePromote ToExample
Coding convention
.claude/rules/<area>.md
"Always use
type
not
interface
for object shapes"
Project architecture
CLAUDE.md
"All API routes go through middleware chain"
Tool preference
CLAUDE.md
"Use pnpm, not npm"
Debugging pattern
.claude/rules/debugging.md
"When tests fail, check env vars first"
File-scoped rule
.claude/rules/<scope>.md
with
paths:
"In migrations/, always add down migration"

2. Feedback Loop Design

Outcome Classification

Every agent task produces an outcome. Classify it:

SUCCESS         - Task completed, user accepted result
PARTIAL         - Task completed but required corrections
FAILURE         - Task failed, user had to redo
REJECTION       - User explicitly rejected approach
TIMEOUT         - Task exceeded time/token budget
ERROR           - Technical error (tool failure, API error)

Signal Extraction from Outcomes

OutcomeSignalMemory Action
SUCCESS (first try)Approach works wellReinforce (increment confidence)
SUCCESS (after correction)Initial approach had gapLog the correction pattern
PARTIAL (user edited result)Output format or content gapLog what user changed
FAILUREApproach fundamentally wrongLog anti-pattern with context
REJECTIONMisunderstood requirementsLog clarification pattern
Repeated ERRORTool or environment issueLog workaround or fix

Feedback Capture Template

## Learning: [Short description]

**Context:** [What task was being performed]
**What happened:** [Outcome description]
**Root cause:** [Why the outcome occurred]
**Correct approach:** [What should have been done]
**Confidence:** [High/Medium/Low]
**Recurrence:** [First time / Seen N times]
**Action:** [KEEP / PROMOTE / EXTRACT]

3. Performance Regression Detection

Metrics to Track

MetricMeasurementRegression Signal
First-attempt success rateTasks accepted without correctionDropping below 70%
Correction count per taskUser edits after agent outputRising above 2 per task
Tool error rateFailed tool calls / total callsRising above 5%
Context relevanceRetrieved context actually usedDropping below 60%
Task completion timeTurns to complete taskRising trend over 5 sessions

Regression Response Protocol

1. DETECT: Metric crosses threshold
2. DIAGNOSE: Compare recent sessions vs baseline
   - What changed? (New code? New patterns? New tools?)
   - Which task types are affected?
   - Is it a memory issue or a capability issue?
3. RESPOND:
   - Memory issue → Review and curate MEMORY.md
   - Stale rules → Update CLAUDE.md
   - New code patterns → Add rules for new patterns
   - Capability gap → Extract as skill request
4. VERIFY: Track metric for next 3 sessions

4. Skill Extraction

When a solution pattern is proven and reusable, extract it into a standalone skill.

Extraction Criteria

A pattern is ready for extraction when:
- Used successfully 5+ times across different contexts
- Solution is generalizable (not project-specific)
- Takes more than trivial effort to recreate from scratch
- Would benefit other projects/users

Extraction Process

Step 1: Document the pattern
  - What problem does it solve?
  - What's the step-by-step approach?
  - What are the inputs and outputs?
  - What are the edge cases?

Step 2: Generalize
  - Remove project-specific details
  - Identify configurable parameters
  - Add handling for common variations

Step 3: Package as skill
  - Create SKILL.md with frontmatter
  - Add references/ for knowledge bases
  - Add scripts/ if automatable
  - Add assets/ for templates

Step 4: Validate
  - Test on a different project
  - Have another person/agent use it
  - Iterate on unclear instructions

5. Meta-Learning Patterns

Adaptive Capture Strategy

Not all observations are equally valuable. Adjust what gets captured based on what proved useful:

Initial strategy: Capture everything
After 10 sessions: Analyze which captured items led to promotions
After 20 sessions: Adjust capture to focus on high-value categories

High-value categories (typically):
  - Error resolutions (80% promotion rate)
  - User corrections (70% promotion rate)
  - Tool preferences (60% promotion rate)

Low-value categories (typically):
  - File structure observations (10% promotion rate)
  - One-off workarounds (5% promotion rate)

Anti-Pattern Detection

Beyond capturing what works, actively detect what fails:

Anti-PatternDetection SignalResponse
Repeated wrong import pathSame correction 3+ timesAdd to CLAUDE.md as rule
Wrong test framework usedUser always changes test approachAdd testing rules
Incorrect API usageSame API error patternAdd API usage notes
Style guide violationsUser reformats same patternsAdd style rules
Wrong branch workflowUser corrects git operationsAdd git workflow rules

6. Continuous Calibration

Confidence Scoring

Every piece of learned knowledge carries a confidence score:

Confidence = base_score * recency_factor * consistency_factor

base_score:
  - User explicitly stated: 1.0
  - Observed from successful outcome: 0.8
  - Inferred from pattern: 0.6
  - Guessed from context: 0.3

recency_factor:
  - Last 7 days: 1.0
  - 7-30 days: 0.9
  - 30-90 days: 0.7
  - 90+ days: 0.5

consistency_factor:
  - Never contradicted: 1.0
  - Contradicted once, reaffirmed: 0.9
  - Contradicted, not reaffirmed: 0.5
  - Actively contradicted: 0.0 (delete)

Belief Revision

When new information contradicts existing knowledge:

1. Compare confidence scores
2. If new info higher confidence → update knowledge
3. If roughly equal → flag for user confirmation
4. If new info lower confidence → keep existing, note conflict
5. Always log the conflict for review

Workflows

Workflow 1: Weekly Memory Health Check

1. Read all memory files (MEMORY.md + topic files)
2. Count total entries and lines
3. For each entry, classify: PROMOTE / CONSOLIDATE / STALE / KEEP / EXTRACT
4. Execute promotions (with user confirmation)
5. Execute consolidations
6. Delete stale entries
7. Verify under 200-line limit
8. Report: entries promoted, consolidated, deleted, remaining

Workflow 2: Post-Session Learning Capture

1. Review session outcomes (successes, corrections, failures)
2. For each correction: log what was wrong and what was right
3. For each failure: log root cause and correct approach
4. Check existing memory for related entries
5. If related entry exists: increment recurrence count
6. If new: add entry with context
7. If recurrence threshold met: flag for promotion

Workflow 3: Regression Investigation

1. Identify the degraded metric
2. Pull last 5 sessions' outcomes for that task type
3. Compare against baseline (first 5 sessions)
4. Identify what changed: memory, code, rules, environment
5. Propose fix: update rule, add rule, retrain pattern
6. Apply fix
7. Monitor next 3 sessions

Common Pitfalls

PitfallWhy It HappensFix
Memory bloatAuto-capture without curationWeekly review, enforce 200-line limit
Stale rulesCode changes, rules don't updateTimestamp rules, periodic re-verification
Over-promotionPromoting one-off patterns as rulesRequire 3+ recurrences before promotion
Silent regressionNo metrics trackingImplement outcome classification
Cargo cult rulesCopying rules without understandingEach rule must have a "why" annotation
Contradiction spiralsNew rules conflict with old rulesBelief revision protocol

Integration Points

SkillIntegration
context-engineContext Engine manages what the agent sees; Self-Improving Agent manages what the agent remembers
agent-designerAgent Designer defines agent architecture; Self-Improving Agent adds the learning layer
prompt-engineer-toolkitPrompts that degrade over time are a regression; track and test them
observability-designerMonitor agent performance metrics alongside system metrics

References

  • references/feedback-loop-patterns.md
    - Detailed feedback capture and analysis patterns
  • references/memory-curation-guide.md
    - Step-by-step memory review and promotion procedures
  • references/meta-learning-architectures.md
    - Advanced patterns for agents that learn how to learn

Troubleshooting

ProblemCauseSolution
MEMORY.md exceeds 200 lines and keeps growingAuto-capture enabled without scheduled curationRun the Weekly Memory Health Check workflow; split topic-specific entries into
memory/<topic>.md
files
Promoted rules contradict each otherTwo conflicting patterns both crossed the 3-recurrence thresholdApply the Belief Revision protocol -- compare confidence scores, resolve the conflict, delete the weaker rule
Agent performance degrades after a promotion batchNewly promoted rules interact badly or are overly prescriptiveRoll back the most recent promotions, re-validate each rule in isolation, and promote incrementally
Skill extraction produces a package that only works on the original projectGeneralization step was skipped or rushedRevisit Extraction Process Step 2 -- strip project-specific details, parameterize hardcoded values, test on a second project before packaging
Feedback loop captures noise (trivial observations dominate)Capture strategy has not been calibrated with the Adaptive Capture StrategyAfter 10 sessions, analyze promotion rates by category and restrict capture to high-value categories (error resolutions, user corrections, tool preferences)
Regression Detection flags false positivesThresholds set too aggressively for early-stage projectsWiden thresholds during the first 20 sessions (e.g., first-attempt success 60% instead of 70%), then tighten once a stable baseline exists
Confidence scores decay too fast on valid long-term rulesRecency factor penalizes rules that are infrequently encountered but still correctFor rules explicitly confirmed by the user, override the recency factor to 1.0 regardless of age

Success Criteria

  • First-attempt success rate above 80% after 20 sessions of active self-improvement, measured as tasks accepted without user correction.
  • Memory size stays under 200 lines in MEMORY.md at all times, with overflow correctly routed to topic files.
  • Promotion rate of 15-25% of captured observations within 30 days, indicating the capture strategy targets high-value signals.
  • Zero stale rules remaining after each Weekly Memory Health Check -- every rule references current code, tools, and workflows.
  • Regression detection latency under 3 sessions -- performance degradation is flagged within 3 sessions of onset, not discovered weeks later.
  • Extracted skills reusable across 2+ projects without modification, validating that the generalization step produces genuinely portable packages.
  • Contradiction resolution within 1 session -- conflicting rules are detected and resolved via the Belief Revision protocol before they cause downstream errors.

Scope & Limitations

This skill covers:

  • Architectural patterns for building agents that learn from execution history and user feedback.
  • Memory lifecycle management: capture, curation, promotion, and retirement of learned knowledge.
  • Performance regression detection frameworks and response protocols for agent systems.
  • Skill extraction methodology for graduating proven patterns into reusable, standalone packages.

This skill does NOT cover:

  • Runtime agent orchestration or multi-agent coordination -- see
    agent-workflow-designer
    and
    agent-protocol
    for those patterns.
  • Prompt engineering, testing, or versioning of the prompts themselves -- see
    prompt-engineer-toolkit
    for prompt lifecycle management.
  • Infrastructure-level observability (logging, tracing, alerting dashboards) -- see
    observability-designer
    for system-level monitoring.
  • Initial agent architecture design, tool selection, or capability planning -- see
    agent-designer
    for foundational agent design decisions.

Integration Points

SkillIntegrationData Flow
context-engineContext Engine controls what the agent sees per session; Self-Improving Agent decides what is worth remembering long-termPromoted rules and curated memory flow into context retrieval; context relevance metrics flow back for regression tracking
agent-designerAgent Designer defines the agent's architecture and capabilities; Self-Improving Agent layers learning infrastructure on topArchitecture constraints inform what feedback loops are possible; extracted skills feed back as new agent capabilities
prompt-engineer-toolkitPrompts degrade over time as codebases evolve; Self-Improving Agent detects prompt regression via outcome trackingPerformance metrics flag underperforming prompts; prompt updates feed back as rule changes in CLAUDE.md
observability-designerObservability provides system-level metrics; Self-Improving Agent provides agent-behavior-level metricsSystem telemetry enriches regression diagnosis; agent performance metrics can be exported to observability dashboards
tech-debt-trackerStale rules and bloated memory are a form of technical debt; Tech Debt Tracker can surface them alongside code debtMemory health metrics feed into debt scoring; debt prioritization informs which stale rules to retire first
agent-workflow-designerMulti-step agent workflows benefit from per-step feedback capture and cross-workflow pattern extractionPer-step outcome data flows into feedback loops; extracted workflow optimizations flow back as updated workflow definitions