AutoResearchClaw a-evolve

install
source · Clone the upstream repo
git clone https://github.com/aiming-lab/AutoResearchClaw
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiming-lab/AutoResearchClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/a-evolve" ~/.claude/skills/aiming-lab-autoresearchclaw-a-evolve && rm -rf "$T"
manifest: .claude/skills/a-evolve/SKILL.md
source content

A-Evolve: Agentic Evolution Skill

Apply the Solve → Observe → Evolve → Gate → Reload methodology from A-Evolve to iteratively improve agent performance. This skill is prompt-based — no external dependencies, no harness changes. You analyze failures, propose workspace mutations, and generate durable artifacts (skills, prompt patches, knowledge entries) that the agent can load in future runs.

Core Loop

When asked to evolve or improve agent performance, follow this 5-step loop:

1. Solve (Collect Evidence)

Gather the agent's execution artifacts. Ask the user for or locate:

  • Run logs, error traces, or experiment outputs
  • Pass/fail results per task
  • Metric values (accuracy, reward, success rate)
  • Any existing session files from previous runs

If inside AutoResearchClaw, look at:

  • artifacts/rc-*/
    — experiment outputs, charts, reviews
  • evolve.log
    or stage-specific logs
  • reviews.md
    — peer review feedback
  • Sentinel watchdog reports

2. Observe (Diagnose)

Analyze the collected evidence to produce structured observations:

For each failed or underperforming task, identify:

  • Error category: code bug, timeout, wrong approach, missing knowledge, API misuse, hallucinated reference, prompt ambiguity, etc.
  • Root cause: What specifically went wrong and why
  • Frequency: Is this a one-off or a recurring pattern across tasks?
  • Severity: blocking (pipeline crash) / degrading (wrong result) / cosmetic (formatting issue)

Write observations as a structured list:

## Observations (Batch N)

### OBS-1: [Category] Short description
- Tasks affected: task_001, task_005, task_012
- Root cause: ...
- Frequency: 3/50 tasks (6%)
- Severity: degrading

### OBS-2: ...

3. Evolve (Propose Mutations)

Based on observations, propose one or more of these mutation types:

A. Generate a Skill (for recurring patterns, frequency ≥ 3)

Write a new

SKILL.md
file that teaches the agent how to handle this pattern. A good evolved skill:

  • Targets a specific failure category, not generic advice
  • Contains concrete steps the agent should follow
  • Includes a "when to apply" trigger condition
  • Is short (under 100 lines) and self-contained

Example — if the agent keeps failing at API pagination:

---
name: api-pagination-handler
description: >
  Handle paginated API responses correctly. Use when making API calls
  that may return partial results, or when results seem truncated.
---

When calling any API that supports pagination:

1. Check response for pagination indicators: `next_page`, `offset`,
   `has_more`, `cursor`, or truncated result counts.
2. If paginated, loop until all pages are collected.
3. Concatenate results before processing.
4. Set a max-page safety limit (default: 20) to prevent infinite loops.
5. Log total items collected vs expected count if available.

B. Patch the System Prompt (for prompt ambiguity or missing guidance)

Write a short addendum to the system prompt that addresses the gap. Keep patches minimal — one paragraph per issue. Format:

## Prompt Patch: [Issue]
Append to system prompt:
> When [specific situation], always [specific action] because [reason].

C. Add a Knowledge Entry (for factual gaps or learned heuristics)

Record a reusable insight as a knowledge entry:

{
  "id": "know-001",
  "category": "experiment_design",
  "insight": "Synthetic benchmarks with <100 samples produce high-variance results. Always use ≥500 samples or report confidence intervals.",
  "source": "observation OBS-3 from batch 2",
  "confidence": 0.85
}

D. Do Nothing (if observation is a one-off, severity is cosmetic, or the fix would be too broad / risky)

4. Gate (Validate)

Before accepting any mutation, check:

  • Specificity: Does it target the observed failure without being so broad it could cause regressions elsewhere?
  • Testability: Could you verify this mutation helps by re-running the failed tasks?
  • Blast radius: How much of the agent's behavior does this change? Prefer small, targeted mutations over large rewrites.
  • Consistency: Does it contradict existing skills or prompt guidance?

If a mutation fails the gate, either refine it or discard it. Explain your reasoning to the user.

5. Reload (Apply and Record)

Present the accepted mutations to the user. For each:

  • State what changed and why
  • Show the artifact (skill file, prompt patch, knowledge entry)
  • Suggest where to place it in the project

For AutoResearchClaw projects, recommended locations:

ArtifactLocation
Evolved skill
.claude/skills/evolved/<skill-name>/SKILL.md
Prompt patchAppend to
prompts.default.yaml
or custom prompts file
Knowledge entry
docs/kb/evolved_knowledge/<id>.json
Observation log
evolution/observations/<batch>.md

Keep a running version log so the user can track what evolved and when:

## Evolution Log
- evo-1 (2026-03-30): Generated `api-pagination-handler` skill from OBS-1
- evo-2 (2026-03-30): Prompt patch for citation format from OBS-4

Usage with AutoResearchClaw

This skill maps to ARC's pipeline stages:

ARC StageEvolution Role
12 EXPERIMENT_RUNSource of Solve artifacts
13 ITERATIVE_REFINEMain Observe + Evolve trigger point
15 RESEARCH_DECISIONNatural Gate — PROCEED = accept, REFINE = retry
18 PEER_REVIEWAdditional Observe signal for writing quality

When the user says "evolve my research pipeline" or similar:

  1. Ask which run to analyze (or find the latest
    artifacts/rc-*/
    )
  2. Run the Observe step on experiment outputs + review feedback
  3. Propose mutations targeting the weakest pipeline stages
  4. Generate skill files that ARC can load via
    .claude/skills/

Anti-Patterns

Do NOT:

  • Generate vague, generic skills ("always be careful", "check your work")
  • Propose mutations for one-off errors that won't recur
  • Rewrite the entire system prompt — patch it surgically
  • Generate more than 3 skills per evolution cycle (quality over quantity)
  • Mutate tool code unless the user explicitly asks for it

Relationship to MetaClaw

If AutoResearchClaw has MetaClaw enabled (

metaclaw_bridge.enabled: true
), evolved skills from this process can be placed in
~/.metaclaw/skills/arc-*/
so MetaClaw injects them into future runs automatically. The two systems are complementary:

  • A-Evolve skill: Deep, targeted mutation from structured observation
  • MetaClaw lesson: Broad pattern captured from pipeline warnings/errors

Both can coexist. Skills generated here are higher-precision; MetaClaw lessons are higher-recall.