AutoResearchClaw a-evolve
git clone https://github.com/aiming-lab/AutoResearchClaw
T=$(mktemp -d) && git clone --depth=1 https://github.com/aiming-lab/AutoResearchClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/a-evolve" ~/.claude/skills/aiming-lab-autoresearchclaw-a-evolve && rm -rf "$T"
.claude/skills/a-evolve/SKILL.mdA-Evolve: Agentic Evolution Skill
Apply the Solve → Observe → Evolve → Gate → Reload methodology from A-Evolve to iteratively improve agent performance. This skill is prompt-based — no external dependencies, no harness changes. You analyze failures, propose workspace mutations, and generate durable artifacts (skills, prompt patches, knowledge entries) that the agent can load in future runs.
Core Loop
When asked to evolve or improve agent performance, follow this 5-step loop:
1. Solve (Collect Evidence)
Gather the agent's execution artifacts. Ask the user for or locate:
- Run logs, error traces, or experiment outputs
- Pass/fail results per task
- Metric values (accuracy, reward, success rate)
- Any existing session files from previous runs
If inside AutoResearchClaw, look at:
— experiment outputs, charts, reviewsartifacts/rc-*/
or stage-specific logsevolve.log
— peer review feedbackreviews.md- Sentinel watchdog reports
2. Observe (Diagnose)
Analyze the collected evidence to produce structured observations:
For each failed or underperforming task, identify:
- Error category: code bug, timeout, wrong approach, missing knowledge, API misuse, hallucinated reference, prompt ambiguity, etc.
- Root cause: What specifically went wrong and why
- Frequency: Is this a one-off or a recurring pattern across tasks?
- Severity: blocking (pipeline crash) / degrading (wrong result) / cosmetic (formatting issue)
Write observations as a structured list:
## Observations (Batch N) ### OBS-1: [Category] Short description - Tasks affected: task_001, task_005, task_012 - Root cause: ... - Frequency: 3/50 tasks (6%) - Severity: degrading ### OBS-2: ...
3. Evolve (Propose Mutations)
Based on observations, propose one or more of these mutation types:
A. Generate a Skill (for recurring patterns, frequency ≥ 3)
Write a new
SKILL.md file that teaches the agent how to handle this
pattern. A good evolved skill:
- Targets a specific failure category, not generic advice
- Contains concrete steps the agent should follow
- Includes a "when to apply" trigger condition
- Is short (under 100 lines) and self-contained
Example — if the agent keeps failing at API pagination:
--- name: api-pagination-handler description: > Handle paginated API responses correctly. Use when making API calls that may return partial results, or when results seem truncated. --- When calling any API that supports pagination: 1. Check response for pagination indicators: `next_page`, `offset`, `has_more`, `cursor`, or truncated result counts. 2. If paginated, loop until all pages are collected. 3. Concatenate results before processing. 4. Set a max-page safety limit (default: 20) to prevent infinite loops. 5. Log total items collected vs expected count if available.
B. Patch the System Prompt (for prompt ambiguity or missing guidance)
Write a short addendum to the system prompt that addresses the gap. Keep patches minimal — one paragraph per issue. Format:
## Prompt Patch: [Issue] Append to system prompt: > When [specific situation], always [specific action] because [reason].
C. Add a Knowledge Entry (for factual gaps or learned heuristics)
Record a reusable insight as a knowledge entry:
{ "id": "know-001", "category": "experiment_design", "insight": "Synthetic benchmarks with <100 samples produce high-variance results. Always use ≥500 samples or report confidence intervals.", "source": "observation OBS-3 from batch 2", "confidence": 0.85 }
D. Do Nothing (if observation is a one-off, severity is cosmetic, or the fix would be too broad / risky)
4. Gate (Validate)
Before accepting any mutation, check:
- Specificity: Does it target the observed failure without being so broad it could cause regressions elsewhere?
- Testability: Could you verify this mutation helps by re-running the failed tasks?
- Blast radius: How much of the agent's behavior does this change? Prefer small, targeted mutations over large rewrites.
- Consistency: Does it contradict existing skills or prompt guidance?
If a mutation fails the gate, either refine it or discard it. Explain your reasoning to the user.
5. Reload (Apply and Record)
Present the accepted mutations to the user. For each:
- State what changed and why
- Show the artifact (skill file, prompt patch, knowledge entry)
- Suggest where to place it in the project
For AutoResearchClaw projects, recommended locations:
| Artifact | Location |
|---|---|
| Evolved skill | |
| Prompt patch | Append to or custom prompts file |
| Knowledge entry | |
| Observation log | |
Keep a running version log so the user can track what evolved and when:
## Evolution Log - evo-1 (2026-03-30): Generated `api-pagination-handler` skill from OBS-1 - evo-2 (2026-03-30): Prompt patch for citation format from OBS-4
Usage with AutoResearchClaw
This skill maps to ARC's pipeline stages:
| ARC Stage | Evolution Role |
|---|---|
| 12 EXPERIMENT_RUN | Source of Solve artifacts |
| 13 ITERATIVE_REFINE | Main Observe + Evolve trigger point |
| 15 RESEARCH_DECISION | Natural Gate — PROCEED = accept, REFINE = retry |
| 18 PEER_REVIEW | Additional Observe signal for writing quality |
When the user says "evolve my research pipeline" or similar:
- Ask which run to analyze (or find the latest
)artifacts/rc-*/ - Run the Observe step on experiment outputs + review feedback
- Propose mutations targeting the weakest pipeline stages
- Generate skill files that ARC can load via
.claude/skills/
Anti-Patterns
Do NOT:
- Generate vague, generic skills ("always be careful", "check your work")
- Propose mutations for one-off errors that won't recur
- Rewrite the entire system prompt — patch it surgically
- Generate more than 3 skills per evolution cycle (quality over quantity)
- Mutate tool code unless the user explicitly asks for it
Relationship to MetaClaw
If AutoResearchClaw has MetaClaw enabled (
metaclaw_bridge.enabled: true),
evolved skills from this process can be placed in ~/.metaclaw/skills/arc-*/
so MetaClaw injects them into future runs automatically. The two systems
are complementary:
- A-Evolve skill: Deep, targeted mutation from structured observation
- MetaClaw lesson: Broad pattern captured from pipeline warnings/errors
Both can coexist. Skills generated here are higher-precision; MetaClaw lessons are higher-recall.