GB-Power-Market-JJ autoforge
AutoForge is a production-grade autonomous optimization framework for AI agents. It replaces subjective "reflection" with mathematically rigorous convergence loops — tracking every iteration in TSV, cross-validating with multiple models, and stopping only when pass rates confirm real improvement. Four specialized modes: prompt (skill & doc optimization via scenario simulation), code (sandboxed test execution with measurable criteria), audit (CLI verification against live tool behavior), and project (whole-repo cross-file consistency analysis). Battle-tested across 50+ iterations on production skills. Use when: user says "autoforge", "forge", "optimize skill", "improve", "run autoforge", "optimize code", "improve script", "optimize repo", "forge project", "check project", "repo audit".
git clone https://github.com/GeorgeDoors888/GB-Power-Market-JJ
T=$(mktemp -d) && git clone --depth=1 https://github.com/GeorgeDoors888/GB-Power-Market-JJ "$T" && mkdir -p ~/.claude/skills && cp -r "$T/openclaw-skills/skills/akrimm702/autoforge" ~/.claude/skills/georgedoors888-gb-power-market-jj-autoforge && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/GeorgeDoors888/GB-Power-Market-JJ "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/openclaw-skills/skills/akrimm702/autoforge" ~/.openclaw/skills/georgedoors888-gb-power-market-jj-autoforge && rm -rf "$T"
openclaw-skills/skills/akrimm702/autoforge/SKILL.mdAutoForge — Autonomous Optimization Framework
Stop reflecting. Start converging. Every iteration is measured, logged, and validated — not vibed.
AutoForge replaces ad-hoc "improve this" prompts with a rigorous optimization loop: define evals, run iterations, track pass rates in TSV, report live to your channel, and stop only when math says you're done. Multi-model cross-validation prevents the "same model grades its own homework" blind spot.
Four modes. One convergence standard.
| Mode | What it does | Best for |
|---|---|---|
| Simulate 5 scenarios/iter, evaluate Yes/No | SKILL.md, prompts, doc templates |
| Sandboxed test execution, measure exit/stdout/stderr | Shell scripts, Python tools, pipelines |
| Test CLI commands live, verify SKILL.md matches reality | CLI skill documentation |
| Scan whole repo, cross-file consistency analysis | README↔CLI drift, Dockerfile↔deps, CI gaps |
AutoForge — Top-Agent Architecture
Overview
Agent (you) ├── State: results.tsv, current target file state, iteration counter ├── Iteration 1: evaluate → improve → write TSV → report ├── Iteration 2: evaluate → improve → write TSV → report ├── ... └── Finish: report.sh --final → configured channel
Sub-Agent = You
"Sub-Agent" is a conceptual role, not a separate process. You (the top-agent) execute each iteration yourself: simulate/execute → evaluate → write TSV → call report.sh. The templates below describe what you do PER ITERATION — not what you send to another agent.
For code mode, run tests using the
exec tool.
Multi-Model Setup (recommended for Deep Audits)
For complex audits, you can split two roles across different models:
| Role | Model | Task |
|---|---|---|
| Optimizer | Opus / GPT-4.1 | Analyzes, finds issues, writes fixes |
| Validator | GPT-5 / Gemini (different model) | Checks against ground truth, provides pass rate |
Flow: Optimizer and Validator alternate. Optimizer iterations have status
improved/retained/discard. Validator iterations confirm or refute the pass rate. Spawn validators as sub-agents with sessions_spawn and explicit model.
When to use Multi-Model: Deep Audits (>5 iterations expected), complex ground truth, or when a single model is blind to its own errors.
When Single-Model suffices: Simple CLI audits, prompt optimization, code with clear tests.
Configuration
AutoForge uses environment variables for reporting. All are optional — without them, output goes to stdout.
| Variable | Default | Description |
|---|---|---|
| | Messaging channel for reports |
| (none) | Chat/group ID for report delivery |
| (none) | Thread/topic ID within the chat |
Hard Invariants
These rules apply always, regardless of mode:
- TSV is mandatory. Every iteration writes exactly one row to
.results/[target]-results.tsv - Reporting is mandatory. Call
immediately after every TSV row.report.sh - --dry-run never overwrites the target. Only TSV,
, and reports are written.*-proposed.md - Mode isolation is strict. Only execute steps for the assigned mode.
- Iteration 1 = Baseline. Evaluate the original version unchanged, status
.baseline
Modes — Read ONLY Your Mode!
You are assigned ONE mode. Ignore all sections for other modes.
| Mode | What happens | Output |
|---|---|---|
| Mentally simulate skill/prompt, evaluate against evals | Improved prompt text |
| Run tests in sandbox, measure results | Improved code |
| Test CLI commands (read-only only!) + verify SKILL.md against reality | Improved SKILL.md |
| Scan whole repo, cross-file analysis, fix multiple files per iteration | Improved repository |
Your mode is in the task prompt. Everything else is irrelevant to you.
TSV Format (same for ALL modes)
Header (once at loop start):
printf '%s\t%s\t%s\t%s\t%s\n' "iteration" "prompt_version_summary" "pass_rate" "change_description" "status" > results/[target]-results.tsv
Row per iteration:
printf '%s\t%s\t%s\t%s\t%s\n' "1" "Baseline" "58%" "Original version" "baseline" >> results/[target]-results.tsv
Use
notprintf!echo -einterprets backslashes in field values.echo -eoutputs strings literally.printf '%s'
5 columns, TAB-separated, EXACTLY this order:
| # | Column | Type | Rules |
|---|---|---|---|
| 1 | | Integer | 1, 2, 3, ... |
| 2 | | String | Max 50 Unicode chars. No tabs, no newlines. |
| 3 | | String | Number + : , , . Always integer. |
| 4 | | String | Max 100 Unicode chars. No tabs, no newlines. |
| 5 | | Enum | Exactly one of: · · · |
Escaping rules:
- Tabs in text fields → replace with spaces
- Newlines in text fields → replace with
| - Empty fields → use hyphen
(never leave empty)-
and backticks → use$
or escape withprintf '%s'
(prevents unintended variable interpolation)\$- Unicode/Emoji allowed, count as 1 character (not bytes)
Status rules (based on pass-rate comparison):
— Mandatory for Iteration 1. Evaluate original version only.baseline
— Pass rate higher than previous best → new version becomes current stateimproved
— Pass rate equal or marginally better → predecessor remainsretained
— Pass rate lower → change discarded, revert to best statediscard
Reporting (same for ALL modes)
After EVERY TSV row (including baseline):
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]"
After loop ends, additionally with
--final:
bash scripts/report.sh results/[target]-results.tsv "[Skill Name]" --final
The report script reads
AF_CHANNEL, AF_CHAT_ID, and AF_TOPIC_ID from environment. Without them, it prints to stdout with ANSI colors.
Stop Conditions (for ALL modes)
Priority — first matching condition wins, top to bottom:
- 🛑 Minimum iterations — If specified in task (e.g. "min 5"), this count MUST be reached. No other condition can stop before.
- 🛑 Max 30 iterations — Hard safety net, stop immediately.
- ❌ 3×
in a row → structural problem, stop + analyze.discard - ✅ 3× 100% pass rate (after minimum) → confirmed perfect, done.
- ➡️ 5×
in a row → converged, done.retained
Counting rules:
= three iterations with3× 100%
, not necessarily consecutive.pass_rate == 100%
and5× retained
= consecutive (in a row).3× discard
counts toward no series.baseline
interruptsimproved
andretained
series.discard
At 100% in early iterations: Keep going! Test harder edge cases. Only 3× 100% after the minimum confirms true perfection.
Recognizing Validator Noise
In multi-model setups, the Validator can produce false positives — fails that aren't real issues:
- Config path vs tool name confusion (e.g.
≠agents.list[]
tool)agents_list - Inverted checks ("no X" → Validator looks for X as required)
- Normal English as forbidden reference (e.g. "runtime outcome" ≠
)runtime: "acp" - Overcounting (thread commands counted as subagent commands)
Rule: If after all real fixes >3 discards come in a row and the fail justifications don't hold up under scrutiny → declare convergence, don't validate endlessly.
Execution Modes
| Flag | Behavior |
|---|---|
(default) | Only TSV + proposed files. Target file/repo remains unchanged. |
| Target file/repo is overwritten. Auto-backup → |
| Read existing TSV, continue from last iteration. On invalid format: abort. |
mode: prompt
Only read if your task contains
!mode: prompt
Per Iteration: What you do
- Read current prompt/skill
- Mentally simulate 5 different realistic scenarios
- Evaluate each scenario against all evals (Yes=1, No=0)
- Pass rate = (Sum Yes) / (Eval count × 5 scenarios) × 100
- Compare with best previous pass rate → determine status
- On
: propose minimal, surgical improvementimproved - Write TSV row + call report.sh
- Check stop conditions
At the End
Best version →
results/[target]-proposed.md + report.sh --final
mode: code
Only read if your task contains
!mode: code
Per Iteration: What you do
- Create sandbox:
SCRATCH=$(mktemp -d) && cd $SCRATCH - Write current code to sandbox
- Execute test command (with
)timeout 60s - Measure: exit_code, stdout, stderr, runtime
- Evaluate against evals → calculate pass rate
- On
: minimal code improvement + verify againimproved - Write TSV row + call report.sh
- Check stop conditions
Code Eval Types
| Eval Type | Description | Example |
|---|---|---|
| Process exit code | |
| stdout contains string | |
| stdout matches regex | |
| Test framework green | |
| Runtime limit | |
| No error output | |
| Output file created | |
| Output is valid JSON | |
At the End
Best code →
results/[target]-proposed.[ext] + report.sh --final
mode: audit
Only read if your task contains
!mode: audit
⚠️ DO NOT write your own code. Only test CLI commands of the target tool (
--help + read-only).
Two Variants
Simple Audit (CLI skill, clear commands):
- 2 iterations: Baseline → Proposed Fix
- For tools with clear
output and simple command structure--help
Deep Audit (complex docs, many checks):
- Iterative loop like prompt/code, same stop conditions
- For extensive documentation with many checkpoints (e.g. config keys, tool policy, parameter lists)
- Recommended: Multi-Model setup (Opus Optimizer + external Validator)
Simple Audit Flow
- Write TSV header
- Iteration 1 (Baseline): Test every documented command → pass rate → TSV + report
- Iteration 2 (Proposed Fix): Write improved SKILL.md → expected pass rate → TSV + report
- Improved SKILL.md →
results/[target]-proposed.md - Detail results →
(NOT in TSV!)results/[target]-audit-details.md - report.sh
--final
Deep Audit Flow
- Write TSV header
- Iteration 1 (Baseline): Extract ground truth from source, define all checks, evaluate baseline
- Iterations 2+: Optimizer fixes issues → Validator checks → TSV + report per iteration
- Loop runs until stop conditions trigger (3× 100%, 5× retained, 3× discard)
- Final version →
orresults/[target]-proposed.mdresults/[target]-v1.md - report.sh
--final
Fixed Evals (audit)
- Completeness — Does SKILL.md cover ≥80% of real commands/config?
- Correctness — Are ≥90% of documented commands/params syntactically correct?
- No stale references — Does everything documented actually exist?
- No missing core features — Are all important features covered?
- Workflow quality — Does quick-start actually work?
mode: project
Only read if your task contains
!mode: project
⚠️ This mode operates on an ENTIRE repository/directory, not a single file. Cross-file consistency is the core feature — this is NOT "audit on many files."
Three Phases
Project mode runs through three sequential phases. Phases 1 and 2 happen once (in Iteration 1 = Baseline). Phase 3 is the iterative fix loop.
Phase 1: Scan & Plan
-
Analyze the repo directory:
# Discover structure tree -L 3 --dirsfirst [target_dir] ls -la [target_dir] -
Identify relevant files and classify by priority:
Priority Files critical README, Dockerfile, CI workflows (.github/workflows), package.json/requirements.txt, main entry points normal Tests, configs, scripts, .env.example, .gitignore low Docs, examples, LICENSE, CHANGELOG -
Build the File-Map — a mental inventory of what exists and what's missing.
-
Compose eval set: Merge user-provided evals with auto-detected evals (see Default Evals below).
Phase 2: Cross-File Analysis
Run consistency checks across files. Each check = one eval point:
| Check | What it verifies |
|---|---|
| README ↔ CLI | Documented commands/flags match actual output |
| Dockerfile ↔ deps | / versions match what Dockerfile installs |
| CI ↔ project structure | Workflow references correct paths, scripts, test commands |
↔ code | Every env var in code has a corresponding entry in |
| Imports ↔ dependencies | Every / has a matching dependency declaration |
| Tests ↔ source | Test files exist for critical modules |
↔ artifacts | Build outputs, secrets, and caches are excluded |
Result of Phase 2: A complete eval checklist with per-file and cross-file checks, each scored Yes/No.
Phase 3: Iterative Fix Loop
Same loop logic as prompt/code/audit — TSV, report.sh, stop conditions. Key differences:
- Multiple files can be changed per iteration
- Pass rate = aggregated over ALL evals (file-specific + cross-file)
- Fixes are minimal and surgical — don't refactor blindly, only fix what improves pass rate
includes which files were touched:change_description"Fix Dockerfile + CI workflow sync"
Per Iteration: What you do
- Evaluate current repo state against all evals (file-specific + cross-file)
- Calculate pass rate: (passing evals / total evals) × 100
- Compare with best previous pass rate → determine status
- On
: apply minimal, surgical fixes to the fewest files necessaryimproved - Verify the fix didn't break other evals (re-run affected checks)
- Write TSV row + call report.sh
- Check stop conditions
Dry-Run vs Live
| Flag | Behavior |
|---|---|
(default) | Fixed files → directory (mirrors repo structure). Original repo untouched. |
| Files overwritten in-place. Originals backed up → (preserving directory structure). |
Default Evals (auto-applied unless overridden)
These evals are automatically used when the user doesn't provide custom evals. The agent detects which are applicable based on what exists in the repo:
| # | Eval | Condition |
|---|---|---|
| 1 | README accurate? (describes actual features/commands) | README exists |
| 2 | Tests present and green? ( / / ) | Test files or test config detected |
| 3 | CI configured and syntactically correct? | or exists |
| 4 | No hardcoded secrets? (`grep -rE "(password | api_key |
| 5 | Dependencies complete? ( ↔ imports, ↔ requires) | Dependency file exists |
| 6 | Dockerfile functional? ( succeeds or Dockerfile syntax valid) | Dockerfile exists |
| 7 | sensible? (no secrets, build artifacts excluded) | exists |
| 8 | License present? | Always |
Eval Scoring
Pass Rate = (Passing Evals / Total Applicable Evals) × 100
Evals that don't apply (e.g. "Dockerfile functional?" when no Dockerfile exists) are excluded from the total, not counted as passes.
At the End
: All proposed changes →--dry-run
directoryresults/[target]-proposed/
: Changes already applied, backups in--liveresults/backups/- report.sh
--final - Optionally:
with per-file findings (NOT in TSV!)results/[target]-project-details.md
Directory Structure
autoforge/ ├── SKILL.md ← This file ├── results/ │ ├── [target]-results.tsv ← TSV logs │ ├── [target]-proposed.md ← Proposed improvement (prompt/audit) │ ├── [target]-proposed/ ← Proposed repo changes (project mode) │ │ ├── README.md │ │ ├── Dockerfile │ │ └── ... │ ├── [target]-v1.md ← Deep audit final version │ ├── [target]-audit-details.md ← Audit details (audit mode only) │ ├── [target]-project-details.md ← Project details (project mode only) │ └── backups/ ← Auto-backups (--live) │ ├── [file].bak ← Single file backups (prompt/code/audit) │ └── [target]-backup/ ← Full directory backup (project mode) ├── scripts/ │ ├── report.sh ← Channel reporting │ └── visualize.py ← PNG chart (optional) ├── references/ │ ├── eval-examples.md ← Pre-built evals │ └── ml-mode.md ← ML training guide └── examples/ ├── demo-results.tsv ← Demo data └── example-config.json ← Example configuration
Examples (task descriptions, NOT CLI commands)
AutoForge is not a CLI tool — it's a skill prompt for the agent:
# Optimize a prompt "Start autoforge mode: prompt for the coding-agent skill. Evals: PTY correct? Workspace protected? Clearly structured?" # Audit a CLI skill (simple) "Start autoforge mode: audit for notebooklm-py." # Deep audit with multi-model "Start autoforge mode: audit (deep) for subagents docs. Optimizer: Opus, Validator: GPT-5 Extract ground truth from source, validate iteratively." # Optimize code "Start autoforge mode: code for backup.sh. File: ./backup.sh Test: bash backup.sh personal --dry-run Evals: exit_code==0, backup file created, < 10s runtime" # Optimize a whole repository "Start autoforge mode: project for ./my-app Evals: Tests green? CI correct? No hardcoded secrets? README accurate?" # Project mode with custom focus "Start autoforge mode: project for /path/to/api-server Focus: Docker + CI pipeline consistency Evals: docker build succeeds, CI workflow references correct paths, .env.example covers all env vars used in code" # Project mode dry-run (default) "Start autoforge mode: project for ./my-tool --dry-run Use default evals. Show me what needs fixing."
Eval Examples → Mode Mapping
references/eval-examples.md provides ready-to-use Yes/No evals grouped by category. Here's how they map to AutoForge modes:
| eval-examples.md Category | AutoForge Mode | Notes |
|---|---|---|
| Briefing, Email, Calendar, Summary, Proposal | | Mental simulation with scenario evals |
| Python Script, Shell Script, API, Data Pipeline, Build | | Real execution with measurable criteria |
| CI/CD, Docker, Helm, Kubernetes, Terraform | or | for single files, for cross-file |
| Code Review, API Documentation | | Verify docs match reality |
| Project / Repository, Cross-File Consistency, Security Baseline | | Whole-repo scanning and cross-file checks |
Pick evals from the matching category and paste them into your task prompt as the eval set.
Tips
- Always start with
--dry-run
= think,prompt
= execute,code
= test CLI,audit
= optimize repoproject- Simple Audit for clear CLI skills, Deep Audit for complex docs
- Project mode scans the whole repo — cross-file consistency is the killer feature
- Multi-Model for Deep Audits: different models cover different blind spots
- At >3 discards after all fixes: check for validator noise, declare convergence if justified
- TSV + report.sh are NOT optional — they are the user interface
- For ML training: see
references/ml-mode.md