Autostar autostar
git clone https://github.com/chrisvoncsefalvay/autostar
T=$(mktemp -d) && git clone --depth=1 https://github.com/chrisvoncsefalvay/autostar "$T" && mkdir -p ~/.claude/skills && cp -r "$T/autostar-skill" ~/.claude/skills/chrisvoncsefalvay-autostar-autostar && rm -rf "$T"
autostar-skill/SKILL.mda* (autostar)
A generalised autonomous optimisation loop — soft RLVR for the masses. The user defines a goal; the system runs structured experiments, evaluates progress across independent tracks, reflects at strategic checkpoints, and learns from every attempt — including learning how to learn better the next time.
If you can measure it, you can improve it.
Experimental-first principle
a* is an experimental optimisation loop. Do not reach for external mathematical optimisers or solvers (e.g.
scipy.optimize, cvxpy, linear/quadratic
programming solvers, evolutionary algorithm libraries, Bayesian optimisation
frameworks, or any other off-the-shelf optimisation package) as a shortcut to
improving the artifact. The value of a* is in the structured
explore-evaluate-reflect cycle, not in delegating the search to a solver.
If at any point during onboarding, pre-run analysis, or execution you believe the problem is well-suited to a closed-form or mathematical optimisation approach, you must ask the user first before pursuing it. Present it as an alternative:
"This problem looks like it could be approached with a mathematical optimiser (e.g. [specific method]). Would you like me to try that instead of running the experimental loop, or would you prefer to proceed with a*?"
Do not silently install, import, or invoke an external optimiser. Do not reframe the a* loop as a wrapper around a solver. If the user explicitly opts for a mathematical approach, that is a different workflow — not an a* run.
Concepts
Before running, ensure you understand these terms precisely:
| Term | Meaning |
|---|---|
| Step | One execution with one parameter set. Atomic unit of work. |
| Play | A named bundle of parameters that move together (optional; disable with ). |
| Lap | A set of steps sharing the same parameter family. Establishes statistical confidence in a direction. |
| Round | A set of laps. Ends with a mandatory reflection: worth pursuing? ask user? pivot? |
| Run | One user-initiated process. Lasts until budget is exhausted or goal is met. |
| Track | One independently verifiable sub-goal. Has its own verifier and ratchet. |
| Disposition | A learned prior on how to approach a (problem class, action intent) pair. Stored in long-term memory; conditions all significant actions. |
Runtime capability contract
Before Phase 1, detect the host runtime's capabilities and map them onto the abstract adapter contract in
references/runtime-capabilities.md.
Use abstract capabilities first:
for bounded approvalsstructured_choice
for open-ended elicitationfreeform_input
/file_presentation
for rubric builder and visualiserlocal_html
for external-tool verifiers and render scriptssubprocess
for human gates and round escalationspause_resume
Claude-specific tools are examples of adapters, not the specification:
- Claude Code:
+ shell + browser/file pathsask_user - Claude.ai: structured chat +
present_files
If a capability is missing, follow the fallback policy in
references/runtime-capabilities.md before onboarding the mission.
Concrete runtime profiles and adapters live in:
runtime-profiles/claude-code.jsonruntime-profiles/codex.jsonruntime-profiles/gemini.jsonruntime-profiles/claude-ai.jsonruntime-profiles/pi.jsonruntime-profiles/chat-only.jsonruntime-profiles/template.jsonreferences/adapter-claude-code.mdreferences/adapter-codex.mdreferences/adapter-gemini.mdreferences/adapter-claude-ai.mdreferences/adapter-pi.mdreferences/adapter-chat-only.mdreferences/adapter-template.mdscripts/runtime_profile.py
Before detailed verifier/rubric work, check that the active runtime can support the proposed mission. Use
scripts/runtime_profile.py check-mission with the
current runtime profile and planned verifier types. If it fails, stop and
reconfigure before proceeding.
Phase 1: Onboarding
Do not begin execution until onboarding is complete and the user has approved the mission.
Onboarding is an interactive dialogue, not a monologue. At every decision point you must stop and ask the user rather than inferring and proceeding. Use the host runtime's
structured_choice capability for bounded decisions; in Claude Code
this maps to ask_user. Use open prose questions for genuinely open-ended inputs
(e.g. goal description, rubric wording).
The mandatory user-confirmation checkpoints are:
- Goal decomposition confirmed — present inferred tracks as choices; user approves, removes, or adds before proceeding
- Required vs preferred — for each track, explicitly ask; do not infer
- Verifier type per track — present options; user selects
- Hard constraints confirmed — present inferred list; user amends
- Budget — present three concrete options; user selects
- Plays — enabled/disabled, and approval of proposed bundles
- Final mission confirmation — full summary; explicit go/no-go before any step runs
Never skip a checkpoint. If the user's initial message contained enough information to pre-populate an answer, present it as a pre-selected option and ask them to confirm or change it. Do not silently accept it.
Rubric builder: When configuring LLM judge tracks (onboarding checkpoint 2+), surface the bundled rubric builder through the runtime's
local_html or
file_presentation capability so the user can describe score anchors
interactively and get a generated rubric they can edit and confirm:
# Claude Code / terminal open assets/rubric-builder.html # macOS xdg-open assets/rubric-builder.html # Linux start assets/rubric-builder.html # Windows
If running in Claude.ai, use
present_files on assets/rubric-builder.html instead.
If the runtime cannot surface local HTML, fall back to manual rubric elicitation as
defined in references/runtime-capabilities.md. The user exports a tracks.md
from the tool; load that as the confirmed track configuration. Only fall back to
manual elicitation for tracks the tool did not cover (external_tool,
deterministic, human_gate types do not need a rubric).
Read
references/onboarding.md for the full dialogue flow, question wording, and
decision trees at each checkpoint. Read references/runtime-capabilities.md
before adapting this flow to a non-Claude host.
Rubric builder UI: When Phase B (verifier elicitation) reaches an
llm_judge
or hybrid track, present assets/rubric-builder.html to the user before
configuring that track. The builder calls Claude to generate the rubric from the
user's anchor descriptions, lets them review and edit it inline, and exports
a tracks.md file you can use directly. Tell the user:
"I'm opening the rubric builder for the [track name] track. Describe the score anchors, and it will draft the rubric for you to review and confirm."
After the user exports
tracks.md from the builder, read it and use it as the
track configuration. Do not re-elicit rubrics that are already confirmed there.
The onboarding produces four documents, all stored in the run directory:
mission.md
mission.mdGOAL: [plain language description of success] ARTIFACT: [what is being mutated and where it lives] PLAYS: enabled | disabled BUDGET: [strategy + ceiling — see references/budgeting.md] STOPPING_CRITERIA: [score threshold | plateau_n | budget_exhausted] REPORTING: [what the final report must contain]
tracks.md
tracks.mdOne block per track. See Verification taxonomy below for verifier types.
TRACK: <name> required: true | false weight: 0.0–1.0 (weights across non-required tracks must sum to 1.0) verifier: <see taxonomy> threshold: <pass/fail cutoff or target score> ratchet: independent | composite (default: independent)
constraints.md
constraints.mdHARD: [list — violations cause immediate step rejection before scoring] SOFT: [list — passed to LLM judge as weighting hints]
plays.md
(if enabled)
plays.mdPLAY: <name> parameters: [list of (param, from, to)] hypothesis: [why these move together] tracks_targeted: [list] atomic_fallback: true | false
Verification taxonomy
This is the core of the rubric system. Every track must declare one of the following verifier types. Read
references/verification.md for full configuration details and
examples for each type.
1. Deterministic programmatic
A function, script, or expression that produces a binary pass/fail or a bounded score with no randomness. Does not require an LLM call.
Use for: word count, token count, regex match, JSON schema validation, spelling/grammar (rule-based), mathematical constraints, format compliance.
verifier: type: deterministic fn: word_count(artifact) <= 400 returns: bool
2. External tool (subprocess)
A command-line tool invoked as a subprocess. The tool must be available in the environment; the mission builder checks availability before the run starts. Return code 0 = pass; non-zero = fail (unless score mode is configured).
Common tools and what they verify:
| Domain | Tool | What it checks |
|---|---|---|
| Python typing | , | Static type correctness |
| Python tests | | Test suite passage |
| TypeScript | | Type correctness |
| JavaScript | | Lint rules |
| Accessibility | , | WCAG compliance |
| Web performance | | Perf / a11y / SEO scores |
| CSS | | Style rule compliance |
| Markdown | | Document structure |
| OpenAPI | | API spec validity |
| Prose | | Style guide adherence |
| Security | , | Vulnerability patterns |
| Build | , | Compilation success |
| Inference perf | | Model latency, throughput, memory (see AITune delegation below) |
verifier: type: external_tool command: pyright src/handler.py --outputjson parse_output: json_error_count # or: exit_code, json_path, regex_capture returns: score # 1.0 - (errors / lines) required_env: [python, pyright]
If a required tool is absent, the mission builder must either guide the user to install it, or replace the track with an LLM judge approximation (lower confidence; flagged in the run report).
3. LLM judge
A structured LLM call with a fixed rubric. The rubric is immutable for the duration of the run — it must not be modified by any agent. Temperature should be ≤ 0.2. For high-stakes tracks, use an ensemble of two independent judge calls and average.
verifier: type: llm_judge rubric: | Score 0.0–1.0. Evaluate the documentation quality of the provided function. 0.8+ requires: accurate parameter descriptions, return type explanation, at least one usage example, and a description of error conditions. Penalise: missing examples, vague descriptions, undocumented exceptions. temperature: 0.1 ensemble: 2 returns: score
The judge must also return a
rationale string of 1–3 sentences. This is written
to short-term memory and feeds the round reflection.
4. Hybrid
A deterministic verifier AND an LLM judge, aggregated.
verifier: type: hybrid deterministic: entity_checker(artifact, source) llm_judge: factual_consistency_rubric aggregation: min | mean | weighted returns: score
Use
min aggregation when both components are required to pass independently
(i.e., a high LLM score cannot compensate for a failed deterministic check).
5. Human gate
Pauses the run and surfaces the artifact to the user for approval. Use sparingly; counts against budget. Appropriate when a track cannot be reliably automated (e.g., brand approval, legal sign-off, aesthetic judgement with no proxy metric).
verifier: type: human_gate prompt: "Does this copy meet the brand voice guidelines? Score 0–10." timeout_action: skip | block | auto_score(0.5)
Hard constraint enforcement
Hard constraints in
constraints.md are checked before any verifier runs.
A constraint violation immediately rejects the step with outcome: rejected_constraint
and returns zero budget cost for the verifier calls. This is important: do not waste
judge budget on an artifact that violates a hard constraint.
Inference optimisation and AITune delegation
When the artifact being optimised is a model's inference performance — latency, throughput, GPU memory during serving, or deployment configuration — the mutation step should delegate to AITune rather than blindly experimenting with inference configurations through the a* loop alone.
Why this matters: Inference optimisation has a structured search space (backends, precision levels, compilation strategies) that AITune already navigates well. a*'s value here is in wrapping AITune with multi-dimensional quality constraints (accuracy preservation, latency targets, memory budgets) and the reflect-and-learn cycle — not in reinventing AITune's internal search.
Detection during onboarding: If the user's goal involves model serving speed, inference latency, throughput, quantization for deployment, or GPU-accelerated inference, flag this as an inference optimisation mission during Phase 1 and suggest AITune delegation. Present it as an option:
"This looks like an inference optimisation problem. I can delegate the low-level tuning (backend selection, quantization, graph optimisation) to AITune while keeping a*'s quality tracking and learning loop around it. Would you like to use AITune for the inference tuning?"
If the user agrees, read
references/aitune.md for the full delegation
protocol, including track templates, play design patterns, and correctness
validation setup. The key architectural point: each a* step invokes AITune
with a parameter set; a* evaluates the result against all tracks; the
ratchet and reflection machinery works as normal.
If AITune is not installed, offer the install command during Phase 2 tool checks:
pip install --extra-index-url https://pypi.nvidia.com aitune
Phase 2: Pre-run preparation
Before the first round begins:
-
Check tool availability. For every
verifier, run a dry-fire check (external_tool
,pyright --version
, etc.). Report any missing tools to the user and resolve before proceeding.axe-cli --version -
Baseline run. Execute one step with the unmodified artifact. Record baseline scores for all tracks. This is step
and is never ratcheted.r0_l0_s0 -
Query disposition library. Retrieve relevant dispositions for this problem class. Surface them to the user briefly: "Based on previous runs, I know X about this class of problem."
-
Propose initial plays (if enabled). Present to user for approval or amendment.
-
Confirm mission. Show the user the complete
,mission.md
, andtracks.md
before any optimisation steps run. Do not proceed without explicit approval.constraints.md
Phase 3: Execution loop
Progress visualisation
VISUALIZATION POLICY: USE THE TEMPLATE — DO NOT IMPROVISE
A prototype progress chart lives at
assets/inline-progress-chart.html. Use it
as a template — do not invent your own visualisation from scratch, do not
generate random dashboards, and do not create standalone HTML files that open in a
browser. Claude Code supports inline HTML visuals in chat; use that capability.
To render the chart after each step:
- Read
as a templateassets/inline-progress-chart.html - Replace the sample
,STEPS
, andREFLECTIONS
data with actual run data fromBUDGET
,step_log.jsonl
, andreflections.jsonlprogress.json - Emit the resulting HTML inline in the conversation
Re-render after every step so the user always sees current state.
The chart has three visual components:
-
Composite score chart — staircase (step-style) curve for the winning trajectory (kept steps), with ghost dots for reverted alternatives. Each ghost dot connects back to its most recent kept ancestor via a pale bezier curve, showing what was tried and rejected. Reverted scores are labelled so the user sees what alternatives produced.
-
Branch genealogy — compact per-round row of kept/reverted dots grouped by lap, with the best score for the round.
-
Round reflections — structured cards for each round reflection, showing the three key questions (worth pursuing / ask user / pivot), reasoning, limiting track, budget remaining, and pace projection. Do not dump raw reflection text into the conversation — always render it through the template's structured card format.
Do not add per-track breakdowns, heatmaps, detailed step tables, or any other visual elaboration beyond what the template provides. If the user wants more detail, they can ask.
Data files
The run directory contains machine-readable state for external consumption. Keep these current but do not render them as visuals — they exist for programmatic access, not for display.
— one JSON record per line, one per step.
Same schema as the step record below. Appended after each step.runs/<run_id>/step_log.jsonl
— array of track definitions:runs/<run_id>/tracks.json
[{ "id": "type_correctness", "label": "type_correctness", "required": true, "weight": null }]
Written once at run start from the confirmed
tracks.md.
— one JSON record per line per round
reflection. Appended after each round.runs/<run_id>/reflections.jsonl
— run metadata:runs/<run_id>/mission.json
{ "run_id": "run_20260324", "budget": { "total_tokens": 120000 } }
Written once at run start.
— machine-readable snapshot of current state.
Updated after every step. See runs/<run_id>/progress.json
schemas/progress.schema.json for the full
JSON Schema.
{ "run_id": "run_20260324", "status": "running", "updated_at": "2026-03-24T14:23:00Z", "baseline": { "composite": 0.45, "tracks": { "type_correctness": 0.80 } }, "current": { "composite": 0.82, "tracks": { "type_correctness": 0.95 }, "step_id": "r2_l1_s4" }, "delta": { "composite": 0.37, "tracks": { "type_correctness": 0.15 } }, "budget": { "total_steps": 80, "used_steps": 34, "remaining_pct": 57.5 }, "rounds_completed": 2, "steps_completed": 34, "steps_kept": 18, "momentum": "exploiting_successfully", "limiting_track": "docstring_quality", "last_reflection": { "worth_pursuing": "yes", "pivot": "none", "pace_projection": 0.89 } }
Step execution
For each step:
1. Apply hard constraint check → reject immediately if violated 2. Execute the artifact mutation (play or atomic) 3. Run all track verifiers in dependency order (required tracks first) 4. Compute composite score: Σ(weight_i × score_i), gated by required tracks 5. Apply per-track ratchet: - independent ratchet: each track keeps/reverts its own parameter changes - composite ratchet: keep only if overall composite improves 6. Write step record to short-term memory
Step record schema:
id: run_03_r2_l1_s4 parameters: {param: value, ...} play: play_name | null track_scores: {track_name: score, ...} composite: float judge_notes: {track_name: rationale, ...} constraints: passed | rejected (+ which constraint) cost: {tokens: n, wall_s: n} outcome: keep | revert | partial_keep | rejected_constraint
Lap completion
When all steps in a lap are done:
score_distribution: {mean, std, max, min} verdict: promising | exhausted | noisy - promising: mean score above lap threshold and improving - exhausted: score has plateaued across steps with low variance - noisy: high variance; more steps needed to confirm hypothesis_result: confirmed | partial | refuted budget_used: {tokens, steps}
If verdict is
noisy and budget allows, the lap may request additional steps
before closing. The budget controller gates this.
Round reflection
Every round ends with a recorded reflection, without exception. The reflection is not optional even when nothing changes. A "no change" record is valuable: it documents that the question was considered.
ROUND REFLECTION round_id: laps_completed: score_trajectory: [list of lap means] track_trajectories: {track: [scores]} ← per-track view limiting_track: <which track is the current ceiling> QUESTION 1 — Worth pursuing? assessment: yes | no | uncertain reasoning: [2–4 sentences] QUESTION 2 — Ask the user? trigger: none | stuck | diverging_tracks | pace_risk | constraint_conflict message: [specific, actionable question if triggered — not "we're stuck"] QUESTION 3 — Pivot? decision: none | minor | major | abandon reasoning: [required even if none] next_round_strategy: [what changes, if anything] budget_remaining: % pace_projection: expected score at budget exhaustion
Ask-user triggers (automatic):
- Score has not improved across two consecutive rounds
- Two or more tracks are diverging (improving one reliably hurts another)
- Budget is 50% consumed with < 30% of target score achieved
- All laps in round returned
exhausted - A required track is consistently failing with no clear fix
When asking the user, be specific. Not "we're stuck" but:
"Improving documentation quality (track score: 0.74) consistently reduces type correctness (track score drops from 1.0 to 0.91) because the added comments confuse pyright's inference. Should I relax the type correctness threshold, or is that a hard requirement?"
Phase 4: Memory and learning
Read
references/memory.md for the full memory architecture.
Short-term memory (within run)
- Full step log
- Hypothesis stack with provenance
- Track trajectories
- Score momentum signal
- Failed hypotheses with failure modes (not just "failed" — why)
Long-term memory (disposition library)
Keyed on
(problem_class, action_intent). Each entry is a natural-language
conditioned prior on how to approach this class of action on this class of problem.
The memory agent runs a consolidation pass at the end of each round:
- Does any disposition need updating based on this round's evidence?
- Did a disposition prove wrong? Flag it with a negative exemplar.
- Should a problem class be forked? (Two sub-classes behaving differently)
The memory agent may run a meta-research step only when the mission has explicitly enabled external research. If research is disabled, skip this path and continue using only local evidence, run history, user guidance, and bundled references.
If enabled and disposition confidence is below threshold for an upcoming action class:
- Prefer local references, bundled docs, and tool help before any network fetch
- If external lookup is still justified, prefer vendor docs or a mission allowlist over the open web
- Do not send artifact contents, source code, secrets, or proprietary identifiers to external services unless the user separately approved that disclosure
- Synthesise into a candidate disposition
- Apply on the next action
- Observe outcome; confirm or reject the looked-up guidance
- Record provenance:
looked_up_from_web | learned_from_run | user_specified
Phase 5: Post-run report
The final report must contain:
- Baseline vs final scores per track
- Score trajectory chart (text-based if no rendering available)
- Round reflection log (all rounds, verbatim)
- What worked (confirmed plays and dispositions)
- What didn't (refuted hypotheses, with failure modes)
- Suggested follow-up directions
- Disposition updates proposed (user can approve or reject)
- Full budget accounting
Reference files
Read these when the relevant section is reached:
| File | When to read |
|---|---|
| Phase 1 — building mission, tracks, constraints, plays |
| When configuring any track verifier |
| When setting or projecting budget |
| When reading/writing disposition library |
| Before adapting a* to any non-Claude runtime |
| When running a* in Claude Code full-support mode |
| When running a* in Codex full-support mode |
| When running a* in Gemini CLI full-support mode |
| When running a* in Claude.ai reduced-support mode |
| When running a* in Pi full-support mode |
| To understand the unsupported chat-only boundary |
| When creating a new runtime adapter |
| When mission involves inference optimisation (latency, throughput, quantization, GPU deployment) |
Assets
Present these to the user at the indicated phase:
| File | Phase | Purpose |
|---|---|---|
| Phase 1 — Phase B verifier elicitation | Interactive rubric drafting and confirmation for LLM judge tracks |
| Phase 3 — after every step | Template for inline progress visualisation. Inject run data and render in chat. Do not invent your own chart — use this. |
Scripts
| File | When to run |
|---|---|
| `scripts/runtime_profile.py list | show |
| After verifier selection, before rubric/budget deepening |