install
source · Clone the upstream repo
git clone https://github.com/Upsonic/Upsonic
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Upsonic/Upsonic "$T" && mkdir -p ~/.claude/skills && cp -r "$T/prebuilt_autonomous_agents/applied_scientist/skills/evaluate" ~/.claude/skills/upsonic-upsonic-evaluate && rm -rf "$T"
manifest:
prebuilt_autonomous_agents/applied_scientist/skills/evaluate/SKILL.mdsource content
Evaluate Skill
Purpose
Compare baseline and new implementation results. Produce the machine-readable final report
result.json, update experiments.json, and append a row to comparison.json.
When to Use
Phase 5 — after the new implementation is complete and metrics are collected.
Input
| Parameter | Type | Description |
|---|---|---|
| experiment_path | path | |
| research_name | string | Name of this experiment |
Actions
-
Collect all metrics from
(Phase 3 baseline entry + Phase 4 new method entry).log.json -
Determine verdict:
: new method outperforms baseline on the majority of key metricsBETTER
: new method underperforms baseline on the majority of key metricsWORSE
: mixed results or differences within noise marginINCONCLUSIVE
: experiment could not produce comparable results (dependency failure, implementation crash, data incompatibility)FAILED
-
Write
in the exact schema below. Always valid JSON; never leave fields undefined — use{experiment_path}/result.json
for unknown values.null{ "name": "{research_name}", "verdict": "BETTER", "summary": "2-3 paragraphs explaining what the new method does, how it fundamentally differs from the baseline, and what trade-offs it makes.", "explanation": "2-3 sentences explaining WHY this verdict was reached. Reference specific metrics and their differences. Be concrete — mention numbers, not vague statements.", "comparison": { "metrics": [ { "name": "accuracy", "current": 0.853, "new": 0.872, "diff": 0.019, "diff_display": "+0.019", "unit": null, "higher_is_better": true, "better": "new" }, { "name": "training_time_seconds", "current": 2.0, "new": 45.0, "diff": 43.0, "diff_display": "+43.0", "unit": "seconds", "higher_is_better": false, "better": "current" } ] }, "file_locations": { "current_notebook": "experiments/{research_name}/current.ipynb", "current_data": "experiments/{research_name}/current_data/", "new_notebook": "experiments/{research_name}/new.ipynb", "research_paper": "experiments/{research_name}/research.pdf", "experiment_log": "experiments/{research_name}/log.json" } }Field rules
: exactly one ofverdict
,"BETTER"
,"WORSE"
,"INCONCLUSIVE"
."FAILED"
/summary
: plain text, no markdown headings. Short paragraphs only.explanation
:comparison.metrics[]
/current
are numbers (ornew
if a side could not compute the metric).null
(raw number).diff = new - current
is the short string with sign (diff_display
,"+0.019"
)."-0.03"
:better
|"new"
|"current"
|"tie"
— computed fromnull
anddiff
.higher_is_better
is a short unit string (unit
,"seconds"
, etc.) or"%"
.null
uses paths relative to the experiments directory root.file_locations
-
Update
:experiments/experiments.json- Set
tostatus
(or"completed"
if the experiment failed)."failed" - Fill in
,verdict
,key_metric
,baseline_model
.new_method
is an object:key_metric
.{"name": "...", "baseline": <num>, "new": <num>}
- Set
-
Update
:experiments/comparison.json- If the file does not exist, create it with
.{"experiments": []} - Append an entry:
{ "name": "{research_name}", "date": "YYYY-MM-DD", "baseline": "{baseline_model}", "new_method": "{new_method}", "key_metric": {"name": "accuracy", "baseline": 0.853, "new": 0.872}, "verdict": "BETTER" }
- If the file does not exist, create it with
-
Update
— append a Phase 5 entry:{experiment_path}/log.json{ "name": "Phase 5: Evaluate", "completed_at": "2026-04-17T11:40:00Z", "verdict": "BETTER", "key_change": "accuracy +0.019 (new > current)", "files_written": ["result.json", "experiments.json", "comparison.json"] }
Output
— the final machine-readable report.{experiment_path}/result.json
— updated with this experiment's final verdict.experiments/experiments.json
— new row appended.experiments/comparison.json
— finalized with Phase 5 entry.log.json