install
source · Clone the upstream repo
git clone https://github.com/Upsonic/Upsonic
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Upsonic/Upsonic "$T" && mkdir -p ~/.claude/skills && cp -r "$T/prebuilt_autonomous_agents/applied_scientist/skills/benchmark" ~/.claude/skills/upsonic-upsonic-benchmark && rm -rf "$T"
manifest:
prebuilt_autonomous_agents/applied_scientist/skills/benchmark/SKILL.mdsource content
Benchmark Skill
Purpose
Define the comparison metrics and extract baseline values from the current implementation. Record them as a structured JSON entry so downstream phases and final evaluation can read them directly.
When to Use
Phase 3 — after both current analysis and research analysis are complete.
Input
| Parameter | Type | Description |
|---|---|---|
| experiment_path | path | |
Actions
-
Define comparison metrics:
- Include ALL metrics already used in
.current.ipynb - Add any additional metrics that are relevant for the new method.
- For classification: accuracy, precision, recall, F1, AUC-ROC (as applicable).
- For regression: MSE, RMSE, MAE, R² (as applicable).
- Include training time if measurable.
- Include ALL metrics already used in
-
Extract baseline values:
- Read metric values from
output cells.current.ipynb - If a metric is not computed in the notebook, record it as
and setnull
— both notebooks must then compute it."needs_computation": true
- Read metric values from
-
Append a Phase 3 entry to
under{experiment_path}/log.json
:phases{ "name": "Phase 3: Benchmark", "completed_at": "2026-04-17T10:45:00Z", "metrics": [ { "name": "accuracy", "description": "Fraction of correctly classified samples.", "higher_is_better": true, "baseline": 0.8726, "needs_computation": false }, { "name": "f1", "description": "F1 score (binary, positive class).", "higher_is_better": true, "baseline": 0.7277, "needs_computation": false }, { "name": "roc_auc", "description": "Area under the ROC curve.", "higher_is_better": true, "baseline": 0.9274, "needs_computation": false }, { "name": "training_time_seconds", "description": "Wall-clock training time.", "higher_is_better": false, "baseline": null, "needs_computation": true } ], "notes": "training_time_seconds must be added to both notebooks for a fair comparison." }Do not overwrite earlier entries; append to the
array.phases
Output
— updated with Phase 3 benchmark entry{experiment_path}/log.json- Clear list (in
) of what the new implementation must computemetrics