Upsonic benchmark

Name: benchmark
Author: upsonic

Benchmark Skill

install

source · Clone the upstream repo

git clone https://github.com/Upsonic/Upsonic

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/Upsonic/Upsonic "$T" && mkdir -p ~/.claude/skills && cp -r "$T/prebuilt_autonomous_agents/applied_scientist/skills/benchmark" ~/.claude/skills/upsonic-upsonic-benchmark && rm -rf "$T"

manifest: prebuilt_autonomous_agents/applied_scientist/skills/benchmark/SKILL.md

source content

Benchmark Skill

Purpose

Define the comparison metrics and extract baseline values from the current implementation. Record them as a structured JSON entry so downstream phases and final evaluation can read them directly.

When to Use

Phase 3 — after both current analysis and research analysis are complete.

Input

Parameter	Type	Description
experiment_path	path	`experiments/{research_name}/`

Actions

Define comparison metrics:
- Include ALL metrics already used in
```
current.ipynb
```
  .
- Add any additional metrics that are relevant for the new method.
- For classification: accuracy, precision, recall, F1, AUC-ROC (as applicable).
- For regression: MSE, RMSE, MAE, R² (as applicable).
- Include training time if measurable.
Extract baseline values:
- Read metric values from
```
current.ipynb
```
  output cells.
- If a metric is not computed in the notebook, record it as
```
null
```
  and set
```
"needs_computation": true
```
  — both notebooks must then compute it.

Append a Phase 3 entry to

{experiment_path}/log.json

under

phases

{
  "name": "Phase 3: Benchmark",
  "completed_at": "2026-04-17T10:45:00Z",
  "metrics": [
    {
      "name": "accuracy",
      "description": "Fraction of correctly classified samples.",
      "higher_is_better": true,
      "baseline": 0.8726,
      "needs_computation": false
    },
    {
      "name": "f1",
      "description": "F1 score (binary, positive class).",
      "higher_is_better": true,
      "baseline": 0.7277,
      "needs_computation": false
    },
    {
      "name": "roc_auc",
      "description": "Area under the ROC curve.",
      "higher_is_better": true,
      "baseline": 0.9274,
      "needs_computation": false
    },
    {
      "name": "training_time_seconds",
      "description": "Wall-clock training time.",
      "higher_is_better": false,
      "baseline": null,
      "needs_computation": true
    }
  ],
  "notes": "training_time_seconds must be added to both notebooks for a fair comparison."
}

Do not overwrite earlier entries; append to the

phases

array.

Output

```
{experiment_path}/log.json
```
— updated with Phase 3 benchmark entry
Clear list (in
```
metrics
```
) of what the new implementation must compute