Metaskill evaluate-model

Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint.

install
source · Clone the upstream repo
git clone https://github.com/xvirobotics/metaskill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/xvirobotics/metaskill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/data-science/.claude/skills/evaluate-model" ~/.claude/skills/xvirobotics-metaskill-evaluate-model && rm -rf "$T"
manifest: examples/data-science/.claude/skills/evaluate-model/SKILL.md
source content

You are running model evaluation for this project. Your goal is to load a trained model checkpoint, evaluate it on the held-out test set, compute comprehensive metrics, and generate a structured report.

Dynamic Context

Current branch: !

git branch --show-current
Available checkpoints: !
ls checkpoints/*.pt checkpoints/*.pth 2>/dev/null || echo "No checkpoints found"
Test data: !
ls data/processed/test* data/features/test* 2>/dev/null || echo "No test data found"
Latest metrics: !
ls -t reports/*.json experiments/*.json 2>/dev/null | head -3 || echo "No previous metrics found"
Config files: !
ls configs/*.yaml configs/*.toml 2>/dev/null || echo "No configs found"

Checkpoint Selection

If the user provided a checkpoint path as an argument, use it:

$ARGUMENTS

Otherwise, find the latest checkpoint:

  1. Look for
    checkpoints/best_model.pt
    or
    checkpoints/best_model.pth
  2. If not found, find the most recently modified
    .pt
    or
    .pth
    file in
    checkpoints/
  3. If no checkpoints exist, report the error and stop

Evaluation Process

Step 1: Load and Verify Checkpoint

Verify the checkpoint file exists and can be loaded:

python3 -c "
import torch
ckpt = torch.load('$CHECKPOINT_PATH', map_location='cpu', weights_only=False)
print('Checkpoint keys:', list(ckpt.keys()))
print('Epoch:', ckpt.get('epoch', 'unknown'))
print('Best metric:', ckpt.get('best_metric', 'unknown'))
print('Config:', ckpt.get('config', 'not stored'))
"

Report the checkpoint metadata: epoch, stored metric, config used.

Step 2: Run Evaluation Script

Execute the evaluation:

python3 -m src.models.evaluation.evaluate \
    --checkpoint $CHECKPOINT_PATH \
    --data-dir data/features/ \
    --output-dir reports/ \
    --config configs/experiment.yaml

Alternative patterns to try if the above fails:

  • python3 src/evaluation/evaluate.py --checkpoint $CHECKPOINT_PATH
  • python3 evaluate.py --checkpoint $CHECKPOINT_PATH --test-data data/features/test.parquet

Step 3: Collect Metrics

After evaluation completes, read the metrics output. Look for the metrics JSON file:

cat reports/metrics.json 2>/dev/null || cat reports/evaluation_metrics.json 2>/dev/null

If no JSON file was generated, parse metrics from the script's stdout.

Step 4: Generate Confusion Matrix

If the evaluation script did not generate a confusion matrix plot, create one:

python3 -c "
import json
import numpy as np
from pathlib import Path

# Load metrics that include confusion matrix data
metrics_path = Path('reports/metrics.json')
if metrics_path.exists():
    metrics = json.loads(metrics_path.read_text())
    if 'confusion_matrix' in metrics:
        cm = np.array(metrics['confusion_matrix'])
        print('Confusion Matrix:')
        print(cm)
        print()
        # Print per-class metrics
        for i, row in enumerate(cm):
            precision = row[i] / max(row.sum(), 1)
            recall = row[i] / max(cm[:, i].sum(), 1)
            print(f'Class {i}: Precision={precision:.4f}, Recall={recall:.4f}')
"

Step 5: Compare with Baseline

If previous metrics exist, load and compare:

  1. Find the most recent previous metrics file (excluding the one just generated)
  2. Compute deltas for each metric
  3. Flag any metric regressions (where current is worse than previous)
  4. Highlight improvements

Step 6: Generate Summary Report

Produce a structured evaluation report:

## Model Evaluation Report

### Checkpoint
- Path: [checkpoint path]
- Epoch: [epoch number]
- Training config: [config file used]

### Test Set Metrics
| Metric | Value |
|--------|-------|
| Accuracy | X.XXXX |
| Precision (macro) | X.XXXX |
| Recall (macro) | X.XXXX |
| F1 (macro) | X.XXXX |
| AUC-ROC | X.XXXX |

### Confusion Matrix
[confusion matrix table or reference to plot]

### Comparison with Previous Run
| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| ... | ... | ... | +/- ... |

### Observations
- [Key findings about model performance]
- [Any concerning patterns in errors]
- [Recommendations for improvement]

Write this report to

reports/evaluation_report.md
.

Error Handling

  • If checkpoint cannot be loaded: check for PyTorch version mismatch, report the error
  • If test data is missing: report which files are expected and where to find them
  • If CUDA is not available: run evaluation on CPU (will be slower but should work)
  • If metrics computation fails: report the specific error and which metric caused it