Metaskill evaluate-model
Load the latest model checkpoint, run evaluation on the test set, and generate a metrics report with confusion matrix. Use this after training to assess model performance or to re-evaluate a specific checkpoint.
git clone https://github.com/xvirobotics/metaskill
T=$(mktemp -d) && git clone --depth=1 https://github.com/xvirobotics/metaskill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/data-science/.claude/skills/evaluate-model" ~/.claude/skills/xvirobotics-metaskill-evaluate-model && rm -rf "$T"
examples/data-science/.claude/skills/evaluate-model/SKILL.mdYou are running model evaluation for this project. Your goal is to load a trained model checkpoint, evaluate it on the held-out test set, compute comprehensive metrics, and generate a structured report.
Dynamic Context
Current branch: !
git branch --show-current
Available checkpoints: !ls checkpoints/*.pt checkpoints/*.pth 2>/dev/null || echo "No checkpoints found"
Test data: !ls data/processed/test* data/features/test* 2>/dev/null || echo "No test data found"
Latest metrics: !ls -t reports/*.json experiments/*.json 2>/dev/null | head -3 || echo "No previous metrics found"
Config files: !ls configs/*.yaml configs/*.toml 2>/dev/null || echo "No configs found"
Checkpoint Selection
If the user provided a checkpoint path as an argument, use it:
$ARGUMENTS
Otherwise, find the latest checkpoint:
- Look for
orcheckpoints/best_model.ptcheckpoints/best_model.pth - If not found, find the most recently modified
or.pt
file in.pthcheckpoints/ - If no checkpoints exist, report the error and stop
Evaluation Process
Step 1: Load and Verify Checkpoint
Verify the checkpoint file exists and can be loaded:
python3 -c " import torch ckpt = torch.load('$CHECKPOINT_PATH', map_location='cpu', weights_only=False) print('Checkpoint keys:', list(ckpt.keys())) print('Epoch:', ckpt.get('epoch', 'unknown')) print('Best metric:', ckpt.get('best_metric', 'unknown')) print('Config:', ckpt.get('config', 'not stored')) "
Report the checkpoint metadata: epoch, stored metric, config used.
Step 2: Run Evaluation Script
Execute the evaluation:
python3 -m src.models.evaluation.evaluate \ --checkpoint $CHECKPOINT_PATH \ --data-dir data/features/ \ --output-dir reports/ \ --config configs/experiment.yaml
Alternative patterns to try if the above fails:
python3 src/evaluation/evaluate.py --checkpoint $CHECKPOINT_PATHpython3 evaluate.py --checkpoint $CHECKPOINT_PATH --test-data data/features/test.parquet
Step 3: Collect Metrics
After evaluation completes, read the metrics output. Look for the metrics JSON file:
cat reports/metrics.json 2>/dev/null || cat reports/evaluation_metrics.json 2>/dev/null
If no JSON file was generated, parse metrics from the script's stdout.
Step 4: Generate Confusion Matrix
If the evaluation script did not generate a confusion matrix plot, create one:
python3 -c " import json import numpy as np from pathlib import Path # Load metrics that include confusion matrix data metrics_path = Path('reports/metrics.json') if metrics_path.exists(): metrics = json.loads(metrics_path.read_text()) if 'confusion_matrix' in metrics: cm = np.array(metrics['confusion_matrix']) print('Confusion Matrix:') print(cm) print() # Print per-class metrics for i, row in enumerate(cm): precision = row[i] / max(row.sum(), 1) recall = row[i] / max(cm[:, i].sum(), 1) print(f'Class {i}: Precision={precision:.4f}, Recall={recall:.4f}') "
Step 5: Compare with Baseline
If previous metrics exist, load and compare:
- Find the most recent previous metrics file (excluding the one just generated)
- Compute deltas for each metric
- Flag any metric regressions (where current is worse than previous)
- Highlight improvements
Step 6: Generate Summary Report
Produce a structured evaluation report:
## Model Evaluation Report ### Checkpoint - Path: [checkpoint path] - Epoch: [epoch number] - Training config: [config file used] ### Test Set Metrics | Metric | Value | |--------|-------| | Accuracy | X.XXXX | | Precision (macro) | X.XXXX | | Recall (macro) | X.XXXX | | F1 (macro) | X.XXXX | | AUC-ROC | X.XXXX | ### Confusion Matrix [confusion matrix table or reference to plot] ### Comparison with Previous Run | Metric | Previous | Current | Delta | |--------|----------|---------|-------| | ... | ... | ... | +/- ... | ### Observations - [Key findings about model performance] - [Any concerning patterns in errors] - [Recommendations for improvement]
Write this report to
reports/evaluation_report.md.
Error Handling
- If checkpoint cannot be loaded: check for PyTorch version mismatch, report the error
- If test data is missing: report which files are expected and where to find them
- If CUDA is not available: run evaluation on CPU (will be slower but should work)
- If metrics computation fails: report the specific error and which metric caused it