Metaskill generate-report

Generate a comprehensive summary report of the latest experiment including metrics, plots, and comparison with baseline. Use this after training and evaluation to create a shareable experiment summary.

install

source · Clone the upstream repo

git clone https://github.com/xvirobotics/metaskill

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/xvirobotics/metaskill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/data-science/.claude/skills/generate-report" ~/.claude/skills/xvirobotics-metaskill-generate-report && rm -rf "$T"

manifest: examples/data-science/.claude/skills/generate-report/SKILL.md

source content

You are generating a comprehensive experiment report for this data science project. Your goal is to gather all available metrics, plots, and configuration details from the latest experiment and produce a clear, well-structured report that can be shared with the team.

Dynamic Context

Current branch: !

git branch --show-current

Git commit: !

git rev-parse --short HEAD 2>/dev/null || echo "unknown"

Recent experiment logs: !

ls -lt reports/*.json experiments/*.json 2>/dev/null | head -5 || echo "No experiment logs found"

Available plots: !

ls reports/figures/*.png reports/figures/*.svg 2>/dev/null | head -10 || echo "No plots found"

Checkpoints: !

ls -lt checkpoints/*.pt checkpoints/*.pth 2>/dev/null | head -3 || echo "No checkpoints"

Config used: !

ls configs/*.yaml configs/*.toml 2>/dev/null | head -3 || echo "No configs"

Experiment Name

If the user provided an experiment name:

$ARGUMENTS

Otherwise, derive one from the branch name, latest config file, or use the current date.

Report Generation Process

Step 1: Gather Experiment Data

Collect all available information about the latest experiment:

Metrics: Read the latest metrics JSON from
```
reports/
```
or
```
experiments/
```
Training logs: Look for training output logs, MLflow run data, or W&B run summaries
Configuration: Read the experiment config file (YAML/TOML)
Checkpoint metadata: Load the best checkpoint and extract epoch, metric, config
Dataset statistics: Look for data profiling outputs or read from data validation logs

# Find and read latest metrics
METRICS_FILE=$(ls -t reports/*.json experiments/*.json 2>/dev/null | head -1)
if [ -n "$METRICS_FILE" ]; then
    echo "=== Latest Metrics ==="
    cat "$METRICS_FILE"
fi

# Find config used
CONFIG_FILE=$(ls -t configs/*.yaml configs/*.toml 2>/dev/null | head -1)
if [ -n "$CONFIG_FILE" ]; then
    echo "=== Configuration ==="
    cat "$CONFIG_FILE"
fi

Step 2: Gather Baseline Data

Look for baseline metrics to compare against:

Check for a

reports/baseline_metrics.json

experiments/baseline.json

Check git history for previous metrics files:
```
git log --oneline --all -- reports/*.json
```
If MLflow is configured, query for the baseline run
If no baseline exists, note this in the report

Step 3: Generate Visualizations

If plots do not already exist, generate them:

python3 -c "
import json
from pathlib import Path

# Check if visualization script exists
viz_script = Path('src/evaluation/visualize.py')
if viz_script.exists():
    print('Visualization script found')
else:
    print('No visualization script found -- will generate basic plots')
"

Key visualizations to include:

Training curves: loss and metric over epochs (train vs. validation)
Confusion matrix: if classification task
Metric comparison bar chart: current vs. baseline
Feature importance: if available from the model or analysis

Step 4: Write the Report

Generate the report as a Markdown file at

reports/experiment_report.md

# Experiment Report: [Experiment Name]

**Date:** [current date]
**Branch:** [git branch]
**Commit:** [git commit hash]
**Author:** [generated by /generate-report skill]

---

## Executive Summary

[2-3 sentences: what was the experiment, what was the key result, and is it better than baseline?]

## Experiment Configuration

| Parameter | Value |
|-----------|-------|
| Model architecture | [from config] |
| Learning rate | [from config] |
| Batch size | [from config] |
| Epochs | [from config] |
| Optimizer | [from config] |
| Scheduler | [from config] |
| Random seed | [from config] |
| Dataset version | [from config or DVC] |

## Dataset Summary

| Split | Samples | Features | Classes |
|-------|---------|----------|---------|
| Train | [count] | [count] | [count or N/A] |
| Validation | [count] | [count] | [count or N/A] |
| Test | [count] | [count] | [count or N/A] |

## Results

### Final Metrics

| Metric | Value |
|--------|-------|
| [metric 1] | [value] |
| [metric 2] | [value] |
| ... | ... |

### Comparison with Baseline

| Metric | Baseline | Current | Delta | Improvement? |
|--------|----------|---------|-------|-------------|
| [metric 1] | [value] | [value] | [+/- value] | [Yes/No] |
| ... | ... | ... | ... | ... |

### Training Curves

![Training Loss](figures/training_loss.png)
![Validation Metric](figures/validation_metric.png)

### Confusion Matrix

![Confusion Matrix](figures/confusion_matrix.png)

## Analysis

### Key Findings
- [Finding 1: most important result]
- [Finding 2: notable pattern or observation]
- [Finding 3: any concerning behavior]

### Error Analysis
- [What types of errors does the model make?]
- [Are errors concentrated in specific classes or data subsets?]

### Comparison with Previous Experiments
- [How does this compare to previous runs?]
- [What changed and what impact did it have?]

## Recommendations

### Next Steps
1. [Actionable recommendation 1]
2. [Actionable recommendation 2]
3. [Actionable recommendation 3]

### Potential Improvements
- [Idea for model improvement]
- [Idea for data improvement]
- [Idea for training procedure improvement]

## Artifacts

| Artifact | Path |
|----------|------|
| Best checkpoint | checkpoints/best_model.pt |
| Metrics JSON | reports/metrics.json |
| Config file | configs/experiment.yaml |
| Training logs | experiments/[run-id]/ |
| Figures | reports/figures/ |

---

*Report generated automatically by the /generate-report skill.*

Step 5: Verify Report Quality

After writing the report:

Read it back and verify all placeholders are filled with actual data
Verify all referenced figure paths exist
Verify metrics values are reasonable (not NaN, not obviously wrong)
Ensure the executive summary accurately reflects the detailed results
Check that recommendations are specific and actionable, not generic

Report the path to the generated report file when complete.