Awesome-omni-skill ML Experiment Tracking
Track machine learning experiments with reproducible parameters and metrics
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/machine-learning/ml-experiment-tracking" ~/.claude/skills/diegosouzapw-awesome-omni-skill-ml-experiment-tracking-909a6b && rm -rf "$T"
manifest:
skills/machine-learning/ml-experiment-tracking/SKILL.mdsource content
ML Experiment Tracking Skill
Track machine learning experiments with reproducible parameters and metrics.
Trigger Conditions
- Model configuration changes or hyperparameter updates
- New experiment run initiated
- User invokes with "track experiment" or "compare models"
Input Contract
- Required: Experiment parameters (model, hyperparameters, data)
- Required: Evaluation metrics
- Optional: Baseline comparison, hypothesis
Output Contract
- Experiment log entry with full reproducibility info
- Comparison table against baseline/prior runs
- Recommendation on whether to promote or iterate
Tool Permissions
- Read: Model configs, training data metadata, metric logs
- Write: Experiment logs, comparison reports
- Execute: Metric collection commands
Execution Steps
- Record experiment hypothesis and parameters
- Capture environment (dependencies, data version, code commit)
- Execute or observe training run
- Collect metrics and artifacts
- Compare against baseline and prior experiments
- Recommend: promote, iterate, or abandon
Success Criteria
- Experiment is fully reproducible from logged parameters
- Metrics compared against baseline
- Clear recommendation with rationale
Escalation Rules
- Escalate if model performance degrades vs. baseline
- Escalate if data drift detected in training set
- Escalate if experiment requires new infrastructure
Example Invocations
Input: "Compare the BERT-base and DistilBERT models for our classification task"
Output: Experiment log: BERT-base (F1: 0.92, latency: 45ms, size: 440MB) vs DistilBERT (F1: 0.89, latency: 12ms, size: 260MB). Recommendation: DistilBERT for production (3% F1 trade-off for 73% latency improvement). Promote to staging for A/B test.