install
source · Clone the upstream repo
git clone https://github.com/pyramidheadshark/claude-scaffold
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/pyramidheadshark/claude-scaffold "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/experiment-tracking" ~/.claude/skills/pyramidheadshark-claude-scaffold-experiment-tracking && rm -rf "$T"
manifest:
.claude/skills/experiment-tracking/SKILL.mdsource content
Experiment Tracking
When to Load This Skill
Load when working with: MLflow experiments, run logging, model registry, artifact management, experiment comparison, cross-validation with tracking.
Core Concepts
| Concept | Purpose |
|---|---|
| Run | Single training execution — logs params, metrics, artifacts |
| Experiment | Named collection of runs — logical grouping by model type or task |
| Model Registry | Versioned model store — stages: None → Staging → Production |
| Artifact | Any file output — model weights, plots, feature importance |
Run Lifecycle Pattern
Always use context manager — never log outside a run:
import mlflow import mlflow.sklearn mlflow.set_experiment("my-experiment") with mlflow.start_run(run_name="baseline-rf") as run: mlflow.log_params({ "n_estimators": 100, "max_depth": 5, "random_state": 42, }) model.fit(X_train, y_train) score = model.score(X_val, y_val) mlflow.log_metric("val_accuracy", score) mlflow.sklearn.log_model(model, "model") run_id = run.info.run_id
Autolog Pattern
Use autolog for quick iteration — disable before production for explicit control:
mlflow.sklearn.autolog( log_input_examples=True, log_model_signatures=True, log_models=True, silent=True, ) with mlflow.start_run(): model.fit(X_train, y_train)
Cross-Validation with MLflow
Log CV results as metrics with step index:
from sklearn.model_selection import cross_val_score import numpy as np with mlflow.start_run(): mlflow.log_params({"cv_folds": 5, "model": "RandomForest"}) scores = cross_val_score(model, X, y, cv=5, scoring="f1_macro") for i, score in enumerate(scores): mlflow.log_metric("cv_f1", score, step=i) mlflow.log_metric("cv_f1_mean", scores.mean()) mlflow.log_metric("cv_f1_std", scores.std())
Model Registry
model_uri = f"runs:/{run_id}/model" registered = mlflow.register_model(model_uri, "my-classifier") client = mlflow.tracking.MlflowClient() client.transition_model_version_stage( name="my-classifier", version=registered.version, stage="Staging", )
Loading a registered model:
model = mlflow.sklearn.load_model("models:/my-classifier/Staging")
Experiment Comparison
client = mlflow.tracking.MlflowClient() runs = client.search_runs( experiment_ids=["1"], order_by=["metrics.val_f1 DESC"], max_results=10, ) for run in runs: print(run.info.run_id, run.data.metrics.get("val_f1"))
Serving via MLflow
mlflow models serve -m "models:/my-classifier/Production" --port 5001 --no-conda
Request format:
curl -X POST http://localhost:5001/invocations \ -H "Content-Type: application/json" \ -d '{"dataframe_records": [{"feature1": 1.0, "feature2": 2.0}]}'
Artifact Logging
with mlflow.start_run(): fig.savefig("confusion_matrix.png") mlflow.log_artifact("confusion_matrix.png", artifact_path="plots") mlflow.log_dict(feature_importance_dict, "feature_importance.json") mlflow.log_text(classification_report_str, "classification_report.txt")
Project Structure for Tracking
src/{project_name}/ ├── training/ │ ├── train.py # entry point — sets experiment, calls fit │ ├── evaluate.py # eval loop — logs metrics per epoch/fold │ └── register.py # promotes best run to Model Registry ├── mlruns/ # local tracking store (gitignore this) └── mlflow.db # local SQLite backend (gitignore this)
Known Pitfalls
- Always use
— orphan runs (logged outside context) pollute the experiment registry and are hard to clean upwith mlflow.start_run(): - Never call
manually — the context manager handles it; manual calls can corrupt the run statemlflow.end_run() - Set
env var in CI — default isMLFLOW_TRACKING_URI
(relative), which breaks across working directories./mlruns
must be called BEFOREmlflow.autolog()
— calling it after has no effectmodel.fit()
Resources
- MLflow docs: https://mlflow.org/docs/latest/
- Model Registry concepts: https://mlflow.org/docs/latest/model-registry.html