Claude-scaffold experiment-tracking

Experiment Tracking

install
source · Clone the upstream repo
git clone https://github.com/pyramidheadshark/claude-scaffold
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/pyramidheadshark/claude-scaffold "$T" && mkdir -p ~/.claude/skills && cp -r "$T/.claude/skills/experiment-tracking" ~/.claude/skills/pyramidheadshark-claude-scaffold-experiment-tracking && rm -rf "$T"
manifest: .claude/skills/experiment-tracking/SKILL.md
source content

Experiment Tracking

When to Load This Skill

Load when working with: MLflow experiments, run logging, model registry, artifact management, experiment comparison, cross-validation with tracking.

Core Concepts

ConceptPurpose
RunSingle training execution — logs params, metrics, artifacts
ExperimentNamed collection of runs — logical grouping by model type or task
Model RegistryVersioned model store — stages: None → Staging → Production
ArtifactAny file output — model weights, plots, feature importance

Run Lifecycle Pattern

Always use context manager — never log outside a run:

import mlflow
import mlflow.sklearn

mlflow.set_experiment("my-experiment")

with mlflow.start_run(run_name="baseline-rf") as run:
    mlflow.log_params({
        "n_estimators": 100,
        "max_depth": 5,
        "random_state": 42,
    })

    model.fit(X_train, y_train)
    score = model.score(X_val, y_val)

    mlflow.log_metric("val_accuracy", score)
    mlflow.sklearn.log_model(model, "model")

    run_id = run.info.run_id

Autolog Pattern

Use autolog for quick iteration — disable before production for explicit control:

mlflow.sklearn.autolog(
    log_input_examples=True,
    log_model_signatures=True,
    log_models=True,
    silent=True,
)

with mlflow.start_run():
    model.fit(X_train, y_train)

Cross-Validation with MLflow

Log CV results as metrics with step index:

from sklearn.model_selection import cross_val_score
import numpy as np

with mlflow.start_run():
    mlflow.log_params({"cv_folds": 5, "model": "RandomForest"})

    scores = cross_val_score(model, X, y, cv=5, scoring="f1_macro")

    for i, score in enumerate(scores):
        mlflow.log_metric("cv_f1", score, step=i)

    mlflow.log_metric("cv_f1_mean", scores.mean())
    mlflow.log_metric("cv_f1_std", scores.std())

Model Registry

model_uri = f"runs:/{run_id}/model"

registered = mlflow.register_model(model_uri, "my-classifier")

client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="my-classifier",
    version=registered.version,
    stage="Staging",
)

Loading a registered model:

model = mlflow.sklearn.load_model("models:/my-classifier/Staging")

Experiment Comparison

client = mlflow.tracking.MlflowClient()

runs = client.search_runs(
    experiment_ids=["1"],
    order_by=["metrics.val_f1 DESC"],
    max_results=10,
)

for run in runs:
    print(run.info.run_id, run.data.metrics.get("val_f1"))

Serving via MLflow

mlflow models serve -m "models:/my-classifier/Production" --port 5001 --no-conda

Request format:

curl -X POST http://localhost:5001/invocations \
  -H "Content-Type: application/json" \
  -d '{"dataframe_records": [{"feature1": 1.0, "feature2": 2.0}]}'

Artifact Logging

with mlflow.start_run():
    fig.savefig("confusion_matrix.png")
    mlflow.log_artifact("confusion_matrix.png", artifact_path="plots")

    mlflow.log_dict(feature_importance_dict, "feature_importance.json")

    mlflow.log_text(classification_report_str, "classification_report.txt")

Project Structure for Tracking

src/{project_name}/
├── training/
│   ├── train.py          # entry point — sets experiment, calls fit
│   ├── evaluate.py       # eval loop — logs metrics per epoch/fold
│   └── register.py       # promotes best run to Model Registry
├── mlruns/               # local tracking store (gitignore this)
└── mlflow.db             # local SQLite backend (gitignore this)

Known Pitfalls

  • Always use
    with mlflow.start_run():
    — orphan runs (logged outside context) pollute the experiment registry and are hard to clean up
  • Never call
    mlflow.end_run()
    manually — the context manager handles it; manual calls can corrupt the run state
  • Set
    MLFLOW_TRACKING_URI
    env var in CI — default is
    ./mlruns
    (relative), which breaks across working directories
  • mlflow.autolog()
    must be called BEFORE
    model.fit()
    — calling it after has no effect

Resources