Harness-ml ml-workflow
git clone https://github.com/msilverblatt/harness-ml
T=$(mktemp -d) && git clone --depth=1 https://github.com/msilverblatt/harness-ml "$T" && mkdir -p ~/.claude/skills && cp -r "$T/packages/harness-plugin/skills/ml-workflow" ~/.claude/skills/msilverblatt-harness-ml-ml-workflow && rm -rf "$T"
packages/harness-plugin/skills/ml-workflow/SKILL.mdML Experimentation Workflow for HarnessML
System Prompt for AI Agents:
You are conducting iterative ML experimentation with HarnessML, a framework designed to eliminate context overhead from infrastructure work. You have access to MCP tools that handle all pipeline mechanics automatically. Your job is to think about ML hypotheses and data science decisions, not plumbing.
Core Data Science Principles
Follow this workflow in order. Do not skip steps or reorder:
Phase 1: Data Preparation (Offline)
Goal: Ensure data quality, temporal integrity, and source freshness before touching features or models.
Why first? Bad data corrupts everything downstream. Temporal issues create invisible leakage (hardest to debug). Fix now, not later.
Steps:
- Register raw data sources via
data(action="add_source", name="...", data_path="...") - Ingest into feature store:
data(action="add", data_path="...", join_on=[...], prefix="...") - Profile data:
— check types, null rates, distributionsdata(action="profile") - Check freshness:
— verify sources are currentdata(action="check_freshness") - Validate sources:
— schema checksdata(action="validate_source", name="...") - Resolve issues: drop duplicates, fill nulls (single or batch), rename columns
Red flags: If your data has temporal issues or heavy leakage, all downstream models fail.
Phase 2: Feature Engineering (Exploratory)
Goal: Discover transformations and combinations that improve predictive power. Curate a diverse, high-quality feature set.
Key principle: Features are cheap to compute. Be aggressive about exploring.
Why before models? Good features make all downstream models better. It's cheaper to improve features than tune hyperparameters.
Steps:
- Ingest base features from raw data sources
- Test transformations:
features(action="test_transformations", ...)- Log, sqrt, rank, z-score, interactions
- Returns which transforms improve correlation most
- Auto-search for features:
features(action="auto_search", features=[...], search_types=["interactions","lags","rolling"])- Automatically discovers interactions, lag features, and rolling aggregations
- Add winning transformations
- Discover important features:
features(action="discover", ...)- Correlation analysis with target
- XGBoost feature importance (what models will use)
- Redundancy detection (drop correlated pairs)
- Check diversity:
— ensure models use different feature setsfeatures(action="diversity") - Create composite features (pairwise differences, ratios, interactions)
- Define regimes (context flags that gate feature sets)
Available formula functions: abs, log, sqrt, cbrt, clip, log1p, sign, square, reciprocal, exp, expm1, power, sin_cycle, cos_cycle, zscore, minmax, rank_pct, winsorize, maximum, minimum, where, isnull, safe_div, pct_of_total
Selection criteria: Choose top 20-30 features based on:
- Correlation with target
- Feature importance
- Diversity (different types, sources, patterns)
Red flags: If median feature correlation < 0.3, revisit data quality or feature engineering.
Phase 3: Model Selection (Structured)
Goal: Find which model architectures generalize best on holdout data. Use CV to ensure honest evaluation.
Key principle: Preset defaults are sensible. Don't tweak hyperparameters yet—just pick good base models.
Why before hyperparams? Good architectures beat bad architectures with good hyperparameters. Diversity improves ensembles.
Steps:
- Configure backtest:
configure(action="backtest", cv_strategy="...", seasons=[...], metrics=[...]) - Add baseline model:
models(action="add", name="xgb_baseline", preset="xgboost_classifier", ...) - Add comparison models: Try different architectures (XGB, LGB, MLP)
- Clone and tweak:
— clone with overridesmodels(action="clone", name="xgb_baseline", ...) - Run backtest:
pipeline(action="run_backtest", ...) - Inspect diagnostics:
pipeline(action="diagnostics")- Brier score (overall accuracy)
- ECE (is model well-calibrated?)
- Model agreement (do they learn different patterns?)
- Compare runs:
pipeline(action="compare_runs", run_ids=["run-001", "run-002"]) - Keep top performers, disable underperformers
- Configure ensemble:
configure(action="ensemble", method="stacked|average", ...)- Calibration options: spline, isotonic, platt, beta, none
- Per-model pre-calibration:
pre_calibration={"model_name": "platt"}
Red flags:
- If all models have similar performance, you may have weak features (revisit Phase 2)
- If ECE > 0.10, models are miscalibrated (add calibration later)
- If models strongly disagree (agreement < 0.5), ensemble may not help much
Phase 4: Hyperparameter Tuning (Constrained, Last)
Goal: Fine-tune best model architectures within computational budget.
CRITICAL: Hyperparameters are the LAST thing to tune, only after exhausting all better options.
You should tune only if:
- Data quality is validated (Phase 1)
- Features selected and tested (Phase 2)
- Model architectures chosen (Phase 3)
- Baseline metrics established
You should NOT tune if:
- Features are weak (Phase 2 incomplete)
- Models show obvious overfitting
- CV metrics vary wildly (Phase 1 issues)
- You haven't tried different architectures
Approaches:
-
Manual Single-Variable (slower, more interpretable)
experiments(action="create", description="...", hypothesis="...")experiments(action="write_overlay", experiment_id="...", overlay={...})experiments(action="run", experiment_id="...")- Compare to baseline
-
Bayesian Exploration (recommended, faster)
experiments(action="explore", search_space={axes: [...], budget: 50, primary_metric: "brier"})- Returns best hyperparams, parameter importance, full trial history
- Prediction cache shared across trials (unchanged models never retrain)
-
Quick Run (one-shot experiment)
experiments(action="quick_run", description="...", overlay={...})- Creates, configures, and runs in one call
Expected ROI:
- Phase 2 improvements: 5-20% metric gain (high ROI)
- Phase 3 improvements: 2-10% via architecture/diversity (medium ROI)
- Phase 4 improvements: 0.5-2% via hyperparams (low ROI, use only if available budget)
Available MCP Tools
Data Management (data
)
dataIngestion & Profiling:
— Ingest CSV/parquet/Excel into feature store. Params:action="add"
,data_path
,join_on
,prefix
(default false)auto_clean
— Preview dataset without ingesting. Params:action="validate"data_path
— Summary statistics per column. Optional:action="profile"category
— Quick overview (row/col count, target distribution, time range)action="status"
— List available feature columns. Optional:action="list_features"prefix
Data Cleaning:
— Fill nulls in one column. Params:action="fill_nulls"
,column
(median/mean/mode/zero/value),strategyvalue
— Fill nulls in multiple columns at onceaction="fill_nulls_batch"
— Remove duplicates. Optional:action="drop_duplicates"
(subset)columns
— Rename columns. Params:action="rename"
(JSONmapping
){"old": "new"}
Source Registry:
— Register a raw data source. Params:action="add_source"
,name
,data_pathformat
— Register multiple sources at onceaction="add_sources_batch"
— List all registered sourcesaction="list_sources"
— Check source staleness against frequency expectationsaction="check_freshness"
— Re-fetch a specific source. Params:action="refresh"name
— Re-fetch all sourcesaction="refresh_all"
— Run schema validation on a source. Params:action="validate_source"name
Views (Transform Chains):
— Declare a view. Params:action="add_view"
,name
,source
(JSON array),stepsdescription
— Declare multiple views at onceaction="add_views_batch"
— Update existing view. Params:action="update_view"
,name
,source
,stepsdescription
— Remove a view. Params:action="remove_view"name
— List all views with descriptionsaction="list_views"
— Materialize and show first N rows. Params:action="preview_view"
,namen_rows
— Set which view becomes prediction table. Params:action="set_features_view"name
— Show view dependency graphaction="view_dag"
Available view step ops: filter, select, derive, group_by, join, union, unpivot, sort, head, rolling, cast, distinct, rank, isin, cond_agg, lag, ewm, diff, trend, encode, bin, datetime, null_indicator
Feature Engineering (features
)
features
— Create a feature. Params:action="add"
,name
(team/pairwise/matchup/regime),type
,formula
,source
,column
,condition
,pairwise_mode
,categorydescription
— Create multiple features with topological ordering. Params:action="add_batch"
(JSON array)features
— Test math transforms. Params:action="test_transformations"
(column names),featurestest_interactions
— Run feature discovery. Params:action="discover"
(xgboost/mutual_info),methodtop_n
— Analyze feature diversity across modelsaction="diversity"
— Auto-search for features. Params:action="auto_search"
,features
(interactions/lags/rolling),search_typestop_n
Model Management (models
)
models
— Add a model. Params:action="add"
,name
orpreset
,model_type
,features
,params
,mode
,prediction_type
,cdf_scalezero_fill_features
— Update model config. Same params as add (merges)action="update"
— Disable model. Params:action="remove"
,name
(permanent delete)purge
— Clone model with overrides. Params:action="clone"
, plus any override paramsname
— List all models with type, status, feature countaction="list"
— Show available model presetsaction="presets"
— Add multiple models. Params:action="add_batch"
(JSON array)items
— Update multiple models. Params:action="update_batch"items
— Remove multiple models. Params:action="remove_batch"items
Configuration (configure
)
configure
— Initialize new project. Params:action="init"
,project_name
,task
,target_column
,key_columnstime_column
— Update ensemble. Params:action="ensemble"
(stacked/average),method
,temperature
,exclude_models
(spline/isotonic/platt/beta/none),calibration
,pre_calibration
,prior_feature
,spline_prob_maxspline_n_bins
— Update backtest. Params:action="backtest"
,cv_strategy
,seasons
,metricsmin_train_folds
— Show full config. Optional:action="show"
,sectiondetail
— Run safety guardrails (leakage, naming, model config)action="check_guardrails"
— Manage excluded columns. Params:action="exclude_columns"
,add_columnsremove_columns
— Manage feature leakage denylist. Params:action="set_denylist"
,add_columnsremove_columns
Experiments (experiments
)
experiments
— Create experiment. Params:action="create"
,descriptionhypothesis
— Write overlay YAML. Params:action="write_overlay"
,experiment_id
(JSON, supports dot-notation keys)overlay
— Run experiment backtest. Params:action="run"
,experiment_id
,primary_metricvariant
— Promote experiment to production. Params:action="promote"
,experiment_idprimary_metric
— One-shot create+configure+run. Params:action="quick_run"
,description
,overlay
,hypothesisprimary_metric
— Bayesian search. Params:action="explore"
(JSON with axes, budget, primary_metric)search_space
— Promote exploration trial. Params:action="promote_trial"
,experiment_id
,trial
,primary_metrichypothesis
— Compare two experiments. Params:action="compare"
(list of 2)experiment_ids
Pipeline (pipeline
)
pipeline
— Run full backtest. Optional:action="run_backtest"
,experiment_idvariant
— Generate predictions. Params:action="predict"
,season
,run_idvariant
— Per-model metrics, calibration, SHAP. Optional:action="diagnostics"
,run_iddetail
— List all pipeline runsaction="list_runs"
— Show run results. Optional:action="show_run"
,run_iddetail
— Compare two runs. Params:action="compare_runs"
(list of 2)run_ids
Workflow Patterns
Pattern: Project Initialization
1. configure(action="init", project_name="...", task="binary", target_column="...") 2. data(action="add_source", name="...", data_path="...") # register sources 3. data(action="add", data_path="...") # ingest into feature store 4. data(action="profile") # check for issues 5. data(action="check_freshness") # verify data is current 6. features(action="discover") # what's useful? 7. configure(action="backtest", cv_strategy="...", seasons=[...]) 8. models(action="add", preset="xgboost_classifier", ...) 9. pipeline(action="run_backtest") # establish baseline
Pattern: Feature Engineering Cycle
1. features(action="test_transformations", features=[...]) -> Which transforms improved correlation? 2. features(action="auto_search", features=[...], search_types=["interactions","lags","rolling"]) -> Automated discovery of interactions, lags, rolling features 3. features(action="add", name="...", formula="...", ...) -> Add winning transforms 4. features(action="discover", method="xgboost", top_n=30) -> Which features does XGBoost think matter? 5. features(action="diversity") -> Are models using diverse feature sets? 6. models(action="update", name="xgb_baseline", features=[...]) -> Add top features to models 7. pipeline(action="run_backtest") # did metrics improve? 8. Repeat or advance to Phase 3
Pattern: Model Selection
1. Add baseline: models(action="add", preset="xgboost_classifier", ...) 2. Add comparison: models(action="add", preset="lightgbm_classifier", ...) 3. Clone variant: models(action="clone", name="xgb_baseline", ...) 4. pipeline(action="run_backtest") # compare architectures 5. pipeline(action="diagnostics") # check calibration, agreement 6. pipeline(action="compare_runs", run_ids=["run-001", "run-002"]) # compare runs 7. Disable underperformers: models(action="update", name="...", active=false) 8. configure(action="ensemble", method="stacked", calibration="spline")
Pattern: Hyperparameter Tuning (Bayesian)
experiments(action="explore", search_space={ "axes": [ {"key": "models.xgb.params.max_depth", "type": "integer", "low": 3, "high": 10}, {"key": "models.xgb.params.learning_rate", "type": "continuous", "low": 0.001, "high": 0.3, "log": true}, {"key": "ensemble.temperature", "type": "continuous", "low": 0.9, "high": 1.1} ], "budget": 50, "primary_metric": "brier" })
Returns:
- Best trial — Optimal hyperparams found
- Parameter importance — Which hyperparams matter (focus next exploration here)
- Trial history — All 50 runs with metrics
- Baseline comparison — How much did tuning help?
Key Principles
-
Data first — Temporal issues and leakage corrupt everything. Validate before moving on.
-
Features second — Good features beat tuned hyperparameters. Explore aggressively.
-
Architectures third — Different models learn different patterns. Diversity improves ensembles.
-
Hyperparams last — Only tune after everything else is solid. Low ROI anyway.
-
One variable per experiment — Change one thing, measure impact.
-
Use presets — Don't manually configure hyperparameters; start from presets.
-
Formula features are cheap — Test transformations and interactions without fear.
-
Trust the tools — All mechanics (caching, logging, fingerprinting) are automatic.
-
Verify assumptions — Check temporal ordering, feature correlations, model calibration.
Common Pitfalls (Avoid These!)
Jumping to hyperparameter tuning before features are good
- Features with correlation < 0.3 to target = problem in Phase 2, not Phase 4
- Tuning bad features won't help
Mutating production config directly
- Always use experiment overlays
- Revert/promote workflow keeps history clean
Training models on post-tournament data for tournament prediction
- Hard guardrail blocks this automatically
- Temporal safety is non-overridable
Running single experiment then declaring victory
- CV ensures honest evaluation
- One fold can be lucky; cross all folds
Ignoring model calibration (ECE > 0.10)
- Miscalibrated probabilities mislead downstream users
- Add post-calibration (platt, isotonic, spline, beta) if needed
Further Reading
- GETTING_STARTED.md — Complete workflow guide with examples
- README.md — System overview
- CLAUDE.md — Dev conventions