Claude-skill-registry-data ml-rigor
Enforces baseline comparisons, cross-validation, interpretation, and leakage prevention for ML pipelines
git clone https://github.com/majiayu000/claude-skill-registry-data
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/ml-rigor" ~/.claude/skills/majiayu000-claude-skill-registry-data-ml-rigor && rm -rf "$T"
data/ml-rigor/SKILL.mdMachine Learning Rigor Patterns
When to Use
Load this skill when building machine learning models. Every ML pipeline must demonstrate:
- Baseline comparison: Beat a dummy model before claiming success
- Cross-validation: Report variance, not just a single score
- Interpretation: Explain what the model learned
- Leakage prevention: Ensure no future information leaks into training
Quality Gate: ML findings without baseline comparison or cross-validation are marked as "Exploratory" in reports.
1. Baseline Requirements
Every model must be compared to baselines. A model that can't beat a dummy classifier isn't learning anything useful.
Always Compare To:
- DummyClassifier/DummyRegressor - The absolute minimum bar
- Simple linear model - LogisticRegression or LinearRegression
- Domain heuristic (if available) - Rule-based approach
Baseline Code Template
from sklearn.dummy import DummyClassifier, DummyRegressor from sklearn.linear_model import LogisticRegression, LinearRegression from sklearn.model_selection import cross_val_score import numpy as np print("[DECISION] Establishing baselines before training complex models") # Classification baselines dummy_clf = DummyClassifier(strategy='most_frequent') dummy_scores = cross_val_score(dummy_clf, X_train, y_train, cv=5, scoring='accuracy') print(f"[METRIC:baseline_accuracy] {dummy_scores.mean():.3f} (majority class)") print(f"[METRIC:baseline_accuracy_std] {dummy_scores.std():.3f}") # Simple linear baseline lr = LogisticRegression(max_iter=1000, random_state=42) lr_scores = cross_val_score(lr, X_train, y_train, cv=5, scoring='accuracy') print(f"[METRIC:linear_baseline_accuracy] {lr_scores.mean():.3f}") print(f"[METRIC:linear_baseline_accuracy_std] {lr_scores.std():.3f}") # For regression tasks dummy_reg = DummyRegressor(strategy='mean') dummy_rmse = cross_val_score(dummy_reg, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error') print(f"[METRIC:baseline_rmse] {-dummy_rmse.mean():.3f} (mean predictor)")
Improvement Over Baseline with CI
from scipy import stats # Calculate improvement with confidence interval model_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy') improvement = model_scores.mean() - dummy_scores.mean() # Bootstrap CI for improvement n_bootstrap = 1000 improvements = [] for _ in range(n_bootstrap): idx = np.random.choice(len(model_scores), len(model_scores), replace=True) boot_improvement = model_scores[idx].mean() - dummy_scores[idx].mean() improvements.append(boot_improvement) ci_low, ci_high = np.percentile(improvements, [2.5, 97.5]) print(f"[METRIC:improvement_over_baseline] {improvement:.3f}") print(f"[STAT:ci] 95% CI [{ci_low:.3f}, {ci_high:.3f}]") print(f"[STAT:effect_size] Relative improvement: {improvement/dummy_scores.mean()*100:.1f}%") if ci_low > 0: print("[FINDING] Model significantly outperforms baseline") else: print("[LIMITATION] Improvement over baseline not statistically significant")
2. Cross-Validation Requirements
Never report a single train/test split. Cross-validation shows how much your score varies.
Requirements:
- Use stratified K-fold for classification (preserves class distribution)
- Report mean +/- std, not just mean
- Calculate confidence interval for mean performance
- Use repeated CV for small datasets
Cross-Validation Code Template
from sklearn.model_selection import ( StratifiedKFold, cross_val_score, cross_validate, RepeatedStratifiedKFold ) import numpy as np from scipy import stats print("[DECISION] Using 5-fold stratified CV to estimate model performance") # Stratified K-Fold for classification cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Multiple metrics scoring = ['accuracy', 'f1_weighted', 'roc_auc'] cv_results = cross_validate(model, X, y, cv=cv, scoring=scoring, return_train_score=True) # Report mean +/- std (REQUIRED) for metric in scoring: test_scores = cv_results[f'test_{metric}'] train_scores = cv_results[f'train_{metric}'] print(f"[METRIC:cv_{metric}_mean] {test_scores.mean():.3f}") print(f"[METRIC:cv_{metric}_std] {test_scores.std():.3f}") # Check for overfitting gap = train_scores.mean() - test_scores.mean() if gap > 0.1: print(f"[LIMITATION] Train-test gap of {gap:.3f} suggests overfitting")
Confidence Interval for CV Mean
# CI for cross-validation mean (t-distribution for small n) def cv_confidence_interval(scores, confidence=0.95): n = len(scores) mean = scores.mean() se = stats.sem(scores) h = se * stats.t.ppf((1 + confidence) / 2, n - 1) return mean - h, mean + h ci_low, ci_high = cv_confidence_interval(cv_results['test_accuracy']) print(f"[STAT:ci] 95% CI for accuracy: [{ci_low:.3f}, {ci_high:.3f}]") # For small datasets, use repeated CV print("[DECISION] Using repeated CV for more stable estimates") rcv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42) scores = cross_val_score(model, X, y, cv=rcv, scoring='accuracy') print(f"[METRIC:repeated_cv_mean] {scores.mean():.3f}") print(f"[METRIC:repeated_cv_std] {scores.std():.3f}")
3. Hyperparameter Tuning
Avoid overfitting to the validation set. Report the distribution of scores, not just the best.
Requirements:
- Use RandomizedSearchCV or Bayesian optimization (Optuna)
- Report distribution of scores across parameter combinations
- Use nested CV to get unbiased performance estimate
- Watch for overfitting to validation set
RandomizedSearchCV Template
from sklearn.model_selection import RandomizedSearchCV, cross_val_score from scipy.stats import uniform, randint import numpy as np print("[DECISION] Using RandomizedSearchCV with 100 iterations") # Define parameter distributions (not just lists!) param_distributions = { 'n_estimators': randint(50, 500), 'max_depth': randint(3, 20), 'min_samples_split': randint(2, 20), 'min_samples_leaf': randint(1, 10), 'max_features': uniform(0.1, 0.9) } random_search = RandomizedSearchCV( model, param_distributions, n_iter=100, cv=5, scoring='accuracy', random_state=42, return_train_score=True, n_jobs=-1 ) random_search.fit(X_train, y_train) # Report distribution of scores (not just best!) cv_results = random_search.cv_results_ all_test_scores = cv_results['mean_test_score'] all_train_scores = cv_results['mean_train_score'] print(f"[METRIC:tuning_best_score] {random_search.best_score_:.3f}") print(f"[METRIC:tuning_score_range] [{all_test_scores.min():.3f}, {all_test_scores.max():.3f}]") print(f"[METRIC:tuning_score_std] {all_test_scores.std():.3f}") # Check for overfitting during tuning best_idx = random_search.best_index_ train_test_gap = all_train_scores[best_idx] - all_test_scores[best_idx] if train_test_gap > 0.1: print(f"[LIMITATION] Best model shows {train_test_gap:.3f} train-test gap")
Nested Cross-Validation (Unbiased Estimate)
from sklearn.model_selection import cross_val_score, StratifiedKFold print("[DECISION] Using nested CV for unbiased performance estimate") # Outer CV for performance estimation outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) # Inner CV is handled by RandomizedSearchCV inner_search = RandomizedSearchCV( model, param_distributions, n_iter=50, cv=3, # Inner CV scoring='accuracy', random_state=42, n_jobs=-1 ) # Nested CV scores nested_scores = cross_val_score(inner_search, X, y, cv=outer_cv, scoring='accuracy') print(f"[METRIC:nested_cv_mean] {nested_scores.mean():.3f}") print(f"[METRIC:nested_cv_std] {nested_scores.std():.3f}") print(f"[STAT:ci] 95% CI [{nested_scores.mean() - 1.96*nested_scores.std()/np.sqrt(5):.3f}, " f"{nested_scores.mean() + 1.96*nested_scores.std()/np.sqrt(5):.3f}]")
4. Calibration Requirements
Probability predictions must be calibrated. A 70% confidence should be correct 70% of the time.
Requirements:
- Check calibration curve for probability predictions
- Report Brier score (lower is better)
- Apply calibration if needed (Platt scaling, isotonic regression)
Calibration Code Template
from sklearn.calibration import calibration_curve, CalibratedClassifierCV from sklearn.metrics import brier_score_loss import matplotlib.pyplot as plt print("[DECISION] Checking probability calibration") # Get probability predictions y_prob = model.predict_proba(X_test)[:, 1] # Brier score (0 = perfect, 1 = worst) brier = brier_score_loss(y_test, y_prob) print(f"[METRIC:brier_score] {brier:.4f}") # Calibration curve prob_true, prob_pred = calibration_curve(y_test, y_prob, n_bins=10) # Plot calibration curve fig, ax = plt.subplots(figsize=(8, 6)) ax.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated') ax.plot(prob_pred, prob_true, 's-', label=f'Model (Brier={brier:.3f})') ax.set_xlabel('Mean predicted probability') ax.set_ylabel('Fraction of positives') ax.set_title('Calibration Curve') ax.legend() plt.savefig('figures/calibration_curve.png', dpi=150, bbox_inches='tight') print("[ARTIFACT:figure] figures/calibration_curve.png") # Check calibration quality max_calibration_error = np.max(np.abs(prob_true - prob_pred)) print(f"[METRIC:max_calibration_error] {max_calibration_error:.3f}") if max_calibration_error > 0.1: print("[LIMITATION] Model probabilities are poorly calibrated")
Apply Calibration
print("[DECISION] Applying isotonic calibration to improve probability estimates") # Calibrate using held-out data calibrated_model = CalibratedClassifierCV(model, method='isotonic', cv=5) calibrated_model.fit(X_train, y_train) # Compare before/after y_prob_calibrated = calibrated_model.predict_proba(X_test)[:, 1] brier_calibrated = brier_score_loss(y_test, y_prob_calibrated) print(f"[METRIC:brier_before_calibration] {brier:.4f}") print(f"[METRIC:brier_after_calibration] {brier_calibrated:.4f}") print(f"[METRIC:calibration_improvement] {brier - brier_calibrated:.4f}")
5. Interpretation Requirements
Explain what the model learned. Black boxes are not acceptable for important decisions.
Requirements:
- Compute permutation importance or SHAP values
- Show at least one case study (why this specific prediction?)
- Verify features make domain sense
Permutation Importance Template
from sklearn.inspection import permutation_importance import pandas as pd print("[DECISION] Computing permutation importance on test set") # Permutation importance (more reliable than built-in feature_importances_) perm_importance = permutation_importance( model, X_test, y_test, n_repeats=30, random_state=42, n_jobs=-1 ) # Create importance DataFrame importance_df = pd.DataFrame({ 'feature': feature_names, 'importance_mean': perm_importance.importances_mean, 'importance_std': perm_importance.importances_std }).sort_values('importance_mean', ascending=False) # Report top features print("[METRIC:top_features]") for i, row in importance_df.head(5).iterrows(): print(f" {row['feature']}: {row['importance_mean']:.4f} (+/- {row['importance_std']:.4f})") # Check for unexpected features print("[CHECK:domain_sense] Verify top features align with domain knowledge")
SHAP Values Template
import shap print("[DECISION] Using SHAP for model interpretation") # Create explainer explainer = shap.TreeExplainer(model) # or shap.KernelExplainer for any model shap_values = explainer.shap_values(X_test) # Global importance shap.summary_plot(shap_values, X_test, feature_names=feature_names, show=False) plt.savefig('figures/shap_summary.png', dpi=150, bbox_inches='tight') print("[ARTIFACT:figure] figures/shap_summary.png") # Mean absolute SHAP values if isinstance(shap_values, list): # Multi-class shap_importance = np.abs(shap_values[1]).mean(axis=0) else: shap_importance = np.abs(shap_values).mean(axis=0) top_features = sorted(zip(feature_names, shap_importance), key=lambda x: x[1], reverse=True)[:5] print("[METRIC:shap_top_features]") for feat, imp in top_features: print(f" {feat}: {imp:.4f}")
Case Study (Individual Prediction)
print("[DECISION] Analyzing individual prediction for interpretability") # Select an interesting case (e.g., high confidence wrong prediction) y_pred_proba = model.predict_proba(X_test)[:, 1] y_pred = model.predict(X_test) mistakes = (y_pred != y_test) high_conf_mistakes = mistakes & (np.abs(y_pred_proba - 0.5) > 0.4) if high_conf_mistakes.any(): case_idx = np.where(high_conf_mistakes)[0][0] print(f"[ANALYSIS] Case study: High-confidence mistake (index {case_idx})") print(f" Predicted: {y_pred[case_idx]} (prob={y_pred_proba[case_idx]:.3f})") print(f" Actual: {y_test.iloc[case_idx]}") # SHAP for this case shap.force_plot(explainer.expected_value[1] if isinstance(explainer.expected_value, list) else explainer.expected_value, shap_values[case_idx] if not isinstance(shap_values, list) else shap_values[1][case_idx], X_test.iloc[case_idx], matplotlib=True, show=False) plt.savefig('figures/case_study_shap.png', dpi=150, bbox_inches='tight') print("[ARTIFACT:figure] figures/case_study_shap.png") print("[LIMITATION] Model fails on cases with pattern: [describe pattern]")
6. Error Analysis
Understand where and why the model fails. Aggregate metrics hide important failure modes.
Requirements:
- Slice performance by key segments (demographics, time periods, etc.)
- Analyze failure modes (false positives vs false negatives)
- Check for systematic errors (biases, subgroup issues)
Error Analysis Code Template
from sklearn.metrics import classification_report, confusion_matrix import pandas as pd import numpy as np print("[DECISION] Performing error analysis across segments") y_pred = model.predict(X_test) # Overall confusion matrix cm = confusion_matrix(y_test, y_pred) print("[METRIC:confusion_matrix]") print(cm) # Classification report print("[ANALYSIS] Classification report:") print(classification_report(y_test, y_pred, target_names=class_names))
Slice Performance Analysis
print("[DECISION] Analyzing performance by key segments") # Create test DataFrame with predictions test_df = X_test.copy() test_df['y_true'] = y_test.values test_df['y_pred'] = y_pred # Define segments to analyze segments = { 'age_group': pd.cut(test_df['age'], bins=[0, 30, 50, 100], labels=['young', 'middle', 'senior']), 'income_tier': pd.qcut(test_df['income'], q=3, labels=['low', 'medium', 'high']) } print("[METRIC:slice_performance]") for segment_name, segment_values in segments.items(): test_df['segment'] = segment_values for segment_val in segment_values.unique(): mask = test_df['segment'] == segment_val if mask.sum() > 10: # Only report if enough samples segment_accuracy = (test_df.loc[mask, 'y_true'] == test_df.loc[mask, 'y_pred']).mean() print(f" {segment_name}={segment_val}: accuracy={segment_accuracy:.3f} (n={mask.sum()})") # Check for underperformance if segment_accuracy < overall_accuracy - 0.1: print(f" [LIMITATION] Model underperforms on {segment_name}={segment_val}")
Failure Mode Analysis
print("[DECISION] Analyzing failure modes") # Separate false positives and false negatives fp_mask = (y_pred == 1) & (y_test == 0) fn_mask = (y_pred == 0) & (y_test == 1) print(f"[METRIC:false_positive_rate] {fp_mask.mean():.3f}") print(f"[METRIC:false_negative_rate] {fn_mask.mean():.3f}") # Analyze characteristics of errors if fp_mask.sum() > 0: print("[ANALYSIS] False positive characteristics:") fp_data = X_test[fp_mask] for col in feature_names[:5]: # Top features print(f" {col}: mean={fp_data[col].mean():.3f} vs overall={X_test[col].mean():.3f}") # Check for systematic bias print("[CHECK:systematic_error] Review if errors correlate with protected attributes")
7. Leakage Checklist
Data leakage silently destroys model validity. Check these BEFORE trusting any results.
Checklist:
- Time-based splits for temporal data (no future information)
- No target information in features (derived features, proxies)
- Preprocessing inside CV loop (no fit on test data)
- Group-aware splits if samples are related (same user, same session)
Time-Based Split Template
from sklearn.model_selection import TimeSeriesSplit print("[DECISION] Using time-based split for temporal data") # Check if data has temporal component if 'date' in df.columns or 'timestamp' in df.columns: print("[CHECK:temporal_leakage] Using TimeSeriesSplit to prevent future information leak") # Sort by time df_sorted = df.sort_values('date') # Time-based cross-validation tscv = TimeSeriesSplit(n_splits=5) for fold, (train_idx, test_idx) in enumerate(tscv.split(df_sorted)): train_max_date = df_sorted.iloc[train_idx]['date'].max() test_min_date = df_sorted.iloc[test_idx]['date'].min() if train_max_date >= test_min_date: print(f"[ERROR] Fold {fold}: Data leakage detected!") else: print(f"[CHECK:fold_{fold}] Train ends {train_max_date}, Test starts {test_min_date} - OK") else: print("[LIMITATION] No temporal column found - using random split")
Target Leakage Detection
print("[CHECK:target_leakage] Checking for target information in features") # Check correlation between features and target correlations = X_train.corrwith(pd.Series(y_train, index=X_train.index)) high_corr_features = correlations[correlations.abs() > 0.9].index.tolist() if high_corr_features: print(f"[WARNING] Suspiciously high correlations with target:") for feat in high_corr_features: print(f" {feat}: r={correlations[feat]:.3f}") print("[LIMITATION] Review these features for potential target leakage") else: print("[CHECK:target_leakage] No obvious target leakage detected") # Check for post-hoc features (created after the outcome) print("[CHECK:feature_timing] Verify all features are available at prediction time")
Preprocessing Inside CV Template
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.model_selection import cross_val_score print("[DECISION] Using Pipeline to prevent preprocessing leakage") # WRONG: Fitting scaler on all data before CV # scaler = StandardScaler() # X_scaled = scaler.fit_transform(X) # LEAKAGE! # scores = cross_val_score(model, X_scaled, y, cv=5) # CORRECT: Preprocessing inside pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', model) ]) scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy') print(f"[METRIC:cv_accuracy_mean] {scores.mean():.3f}") print(f"[METRIC:cv_accuracy_std] {scores.std():.3f}") print("[CHECK:preprocessing_leakage] Scaler fit inside CV - no leakage")
Group-Aware Splitting
from sklearn.model_selection import GroupKFold print("[CHECK:group_leakage] Checking if samples are related") # If samples belong to groups (e.g., multiple records per user) if 'user_id' in df.columns: print("[DECISION] Using GroupKFold to prevent group leakage") groups = df['user_id'] gkf = GroupKFold(n_splits=5) scores = cross_val_score(model, X, y, cv=gkf, groups=groups, scoring='accuracy') print(f"[METRIC:group_cv_accuracy_mean] {scores.mean():.3f}") print(f"[METRIC:group_cv_accuracy_std] {scores.std():.3f}") print("[CHECK:group_leakage] Same user never in both train and test - OK") else: print("[CHECK:group_leakage] No group column - using standard CV")
Complete ML Pipeline Example
Putting it all together:
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score from sklearn.dummy import DummyClassifier from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler import numpy as np print("[OBJECTIVE] Build and validate classification model with full rigor") # 1. BASELINE print("\n--- Baseline Comparison ---") dummy = DummyClassifier(strategy='most_frequent') dummy_scores = cross_val_score(dummy, X, y, cv=5, scoring='accuracy') print(f"[METRIC:baseline_accuracy] {dummy_scores.mean():.3f}") # 2. CROSS-VALIDATION print("\n--- Cross-Validation ---") pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier(random_state=42)) ]) cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy') print(f"[METRIC:cv_accuracy_mean] {cv_scores.mean():.3f}") print(f"[METRIC:cv_accuracy_std] {cv_scores.std():.3f}") # 3. IMPROVEMENT OVER BASELINE improvement = cv_scores.mean() - dummy_scores.mean() print(f"[METRIC:improvement_over_baseline] {improvement:.3f}") print(f"[STAT:effect_size] Relative improvement: {improvement/dummy_scores.mean()*100:.1f}%") # 4. INTERPRETATION print("\n--- Interpretation ---") # (permutation importance code here) # 5. FINDING (only after full evidence) if improvement > 0.05: print(f"[FINDING] Random Forest achieves {cv_scores.mean():.3f} accuracy, " f"improving {improvement:.3f} over baseline ({improvement/dummy_scores.mean()*100:.1f}% relative)") print(f"[SO_WHAT] Model provides actionable predictions for business use case") else: print("[FINDING] Model does not significantly outperform baseline") print("[LIMITATION] Consider simpler approach or feature engineering")
Quality Gate: Required Evidence
Before any
[FINDING] in ML, you MUST have:
| Evidence | Marker | Example |
|---|---|---|
| Baseline comparison | | |
| CV scores with variance | | |
| Improvement quantified | | |
| Effect size | | |
| Interpretation | | Top 3 features listed |
| Limitations acknowledged | | Model constraints documented |
Missing any of these? Your finding is "Exploratory", not "Verified".