BioSkills bio-machine-learning-model-validation
Implements nested cross-validation and stratified splits for unbiased model evaluation on biomedical datasets. Prevents data leakage and overfitting in biomarker discovery. Use when validating classifiers or optimizing hyperparameters on omics data.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/machine-learning/model-validation" ~/.claude/skills/gptomics-bioskills-bio-machine-learning-model-validation && rm -rf "$T"
machine-learning/model-validation/SKILL.mdVersion Compatibility
Reference examples tested with: numpy 1.26+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Cross-Validation for Biomedical Data
"Properly validate my omics classifier" → Use nested cross-validation with stratified splits to get unbiased performance estimates while tuning hyperparameters on small biomedical datasets.
- Python:
withsklearn.model_selection.cross_val_score()
inner/outer loopsStratifiedKFold
Why Nested CV Matters
Simple train/test splits overestimate performance on small omics datasets. Nested CV provides unbiased estimates by separating hyperparameter tuning from performance evaluation.
Nested Cross-Validation
Goal: Obtain unbiased performance estimates by separating hyperparameter tuning from evaluation.
Approach: Use an outer CV loop for scoring and an inner CV loop for grid search, preventing information leakage between tuning and evaluation.
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV from sklearn.ensemble import RandomForestClassifier from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline import numpy as np pipe = Pipeline([ ('scaler', StandardScaler()), ('clf', RandomForestClassifier(random_state=42)) ]) param_grid = { 'clf__n_estimators': [50, 100, 200], 'clf__max_depth': [5, 10, None] } # Outer CV: performance estimation (5 folds) # Inner CV: hyperparameter tuning (3 folds) outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) nested_scores = [] for train_idx, test_idx in outer_cv.split(X, y): X_train, X_test = X.iloc[train_idx], X.iloc[test_idx] y_train, y_test = y[train_idx], y[test_idx] grid = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='roc_auc', n_jobs=-1) grid.fit(X_train, y_train) score = grid.score(X_test, y_test) nested_scores.append(score) print(f'Nested CV AUC: {np.mean(nested_scores):.3f} +/- {np.std(nested_scores):.3f}')
Stratified K-Fold
Goal: Evaluate model performance while preserving class proportions in each fold.
Approach: Split data into stratified folds and compute cross-validated scores to account for class imbalance.
from sklearn.model_selection import StratifiedKFold, cross_val_score # Always stratify for class imbalance cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc') print(f'CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}')
Repeated Stratified K-Fold
Goal: Produce more stable performance estimates by averaging across multiple CV repetitions.
Approach: Repeat stratified K-fold splitting with different random seeds and aggregate scores across all iterations.
from sklearn.model_selection import RepeatedStratifiedKFold # More robust estimate with multiple repeats cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42) scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc') print(f'Repeated CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}')
Leave-One-Out (Small Datasets)
Goal: Maximize training data when sample size is very small (n < 30).
Approach: Hold out one sample at a time for testing and train on all remaining samples, then aggregate predictions.
from sklearn.model_selection import LeaveOneOut, cross_val_predict # Use for very small datasets (n < 30) loo = LeaveOneOut() y_pred = cross_val_predict(pipe, X, y, cv=loo, method='predict_proba')[:, 1] auc = roc_auc_score(y, y_pred) print(f'LOO AUC: {auc:.3f}')
Group-Aware Splits
Goal: Prevent data leakage when samples from the same patient or batch are correlated.
Approach: Use group-aware splitting to ensure all samples from a single group stay in the same fold.
from sklearn.model_selection import GroupKFold, LeaveOneGroupOut # When samples from same patient/batch must stay together groups = meta['patient_id'].values group_cv = GroupKFold(n_splits=5) scores = cross_val_score(pipe, X, y, cv=group_cv, groups=groups, scoring='roc_auc')
CV Strategy Selection
| Dataset Size | Strategy | Notes |
|---|---|---|
| n > 100 | StratifiedKFold(5) | Standard choice |
| n = 50-100 | StratifiedKFold(10) | More train data per fold |
| n < 30 | LeaveOneOut | Maximum train data |
| Repeated measures | GroupKFold | Keep patients together |
| High variance | RepeatedStratifiedKFold | More stable estimates |
Avoiding Data Leakage
Goal: Ensure feature selection does not use test-fold information, which inflates performance estimates.
Approach: Embed feature selection inside a pipeline so it executes independently within each CV fold.
# WRONG: Feature selection before CV # selected = SelectKBest(k=100).fit_transform(X, y) # Leaks info! # scores = cross_val_score(clf, selected, y, cv=cv) # CORRECT: Feature selection inside CV from sklearn.feature_selection import SelectKBest pipe = Pipeline([ ('scaler', StandardScaler()), ('select', SelectKBest(k=100)), # Done per fold ('clf', RandomForestClassifier()) ]) scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
Related Skills
- machine-learning/omics-classifiers - Model training
- experimental-design/multiple-testing - Multiple hypothesis correction
- machine-learning/biomarker-discovery - Feature selection within CV