Vibe-Skills ml-data-leakage-guard
Detects and prevents data leakage in machine learning and mathematical modeling. Use after ML tasks involving data cleaning, feature engineering, data augmentation, algorithm development, normalization, missing value imputation, dimensionality reduction, feature selection, or time series modeling. Checks if features/statistics would be available at prediction time.
git clone https://github.com/foryourhealth111-pixel/Vibe-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/foryourhealth111-pixel/Vibe-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/bundled/skills/ml-data-leakage-guard" ~/.claude/skills/foryourhealth111-pixel-vibe-skills-ml-data-leakage-guard && rm -rf "$T"
bundled/skills/ml-data-leakage-guard/SKILL.mdML Data Leakage Guard Skill
Automatically detects and prevents data leakage in machine learning workflows by verifying that all preprocessing steps, feature engineering, and statistical computations would be available at prediction time.
When to Use This Skill
Use this skill after work involving:
- Data preprocessing (normalization, standardization, scaling)
- Missing value imputation
- Feature engineering and feature selection
- Dimensionality reduction (PCA, SVD, t-SNE)
- Target encoding or label encoding
- Time series feature construction
- Data augmentation strategies
- Algorithm development and optimization
- Train-test split procedures
- Cross-validation setup
Not For / Boundaries
- Pure theoretical ML discussions without implementation
- Model architecture design (without data preprocessing)
- Hyperparameter tuning (unless it involves data-dependent operations)
Core Principle
The Golden Rule: At the exact moment of prediction in production, can I access this value from the database or compute it using only information available up to that point?
If the answer is "no" or "not completely", then data leakage exists.
Quick Reference
Critical Leakage Patterns
Pattern 1: Preprocessing Before Split
# ❌ WRONG: Leakage - fit on entire dataset scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Uses test set statistics X_train, X_test = train_test_split(X_scaled, y) # ✅ CORRECT: Fit only on training data X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) # Fit on train only X_test_scaled = scaler.transform(X_test) # Transform test using train statistics
Pattern 2: Global Missing Value Imputation
# ❌ WRONG: Uses global statistics including test set df['age'].fillna(df['age'].mean(), inplace=True) # Global mean includes test data X_train, X_test = train_test_split(df, y) # ✅ CORRECT: Compute statistics on training set only X_train, X_test, y_train, y_test = train_test_split(df, y) train_mean = X_train['age'].mean() # Only from training data X_train['age'].fillna(train_mean, inplace=True) X_test['age'].fillna(train_mean, inplace=True) # Use train mean for test
Pattern 3: PCA/Dimensionality Reduction on Full Dataset
# ❌ WRONG: PCA learns variance structure from test set pca = PCA(n_components=10) X_reduced = pca.fit_transform(X) # Includes test set variance X_train, X_test = train_test_split(X_reduced, y) # ✅ CORRECT: Fit PCA only on training data X_train, X_test, y_train, y_test = train_test_split(X, y) pca = PCA(n_components=10) X_train_reduced = pca.fit_transform(X_train) # Learn from train only X_test_reduced = pca.transform(X_test) # Apply train-learned transformation
Pattern 4: Target Encoding with Full Dataset
# ❌ WRONG: Uses target values from test set category_means = df.groupby('category')['target'].mean() # Includes test targets df['category_encoded'] = df['category'].map(category_means) X_train, X_test = train_test_split(df, y) # ✅ CORRECT: Compute encoding only from training targets X_train, X_test, y_train, y_test = train_test_split(df, y) category_means = X_train.groupby('category')['target'].mean() # Train only X_train['category_encoded'] = X_train['category'].map(category_means) X_test['category_encoded'] = X_test['category'].map(category_means)
Pattern 5: Feature Selection on Full Dataset
# ❌ WRONG: Feature selection sees test set from sklearn.feature_selection import SelectKBest selector = SelectKBest(k=10) X_selected = selector.fit_transform(X, y) # Uses test set for selection X_train, X_test = train_test_split(X_selected, y) # ✅ CORRECT: Select features using training data only X_train, X_test, y_train, y_test = train_test_split(X, y) selector = SelectKBest(k=10) X_train_selected = selector.fit_transform(X_train, y_train) # Train only X_test_selected = selector.transform(X_test) # Apply train-learned selection
Pattern 6: Random Split on Temporal Data
# ❌ WRONG: Random split on time series (uses future to predict past) X_train, X_test = train_test_split(df, test_size=0.2, random_state=42) # ✅ CORRECT: Time-based split for temporal data split_date = '2024-01-01' X_train = df[df['date'] < split_date] X_test = df[df['date'] >= split_date]
Pattern 7: Future Function in Time Series Features
# ❌ WRONG: Uses future data to compute current features df['daily_avg'] = df.groupby('date')['value'].transform('mean') # Includes all day's data # ✅ CORRECT: Use only past data (expanding window) df = df.sort_values('timestamp') df['cumulative_avg'] = df.groupby('user_id')['value'].expanding().mean().reset_index(0, drop=True)
Pattern 8: Post-Event Features
# ❌ WRONG: Feature only exists after the outcome # Predicting loan default using "number of collection calls" as feature # Collection calls only happen AFTER default occurs # ✅ CORRECT: Use only pre-event features # Use features available BEFORE the outcome: credit score, income, debt ratio, etc.
Pattern 9: Leakage in Cross-Validation
# ❌ WRONG: Preprocessing before CV split scaler = StandardScaler() X_scaled = scaler.fit_transform(X) scores = cross_val_score(model, X_scaled, y, cv=5) # Each fold sees other folds' statistics # ✅ CORRECT: Preprocessing inside CV pipeline from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('scaler', StandardScaler()), ('model', LogisticRegression()) ]) scores = cross_val_score(pipeline, X, y, cv=5) # Scaling happens per fold
Pattern 10: Data Augmentation Leakage
# ❌ WRONG: Augment before split (test set influenced by augmented train data) X_augmented = augment_data(X) # Augmentation sees all data X_train, X_test = train_test_split(X_augmented, y) # ✅ CORRECT: Augment only training data after split X_train, X_test, y_train, y_test = train_test_split(X, y) X_train_augmented = augment_data(X_train) # Augment train only # X_test remains unchanged
Leakage Detection Checklist
After any ML preprocessing or feature engineering, verify:
- Fit-Transform Order: Did I fit any transformer (scaler, encoder, imputer, PCA) on the full dataset before splitting?
- Global Statistics: Did I compute any statistics (mean, std, median, mode) using the entire dataset?
- Feature Selection: Did I select features using information from the test set?
- Target Information: Did any feature encoding use target values from the test set?
- Temporal Order: For time series, did I use random splits or include future information?
- Post-Event Features: Are any features only available after the outcome occurs?
- Cross-Validation: Did I preprocess before setting up CV folds?
- Production Reality: At prediction time, can I actually compute this feature with available data?
The "Prediction Time" Test
For every feature and preprocessing step, ask:
When the model is deployed and receives a new data point at time T: - Can I query this value from the database? - Can I compute this statistic using only data available before time T? - Does this feature require knowing the outcome I'm trying to predict? If any answer is "NO", you have data leakage.
Examples
Example 1: Detecting Normalization Leakage
Input: Code that normalizes data before train-test split
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_normalized = scaler.fit_transform(X) X_train, X_test, y_train, y_test = train_test_split(X_normalized, y, test_size=0.2)
Leakage Detection:
- ❌ LEAKAGE DETECTED:
on entire dataset Xfit_transform - The scaler learned mean and std from the full dataset, including test set
- At prediction time, you won't have access to test set statistics
- Test set performance is artificially inflated
Corrected Code:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) scaler = StandardScaler() X_train_normalized = scaler.fit_transform(X_train) # Fit on train only X_test_normalized = scaler.transform(X_test) # Transform using train statistics
Example 2: Detecting Missing Value Imputation Leakage
Input: Code that fills missing values with global mean
df = pd.read_csv('data.csv') df['income'].fillna(df['income'].mean(), inplace=True) X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2)
Leakage Detection:
- ❌ LEAKAGE DETECTED: Global mean computed on entire dataset
- The mean includes test set values
- At prediction time, you won't know the test set mean
- This inflates test performance
Corrected Code:
df = pd.read_csv('data.csv') X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2) train_mean = X_train['income'].mean() # Compute mean from training data only X_train['income'].fillna(train_mean, inplace=True) X_test['income'].fillna(train_mean, inplace=True) # Use training mean for test
Example 3: Detecting Time Series Leakage
Input: Stock price prediction with random split and rolling average feature
df['rolling_avg_7d'] = df.groupby('stock')['price'].rolling(7, center=True).mean() X_train, X_test = train_test_split(df, test_size=0.2, random_state=42)
Leakage Detection:
- ❌ LEAKAGE DETECTED - Multiple Issues:
in rolling window uses future prices (t+3 days) to compute feature at time tcenter=True- Random split on temporal data means using "tomorrow's data" to predict "yesterday"
- At prediction time, you can't access future prices
Corrected Code:
# Fix 1: Use backward-looking rolling window (no center=True) df = df.sort_values(['stock', 'date']) df['rolling_avg_7d'] = df.groupby('stock')['price'].rolling(7, min_periods=1).mean().reset_index(0, drop=True) # Fix 2: Time-based split instead of random split_date = '2024-01-01' X_train = df[df['date'] < split_date] X_test = df[df['date'] >= split_date]
Example 4: Detecting Target Encoding Leakage
Input: Category encoding using target mean from full dataset
category_target_mean = df.groupby('category')['target'].mean() df['category_encoded'] = df['category'].map(category_target_mean) X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target'])
Leakage Detection:
- ❌ LEAKAGE DETECTED: Target encoding uses test set labels
- The encoding directly uses the answer (target) from test set
- This is severe leakage - you're literally using test labels as features
- At prediction time, you don't know the target value
Corrected Code:
X_train, X_test, y_train, y_test = train_test_split(df.drop('target', axis=1), df['target']) # Compute target mean only from training data train_df = X_train.copy() train_df['target'] = y_train category_target_mean = train_df.groupby('category')['target'].mean() X_train['category_encoded'] = X_train['category'].map(category_target_mean) X_test['category_encoded'] = X_test['category'].map(category_target_mean)
Example 5: Detecting Post-Event Feature Leakage
Input: Predicting customer churn using "number of retention calls" as feature
features = ['account_age', 'monthly_spend', 'support_tickets', 'retention_calls_count'] X = df[features] y = df['churned']
Leakage Detection:
- ❌ LEAKAGE DETECTED: Post-event feature
- "retention_calls_count" only exists AFTER customer shows churn signals
- Company only makes retention calls to customers who are already churning
- This is a consequence of churn, not a predictor
- At prediction time (before churn), this feature doesn't exist or is always 0
Corrected Code:
# Remove post-event features, use only pre-event features features = ['account_age', 'monthly_spend', 'support_tickets', 'login_frequency', 'feature_usage_decline', 'payment_delays'] X = df[features] y = df['churned']
Leakage Severity Levels
CRITICAL (Model is completely invalid):
- Target encoding with test labels
- Post-event features
- Using future data in time series
HIGH (Significantly inflated performance):
- Preprocessing before train-test split
- Feature selection on full dataset
- Global imputation statistics
MEDIUM (Moderate performance inflation):
- Incorrect cross-validation setup
- Subtle temporal leakage in feature engineering
References
: Comprehensive catalog of leakage patternsreferences/leakage-patterns.md
: Time series specific leakage issuesreferences/temporal-leakage.md
: How to detect leakage in existing codereferences/detection-strategies.md
Maintenance
- Sources: ML best practices, Kaggle competition lessons, production ML experience
- Last updated: 2026-01-19
- Known limits: Cannot detect domain-specific leakage without context (e.g., whether a feature is post-event in a specific business context)