Vibe-Skills scientific-data-preprocessing
⚠️ CRITICAL USER EXPERIENCE-BASED SKILL - ALWAYS CONSULT BEFORE DATA PREPROCESSING ⚠️ Prevents catastrophic errors (88.9% error rate in V1.0 case study) through multi-level feature analysis, data leakage detection, and semantic validation. MANDATORY for: data preprocessing, feature engineering, standardization, normalization, interpolation, missing value handling, feature selection, or ANY data transformation task. Covers grouped time-series, cross-sectional, panel data. Detects: time travel leakage, causal inversion, ID misuse, semantic-numeric fallacies, distribution blindness. User's hard-won lessons from real project failures.
git clone https://github.com/foryourhealth111-pixel/Vibe-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/foryourhealth111-pixel/Vibe-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/bundled/skills/scientific-data-preprocessing" ~/.claude/skills/foryourhealth111-pixel-vibe-skills-scientific-data-preprocessing && rm -rf "$T"
bundled/skills/scientific-data-preprocessing/SKILL.mdScientific Data Preprocessing Skill
⚠️ CRITICAL: USER'S HARD-WON EXPERIENCE - MANDATORY CONSULTATION ⚠️
This skill encapsulates painful lessons learned from real preprocessing disasters (88.9% error rate documented). ALWAYS use this skill for planning, reflection, and validation when ANY data preprocessing is involved.
Why this skill is mandatory:
- Based on actual project failures (V1.0, V2.0 case studies)
- Prevents data leakage that causes production disasters
- Catches semantic errors AI agents commonly make
- Saves weeks of debugging and model retraining
When to invoke (DO NOT SKIP):
- ✅ Before starting ANY data preprocessing task
- ✅ During preprocessing for reflection and validation
- ✅ After preprocessing for comprehensive audit
- ✅ When reviewing AI-generated preprocessing code
Core Mission
Prevent catastrophic preprocessing errors in grouped time-series data by applying multi-level feature analysis and respecting data structure boundaries.
When to Use This Skill
MANDATORY consultation - trigger immediately when:
Data Preprocessing Tasks (ALWAYS)
- Any data cleaning, transformation, or preparation work
- Loading and preparing data for modeling
- Creating training/test splits
- Handling missing values (imputation, deletion)
- Feature scaling/normalization/standardization
- Encoding categorical variables
- Feature engineering or construction
- Feature selection or dimensionality reduction
Data Structure Types (ALWAYS)
- Preprocesssing time-series data with natural groupings (matches, sessions, patients, experiments)
- Sports analytics (tennis, basketball, etc.)
- Medical/clinical data with patient groupings
- Panel data or longitudinal studies
- Any grouped/hierarchical data structure
Quality Assurance (ALWAYS)
- Auditing existing preprocessing for data leakage or semantic errors
- Reviewing AI-generated preprocessing code for common pitfalls
- Validating preprocessing before model training
- Debugging unexpected model performance
Critical Checkpoints (NEVER SKIP)
- ✅ BEFORE: Planning preprocessing strategy
- ✅ DURING: Reflecting on decisions and checking for errors
- ✅ AFTER: Comprehensive validation and audit
Trigger keywords that MUST invoke this skill:
- "preprocess", "preprocessing", "data cleaning", "data preparation"
- "standardize", "normalize", "scale", "transform"
- "impute", "fill missing", "handle NaN"
- "encode", "one-hot", "categorical"
- "feature engineering", "feature selection", "feature construction"
- "train test split", "cross validation split"
- "interpolate", "smooth", "aggregate"
Not For / Boundaries
This skill does NOT:
- Handle purely cross-sectional data (ungrouped, single timepoint)
- Make domain-specific feature engineering decisions (you decide business logic)
- Choose ML models (focuses on preprocessing only)
- Handle distributed/big data infrastructure (assumes data fits in memory)
Required inputs before proceeding:
- Confirmation that data has groups (e.g., match_id, patient_id, session_id)
- Understanding of whether goal is within-group (relative) or cross-group (absolute) comparison
- Domain constraints on data ranges/units
Quick Reference
Multi-Level Feature Analysis Framework
Level 1: Data Type
# Check data types df.dtypes # int64, float64, object, etc.
Level 2: Feature Type Classification
# Binary (0/1) binary_features = [col for col in df.columns if df[col].nunique() == 2] # Categorical (finite discrete values) categorical_features = [col for col in df.select_dtypes(include='object').columns] # Continuous (infinite possible values) continuous_features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns if df[col].nunique() > 10]
Level 3: Data Structure
# Check for grouping print(f"Number of groups: {df['group_id'].nunique()}") print(f"Avg points per group: {df.groupby('group_id').size().mean():.1f}") # Check for time-series df_sorted = df.sort_values(['group_id', 'timestamp'])
Level 4: Physical Meaning
# Validate physical ranges assert df['speed_mph'].max() < 200, "Speed exceeds physical limit" assert df['distance_meters'].min() >= 0, "Negative distance impossible"
Critical Processing Decision Tree
# Decision: Within-group or global processing? def choose_processing_scope(data, feature, goal): """ goal = 'relative' → within-group (e.g., "this point was intense FOR THIS MATCH") goal = 'absolute' → global (e.g., "this was an intense point OVERALL") """ if goal == 'relative': return 'within_group' elif goal == 'absolute': return 'global' else: raise ValueError("Goal must be 'relative' or 'absolute'")
Pattern 1: Within-Group Interpolation (CORRECT)
from scipy.interpolate import CubicSpline import numpy as np # ✅ CORRECT: Interpolate within each group for group_id in df['match_id'].unique(): mask = df['match_id'] == group_id group_data = df.loc[mask, 'speed_mph'].copy() # Get valid (non-NaN) indices valid_idx = group_data.notna() valid_positions = np.where(valid_idx)[0] valid_values = group_data[valid_idx].values if len(valid_positions) >= 4: cs = CubicSpline(valid_positions, valid_values) missing_positions = np.where(~valid_idx)[0] df.loc[mask & ~valid_idx, 'speed_mph'] = cs(missing_positions)
Pattern 2: Global Interpolation (WRONG - Don't Do This)
# ❌ WRONG: Cross-group interpolation # This interpolates between match A's last point and match B's first point! cs = CubicSpline( np.where(df['speed_mph'].notna())[0], # ❌ All indices globally df['speed_mph'].dropna().values ) df.loc[df['speed_mph'].isna(), 'speed_mph'] = cs( np.where(df['speed_mph'].isna())[0] )
Pattern 3: Within-Group Standardization (for Relative Analysis)
from sklearn.preprocessing import StandardScaler # ✅ CORRECT: Standardize within each match for match_id in df['match_id'].unique(): mask = df['match_id'] == match_id scaler = StandardScaler() df.loc[mask, 'distance_run_std_within'] = scaler.fit_transform( df.loc[mask, [['distance_run']] ) # Interpretation: z=+2 means "2 std above average FOR THIS MATCH"
Pattern 4: Global Standardization (for Absolute Comparison)
# ✅ CORRECT: Global standardization (when appropriate) scaler = StandardScaler() df['distance_run_std_global'] = scaler.fit_transform(df[['distance_run']]) # Interpretation: z=+2 means "2 std above average ACROSS ALL MATCHES"
Pattern 5: Feature Type Processing Rules
# Binary variables (0/1) - KEEP AS-IS binary_cols = ['is_ace', 'is_winner', 'is_error'] # ❌ NEVER standardize these! They have semantic meaning as 0/1 # Categorical variables - ONE-HOT ENCODE df_encoded = pd.get_dummies(df, columns=['server', 'serve_number'], dtype=int) # Continuous variables - STANDARDIZE (within-group or global) continuous_cols = ['distance_run', 'rally_count', 'speed_mph'] # ✅ Apply pattern 3 or 4 based on goal
Pattern 6: Sliding Window Features (for Momentum)
# ✅ CORRECT: Sliding window for momentum analysis window = 10 df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform( lambda x: x.rolling(window, min_periods=1).mean() ) # ❌ WRONG: Cumulative features (loses temporal locality) df['cumulative_points_won'] = df.groupby('match_id')['point_won'].cumsum() # This just increases monotonically and correlates with point_number
Pattern 7: Data Quality Validation
def validate_data_quality(df, feature, expected_range): """Validate before processing""" # Check range assert df[feature].min() >= expected_range[0], f"{feature} below minimum" assert df[feature].max() <= expected_range[1], f"{feature} above maximum" # Check for anomalies mean = df[feature].mean() std = df[feature].std() if std > mean: print(f"⚠️ WARNING: {feature} has std > mean (highly skewed or errors)") # Check missing pattern missing_by_group = df.groupby('match_id')[feature].apply(lambda x: x.isna().sum()) if missing_by_group.max() > len(df) / df['match_id'].nunique() * 0.5: print(f"⚠️ WARNING: {feature} has >50% missing in some groups") # Example validate_data_quality(df, 'speed_mph', expected_range=(50, 165))
Pattern 8: Detect Processing Scope Automatically
def detect_processing_scope(df, group_col, feature_col): """ Recommend within-group vs global based on variance structure """ # Calculate variance components within_group_var = df.groupby(group_col)[feature_col].var().mean() global_var = df[feature_col].var() # Intraclass correlation between_group_var = global_var - within_group_var icc = between_group_var / global_var if icc > 0.5: return 'within_group', f"High between-group variance (ICC={icc:.2f})" else: return 'global', f"Low between-group variance (ICC={icc:.2f})" scope, reason = detect_processing_scope(df, 'match_id', 'distance_run') print(f"Recommended: {scope} - {reason}")
Pattern 9: Data Leakage Detection
def detect_data_leakage(df, target_col, feature_cols, id_cols): """ Critical checks for data leakage and AI common pitfalls """ issues = [] # 1. ID Leakage: High cardinality variables as features for col in feature_cols: if col in id_cols: issues.append(f"❌ FATAL: {col} is an ID - NEVER use as feature") continue # Check if looks like ID (>50% unique) uniqueness = df[col].nunique() / len(df) if uniqueness > 0.5: issues.append(f"⚠️ {col}: {uniqueness*100:.1f}% unique - possible ID leakage") # 2. Causal Inversion: Perfect correlation with target for col in feature_cols: if col == target_col: continue if df[col].dtype in ['int64', 'float64']: corr = abs(df[[col, target_col]].corr().iloc[0, 1]) if corr > 0.95: issues.append(f"❌ FATAL: {col} correlation={corr:.3f} - likely consequence of target!") # 3. Meaningless Numeric: Codes treated as numbers for col in feature_cols: if df[col].dtype in ['int64', 'float64']: # Pattern: High values, many uniques, looks like code if df[col].min() > 1000 and df[col].nunique() > 100: issues.append(f"⚠️ {col}: Looks like code (zipcode/ID) - should be categorical") # 4. Time Travel: Check if standardization used global statistics # (Requires knowing if train/test split was done first) # Print report if issues: print("="*60) print("DATA LEAKAGE AUDIT") print("="*60) for issue in issues: print(issue) print("="*60) else: print("✅ No obvious leakage detected") return issues # Example usage issues = detect_data_leakage( df, target_col='point_won', feature_cols=['speed_mph', 'user_id', 'distance_run'], id_cols=['match_id', 'user_id'] )
Pattern 10: Distribution-Aware Scaling
from scipy.stats import skew, kurtosis from sklearn.preprocessing import StandardScaler, RobustScaler def smart_scaler_selection(df, col): """ Choose scaler based on distribution characteristics """ data = df[col].dropna() # Check distribution skewness = skew(data) kurt = kurtosis(data) print(f"{col}: skewness={skewness:.2f}, kurtosis={kurt:.2f}") if abs(skewness) < 0.5 and abs(kurt) < 3: # Roughly normal print(" → StandardScaler (data is roughly normal)") return StandardScaler(), None elif skewness > 1: # Right-skewed (long tail) print(" → Log transform + StandardScaler (right-skewed)") return StandardScaler(), 'log' else: # Heavy outliers or non-normal print(" → RobustScaler (heavy outliers)") return RobustScaler(), None # Example usage for col in continuous_features: scaler, transform = smart_scaler_selection(df, col) if transform == 'log': df[f'{col}_log'] = np.log1p(df[col]) df[f'{col}_scaled'] = scaler.fit_transform(df[[f'{col}_log']]) else: df[f'{col}_scaled'] = scaler.fit_transform(df[[col]])
Examples
Example 1: Tennis Match Preprocessing (Complete Pipeline)
Input:
- CSV with 7,284 rows, 31 matches
- Features:
,speed_mph
,distance_run
,rally_count
,is_aceserver - Goal: Analyze momentum (relative intensity within each match)
Steps:
import pandas as pd from sklearn.preprocessing import StandardScaler # 1. Load and inspect df = pd.read_csv('tennis_data.csv') print(f"Matches: {df['match_id'].nunique()}") print(f"Features: {df.dtypes}") # 2. Classify features binary_features = ['is_ace', 'is_winner', 'is_break_point'] categorical_features = ['server', 'serve_number'] continuous_features = ['distance_run', 'speed_mph', 'rally_count'] # 3. Validate data quality for feat in continuous_features: print(f"\n{feat}:") print(df[feat].describe()) # Check for impossible values if feat == 'speed_mph': assert df[feat].max() < 170, "Speed exceeds world record!" # 4. Handle missing values (within-group) for match_id in df['match_id'].unique(): mask = df['match_id'] == match_id for feat in continuous_features: if df.loc[mask, feat].isna().any(): # Simple linear interpolation within match df.loc[mask, feat] = df.loc[mask, feat].interpolate(method='linear') # 5. One-hot encode categorical df = pd.get_dummies(df, columns=categorical_features, dtype=int) # 6. Standardize continuous features WITHIN each match for feat in continuous_features: df[f'{feat}_std'] = np.nan for match_id in df['match_id'].unique(): mask = df['match_id'] == match_id scaler = StandardScaler() df.loc[mask, f'{feat}_std'] = scaler.fit_transform( df.loc[mask, [[feat]] ) # 7. Create sliding window features window = 10 df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform( lambda x: x.rolling(window, min_periods=1).mean() ) # 8. KEEP binary features as 0/1 (don't transform!) # binary_features are already correct print("\n✅ Preprocessing complete!") print(f"Final shape: {df.shape}") print(f"Standardized features: {[f for f in df.columns if f.endswith('_std')]}")
Expected output:
- Binary features remain 0/1
- Categorical features one-hot encoded (e.g.,
,server_1
)server_2 - Continuous features have both original and
versions_std
features have mean≈0, std≈1 WITHIN each match_std- Sliding window features capture local momentum
- No missing values
Example 2: Detecting Cross-Group Contamination
Input:
- Preprocessed data where you suspect cross-group standardization
Steps:
# Check if standardization was done correctly def check_within_group_standardization(df, group_col, feature_std_col): """ Verify that standardized feature has mean≈0, std≈1 within each group """ results = df.groupby(group_col)[feature_std_col].agg(['mean', 'std']) # Within-group standardization: each group should have mean≈0, std≈1 if (results['mean'].abs() < 0.1).all() and (results['std'].between(0.9, 1.1)).all(): print("✅ CORRECT: Within-group standardization detected") return True # Global standardization: groups will have varying means and stds else: print("❌ WRONG: Global standardization detected!") print("Group means:", results['mean'].values[:5]) print("Group stds:", results['std'].values[:5]) return False check_within_group_standardization(df, 'match_id', 'distance_run_std')
Expected output:
- CORRECT: All group means ≈ 0, all group stds ≈ 1
- WRONG: Group means vary widely, indicating global standardization
Example 3: Fixing Cumulative Feature Error
Input:
- Existing pipeline using cumulative sums for momentum
Steps:
# ❌ WRONG approach (existing code) df['cumulative_wins'] = df.groupby('match_id')['point_won'].cumsum() # Problem: This just counts total wins up to this point # Doesn't capture recent momentum! # ✅ CORRECT approach (fix) # Replace cumulative with sliding window window = 10 df['recent_win_rate'] = df.groupby('match_id')['point_won'].transform( lambda x: x.rolling(window, min_periods=1).mean() ) # Compare print("Cumulative (wrong):", df['cumulative_wins'].values[50:60]) print("Sliding window (correct):", df['recent_win_rate'].values[50:60]) # Cumulative: [25, 26, 26, 27, 28, ...] - monotonic # Sliding window: [0.6, 0.7, 0.5, 0.6, ...] - fluctuates with momentum
Expected output:
- Cumulative features removed
- Sliding window features show local variations
- Momentum analysis now captures short-term trends
References
: Navigation and overviewreferences/index.md
: Real-world preprocessing disasters from tennis datareferences/error-case-studies.md
: Full decision trees for all preprocessing choicesreferences/decision-trees.md
: Pre-processing validation checklistreferences/validation-checklist.md
: AI-specific errors (data leakage, semantic fallacies, distribution blindness)references/ai-common-pitfalls.md
Maintenance
⚠️ CRITICAL NOTICE: USER'S PERSONAL EXPERIENCE-BASED SKILL ⚠️
This skill is NOT theoretical - it's based on real project failures:
- V1.0 disaster: 88.9% error rate, weeks of wasted work
- V2.0 issues: Cross-group contamination, unreliable results
- V3.0 success: All errors fixed, production-ready
Why this matters to you (Claude):
- These are the EXACT errors AI agents commonly make
- User has already paid the price for these mistakes
- Ignoring this skill = repeating documented failures
- Following this skill = learning from experience without pain
Authority level: HIGHEST
- Based on user's hard-won lessons from actual project
- Validated through multiple iterations (V1.0 → V2.0 → V3.0)
- Every error documented with impact metrics
- Every fix validated with comprehensive testing
Sources:
- Primary: User's personal project (2024 MCM Problem C - Tennis Momentum Analysis)
- Secondary: Statistical best practices for grouped data
- Tertiary: Common AI preprocessing errors observed across domains
Mandatory consultation:
- ⚠️ ALWAYS consult before, during, and after any data preprocessing
- ⚠️ NEVER skip validation steps outlined in this skill
- ⚠️ When in doubt, err on the side of caution (use this skill)
Last updated: 2026-01-18 (V1.1)
Known limits:
- Assumes data fits in memory (not for big data infrastructure)
- Focused on numeric/categorical features (text/image preprocessing partially covered)
- Does not prescribe domain-specific feature engineering (user decides business logic)
- Requires basic understanding of statistics (mean, std, correlation)