Vibe-Skills scientific-data-preprocessing

⚠️ CRITICAL USER EXPERIENCE-BASED SKILL - ALWAYS CONSULT BEFORE DATA PREPROCESSING ⚠️ Prevents catastrophic errors (88.9% error rate in V1.0 case study) through multi-level feature analysis, data leakage detection, and semantic validation. MANDATORY for: data preprocessing, feature engineering, standardization, normalization, interpolation, missing value handling, feature selection, or ANY data transformation task. Covers grouped time-series, cross-sectional, panel data. Detects: time travel leakage, causal inversion, ID misuse, semantic-numeric fallacies, distribution blindness. User's hard-won lessons from real project failures.

install
source · Clone the upstream repo
git clone https://github.com/foryourhealth111-pixel/Vibe-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/foryourhealth111-pixel/Vibe-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/bundled/skills/scientific-data-preprocessing" ~/.claude/skills/foryourhealth111-pixel-vibe-skills-scientific-data-preprocessing && rm -rf "$T"
manifest: bundled/skills/scientific-data-preprocessing/SKILL.md
source content

Scientific Data Preprocessing Skill

⚠️ CRITICAL: USER'S HARD-WON EXPERIENCE - MANDATORY CONSULTATION ⚠️

This skill encapsulates painful lessons learned from real preprocessing disasters (88.9% error rate documented). ALWAYS use this skill for planning, reflection, and validation when ANY data preprocessing is involved.

Why this skill is mandatory:

  • Based on actual project failures (V1.0, V2.0 case studies)
  • Prevents data leakage that causes production disasters
  • Catches semantic errors AI agents commonly make
  • Saves weeks of debugging and model retraining

When to invoke (DO NOT SKIP):

  • ✅ Before starting ANY data preprocessing task
  • ✅ During preprocessing for reflection and validation
  • ✅ After preprocessing for comprehensive audit
  • ✅ When reviewing AI-generated preprocessing code

Core Mission

Prevent catastrophic preprocessing errors in grouped time-series data by applying multi-level feature analysis and respecting data structure boundaries.

When to Use This Skill

MANDATORY consultation - trigger immediately when:

Data Preprocessing Tasks (ALWAYS)

  • Any data cleaning, transformation, or preparation work
  • Loading and preparing data for modeling
  • Creating training/test splits
  • Handling missing values (imputation, deletion)
  • Feature scaling/normalization/standardization
  • Encoding categorical variables
  • Feature engineering or construction
  • Feature selection or dimensionality reduction

Data Structure Types (ALWAYS)

  • Preprocesssing time-series data with natural groupings (matches, sessions, patients, experiments)
  • Sports analytics (tennis, basketball, etc.)
  • Medical/clinical data with patient groupings
  • Panel data or longitudinal studies
  • Any grouped/hierarchical data structure

Quality Assurance (ALWAYS)

  • Auditing existing preprocessing for data leakage or semantic errors
  • Reviewing AI-generated preprocessing code for common pitfalls
  • Validating preprocessing before model training
  • Debugging unexpected model performance

Critical Checkpoints (NEVER SKIP)

  • BEFORE: Planning preprocessing strategy
  • DURING: Reflecting on decisions and checking for errors
  • AFTER: Comprehensive validation and audit

Trigger keywords that MUST invoke this skill:

  • "preprocess", "preprocessing", "data cleaning", "data preparation"
  • "standardize", "normalize", "scale", "transform"
  • "impute", "fill missing", "handle NaN"
  • "encode", "one-hot", "categorical"
  • "feature engineering", "feature selection", "feature construction"
  • "train test split", "cross validation split"
  • "interpolate", "smooth", "aggregate"

Not For / Boundaries

This skill does NOT:

  • Handle purely cross-sectional data (ungrouped, single timepoint)
  • Make domain-specific feature engineering decisions (you decide business logic)
  • Choose ML models (focuses on preprocessing only)
  • Handle distributed/big data infrastructure (assumes data fits in memory)

Required inputs before proceeding:

  1. Confirmation that data has groups (e.g., match_id, patient_id, session_id)
  2. Understanding of whether goal is within-group (relative) or cross-group (absolute) comparison
  3. Domain constraints on data ranges/units

Quick Reference

Multi-Level Feature Analysis Framework

Level 1: Data Type

# Check data types
df.dtypes  # int64, float64, object, etc.

Level 2: Feature Type Classification

# Binary (0/1)
binary_features = [col for col in df.columns if df[col].nunique() == 2]

# Categorical (finite discrete values)
categorical_features = [col for col in df.select_dtypes(include='object').columns]

# Continuous (infinite possible values)
continuous_features = [col for col in df.select_dtypes(include=['float64', 'int64']).columns
                       if df[col].nunique() > 10]

Level 3: Data Structure

# Check for grouping
print(f"Number of groups: {df['group_id'].nunique()}")
print(f"Avg points per group: {df.groupby('group_id').size().mean():.1f}")

# Check for time-series
df_sorted = df.sort_values(['group_id', 'timestamp'])

Level 4: Physical Meaning

# Validate physical ranges
assert df['speed_mph'].max() < 200, "Speed exceeds physical limit"
assert df['distance_meters'].min() >= 0, "Negative distance impossible"

Critical Processing Decision Tree

# Decision: Within-group or global processing?
def choose_processing_scope(data, feature, goal):
    """
    goal = 'relative' → within-group (e.g., "this point was intense FOR THIS MATCH")
    goal = 'absolute' → global (e.g., "this was an intense point OVERALL")
    """
    if goal == 'relative':
        return 'within_group'
    elif goal == 'absolute':
        return 'global'
    else:
        raise ValueError("Goal must be 'relative' or 'absolute'")

Pattern 1: Within-Group Interpolation (CORRECT)

from scipy.interpolate import CubicSpline
import numpy as np

# ✅ CORRECT: Interpolate within each group
for group_id in df['match_id'].unique():
    mask = df['match_id'] == group_id
    group_data = df.loc[mask, 'speed_mph'].copy()

    # Get valid (non-NaN) indices
    valid_idx = group_data.notna()
    valid_positions = np.where(valid_idx)[0]
    valid_values = group_data[valid_idx].values

    if len(valid_positions) >= 4:
        cs = CubicSpline(valid_positions, valid_values)
        missing_positions = np.where(~valid_idx)[0]
        df.loc[mask & ~valid_idx, 'speed_mph'] = cs(missing_positions)

Pattern 2: Global Interpolation (WRONG - Don't Do This)

# ❌ WRONG: Cross-group interpolation
# This interpolates between match A's last point and match B's first point!
cs = CubicSpline(
    np.where(df['speed_mph'].notna())[0],  # ❌ All indices globally
    df['speed_mph'].dropna().values
)
df.loc[df['speed_mph'].isna(), 'speed_mph'] = cs(
    np.where(df['speed_mph'].isna())[0]
)

Pattern 3: Within-Group Standardization (for Relative Analysis)

from sklearn.preprocessing import StandardScaler

# ✅ CORRECT: Standardize within each match
for match_id in df['match_id'].unique():
    mask = df['match_id'] == match_id
    scaler = StandardScaler()

    df.loc[mask, 'distance_run_std_within'] = scaler.fit_transform(
        df.loc[mask, [['distance_run']]
    )

# Interpretation: z=+2 means "2 std above average FOR THIS MATCH"

Pattern 4: Global Standardization (for Absolute Comparison)

# ✅ CORRECT: Global standardization (when appropriate)
scaler = StandardScaler()
df['distance_run_std_global'] = scaler.fit_transform(df[['distance_run']])

# Interpretation: z=+2 means "2 std above average ACROSS ALL MATCHES"

Pattern 5: Feature Type Processing Rules

# Binary variables (0/1) - KEEP AS-IS
binary_cols = ['is_ace', 'is_winner', 'is_error']
# ❌ NEVER standardize these! They have semantic meaning as 0/1

# Categorical variables - ONE-HOT ENCODE
df_encoded = pd.get_dummies(df, columns=['server', 'serve_number'], dtype=int)

# Continuous variables - STANDARDIZE (within-group or global)
continuous_cols = ['distance_run', 'rally_count', 'speed_mph']
# ✅ Apply pattern 3 or 4 based on goal

Pattern 6: Sliding Window Features (for Momentum)

# ✅ CORRECT: Sliding window for momentum analysis
window = 10

df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# ❌ WRONG: Cumulative features (loses temporal locality)
df['cumulative_points_won'] = df.groupby('match_id')['point_won'].cumsum()
# This just increases monotonically and correlates with point_number

Pattern 7: Data Quality Validation

def validate_data_quality(df, feature, expected_range):
    """Validate before processing"""
    # Check range
    assert df[feature].min() >= expected_range[0], f"{feature} below minimum"
    assert df[feature].max() <= expected_range[1], f"{feature} above maximum"

    # Check for anomalies
    mean = df[feature].mean()
    std = df[feature].std()

    if std > mean:
        print(f"⚠️ WARNING: {feature} has std > mean (highly skewed or errors)")

    # Check missing pattern
    missing_by_group = df.groupby('match_id')[feature].apply(lambda x: x.isna().sum())
    if missing_by_group.max() > len(df) / df['match_id'].nunique() * 0.5:
        print(f"⚠️ WARNING: {feature} has >50% missing in some groups")

# Example
validate_data_quality(df, 'speed_mph', expected_range=(50, 165))

Pattern 8: Detect Processing Scope Automatically

def detect_processing_scope(df, group_col, feature_col):
    """
    Recommend within-group vs global based on variance structure
    """
    # Calculate variance components
    within_group_var = df.groupby(group_col)[feature_col].var().mean()
    global_var = df[feature_col].var()

    # Intraclass correlation
    between_group_var = global_var - within_group_var
    icc = between_group_var / global_var

    if icc > 0.5:
        return 'within_group', f"High between-group variance (ICC={icc:.2f})"
    else:
        return 'global', f"Low between-group variance (ICC={icc:.2f})"

scope, reason = detect_processing_scope(df, 'match_id', 'distance_run')
print(f"Recommended: {scope} - {reason}")

Pattern 9: Data Leakage Detection

def detect_data_leakage(df, target_col, feature_cols, id_cols):
    """
    Critical checks for data leakage and AI common pitfalls
    """
    issues = []

    # 1. ID Leakage: High cardinality variables as features
    for col in feature_cols:
        if col in id_cols:
            issues.append(f"❌ FATAL: {col} is an ID - NEVER use as feature")
            continue

        # Check if looks like ID (>50% unique)
        uniqueness = df[col].nunique() / len(df)
        if uniqueness > 0.5:
            issues.append(f"⚠️ {col}: {uniqueness*100:.1f}% unique - possible ID leakage")

    # 2. Causal Inversion: Perfect correlation with target
    for col in feature_cols:
        if col == target_col:
            continue
        if df[col].dtype in ['int64', 'float64']:
            corr = abs(df[[col, target_col]].corr().iloc[0, 1])
            if corr > 0.95:
                issues.append(f"❌ FATAL: {col} correlation={corr:.3f} - likely consequence of target!")

    # 3. Meaningless Numeric: Codes treated as numbers
    for col in feature_cols:
        if df[col].dtype in ['int64', 'float64']:
            # Pattern: High values, many uniques, looks like code
            if df[col].min() > 1000 and df[col].nunique() > 100:
                issues.append(f"⚠️ {col}: Looks like code (zipcode/ID) - should be categorical")

    # 4. Time Travel: Check if standardization used global statistics
    # (Requires knowing if train/test split was done first)

    # Print report
    if issues:
        print("="*60)
        print("DATA LEAKAGE AUDIT")
        print("="*60)
        for issue in issues:
            print(issue)
        print("="*60)
    else:
        print("✅ No obvious leakage detected")

    return issues

# Example usage
issues = detect_data_leakage(
    df,
    target_col='point_won',
    feature_cols=['speed_mph', 'user_id', 'distance_run'],
    id_cols=['match_id', 'user_id']
)

Pattern 10: Distribution-Aware Scaling

from scipy.stats import skew, kurtosis
from sklearn.preprocessing import StandardScaler, RobustScaler

def smart_scaler_selection(df, col):
    """
    Choose scaler based on distribution characteristics
    """
    data = df[col].dropna()

    # Check distribution
    skewness = skew(data)
    kurt = kurtosis(data)

    print(f"{col}: skewness={skewness:.2f}, kurtosis={kurt:.2f}")

    if abs(skewness) < 0.5 and abs(kurt) < 3:
        # Roughly normal
        print("  → StandardScaler (data is roughly normal)")
        return StandardScaler(), None

    elif skewness > 1:
        # Right-skewed (long tail)
        print("  → Log transform + StandardScaler (right-skewed)")
        return StandardScaler(), 'log'

    else:
        # Heavy outliers or non-normal
        print("  → RobustScaler (heavy outliers)")
        return RobustScaler(), None

# Example usage
for col in continuous_features:
    scaler, transform = smart_scaler_selection(df, col)

    if transform == 'log':
        df[f'{col}_log'] = np.log1p(df[col])
        df[f'{col}_scaled'] = scaler.fit_transform(df[[f'{col}_log']])
    else:
        df[f'{col}_scaled'] = scaler.fit_transform(df[[col]])

Examples

Example 1: Tennis Match Preprocessing (Complete Pipeline)

Input:

  • CSV with 7,284 rows, 31 matches
  • Features:
    speed_mph
    ,
    distance_run
    ,
    rally_count
    ,
    is_ace
    ,
    server
  • Goal: Analyze momentum (relative intensity within each match)

Steps:

import pandas as pd
from sklearn.preprocessing import StandardScaler

# 1. Load and inspect
df = pd.read_csv('tennis_data.csv')
print(f"Matches: {df['match_id'].nunique()}")
print(f"Features: {df.dtypes}")

# 2. Classify features
binary_features = ['is_ace', 'is_winner', 'is_break_point']
categorical_features = ['server', 'serve_number']
continuous_features = ['distance_run', 'speed_mph', 'rally_count']

# 3. Validate data quality
for feat in continuous_features:
    print(f"\n{feat}:")
    print(df[feat].describe())
    # Check for impossible values
    if feat == 'speed_mph':
        assert df[feat].max() < 170, "Speed exceeds world record!"

# 4. Handle missing values (within-group)
for match_id in df['match_id'].unique():
    mask = df['match_id'] == match_id
    for feat in continuous_features:
        if df.loc[mask, feat].isna().any():
            # Simple linear interpolation within match
            df.loc[mask, feat] = df.loc[mask, feat].interpolate(method='linear')

# 5. One-hot encode categorical
df = pd.get_dummies(df, columns=categorical_features, dtype=int)

# 6. Standardize continuous features WITHIN each match
for feat in continuous_features:
    df[f'{feat}_std'] = np.nan
    for match_id in df['match_id'].unique():
        mask = df['match_id'] == match_id
        scaler = StandardScaler()
        df.loc[mask, f'{feat}_std'] = scaler.fit_transform(
            df.loc[mask, [[feat]]
        )

# 7. Create sliding window features
window = 10
df['win_rate_last10'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# 8. KEEP binary features as 0/1 (don't transform!)
# binary_features are already correct

print("\n✅ Preprocessing complete!")
print(f"Final shape: {df.shape}")
print(f"Standardized features: {[f for f in df.columns if f.endswith('_std')]}")

Expected output:

  • Binary features remain 0/1
  • Categorical features one-hot encoded (e.g.,
    server_1
    ,
    server_2
    )
  • Continuous features have both original and
    _std
    versions
  • _std
    features have mean≈0, std≈1 WITHIN each match
  • Sliding window features capture local momentum
  • No missing values

Example 2: Detecting Cross-Group Contamination

Input:

  • Preprocessed data where you suspect cross-group standardization

Steps:

# Check if standardization was done correctly
def check_within_group_standardization(df, group_col, feature_std_col):
    """
    Verify that standardized feature has mean≈0, std≈1 within each group
    """
    results = df.groupby(group_col)[feature_std_col].agg(['mean', 'std'])

    # Within-group standardization: each group should have mean≈0, std≈1
    if (results['mean'].abs() < 0.1).all() and (results['std'].between(0.9, 1.1)).all():
        print("✅ CORRECT: Within-group standardization detected")
        return True

    # Global standardization: groups will have varying means and stds
    else:
        print("❌ WRONG: Global standardization detected!")
        print("Group means:", results['mean'].values[:5])
        print("Group stds:", results['std'].values[:5])
        return False

check_within_group_standardization(df, 'match_id', 'distance_run_std')

Expected output:

  • CORRECT: All group means ≈ 0, all group stds ≈ 1
  • WRONG: Group means vary widely, indicating global standardization

Example 3: Fixing Cumulative Feature Error

Input:

  • Existing pipeline using cumulative sums for momentum

Steps:

# ❌ WRONG approach (existing code)
df['cumulative_wins'] = df.groupby('match_id')['point_won'].cumsum()

# Problem: This just counts total wins up to this point
# Doesn't capture recent momentum!

# ✅ CORRECT approach (fix)
# Replace cumulative with sliding window
window = 10
df['recent_win_rate'] = df.groupby('match_id')['point_won'].transform(
    lambda x: x.rolling(window, min_periods=1).mean()
)

# Compare
print("Cumulative (wrong):", df['cumulative_wins'].values[50:60])
print("Sliding window (correct):", df['recent_win_rate'].values[50:60])

# Cumulative: [25, 26, 26, 27, 28, ...] - monotonic
# Sliding window: [0.6, 0.7, 0.5, 0.6, ...] - fluctuates with momentum

Expected output:

  • Cumulative features removed
  • Sliding window features show local variations
  • Momentum analysis now captures short-term trends

References

  • references/index.md
    : Navigation and overview
  • references/error-case-studies.md
    : Real-world preprocessing disasters from tennis data
  • references/decision-trees.md
    : Full decision trees for all preprocessing choices
  • references/validation-checklist.md
    : Pre-processing validation checklist
  • references/ai-common-pitfalls.md
    : AI-specific errors (data leakage, semantic fallacies, distribution blindness)

Maintenance

⚠️ CRITICAL NOTICE: USER'S PERSONAL EXPERIENCE-BASED SKILL ⚠️

This skill is NOT theoretical - it's based on real project failures:

  • V1.0 disaster: 88.9% error rate, weeks of wasted work
  • V2.0 issues: Cross-group contamination, unreliable results
  • V3.0 success: All errors fixed, production-ready

Why this matters to you (Claude):

  • These are the EXACT errors AI agents commonly make
  • User has already paid the price for these mistakes
  • Ignoring this skill = repeating documented failures
  • Following this skill = learning from experience without pain

Authority level: HIGHEST

  • Based on user's hard-won lessons from actual project
  • Validated through multiple iterations (V1.0 → V2.0 → V3.0)
  • Every error documented with impact metrics
  • Every fix validated with comprehensive testing

Sources:

  • Primary: User's personal project (2024 MCM Problem C - Tennis Momentum Analysis)
  • Secondary: Statistical best practices for grouped data
  • Tertiary: Common AI preprocessing errors observed across domains

Mandatory consultation:

  • ⚠️ ALWAYS consult before, during, and after any data preprocessing
  • ⚠️ NEVER skip validation steps outlined in this skill
  • ⚠️ When in doubt, err on the side of caution (use this skill)

Last updated: 2026-01-18 (V1.1)

Known limits:

  • Assumes data fits in memory (not for big data infrastructure)
  • Focused on numeric/categorical features (text/image preprocessing partially covered)
  • Does not prescribe domain-specific feature engineering (user decides business logic)
  • Requires basic understanding of statistics (mean, std, correlation)