Awesome-Agent-Skills-for-Empirical-Research modeling-strategy-guide

Strategic statistical modeling, experimentation, and causal inference

install

source · Clone the upstream repo

git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/43-wentorai-research-plugins/skills/analysis/statistics/modeling-strategy-guide" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-modeling-strategy && rm -rf "$T"

manifest: skills/43-wentorai-research-plugins/skills/analysis/statistics/modeling-strategy-guide/SKILL.md

source content

Modeling Strategy Guide

A skill for strategic statistical modeling applied to academic research. Covers advanced modeling decisions, experimental design, causal inference, feature engineering, and the critical thinking required to move from data to defensible conclusions.

Overview

Senior data scientists distinguish themselves not by knowing more algorithms but by asking better questions, designing cleaner experiments, and being honest about what the data can and cannot tell them. This skill translates that professional discipline into a research context, helping academics apply modern data science practices to their empirical work. It covers the strategic decisions that matter most: when to use simple models versus complex ones, how to establish causality rather than mere correlation, and how to communicate uncertainty honestly.

The skill is particularly useful for researchers working with observational data who need causal inference techniques, those designing randomized experiments who need proper power calculations and analysis plans, and anyone building predictive models who needs to avoid common overfitting and leakage pitfalls.

Strategic Modeling Decisions

Model Selection Philosophy

Decision Framework:
1. Start with the simplest model that could answer your question
2. Add complexity only when diagnostics reveal inadequacy
3. Prefer interpretable models unless prediction accuracy is the sole goal
4. Always have a baseline (mean, majority class, last observation)

Model Complexity Ladder:
  Level 1: Descriptive statistics, cross-tabulations
  Level 2: Linear/logistic regression
  Level 3: Regularized regression (Lasso, Ridge, Elastic Net)
  Level 4: Tree ensembles (Random Forest, Gradient Boosting)
  Level 5: Deep learning (only with sufficient data and clear justification)

Feature Engineering Principles

import pandas as pd
import numpy as np

def engineer_features(df: pd.DataFrame, config: dict) -> pd.DataFrame:
    """
    Apply systematic feature engineering based on domain knowledge.

    config example:
    {
        'log_transform': ['income', 'citations'],
        'interactions': [('experience', 'education')],
        'polynomial': {'age': 2},
        'time_features': 'date_column',
        'lag_features': {'metric': [1, 7, 30]}
    }
    """
    df = df.copy()

    # Log transforms for right-skewed variables
    for col in config.get('log_transform', []):
        df[f'{col}_log'] = np.log1p(df[col])

    # Interaction terms
    for col_a, col_b in config.get('interactions', []):
        df[f'{col_a}_x_{col_b}'] = df[col_a] * df[col_b]

    # Polynomial features
    for col, degree in config.get('polynomial', {}).items():
        for d in range(2, degree + 1):
            df[f'{col}_pow{d}'] = df[col] ** d

    # Time-based features
    if 'time_features' in config:
        time_col = config['time_features']
        df[time_col] = pd.to_datetime(df[time_col])
        df[f'{time_col}_month'] = df[time_col].dt.month
        df[f'{time_col}_dayofweek'] = df[time_col].dt.dayofweek
        df[f'{time_col}_quarter'] = df[time_col].dt.quarter

    return df

Causal Inference Methods

Beyond Correlation

Method	When to Use	Key Assumption
Randomized experiment	You can randomly assign treatment	Proper randomization, no attrition
Difference-in-differences	Policy change affects one group	Parallel trends pre-treatment
Regression discontinuity	Treatment assigned by cutoff	No manipulation near cutoff
Instrumental variables	Endogeneity present	Valid instrument (relevance + exclusion)
Propensity score matching	Observational data, many confounders	No unobserved confounders
Synthetic control	Single treated unit, many controls	Good pre-treatment fit

Propensity Score Matching

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

def propensity_score_match(df, treatment_col, covariates, caliper=0.05):
    """
    Match treated and control units based on propensity scores.
    """
    # Estimate propensity scores
    X = df[covariates].values
    y = df[treatment_col].values

    lr = LogisticRegression(max_iter=1000, random_state=42)
    lr.fit(X, y)
    df['pscore'] = lr.predict_proba(X)[:, 1]

    # Match using nearest neighbor within caliper
    treated = df[df[treatment_col] == 1]
    control = df[df[treatment_col] == 0]

    nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
    nn.fit(control[['pscore']].values)

    distances, indices = nn.kneighbors(treated[['pscore']].values)

    # Apply caliper
    valid = distances.flatten() < caliper
    matched_treated = treated[valid].index.tolist()
    matched_control = control.iloc[indices.flatten()[valid]].index.tolist()

    return {
        'matched_treated': matched_treated,
        'matched_control': matched_control,
        'n_matched': sum(valid),
        'n_unmatched': sum(~valid),
        'balance_check': 'Run standardized mean differences on covariates'
    }

Experimentation Design

A/B Testing for Research

from scipy import stats
import numpy as np

def design_experiment(baseline_rate, mde, alpha=0.05, power=0.80):
    """
    Calculate required sample size for a two-proportion z-test.

    Args:
        baseline_rate: Current conversion/success rate
        mde: Minimum detectable effect (absolute change)
        alpha: Significance level
        power: Statistical power
    """
    from statsmodels.stats.power import NormalIndPower
    effect_size = mde / np.sqrt(baseline_rate * (1 - baseline_rate))
    analysis = NormalIndPower()
    n = analysis.solve_power(
        effect_size=effect_size, alpha=alpha, power=power, ratio=1.0
    )
    return {
        'sample_size_per_group': int(np.ceil(n)),
        'total_sample_size': int(np.ceil(n)) * 2,
        'baseline_rate': baseline_rate,
        'minimum_detectable_effect': mde,
        'alpha': alpha,
        'power': power
    }

Pre-Analysis Plan Template

Before running any experiment, document:

Primary hypothesis: One clearly stated prediction.
Primary outcome metric: One pre-specified metric for the main test.
Sample size justification: Power calculation with assumptions.
Randomization procedure: How units are assigned to conditions.
Analysis method: Exact statistical test and model specification.
Multiple comparisons: How secondary analyses will be corrected.
Stopping rules: Conditions for early termination (if applicable).

Model Validation

Cross-Validation Strategy

Data Type	Recommended CV	Rationale
i.i.d. data	Stratified K-fold (K=5 or 10)	Preserves class balance
Time series	Time-series split (expanding window)	Prevents look-ahead bias
Grouped data	Group K-fold	Prevents data leakage across groups
Small dataset (n<200)	Leave-one-out or repeated K-fold	Maximizes training data
Spatial data	Spatial blocking	Prevents spatial autocorrelation leakage

Leakage Detection Checklist

No future information used as features (check timestamps)
No target-derived features (e.g., group means computed on full data)
Train/test split performed before any preprocessing
Cross-validation folds respect group structure
Feature selection performed inside CV loop, not before
If accuracy seems too good to be true, it probably is

Communication and Reporting

The Senior DS Reporting Standard

Lead with the business/research question, not the algorithm.
Report confidence intervals, not just point estimates.
Show what you tried that did not work (negative results matter).
Quantify uncertainty: "The model predicts X with a 95% interval of [a, b]."
Be explicit about limitations and assumptions.
Use visualizations that a domain expert (not a statistician) can interpret.

References

Angrist, J. D. & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
Cunningham, S. (2021). Causal Inference: The Mixtape. Yale University Press.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.