Gsd-skill-creator regression-modeling

Modeling relationships between variables using regression. Covers simple linear regression, multiple regression, polynomial regression, logistic regression, model fitting (least squares, maximum likelihood), residual analysis, model diagnostics, R-squared, adjusted R-squared, multicollinearity, variable selection, and the Box-Jenkins dictum that all models are wrong but some are useful. Use when predicting outcomes, quantifying relationships, building predictive models, or diagnosing model fit.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/statistics/regression-modeling" ~/.claude/skills/tibsfox-gsd-skill-creator-regression-modeling && rm -rf "$T"
manifest: examples/skills/statistics/regression-modeling/SKILL.md
source content

Regression Modeling

Regression models quantify the relationship between a response variable and one or more explanatory variables. The goal may be prediction ("what will Y be when X = 10?"), explanation ("how does Y change when X increases by one unit?"), or both. This skill covers the core regression toolkit from simple linear regression through logistic regression, with emphasis on the diagnostics that separate a useful model from a misleading one.

Agent affinity: box (model building, diagnostics, "all models are wrong"), pearson (correlation, regression theory), efron (computational model fitting)

Concept IDs: stat-descriptive-statistics, stat-hypothesis-testing

Simple Linear Regression

The model

Y = beta_0 + beta_1 * X + epsilon, where epsilon ~ N(0, sigma^2).

  • beta_0: Y-intercept. The predicted value of Y when X = 0.
  • beta_1: Slope. The change in predicted Y for a one-unit increase in X.
  • epsilon: Error term. Captures everything the model does not explain.

Least squares estimation

The least squares estimates minimize the sum of squared residuals:

b_1 = sum((x_i - x-bar)(y_i - y-bar)) / sum((x_i - x-bar)^2) b_0 = y-bar - b_1 * x-bar

The fitted line y-hat = b_0 + b_1 * x passes through the point (x-bar, y-bar).

Interpretation

  • b_1 = 2.3: "For each one-unit increase in X, Y increases by 2.3 units on average."
  • b_0 = 15.7: "When X = 0, the predicted Y is 15.7." (Only meaningful if X = 0 is within the data range.)
  • Extrapolation warning: The model is only trustworthy within the range of observed X values. Extrapolating beyond that range assumes the linear relationship continues, which may be false.

R-Squared and Model Fit

R-squared (coefficient of determination)

R^2 = 1 - SS_residual / SS_total = SS_regression / SS_total.

R^2 is the proportion of variance in Y explained by the model. Range: 0 to 1. An R^2 of 0.72 means the model explains 72% of the variability in Y.

Adjusted R-squared

R^2_adj = 1 - (SS_residual / (n-k-1)) / (SS_total / (n-1)).

Adjusted R^2 penalizes for adding predictors. It can decrease when a useless predictor is added. Use adjusted R^2, not R^2, for comparing models with different numbers of predictors.

Cautions about R-squared

  • A high R^2 does not mean the model is correct. A quadratic relationship fit with a line can have moderate R^2 but systematically wrong predictions.
  • A low R^2 does not mean the model is useless. In social science, R^2 = 0.10 with a clear causal mechanism is scientifically important.
  • R^2 always increases (or stays the same) when you add a predictor, regardless of that predictor's relevance. This is why adjusted R^2 exists.

Multiple Regression

The model

Y = beta_0 + beta_1X_1 + beta_2X_2 + ... + beta_k*X_k + epsilon.

Each beta_j is the partial effect of X_j on Y, holding all other predictors constant.

Interpretation with multiple predictors

"Holding all other variables constant, a one-unit increase in X_2 is associated with a b_2-unit change in Y." The phrase "holding all other variables constant" is critical -- it distinguishes multiple regression from running separate simple regressions.

Multicollinearity

When predictors are highly correlated with each other, the model has difficulty separating their individual effects.

Detection:

  • Correlation matrix: Pairwise correlations > 0.8 are concerning.
  • Variance Inflation Factor (VIF): VIF_j = 1 / (1 - R^2_j), where R^2_j is the R^2 from regressing X_j on all other predictors. VIF > 10 is a red flag; VIF > 5 warrants attention.

Consequences: Coefficients become unstable (large standard errors), making individual predictor effects unreliable. The overall model's predictive ability may be fine.

Remedies: Remove one of the correlated predictors, combine them (e.g., principal components), or accept the instability if prediction is the only goal.

Residual Analysis and Diagnostics

"All models are wrong, but some are useful." -- George E.P. Box. The diagnostics below determine whether a model is useful enough.

Assumptions to check

  1. Linearity: The relationship between predictors and response is linear.
  2. Independence: Residuals are independent of each other.
  3. Homoscedasticity: Residuals have constant variance across all levels of X.
  4. Normality: Residuals are approximately normally distributed.

The acronym LINE captures all four.

Residual plots

PlotWhat it checksHealthy patternProblem signal
Residuals vs. fitted valuesLinearity, homoscedasticityRandom scatter around zeroCurved pattern (nonlinearity), funnel shape (heteroscedasticity)
Normal Q-Q plotNormalityPoints on the diagonal lineSystematic departures at the tails
Residuals vs. predictorLinearity for each predictorRandom scatterCurved pattern
Residuals vs. orderIndependenceRandom scatterPatterns over time (autocorrelation)

Influential observations

  • Leverage: How far an observation's X value is from the mean of X. High leverage points have outsized potential influence.
  • Cook's distance: Measures how much all fitted values change when observation i is removed. Cook's D > 4/n is a common threshold.
  • DFFITS: The change in the fitted value at X_i when observation i is deleted. |DFFITS| > 2*sqrt(k/n) is flagged.

Action: Investigate influential points. They may be data entry errors, genuinely unusual observations, or indicators that the model is wrong. Do not automatically remove them.

Polynomial Regression

Y = beta_0 + beta_1X + beta_2X^2 + ... + beta_p*X^p + epsilon.

Used when the scatter plot shows curvature that a straight line cannot capture.

Cautions:

  • Higher-degree polynomials can overfit, fitting noise rather than signal.
  • Polynomial extrapolation is especially dangerous -- polynomial tails diverge wildly.
  • Start with degree 2; go higher only with strong evidence and domain justification.

Logistic Regression

When the response is binary

When Y is 0 or 1 (success/failure, yes/no), linear regression is inappropriate (predicted values can fall outside [0, 1]). Logistic regression models the log-odds:

log(p / (1-p)) = beta_0 + beta_1X_1 + ... + beta_kX_k

where p = P(Y = 1).

Interpretation

  • Odds ratio: exp(beta_j) is the multiplicative change in odds for a one-unit increase in X_j. An odds ratio of 1.5 means "50% higher odds of success."
  • Probability curve: The logistic function p = 1 / (1 + exp(-(beta_0 + beta_1*X))) produces an S-shaped curve mapping the linear predictor to [0, 1].

Model fitting

Logistic regression is fit by maximum likelihood, not least squares. The log-likelihood is maximized iteratively (usually via Newton-Raphson or Fisher scoring).

Diagnostics

  • Deviance residuals replace ordinary residuals.
  • AIC (Akaike Information Criterion) for model comparison (lower is better).
  • Hosmer-Lemeshow test for goodness-of-fit.
  • ROC curve and AUC for classification performance.

Variable Selection

Methods

MethodApproachStrengthsWeaknesses
Forward selectionStart with no predictors, add the most significant one at each stepSimpleMay miss important combinations
Backward eliminationStart with all predictors, remove the least significant one at each stepConsiders all predictorsRequires n >> k
StepwiseCombination of forward and backwardFlexibleOverfits; inflates significance
Best subsetsEvaluate all 2^k possible modelsExhaustiveComputationally expensive for large k
LASSO (L1 penalty)Penalized regression that shrinks some coefficients to zeroBuilt-in variable selection, handles multicollinearityRequires tuning of penalty parameter
AIC / BIC comparisonSelect model with lowest information criterionPrincipled tradeoff between fit and complexityRequires fitting multiple models

Box's dictum applied: No variable selection method guarantees finding the "true" model. Use domain knowledge first, then let statistical methods refine.

Common Mistakes

MistakeWhy it failsFix
Interpreting correlation as causationRegression shows association, not causationOnly causal language if the study design supports it (experiment, not observational)
Ignoring residual plotsThe model may be systematically wrongAlways plot residuals after fitting
Extrapolating beyond data rangeNo evidence that the relationship holds outside observed X valuesState the range of validity
Adding too many predictorsOverfitting; R^2 increases but generalization decreasesUse adjusted R^2 or cross-validation
Ignoring influential pointsOne point can change the entire regression lineCheck leverage, Cook's distance, DFFITS
Using linear regression for binary outcomesPredicted values outside [0, 1], violates assumptionsUse logistic regression

Cross-References

  • box agent: Model building philosophy, response surface methodology, diagnostics.
  • pearson agent: Regression theory, correlation, coefficient estimation.
  • efron agent: Computational fitting, cross-validation, bootstrap for regression.
  • descriptive-statistics skill: Scatter plots and correlation precede regression.
  • inferential-statistics skill: Hypothesis tests for regression coefficients.
  • bayesian-methods skill: Bayesian regression as an alternative to frequentist fitting.

References

  • Box, G. E. P. (1976). "Science and statistics." Journal of the American Statistical Association, 71(356), 791-799.
  • Kutner, M. H., Nachtsheim, C. J., Neter, J., & Li, W. (2005). Applied Linear Statistical Models. 5th edition. McGraw-Hill.
  • Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. 2nd edition. Springer.
  • Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression. 3rd edition. Wiley.