Gsd-skill-creator machine-learning-foundations

Supervised and unsupervised learning, bias-variance tradeoff, cross-validation, decision trees, ensemble methods, neural network fundamentals, and the practitioner's workflow from problem framing through deployment. Covers classification, regression, clustering, dimensionality reduction, regularization, hyperparameter tuning, and evaluation metrics. Use when building predictive models, selecting algorithms, or understanding the machine learning pipeline.

install
source · Clone the upstream repo
git clone https://github.com/Tibsfox/gsd-skill-creator
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Tibsfox/gsd-skill-creator "$T" && mkdir -p ~/.claude/skills && cp -r "$T/examples/skills/data-science/machine-learning-foundations" ~/.claude/skills/tibsfox-gsd-skill-creator-machine-learning-foundations && rm -rf "$T"
manifest: examples/skills/data-science/machine-learning-foundations/SKILL.md
source content

Machine Learning Foundations

Machine learning is the practice of building systems that learn patterns from data and use those patterns to make predictions or decisions on new data. Where statistical modeling (the inference culture) asks "what is the relationship between X and Y?", machine learning (the prediction culture) asks "given X, what is the best prediction of Y?" This skill covers the foundational concepts, algorithms, and workflow of machine learning from the practitioner's perspective.

Agent affinity: breiman (algorithm selection, ensemble methods), tukey (feature engineering, EDA)

Concept IDs: data-correlation, data-distributions, data-measures-of-spread, data-hypothesis-testing

The ML Workflow

StageGoalKey operations
1. Problem framingDefine the task preciselyClassification vs. regression vs. clustering; define target variable and success metric
2. Data collectionAssemble training dataSources, sampling, labeling; ensure data represents the deployment population
3. Feature engineeringCreate informative inputsDomain-driven features, transformations, encoding categoricals
4. Train/test splitPrevent overfitting evaluationHold out 20-30% for testing; never touch test set during development
5. Model selectionChoose algorithm familyBased on data size, interpretability needs, problem structure
6. TrainingFit model parametersOptimization (gradient descent, tree splitting, etc.)
7. ValidationTune hyperparametersk-fold cross-validation on training set only
8. EvaluationAssess on held-out test setMetrics appropriate to the problem (accuracy, F1, RMSE, etc.)
9. InterpretationUnderstand what the model learnedFeature importance, partial dependence, SHAP values
10. DeploymentPut the model in productionMonitoring, drift detection, retraining schedule

Supervised Learning

Classification

The task: given features X, predict a categorical label y.

Key algorithms:

AlgorithmStrengthsWeaknessesWhen to use
Logistic regressionInterpretable, fast, probabilisticLinear decision boundaryBaseline; when interpretability matters
k-Nearest NeighborsNon-parametric, no training phaseSlow at prediction, curse of dimensionalitySmall datasets, low dimensionality
Decision treeInterpretable, handles mixed typesOverfits easily, unstableWhen interpretability is paramount; as building block for ensembles
Random forestRobust, handles high dimensionsLess interpretable than single treeDefault for tabular data
Gradient boostingState-of-the-art tabular performanceProne to overfitting without tuningCompetition-grade tabular prediction
SVMEffective in high dimensionsSlow on large datasets, kernel choiceText classification, small-medium datasets
Neural networkLearns complex patterns, scales to huge dataRequires large data, expensive, black boxImages, text, sequences, very large datasets

Regression

The task: given features X, predict a continuous value y.

Same algorithms apply (linear regression, k-NN regression, decision tree regression, random forest regression, gradient boosting regression, neural network regression). The loss function changes from cross-entropy to squared error (or absolute error, Huber loss, etc.).

Evaluation Metrics

Classification:

MetricFormula / DefinitionWhen to use
AccuracyCorrect / TotalBalanced classes only
PrecisionTP / (TP + FP)Cost of false positives is high (spam detection)
RecallTP / (TP + FN)Cost of false negatives is high (cancer screening)
F1 score2 * Precision * Recall / (Precision + Recall)Need balance between precision and recall
ROC-AUCArea under ROC curveRanking quality across thresholds
Log lossNegative log-likelihood of predicted probabilitiesWhen calibrated probabilities matter

Regression:

MetricFormula / DefinitionWhen to use
MSEMean of (y - y_hat)^2Default; penalizes large errors
RMSEsqrt(MSE)Same scale as y; more interpretable
MAEMean ofy - y_hat
R-squared1 - (SS_res / SS_tot)Proportion of variance explained
MAPEMean ofy - y_hat

The Bias-Variance Tradeoff

The expected prediction error decomposes into three components:

Error = Bias^2 + Variance + Irreducible noise

  • Bias: Error from oversimplifying the model. A linear model fit to a quadratic relationship has high bias (underfitting).
  • Variance: Error from model sensitivity to training data. A deep decision tree memorizes the training set and varies wildly across samples (overfitting).
  • Irreducible noise: Inherent randomness in the data. No model can reduce this.

The tradeoff: Increasing model complexity reduces bias but increases variance. Decreasing complexity reduces variance but increases bias. The optimal model balances both.

Regularization controls this tradeoff by penalizing complexity:

MethodPenaltyEffect
Ridge (L2)Sum of beta_j^2Shrinks coefficients toward zero; keeps all predictors
Lasso (L1)Sum ofbeta_j
Elastic netAlpha * L1 + (1 - Alpha) * L2Combines ridge and lasso benefits
Tree depth limitMax depth, min samples per leafPrevents tree from memorizing noise
DropoutRandomly zero out neurons during trainingPrevents neural network co-adaptation
Early stoppingStop training when validation error increasesUniversal; works for any iterative algorithm

Cross-Validation

Cross-validation estimates out-of-sample performance using only the training data.

k-Fold Cross-Validation

  1. Split training data into k equal folds (k = 5 or 10 is standard).
  2. For each fold i: train on all folds except i, evaluate on fold i.
  3. Average performance across all k evaluations.
  4. Use this average to select hyperparameters.

Critical Rules

  • Never use test data for any decision during training. The test set is opened exactly once, at the very end.
  • Stratified k-fold for classification. Preserve class proportions in each fold.
  • Group k-fold for grouped data. If observations belong to groups (e.g., multiple images from the same patient), all observations from a group must be in the same fold.
  • Time series split for temporal data. Training set always precedes validation set in time. No random shuffling.

Decision Trees

How They Work

A decision tree recursively partitions the feature space by choosing splits that maximize information gain (classification) or minimize mean squared error (regression).

Splitting criteria for classification:

  • Gini impurity: G = 1 - sum(p_k^2). Measures probability of misclassification.
  • Entropy: H = -sum(p_k * log(p_k)). Information-theoretic measure of impurity.
  • In practice: Gini and entropy produce nearly identical trees. Gini is slightly faster to compute.

Controlling complexity:

ParameterEffect
Max depthLimits tree height; primary regularization lever
Min samples splitMinimum observations to attempt a split
Min samples leafMinimum observations in a terminal node
Max featuresNumber of features considered at each split (critical for random forests)

Why Single Trees Overfit

A fully grown tree achieves 100% training accuracy by creating one leaf per observation. This is pure memorization. The tree's variance is enormous -- small changes in training data produce completely different trees. This instability is why ensemble methods exist.

Ensemble Methods

Bagging (Bootstrap Aggregating)

Train multiple models on bootstrap samples (random samples with replacement) and average their predictions. Reduces variance without increasing bias.

Random forest = bagging + random feature subsets at each split. The feature randomization decorrelates the trees, making the average more effective. Random forests are the default algorithm for tabular data because they work well out of the box with minimal tuning.

Boosting

Train models sequentially, with each new model correcting the errors of the previous ensemble:

  • AdaBoost: Reweights misclassified observations. Simple but sensitive to outliers.
  • Gradient boosting: Fits each new tree to the residuals (negative gradient of the loss function). More general and powerful.
  • XGBoost / LightGBM / CatBoost: Optimized gradient boosting implementations with regularization, parallel training, and categorical handling. State-of-the-art for tabular data competitions.

Bagging vs. Boosting

PropertyBagging (Random Forest)Boosting (Gradient Boosting)
ReducesVarianceBias (primarily) + variance
TrainingParallel (fast)Sequential (slower)
Overfitting riskLowHigher without tuning
Tuning effortMinimalSignificant (learning rate, depth, iterations)
Default choiceYes, for most tabular problemsWhen you need maximum performance and will tune

Unsupervised Learning

Clustering

Grouping observations without labels.

AlgorithmAssumptionStrengthsWeaknesses
k-MeansSpherical, equal-size clustersFast, scalableMust specify k; sensitive to initialization
DBSCANDensity-based clustersFinds arbitrary shapes, handles noiseSensitive to epsilon and min_samples parameters
HierarchicalNested cluster structureDendrogram visualization, no k neededO(n^2) or worse; not for large datasets
Gaussian MixtureElliptical clustersSoft assignments (probabilities)Must specify k; can converge to local optima

Choosing k: Elbow method (plot inertia vs. k), silhouette scores, domain knowledge. There is no universally correct k -- clustering is exploratory, not definitive.

Dimensionality Reduction

MethodLinear?PreservesUse when
PCAYesGlobal varianceHigh-dimensional data, preprocessing for modeling
t-SNENoLocal neighbor structure2D/3D visualization of high-dimensional data
UMAPNoLocal + some global structureFaster than t-SNE, better global structure

Neural Networks Introduction

Architecture

A neural network is a composition of linear transformations and non-linear activations:

output = f_L(W_L * f_{L-1}(W_{L-1} * ... f_1(W_1 * x + b_1) ... + b_{L-1}) + b_L)

  • Input layer: Feature vector x.
  • Hidden layers: Each applies a linear transformation (W * x + b) followed by an activation function (ReLU, sigmoid, tanh).
  • Output layer: Sigmoid for binary classification, softmax for multiclass, linear for regression.

Training

  • Loss function: Cross-entropy (classification), MSE (regression).
  • Optimization: Stochastic gradient descent (SGD) and variants (Adam, RMSProp).
  • Backpropagation: Chain rule applied to compute gradients through the network.
  • Batch size: Mini-batches (32-256) balance noise and computation.
  • Learning rate: Most important hyperparameter. Too high -> divergence. Too low -> slow convergence.

When to Use Neural Networks

Neural networks excel when data is large (>100K samples), structured (images, text, sequences), and the relationship is highly non-linear. For tabular data with <10K samples, gradient boosting typically wins. Neural networks are not magic -- they are function approximators that need sufficient data to justify their complexity.

Common Mistakes

MistakeWhy it failsFix
Data leakageFuture information in training featuresAudit every feature for temporal leakage
Not holding out a test setReported performance is overly optimisticSplit before any modeling decisions
Using accuracy on imbalanced data95% accuracy is trivial when 95% of data is one classUse F1, precision-recall, or balanced accuracy
Tuning on the test setTest performance becomes optimisticTune on validation/CV only; test set opened once
Feature scaling mismatchFit scaler on full data, including testFit scaler on training data only, transform test with same scaler
Ignoring class imbalanceModel predicts majority class for everythingOversampling (SMOTE), class weights, or threshold adjustment

Cross-References

  • breiman agent: Algorithm selection, random forests, and the "two cultures" perspective on prediction vs. inference.
  • tukey agent: Exploratory analysis and feature engineering that precede model training.
  • cairo agent: Communicating model results through visualization -- feature importance plots, partial dependence, calibration curves.
  • statistical-modeling skill: The inference-focused counterpart to this prediction-focused skill.
  • data-wrangling skill: Data preparation pipeline that produces training-ready features.
  • ethics-governance skill: Algorithmic bias, fairness metrics, and responsible deployment of ML models.

References

  • Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5-32.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. 2nd edition. Springer.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning. 2nd edition. Springer.
  • Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  • Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of KDD, 785-794.