Awesome-Agent-Skills-for-Empirical-Research scikit-learn
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research
T=$(mktemp -d) && git clone --depth=1 https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/scikit-learn" ~/.claude/skills/brycewang-stanford-awesome-agent-skills-for-empirical-research-scikit-learn && rm -rf "$T"
skills/17-DAAF-Contribution-Community-daaf/dot-claude/skills/scikit-learn/SKILL.mdscikit-learn Skill
General-purpose machine learning with scikit-learn. Covers unsupervised methods (clustering, GMM, PCA, t-SNE, UMAP, manifold learning, evaluation metrics), supervised methods (classification, prediction-focused regression via Ridge/Lasso/ensemble methods, model evaluation, cross-validation), and shared infrastructure (preprocessing, Pipeline construction, feature selection). Use when performing cluster analysis, dimension reduction, classification, prediction-focused regression, or model evaluation in Python. For econometric regression (OLS, FE, IV, DiD), see pyfixest and statsmodels skills instead.
Comprehensive skill for machine learning in Python with scikit-learn. Covers unsupervised methods (clustering, decomposition, manifold learning), supervised methods (classification, regression), and shared infrastructure (preprocessing, pipelines, evaluation). Use decision trees below to find the right guidance, then load detailed references.
What is scikit-learn?
scikit-learn is the standard general-purpose machine learning library for Python:
- Consistent API: Every estimator follows
/fit()
/predict()
— learn once, apply everywheretransform() - Unsupervised methods: Clustering (KMeans, DBSCAN, HDBSCAN, hierarchical), decomposition (PCA, NMF, SVD), mixture models (GMM), manifold learning (t-SNE)
- Supervised methods: Classification (logistic regression, random forest, gradient boosting, SVM) and prediction-focused regression (Ridge, Lasso, ensemble methods)
- Model evaluation: Cross-validation, grid search, metrics for both classification and clustering
- Pipelines: Chain preprocessing and models into reproducible, leak-free workflows
Version Notes
This skill targets scikit-learn 1.8.0. Notable changes in recent versions:
- HDBSCAN added as a first-class estimator (1.3+)
for DataFrame output from transformers (1.2+)set_output(transform="pandas")- HistGradientBoosting estimators are now stable (1.0+)
default for KMeans (1.4+) — uses 10 forn_init="auto"
, 1 forinit="random"init="k-means++"
How to Use This Skill
Reference File Structure
Each topic in
./references/ contains focused documentation:
| File | Purpose | When to Read |
|---|---|---|
| Import patterns, fit/predict/transform API, Pipeline, train_test_split | First use of scikit-learn |
| KMeans, AgglomerativeClustering, DBSCAN, HDBSCAN, SpectralClustering, OPTICS | Cluster analysis tasks |
| GaussianMixture, BayesianGaussianMixture, BIC/AIC model selection | Model-based clustering, soft assignments |
| PCA, KernelPCA, TruncatedSVD, NMF, IncrementalPCA | Dimension reduction tasks |
| t-SNE, UMAP (umap-learn), Isomap, LLE, MDS, SpectralEmbedding | Visualizing high-dimensional data |
| silhouette_score, Davies-Bouldin, Calinski-Harabasz, ARI, NMI, gap statistic | Validating cluster solutions |
| StandardScaler, encoders, ColumnTransformer, Pipeline construction | Preparing data for ML |
| LogisticRegression, RandomForest, GradientBoosting, SVC, KNeighbors | Classification tasks |
| Ridge, Lasso, ElasticNet, tree/ensemble regressors, SVR | ML regression (prediction-focused) |
| Accuracy, F1, ROC-AUC, confusion matrix, cross_val_score, GridSearchCV | Evaluating supervised models |
| SelectKBest, RFE, permutation_importance, VarianceThreshold | Selecting informative features |
| Data leakage, scaling errors, t-SNE misinterpretation, class imbalance | Avoiding common mistakes |
| SHAP values (TreeExplainer, KernelExplainer), permutation importance visualization, partial dependence plots, ICE plots | After training a model, when interpretation or explanation is needed |
| fairlearn MetricFrame, ThresholdOptimizer, ExponentiatedGradient, demographic parity, equalized odds | Assessing or mitigating fairness of supervised models |
Reading Order
- New to scikit-learn? Start with
then the task-specific referencequickstart.md - Clustering task? Read
, thenclustering.mdevaluation-unsupervised.md - Classification task? Read
, thenclassification.mdevaluation-supervised.md - Need preprocessing? Read
(covers Pipeline construction)preprocessing.md - Having issues? Check
firstgotchas.md - Interpretation task? Read
, then checkinterpretation.md
in data-scientist skill for methodologysupervised-ml.md - Fairness assessment? Read
, then checkfairness.md
in data-scientist skill for conceptual frameworksupervised-ml.md
Related Skills
| Skill | Relationship |
|---|---|
| Methodology guidance — load for "when and why" behind unsupervised methods |
| Econometric regression: OLS with fixed effects, IV, DiD, clustered SEs, hypothesis testing |
| Statistical modeling: OLS without FE, GLM, time series, diagnostic tests |
| Data preparation before ML (convert to pandas/numpy before passing to scikit-learn) |
| Spatial analysis — use geopandas for geographic data, not scikit-learn |
| Custom visualization beyond scikit-learn's built-in plotting |
| Load for supervised ML methodology — the "when and why" behind prediction, interpretation, and fairness |
Routing guidance:
- For econometric regression (hypothesis testing, standard errors, coefficient interpretation), use
orpyfixest
— not scikit-learnstatsmodels - For unsupervised methodology (when to cluster, how to validate, what to report), read
in theexploratory-unsupervised.md
skilldata-scientist - For spatial analysis, use
geopandas - For data manipulation, use
polars
Quick Decision Trees
"I need to group observations" (Unsupervised)
What kind of data and clusters? ├─ Continuous data, roughly spherical clusters │ ├─ Know k → KMeans (./references/clustering.md) │ └─ Don't know k → try multiple k + silhouette/gap │ (./references/clustering.md + ./references/evaluation-unsupervised.md) ├─ Continuous data, arbitrary shapes │ ├─ Dense clusters, possible noise → DBSCAN or HDBSCAN (./references/clustering.md) │ └─ Need soft assignments → GaussianMixture (./references/mixture-models.md) ├─ Need hierarchy / dendrogram → AgglomerativeClustering (./references/clustering.md) ├─ Mixed data types → Gower distance workaround (./references/gotchas.md) └─ Need probabilistic model comparison → GaussianMixture with BIC (./references/mixture-models.md)
"I need to reduce dimensions" (Unsupervised)
What is the goal? ├─ Linear reduction for subsequent analysis → PCA (./references/decomposition.md) ├─ Large sparse data → TruncatedSVD (./references/decomposition.md) ├─ Non-negative components → NMF (./references/decomposition.md) ├─ Visualization of structure → t-SNE or UMAP (./references/manifold.md) │ └─ CAUTION: visualization only, not for analysis │ (see data-scientist exploratory-unsupervised.md for methodology) ├─ Nonlinear manifold learning → Isomap or LLE (./references/manifold.md) └─ Correspondence analysis (CA, MCA) → use the prince library
"I need to predict a categorical outcome" (Supervised)
What constraints? ├─ Interpretable model needed → LogisticRegression or DecisionTreeClassifier │ (./references/classification.md) ├─ Best predictive performance → GradientBoostingClassifier or RandomForestClassifier │ (./references/classification.md) ├─ High-dimensional sparse data → LogisticRegression with penalty │ (./references/classification.md) ├─ Small dataset, few features → KNeighborsClassifier or SVC │ (./references/classification.md) └─ Need probability estimates → any classifier with predict_proba() (./references/classification.md)
"I need to predict a continuous outcome" (Supervised)
What kind of regression? ├─ NOTE: For econometric regression (hypothesis testing, standard errors, │ coefficient interpretation), use pyfixest or statsmodels instead ├─ Prediction-focused, nonlinear → GradientBoostingRegressor or RandomForestRegressor │ (./references/regression-ml.md) ├─ High-dimensional with regularization → Lasso, Ridge, or ElasticNet │ (./references/regression-ml.md) ├─ Nonlinear relationships → GradientBoostingRegressor or SVR │ (./references/regression-ml.md) └─ Simple baseline → Ridge (./references/regression-ml.md)
"I need to evaluate a model"
What kind of evaluation? ├─ Unsupervised (no ground truth) │ ├─ Cluster quality → silhouette_score, Davies-Bouldin │ │ (./references/evaluation-unsupervised.md) │ ├─ Stability → Bootstrap + compare across resamples │ │ (./references/evaluation-unsupervised.md) │ └─ Against known labels → ARI, NMI │ (./references/evaluation-unsupervised.md) ├─ Supervised classification │ ├─ Balanced classes → accuracy + F1 (./references/evaluation-supervised.md) │ ├─ Imbalanced classes → precision, recall, ROC-AUC │ │ (./references/evaluation-supervised.md) │ └─ Model selection → cross_val_score or GridSearchCV │ (./references/evaluation-supervised.md) └─ Supervised regression ├─ R-squared, RMSE, MAE (./references/evaluation-supervised.md) └─ Model selection → cross_val_score or GridSearchCV (./references/evaluation-supervised.md)
"I need to interpret or explain a model"
What kind of interpretation? ├─ Feature importance (global) → SHAP beeswarm/bar or permutation importance │ (./references/interpretation.md) ├─ Single prediction explanation → SHAP waterfall or force plot │ (./references/interpretation.md) ├─ Feature effect visualization → PDP or SHAP dependence plot │ (./references/interpretation.md) ├─ Fairness across demographic groups → MetricFrame │ (./references/fairness.md) └─ CAUTION: feature importance ≠ causal importance (see data-scientist supervised-ml.md for methodology)
File-First Execution in Research Workflows
Important: In data research pipelines (see
CLAUDE.md), scikit-learn analyses are executed through script files, not interactively. This ensures auditability and reproducibility.
The pattern:
- Write ML code to
scripts/stage8_analysis/{step}_{task-name}.py - Execute via Bash with automatic output capture wrapper script
- Validation results get automatically embedded in scripts as comments
- If failed, create versioned copy for fixes
Closely read
agent_reference/SCRIPT_EXECUTION_REFERENCE.md for the mandatory file-first execution protocol covering complete code file writing, output capture, and file versioning rules. All ML scripts must follow the Inline Audit Trail (IAT) standard -- see agent_reference/INLINE_AUDIT_TRAIL.md. For ML code, document model selection rationale (why this algorithm, why these hyperparameters, what assumptions) with # INTENT:, # REASONING:, and # ASSUMES: comments.
See:
-- Stage 8 (Analysis & Visualization)agent_reference/WORKFLOW_PHASE4_ANALYSIS.md
-- IAT documentation standardagent_reference/INLINE_AUDIT_TRAIL.md
The examples below show scikit-learn syntax. In research workflows, wrap them in scripts following the file-first pattern.
Quick Reference
Essential Imports
import numpy as np from sklearn.model_selection import train_test_split, cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline
The fit/predict/transform Pattern
# Supervised: fit + predict model.fit(X_train, y_train) y_pred = model.predict(X_test) # Unsupervised: fit + transform (or fit_transform) scaler = StandardScaler() X_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Clustering: fit + labels_ kmeans.fit(X) labels = kmeans.labels_
Common Operations
| Operation | Code |
|---|---|
| Train-test split | |
| Scale features | |
| Build pipeline | |
| Cross-validate | |
| Grid search | |
| KMeans clustering | |
| PCA | |
| Logistic regression | |
| Random forest | |
| Gradient boosting | |
| Classification report | |
| Confusion matrix | |
| Silhouette score | |
| Feature importance | |
| Permutation importance | |
| Set output format | |
Topic Index
| Topic | Reference File |
|---|---|
| Installation and imports | |
| fit/predict/transform API | |
| Pipeline construction | |
| Train-test split | |
| Reproducibility (random_state) | |
| KMeans, MiniBatchKMeans | |
| AgglomerativeClustering | |
| DBSCAN, HDBSCAN, OPTICS | |
| SpectralClustering | |
| GaussianMixture | |
| BayesianGaussianMixture | |
| BIC/AIC model selection | |
| Soft cluster assignments | |
| PCA, KernelPCA | |
| TruncatedSVD (sparse data) | |
| NMF | |
| IncrementalPCA | |
| t-SNE | |
| UMAP (umap-learn) | |
| Isomap, LLE, MDS | |
| silhouette_score | |
| Davies-Bouldin, Calinski-Harabasz | |
| Adjusted Rand Index, NMI | |
| Gap statistic | |
| StandardScaler, MinMaxScaler | |
| OneHotEncoder, OrdinalEncoder | |
| ColumnTransformer | |
| Pipeline, make_pipeline | |
| LogisticRegression | |
| DecisionTreeClassifier | |
| RandomForestClassifier | |
| GradientBoostingClassifier | |
| SVC, KNeighborsClassifier | |
| Ridge, Lasso, ElasticNet | |
| RandomForestRegressor | |
| GradientBoostingRegressor | |
| SVR, KNeighborsRegressor | |
| accuracy, precision, recall, F1 | |
| ROC-AUC, confusion matrix | |
| cross_val_score, GridSearchCV | |
| learning_curve | |
| SelectKBest, RFE | |
| feature_importances_ | |
| permutation_importance | |
| Data leakage | |
| Scaling for distance-based methods | |
| t-SNE/UMAP distance interpretation | |
| Class imbalance | |
| random_state reproducibility | |
| SHAP values (TreeExplainer, KernelExplainer) | |
| Permutation importance visualization | |
| Partial dependence plots (PDP) | |
| ICE plots | |
| Model interpretation caveats | |
| fairlearn MetricFrame | |
| ThresholdOptimizer | |
| ExponentiatedGradient | |
| Demographic parity | |
| Equalized odds | |
| LightGBM (LGBMClassifier, LGBMRegressor) | , |
| XGBoost (XGBClassifier, XGBRegressor) | , |
Citation
When this library is used as a primary analytical tool, include in the report's Software & Tools references:
Pedregosa, F. et al. (2011). "Scikit-learn: Machine Learning in Python." Journal of Machine Learning Research, 12, 2825-2830.
Cite when: scikit-learn is used for machine learning models, clustering, dimensionality reduction, or cross-validation central to the analysis. Do not cite when: Only used for a single preprocessing step (e.g., StandardScaler in a pipeline where the primary model is from another library).
For method-specific citations (e.g., individual algorithms or techniques), consult the reference files in this skill and
agent_reference/CITATION_REFERENCE.md.