Skills scikit-learn
install
source · Clone the upstream repo
git clone https://github.com/TerminalSkills/skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/TerminalSkills/skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scikit-learn" ~/.claude/skills/terminalskills-skills-scikit-learn && rm -rf "$T"
manifest:
skills/scikit-learn/SKILL.mdsource content
scikit-learn
Overview
Scikit-learn is a Python machine learning library that provides a consistent API for the full ML workflow: data preprocessing (scaling, encoding, imputation), model selection (classification, regression, clustering), hyperparameter tuning (grid search, randomized search), cross-validation, and pipeline construction. It supports serialization via joblib for production deployment.
Instructions
- When preprocessing data, use
to apply different transformers to numeric and categorical columns (StandardScaler, OneHotEncoder, SimpleImputer), always within a Pipeline to prevent data leakage.ColumnTransformer - When choosing models, start with fast baselines (LogisticRegression, RandomForest) and use
for best tabular performance, since it handles missing values natively and is faster than GradientBoosting.HistGradientBoostingClassifier - When evaluating, use
with 5-fold CV instead of single train/test splits, and usecross_val_score
instead of accuracy alone since accuracy is misleading on imbalanced datasets.classification_report() - When tuning hyperparameters, use
when the search space exceeds 100 combinations (faster than exhaustive GridSearchCV), and useRandomizedSearchCV
orStratifiedKFold
as appropriate.TimeSeriesSplit - When building pipelines, chain preprocessing and model steps with
to ensure transformers fit only on training data, then serialize the full pipeline withPipeline
for deployment.joblib.dump() - When selecting features, use
for model-agnostic measurement,permutation_importance()
for statistical filtering, orSelectKBest
from tree-based models.feature_importances_
Examples
Example 1: Build a customer churn prediction pipeline
User request: "Create a model to predict which customers will churn"
Actions:
- Build a
withColumnTransformer
for numeric features andStandardScaler
for categoricalOneHotEncoder - Create a
with the transformer andPipelineHistGradientBoostingClassifier - Tune hyperparameters with
usingRandomizedSearchCVStratifiedKFold - Evaluate with
focusing on recall for the churn classclassification_report()
Output: A tuned churn prediction pipeline with preprocessing, model, and evaluation metrics.
Example 2: Cluster customers into segments
User request: "Segment customers based on purchasing behavior"
Actions:
- Preprocess features with
in a pipelineStandardScaler - Use
with silhouette score analysis to determine optimal cluster countKMeans - Run
for dimensionality reduction and visualizationPCA - Profile clusters with
on original features to interpret segmentsgroupby
Output: Customer segments with labeled profiles and a visual cluster map.
Guidelines
- Always use
to prevent data leakage by fitting transformers only on training data.Pipeline - Use
for mixed data types: numeric scaling and categorical encoding in one object.ColumnTransformer - Use
overHistGradientBoostingClassifier
since it is faster and handles missing values natively.GradientBoostingClassifier - Use
with 5-fold CV rather than a single train/test split since single splits are noisy.cross_val_score - Use
when the search space exceeds 100 combinations.RandomizedSearchCV - Use
not just accuracy, which is misleading on imbalanced datasets.classification_report() - Serialize the full pipeline with
, not just the model, since deployment needs preprocessing too.joblib