Awesome-omni-skill scikit-learn
The industry standard library for machine learning in Python. Provides simple and efficient tools for predictive data analysis, covering classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
install
source · Clone the upstream repo
git clone https://github.com/diegosouzapw/awesome-omni-skill
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/diegosouzapw/awesome-omni-skill "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/machine-learning/scikit-learn" ~/.claude/skills/diegosouzapw-awesome-omni-skill-scikit-learn && rm -rf "$T"
manifest:
skills/machine-learning/scikit-learn/SKILL.mdsource content
scikit-learn - Machine Learning in Python
A robust library for classical machine learning. It features a uniform API: all objects share the same interface for fitting, transforming, and predicting.
When to Use
- Classification: Detecting categories (Spam vs. Ham, Disease diagnosis).
- Regression: Predicting continuous values (House prices, Stock trends).
- Clustering: Grouping similar objects (Market segmentation, Image compression).
- Dimensionality Reduction: Reducing feature count while keeping info (PCA, Visualization).
- Model Selection: Comparing models and tuning hyperparameters (Cross-validation, Grid search).
- Preprocessing: Transforming raw data into features (Scaling, Encoding, Imputation).
Reference Documentation
Official docs: https://scikit-learn.org/stable/
User Guide: https://scikit-learn.org/stable/user_guide.html
Search patterns:
sklearn.pipeline.Pipeline, sklearn.model_selection, sklearn.ensemble, sklearn.preprocessing
Core Principles
The "Estimator" Interface
- Estimators: Implement
. They learn from data.fit(X, y) - Transformers: Implement
(andtransform(X)
). They modify data.fit_transform(X) - Predictors: Implement
. They provide estimates for new data.predict(X)
Use scikit-learn For
- Tabular data (Excel-like, CSVs).
- Traditional ML (Random Forests, SVMs, Linear Models).
- Feature engineering and pipeline automation.
- Small to medium-sized datasets.
Do NOT Use For
- Deep Learning / Neural Networks (use PyTorch or TensorFlow).
- Natural Language Processing at scale (use spaCy or HuggingFace).
- Large-scale "Big Data" (use Spark MLlib or Dask-ML).
- Real-time streaming predictions (consider specialized inference engines).
Quick Reference
Installation
pip install scikit-learn
Standard Imports
import numpy as np import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.metrics import classification_report, mean_squared_error
Basic Pattern - Train/Predict
from sklearn.ensemble import RandomForestClassifier # 1. Prepare data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 2. Instantiate and fit model = RandomForestClassifier(n_estimators=100) model.fit(X_train, y_train) # 3. Predict and evaluate predictions = model.predict(X_test) print(classification_report(y_test, predictions))
Critical Rules
✅ DO
- Split before anything - Always use
before looking at data properties.train_test_split - Use Pipelines - Combine preprocessing and modeling to prevent data leakage.
- Scale your data - Models like SVM, KNN, and Linear Regression require feature scaling.
- Check for Imbalance - Use
instratify=y
for classification.train_test_split - Cross-Validate - Don't trust a single train/test split; use
.cross_val_score - Handle Missing Values - Use
or similar before fitting models.SimpleImputer - Standardize Categories - Use
for nominal orOneHotEncoder
for ordinal data.OrdinalEncoder
❌ DON'T
- Fit on test data - Never call
or.fit()
on the test set..fit_transform() - Use Categorical data as-is - Scikit-learn requires numerical input; encode strings first.
- Ignore Class Imbalance - Accuracy is misleading for imbalanced datasets; use F1-score or AUC.
- Overfit - Don't keep tuning hyperparameters until the test score is perfect.
- Ignore Random State - Set
for reproducibility during experiments.random_state
Anti-Patterns (NEVER)
# ❌ BAD: Data Leakage (Fitting scaler on the whole dataset) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) # Data from "future" test set leaks into training! X_train, X_test, y_train, y_test = train_test_split(X_scaled, y) # ✅ GOOD: Fit scaler only on training data X_train, X_test, y_train, y_test = train_test_split(X, y) scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train) X_test_scaled = scaler.transform(X_test) # Use training mean/std # ❌ BAD: Repeating preprocessing manually # (Error-prone and hard to maintain) # ✅ GOOD: Use Pipelines (Automates everything safely) pipe = Pipeline([ ('scaler', StandardScaler()), ('model', RandomForestClassifier()) ]) pipe.fit(X_train, y_train)
Preprocessing (sklearn.preprocessing)
Scaling and Encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder # Scaling numerical data scaler = StandardScaler() X_num_scaled = scaler.fit_transform(X_numeric) # Encoding categorical data encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore') X_cat_encoded = encoder.fit_transform(X_categorical) # Handling missing values from sklearn.impute import SimpleImputer imputer = SimpleImputer(strategy='median') X_imputed = imputer.fit_transform(X_with_nan)
Column Transformer (The Pro Way)
from sklearn.compose import ColumnTransformer numeric_features = ['age', 'salary'] categorical_features = ['city', 'job_type'] preprocessor = ColumnTransformer( transformers=[ ('num', StandardScaler(), numeric_features), ('cat', OneHotEncoder(), categorical_features) ]) # Now use this in a pipeline pipeline = Pipeline([ ('prep', preprocessor), ('clf', LogisticRegression()) ])
Classification
Common Algorithms
from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.ensemble import GradientBoostingClassifier # Logistic Regression (Baseline) log_reg = LogisticRegression(max_iter=1000) # Support Vector Machine svm = SVC(kernel='rbf', probability=True) # Gradient Boosting gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
Regression
Common Algorithms
from sklearn.linear_model import Ridge, Lasso, ElasticNet from sklearn.ensemble import RandomForestRegressor # Regularized Linear Models ridge = Ridge(alpha=1.0) # L2 lasso = Lasso(alpha=0.1) # L1 # Non-linear Regression rf_reg = RandomForestRegressor(n_estimators=100, max_depth=10)
Model Evaluation
Metrics
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, r2_score, mean_absolute_error # Classification acc = accuracy_score(y_true, y_pred) f1 = f1_score(y_true, y_pred, average='weighted') # Regression r2 = r2_score(y_true, y_pred) mae = mean_absolute_error(y_true, y_pred)
Cross-Validation
from sklearn.model_selection import cross_val_score scores = cross_val_score(pipeline, X, y, cv=5, scoring='f1_macro') print(f"Mean F1: {scores.mean():.4f} (+/- {scores.std():.4f})")
Hyperparameter Tuning
Grid Search and Randomized Search
from sklearn.model_selection import GridSearchCV param_grid = { 'clf__n_estimators': [50, 100, 200], 'clf__max_depth': [None, 10, 20], 'clf__min_samples_split': [2, 5] } grid_search = GridSearchCV(pipeline, param_grid, cv=3, scoring='accuracy', n_jobs=-1) grid_search.fit(X_train, y_train) print(f"Best params: {grid_search.best_params_}") best_model = grid_search.best_estimator_
Dimensionality Reduction
PCA (Principal Component Analysis)
from sklearn.decomposition import PCA # Reduce to 2 components for visualization pca = PCA(n_components=2) X_pca = pca.fit_transform(X_scaled) print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
Clustering
K-Means and DBSCAN
from sklearn.cluster import KMeans, DBSCAN # K-Means (Requires specifying K) kmeans = KMeans(n_clusters=3, n_init='auto') clusters = kmeans.fit_predict(X) # DBSCAN (Density-based, finds K automatically) dbscan = DBSCAN(eps=0.5, min_samples=5) clusters = dbscan.fit_predict(X)
Practical Workflows
1. End-to-End Classification Pipeline
def build_and_train_model(X, y): # 1. Identify types num_cols = X.select_dtypes(include=['int64', 'float64']).columns cat_cols = X.select_dtypes(include=['object', 'category']).columns # 2. Setup Preprocessing preprocessor = ColumnTransformer([ ('num', StandardScaler(), num_cols), ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols) ]) # 3. Create Pipeline clf = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(random_state=42)) ]) # 4. Train X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y) clf.fit(X_train, y_train) return clf, X_test, y_test # model, X_test, y_test = build_and_train_model(df.drop('target', axis=1), df['target'])
2. Custom Feature Engineering (Transformer)
from sklearn.base import BaseEstimator, TransformerMixin class LogTransformer(BaseEstimator, TransformerMixin): def __init__(self, columns=None): self.columns = columns def fit(self, X, y=None): return self def transform(self, X): X_copy = X.copy() for col in self.columns: X_copy[col] = np.log1p(X_copy[col]) return X_copy
Performance Optimization
Using n_jobs
# Use all CPU cores for training/tuning model = RandomForestClassifier(n_jobs=-1) grid = GridSearchCV(model, param_grid, n_jobs=-1)
Working with Large Data (partial_fit)
from sklearn.linear_model import SGDClassifier # Online learning (incremental fit) model = SGDClassifier() for X_chunk, y_chunk in data_stream: model.partial_fit(X_chunk, y_chunk, classes=np.unique(y_all))
Common Pitfalls and Solutions
Imbalanced Classes
# ❌ Problem: Model predicts only the majority class # ✅ Solution: Adjust class weights model = RandomForestClassifier(class_weight='balanced') # OR use SMOTE from imbalanced-learn library
Convergence Warnings
# ❌ Problem: "ConvergenceWarning: Liblinear failed to converge" # ✅ Solution: Increase max_iter or scale data model = LogisticRegression(max_iter=2000) # Often solved by applying StandardScaler first!
Categorical Values in Test Set not in Train
# ❌ Problem: ValueError when unseen categories appear in test # ✅ Solution: Use handle_unknown in OneHotEncoder encoder = OneHotEncoder(handle_unknown='ignore')
Scikit-learn is the backbone of Python ML. Its API is so successful that many other libraries (XGBoost, LightGBM) mimic it.