SciAgent-Skills scikit-learn-machine-learning
Classical machine learning in Python. Use for classification, regression, clustering, dimensionality reduction, model evaluation, hyperparameter tuning, and preprocessing pipelines. Covers linear models, tree ensembles, SVMs, K-Means, PCA, t-SNE. For deep learning use PyTorch/TensorFlow; for gradient boosting at scale use XGBoost/LightGBM.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/biostatistics/scikit-learn-machine-learning" ~/.claude/skills/jaechang-hits-sciagent-skills-scikit-learn-machine-learning && rm -rf "$T"
skills/biostatistics/scikit-learn-machine-learning/SKILL.mdscikit-learn
Overview
scikit-learn is the standard Python library for classical machine learning. It provides consistent APIs for supervised learning (classification, regression), unsupervised learning (clustering, dimensionality reduction), model evaluation, and preprocessing, with seamless integration into NumPy/pandas workflows.
When to Use
- Building classification models for labeled data (spam detection, disease diagnosis, species identification)
- Predicting continuous outcomes with regression (price prediction, dose-response modeling)
- Clustering unlabeled data into groups (patient stratification, gene expression clusters)
- Reducing dimensionality for visualization or feature engineering (PCA, t-SNE on multi-omics data)
- Evaluating and comparing model performance with cross-validation
- Tuning hyperparameters systematically (grid search, random search)
- Building reproducible ML pipelines with preprocessing and modeling steps
- For deep learning tasks (images, NLP), use
orpytorch
insteadtransformers - For large-scale gradient boosting, use
orxgboost
insteadlightgbm
Prerequisites
- Python packages:
,scikit-learn
,numpypandas - Optional:
,matplotlib
for visualizationseaborn - Data: Tabular data as NumPy arrays or pandas DataFrames
pip install scikit-learn numpy pandas matplotlib seaborn
Quick Start
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report from sklearn.datasets import load_breast_cancer # Load dataset, split, train, evaluate in 10 lines X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) clf = RandomForestClassifier(n_estimators=100, random_state=42) clf.fit(X_train, y_train) y_pred = clf.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}") print(classification_report(y_test, y_pred, target_names=["malignant", "benign"]))
Core API
Module 1: Data Preprocessing
Scaling, encoding, imputation, and feature engineering.
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder from sklearn.impute import SimpleImputer import numpy as np # Scaling: zero mean, unit variance X = np.array([[1, 2], [3, 4], [5, 6]]) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print(f"Mean: {X_scaled.mean(axis=0)}, Std: {X_scaled.std(axis=0)}") # Mean: [0. 0.], Std: [1. 1.] # Imputation: fill missing values X_missing = np.array([[1, np.nan], [3, 4], [np.nan, 6]]) imputer = SimpleImputer(strategy="median") X_filled = imputer.fit_transform(X_missing) print(f"Filled:\n{X_filled}")
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder # One-hot encoding for nominal categories enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore") X_cat = np.array([["red"], ["blue"], ["green"], ["red"]]) X_encoded = enc.fit_transform(X_cat) print(f"Categories: {enc.categories_}") print(f"Encoded shape: {X_encoded.shape}") # (4, 3)
Module 2: Supervised Learning — Classification
Classifiers for discrete target prediction.
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split X, y = load_iris(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # Compare classifiers classifiers = { "LogisticRegression": LogisticRegression(max_iter=200), "RandomForest": RandomForestClassifier(n_estimators=100, random_state=42), "SVM": SVC(kernel="rbf", C=1.0), "GradientBoosting": GradientBoostingClassifier(n_estimators=100, random_state=42), } for name, clf in classifiers.items(): clf.fit(X_train, y_train) print(f"{name}: accuracy = {clf.score(X_test, y_test):.3f}")
Module 3: Supervised Learning — Regression
Regressors for continuous target prediction.
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet from sklearn.ensemble import RandomForestRegressor from sklearn.datasets import make_regression from sklearn.metrics import mean_squared_error, r2_score X, y = make_regression(n_samples=200, n_features=10, noise=10, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) models = { "Linear": LinearRegression(), "Ridge": Ridge(alpha=1.0), "Lasso": Lasso(alpha=0.1), "RandomForest": RandomForestRegressor(n_estimators=100, random_state=42), } for name, model in models.items(): model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"{name}: RMSE={mean_squared_error(y_test, y_pred, squared=False):.2f}, R²={r2_score(y_test, y_pred):.3f}")
Module 4: Unsupervised Learning — Clustering
Clustering algorithms for unlabeled data.
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering from sklearn.metrics import silhouette_score from sklearn.datasets import make_blobs X, y_true = make_blobs(n_samples=300, centers=4, random_state=42) # K-Means with elbow method for k in [2, 3, 4, 5, 6]: km = KMeans(n_clusters=k, random_state=42, n_init=10) labels = km.fit_predict(X) sil = silhouette_score(X, labels) print(f"k={k}: silhouette={sil:.3f}, inertia={km.inertia_:.1f}")
# DBSCAN — no need to specify k from sklearn.cluster import DBSCAN db = DBSCAN(eps=0.5, min_samples=5) labels = db.fit_predict(X) n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = (labels == -1).sum() print(f"DBSCAN: {n_clusters} clusters, {n_noise} noise points")
Module 5: Dimensionality Reduction
PCA, t-SNE, and other methods for visualization and feature reduction.
from sklearn.decomposition import PCA from sklearn.manifold import TSNE from sklearn.datasets import load_digits X, y = load_digits(return_X_y=True) print(f"Original shape: {X.shape}") # (1797, 64) # PCA — preserve 95% variance pca = PCA(n_components=0.95) X_pca = pca.fit_transform(X) print(f"PCA: {X_pca.shape[1]} components, explained variance: {pca.explained_variance_ratio_.sum():.3f}") # t-SNE — 2D visualization tsne = TSNE(n_components=2, perplexity=30, random_state=42) X_tsne = tsne.fit_transform(X) print(f"t-SNE shape: {X_tsne.shape}") # (1797, 2)
Module 6: Model Evaluation & Selection
Cross-validation, metrics, hyperparameter tuning.
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, confusion_matrix from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) # Cross-validation clf = RandomForestClassifier(n_estimators=100, random_state=42) scores = cross_val_score(clf, X, y, cv=StratifiedKFold(5), scoring="accuracy") print(f"CV accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
# Hyperparameter tuning with GridSearchCV param_grid = { "n_estimators": [50, 100, 200], "max_depth": [5, 10, None], "min_samples_split": [2, 5] } grid = GridSearchCV( RandomForestClassifier(random_state=42), param_grid, cv=5, scoring="accuracy", n_jobs=-1 ) grid.fit(X, y) print(f"Best params: {grid.best_params_}") print(f"Best score: {grid.best_score_:.3f}")
Module 7: Pipelines
Chain preprocessing and models; prevent data leakage.
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingClassifier # Mixed-type preprocessing numeric_features = ["age", "income"] categorical_features = ["gender", "occupation"] preprocessor = ColumnTransformer([ ("num", Pipeline([ ("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()) ]), numeric_features), ("cat", Pipeline([ ("imputer", SimpleImputer(strategy="most_frequent")), ("onehot", OneHotEncoder(handle_unknown="ignore")) ]), categorical_features), ]) pipe = Pipeline([ ("preprocessor", preprocessor), ("classifier", GradientBoostingClassifier(random_state=42)) ]) # pipe.fit(X_train, y_train); pipe.predict(X_test) print("Pipeline steps:", [name for name, _ in pipe.steps])
Common Workflows
Workflow 1: End-to-End Classification
Goal: Complete classification workflow from data loading to evaluation.
import pandas as pd from sklearn.model_selection import train_test_split, cross_val_score from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report from sklearn.datasets import load_breast_cancer # Load data X, y = load_breast_cancer(return_X_y=True, as_frame=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) # Build pipeline pipe = Pipeline([ ("scaler", StandardScaler()), ("clf", RandomForestClassifier(n_estimators=200, random_state=42)) ]) # Cross-validate cv_scores = cross_val_score(pipe, X_train, y_train, cv=5, scoring="f1") print(f"CV F1: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}") # Final evaluation pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) print(classification_report(y_test, y_pred))
Workflow 2: Clustering with Visualization
Goal: Cluster data and visualize with dimensionality reduction.
from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans from sklearn.decomposition import PCA from sklearn.metrics import silhouette_score import matplotlib.pyplot as plt # Generate and scale data X, _ = make_blobs(n_samples=500, centers=4, random_state=42) X_scaled = StandardScaler().fit_transform(X) # Cluster km = KMeans(n_clusters=4, random_state=42, n_init=10) labels = km.fit_predict(X_scaled) print(f"Silhouette: {silhouette_score(X_scaled, labels):.3f}") # Visualize X_2d = PCA(n_components=2).fit_transform(X_scaled) plt.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap="viridis", s=20, alpha=0.7) plt.title("K-Means Clustering (PCA projection)") plt.savefig("clustering_result.png", dpi=150, bbox_inches="tight") print("Saved clustering_result.png")
Workflow 3: Feature Selection + Model Pipeline
Goal: Select best features and build a tuned model.
from sklearn.datasets import make_classification from sklearn.feature_selection import SelectKBest, f_classif from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.svm import SVC from sklearn.model_selection import GridSearchCV X, y = make_classification(n_samples=500, n_features=50, n_informative=10, random_state=42) pipe = Pipeline([ ("scaler", StandardScaler()), ("selector", SelectKBest(f_classif)), ("svm", SVC(kernel="rbf")) ]) param_grid = { "selector__k": [5, 10, 20], "svm__C": [0.1, 1, 10], "svm__gamma": ["scale", "auto"] } grid = GridSearchCV(pipe, param_grid, cv=5, scoring="accuracy", n_jobs=-1) grid.fit(X, y) print(f"Best params: {grid.best_params_}") print(f"Best accuracy: {grid.best_score_:.3f}")
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
| RandomForest, GradientBoosting | | - | Number of trees; higher = better but slower |
| Tree-based models | | -, | Tree depth; = no limit (can overfit) |
| SVM, LogisticRegression | | - | Regularization strength (inverse); lower = more regularization |
| Ridge, Lasso | | - | Regularization strength; higher = more regularization |
| KMeans | required | - | Number of clusters to form |
| DBSCAN | | - | Neighborhood radius; smaller = more clusters |
| PCA | required | - or - | Components to keep; float = variance ratio |
| t-SNE | | - | Balance local/global structure |
| GridSearchCV | | - | Cross-validation folds |
| GridSearchCV, cross_val_score | varies | , , , etc. | Evaluation metric |
Common Recipes
Recipe: Feature Importance Analysis
When to use: Understanding which features drive model predictions.
import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) clf = RandomForestClassifier(n_estimators=200, random_state=42).fit(X, y) importances = clf.feature_importances_ indices = np.argsort(importances)[::-1] feature_names = load_iris().feature_names for i in range(X.shape[1]): print(f"{feature_names[indices[i]]}: {importances[indices[i]]:.4f}")
Recipe: Learning Curve Diagnosis
When to use: Diagnosing overfitting vs underfitting.
from sklearn.model_selection import learning_curve import matplotlib.pyplot as plt import numpy as np train_sizes, train_scores, val_scores = learning_curve( clf, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10), scoring="accuracy" ) plt.plot(train_sizes, train_scores.mean(axis=1), label="Train") plt.plot(train_sizes, val_scores.mean(axis=1), label="Validation") plt.xlabel("Training size"); plt.ylabel("Accuracy"); plt.legend() plt.savefig("learning_curve.png", dpi=150, bbox_inches="tight") print("Saved learning_curve.png")
Recipe: Save and Load Models
When to use: Persisting trained models for later use.
import joblib # Save joblib.dump(pipe, "model_pipeline.joblib") print("Model saved to model_pipeline.joblib") # Load loaded_pipe = joblib.load("model_pipeline.joblib") y_pred = loaded_pipe.predict(X_test) print(f"Loaded model predictions: {y_pred[:5]}")
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Model didn't converge | Increase (e.g., 1000) or scale features with |
| High train accuracy, low test accuracy | Overfitting | Add regularization, reduce , use cross-validation |
| New categories in test data | Use |
with large data | Full dataset in memory | Use / for incremental learning |
| Poor clustering results | Unscaled features or wrong k | Scale features first; use silhouette score to find optimal k |
| Predict before fit | Call first |
| Different results each run | Missing | Set in model and |
| Slow GridSearchCV | Large parameter grid | Use or ; add |
References
- scikit-learn User Guide — official documentation
- scikit-learn API Reference — complete API
- scikit-learn Examples Gallery — tutorials
- Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12:2825-2830.