Babysitter sklearn-model-trainer
Scikit-learn model training skill with cross-validation, hyperparameter tuning, pipeline construction, and model serialization. Enables automated ML model development using scikit-learn's comprehensive toolkit.
install
source · Clone the upstream repo
git clone https://github.com/a5c-ai/babysitter
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/a5c-ai/babysitter "$T" && mkdir -p ~/.claude/skills && cp -r "$T/library/specializations/data-science-ml/skills/sklearn-model-trainer" ~/.claude/skills/a5c-ai-babysitter-sklearn-model-trainer && rm -rf "$T"
manifest:
library/specializations/data-science-ml/skills/sklearn-model-trainer/SKILL.mdsource content
Scikit-learn Model Trainer
Train machine learning models using scikit-learn with cross-validation, hyperparameter tuning, and pipeline construction.
Overview
This skill provides comprehensive capabilities for training machine learning models using scikit-learn. It supports the full model development workflow from data preprocessing through model training, evaluation, and serialization.
Capabilities
Model Training
- Train classification models (LogisticRegression, RandomForest, SVM, etc.)
- Train regression models (LinearRegression, GradientBoosting, etc.)
- Train clustering models (KMeans, DBSCAN, etc.)
- Support for ensemble methods (VotingClassifier, Stacking, etc.)
Cross-Validation
- K-fold cross-validation
- Stratified K-fold for imbalanced datasets
- Time series split for temporal data
- Leave-one-out and leave-p-out validation
- Custom cross-validation strategies
Hyperparameter Tuning
- GridSearchCV for exhaustive search
- RandomizedSearchCV for random sampling
- Halving search strategies for efficiency
- Custom scoring functions
- Multi-metric evaluation
Pipeline Construction
- Feature preprocessing pipelines
- Column transformers for heterogeneous data
- Feature selection integration
- Composite pipelines with caching
Model Serialization
- Save models with joblib (recommended)
- Pickle serialization
- ONNX export for interoperability
- Model versioning support
Prerequisites
Installation
pip install scikit-learn>=1.0.0 joblib pandas numpy
Optional Dependencies
# For ONNX export pip install skl2onnx onnxruntime # For additional preprocessing pip install category_encoders imbalanced-learn
Usage Patterns
Basic Model Training
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split, cross_val_score from sklearn.metrics import classification_report import joblib # Split data X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y ) # Train model model = RandomForestClassifier( n_estimators=100, max_depth=10, random_state=42 ) model.fit(X_train, y_train) # Cross-validation cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy') print(f"CV Accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})") # Evaluate y_pred = model.predict(X_test) print(classification_report(y_test, y_pred)) # Save model joblib.dump(model, 'model.joblib')
Pipeline with Preprocessing
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import GradientBoostingClassifier # Define preprocessing numeric_features = ['age', 'income', 'score'] categorical_features = ['category', 'region'] numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) ]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), ('onehot', OneHotEncoder(handle_unknown='ignore')) ]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ] ) # Create full pipeline pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('classifier', GradientBoostingClassifier()) ]) # Train pipeline.fit(X_train, y_train)
Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV # Define parameter grid param_grid = { 'classifier__n_estimators': [50, 100, 200], 'classifier__max_depth': [3, 5, 10, None], 'classifier__learning_rate': [0.01, 0.1, 0.2] } # Grid search grid_search = GridSearchCV( pipeline, param_grid, cv=5, scoring='f1_weighted', n_jobs=-1, verbose=2 ) grid_search.fit(X_train, y_train) print(f"Best parameters: {grid_search.best_params_}") print(f"Best score: {grid_search.best_score_:.3f}") # Get best model best_model = grid_search.best_estimator_
Feature Selection
from sklearn.feature_selection import SelectFromModel, RFE from sklearn.ensemble import RandomForestClassifier # Method 1: SelectFromModel selector = SelectFromModel( RandomForestClassifier(n_estimators=100, random_state=42), threshold='median' ) X_selected = selector.fit_transform(X_train, y_train) # Method 2: Recursive Feature Elimination rfe = RFE( estimator=RandomForestClassifier(n_estimators=100, random_state=42), n_features_to_select=10, step=1 ) X_rfe = rfe.fit_transform(X_train, y_train) # Get selected features selected_features = X.columns[rfe.support_].tolist()
Integration with Babysitter SDK
Task Definition Example
const sklearnTrainingTask = defineTask({ name: 'sklearn-model-training', description: 'Train a scikit-learn model with cross-validation', inputs: { modelType: { type: 'string', required: true }, trainDataPath: { type: 'string', required: true }, targetColumn: { type: 'string', required: true }, hyperparameters: { type: 'object', default: {} }, cvFolds: { type: 'number', default: 5 }, scoringMetric: { type: 'string', default: 'accuracy' } }, outputs: { modelPath: { type: 'string' }, cvScores: { type: 'array' }, bestScore: { type: 'number' }, featureImportances: { type: 'object' } }, async run(inputs, taskCtx) { return { kind: 'skill', title: `Train ${inputs.modelType} model`, skill: { name: 'sklearn-model-trainer', context: { operation: 'train_with_cv', modelType: inputs.modelType, trainDataPath: inputs.trainDataPath, targetColumn: inputs.targetColumn, hyperparameters: inputs.hyperparameters, cvFolds: inputs.cvFolds, scoringMetric: inputs.scoringMetric } }, io: { inputJsonPath: `tasks/${taskCtx.effectId}/input.json`, outputJsonPath: `tasks/${taskCtx.effectId}/result.json` } }; } });
Model Selection Guide
Classification Models
| Model | Use Case | Pros | Cons |
|---|---|---|---|
| LogisticRegression | Binary/multiclass, interpretable | Fast, interpretable | Linear boundary |
| RandomForestClassifier | General purpose | Robust, handles nonlinearity | Can overfit |
| GradientBoostingClassifier | High accuracy needed | State-of-art performance | Slower training |
| SVC | Small/medium datasets | Effective in high dimensions | Slow on large data |
| XGBClassifier | Competition/production | Fast, accurate | Many hyperparameters |
Regression Models
| Model | Use Case | Pros | Cons |
|---|---|---|---|
| LinearRegression | Baseline, interpretable | Simple, fast | Assumes linearity |
| Ridge/Lasso | Regularization needed | Prevents overfitting | Still linear |
| RandomForestRegressor | General purpose | Handles nonlinearity | Can overfit |
| GradientBoostingRegressor | High accuracy | Excellent performance | Slower |
| SVR | Small datasets | Robust to outliers | Slow scaling |
Best Practices
- Always Use Pipelines: Prevent data leakage by including preprocessing in pipelines
- Stratified Splits: Use stratified sampling for imbalanced classification
- Cross-Validation: Never tune hyperparameters on test data
- Feature Scaling: Apply appropriate scaling for distance-based models
- Random Seeds: Set random_state for reproducibility
- Model Persistence: Use joblib over pickle for large numpy arrays