BioSkills bio-machine-learning-omics-classifiers
Builds classification models for omics data using RandomForest, XGBoost, and logistic regression with sklearn-compatible APIs. Includes proper preprocessing and evaluation metrics for biomarker classifiers. Use when building diagnostic or prognostic classifiers from expression or variant data.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/machine-learning/omics-classifiers" ~/.claude/skills/gptomics-bioskills-bio-machine-learning-omics-classifiers && rm -rf "$T"
machine-learning/omics-classifiers/SKILL.mdVersion Compatibility
Reference examples tested with: matplotlib 3.8+, pandas 2.2+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Classification Models for Omics Data
"Build a classifier from my gene expression data" → Train RandomForest, XGBoost, or logistic regression models on omics features with proper preprocessing and evaluation metrics.
- Python:
,sklearn.ensemble.RandomForestClassifier()xgboost.XGBClassifier()
Core Workflow
Goal: Train a classification model on omics data and evaluate its predictive performance.
Approach: Build a scaled pipeline with a Random Forest classifier, fit on training data, and assess with ROC-AUC on held-out test data.
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report, roc_auc_score, roc_curve import matplotlib.pyplot as plt X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42) pipe = Pipeline([ ('scaler', StandardScaler()), ('clf', RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)) ]) pipe.fit(X_train, y_train) y_pred = pipe.predict(X_test) y_prob = pipe.predict_proba(X_test)[:, 1] print(classification_report(y_test, y_pred)) print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}')
XGBoost Classifier
Goal: Train a gradient-boosted tree classifier using the sklearn-compatible XGBoost API.
Approach: Configure XGBClassifier with proper parameter names (avoiding deprecated aliases) and wrap in a scaling pipeline.
from xgboost import XGBClassifier # Use sklearn-compatible API with proper parameters (avoid deprecated seed, nthread) xgb = XGBClassifier( n_estimators=100, max_depth=6, learning_rate=0.1, random_state=42, # NOT seed n_jobs=-1, # NOT nthread eval_metric='logloss' ) pipe = Pipeline([('scaler', StandardScaler()), ('clf', xgb)]) pipe.fit(X_train, y_train)
Logistic Regression with Regularization
Goal: Build an interpretable linear classifier that simultaneously selects sparse biomarker features.
Approach: Use L1-regularized logistic regression with built-in cross-validation for penalty selection, then extract nonzero coefficients as selected features.
from sklearn.linear_model import LogisticRegressionCV # L1 for sparse biomarkers, L2 for correlated features, elasticnet for mixed logit = LogisticRegressionCV( Cs=10, cv=5, penalty='l1', solver='saga', max_iter=1000, random_state=42 ) pipe = Pipeline([('scaler', StandardScaler()), ('clf', logit)]) pipe.fit(X_train, y_train) # Get selected features (nonzero coefficients) feature_mask = logit.coef_[0] != 0 selected = X.columns[feature_mask]
ROC Curve Visualization
Goal: Generate a publication-quality ROC curve showing classifier discrimination ability.
Approach: Compute false/true positive rates from predicted probabilities and plot with AUC annotation.
fpr, tpr, _ = roc_curve(y_test, y_prob) auc = roc_auc_score(y_test, y_prob) plt.figure(figsize=(6, 6)) plt.plot(fpr, tpr, label=f'ROC (AUC = {auc:.3f})') plt.plot([0, 1], [0, 1], 'k--') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.legend() plt.savefig('roc_curve.png', dpi=150)
Multi-class Classification
Goal: Handle classification tasks with more than two classes while addressing class imbalance.
Approach: Encode labels numerically and use balanced class weights to upweight underrepresented classes during training.
from sklearn.metrics import classification_report from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_encoded = le.fit_transform(y) # Use class_weight for imbalanced data rf = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
Feature Importance from Trees
Goal: Rank features by their contribution to tree-based classifier predictions.
Approach: Extract Gini importances from a fitted Random Forest and sort to identify top contributing features.
import pandas as pd importances = pipe.named_steps['clf'].feature_importances_ feature_imp = pd.DataFrame({'feature': X.columns, 'importance': importances}) feature_imp = feature_imp.sort_values('importance', ascending=False).head(20)
Preprocessing Guidelines
| Data Type | Scaler | Notes |
|---|---|---|
| Log-counts (RNA-seq) | StandardScaler | Assumes ~normal after log |
| TPM/FPKM | StandardScaler | Gene-wise centering |
| Raw counts | None | Tree models handle counts |
| Mixed features | ColumnTransformer | Different scalers per type |
Related Skills
- machine-learning/model-validation - Proper model evaluation
- machine-learning/prediction-explanation - Explain predictions with SHAP
- machine-learning/biomarker-discovery - Reduce features before modeling