BioSkills bio-metabolomics-statistical-analysis
Statistical analysis for metabolomics data. Covers preprocessing (log2 transformation, normalization), limma moderated testing with empirical Bayes, Welch's t-tests with BH correction, fold change estimation, and multivariate methods (PCA, PLS-DA, OPLS-DA). Use when identifying differentially abundant metabolites or building classification models.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/metabolomics/statistical-analysis" ~/.claude/skills/gptomics-bioskills-bio-metabolomics-statistical-analysis && rm -rf "$T"
metabolomics/statistical-analysis/SKILL.mdVersion Compatibility
Reference examples tested with: limma 3.58+, ashr 2.2+, scipy 1.12+, statsmodels 0.14+, numpy 1.26+, pandas 2.1+, mixOmics 6.28+, ggplot2 3.5+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - R:
thenpackageVersion('<pkg>')
to verify parameters?function_name
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Metabolomics Statistical Analysis
Processing Pipeline
Standard pipeline: zero/missing handling -> log2 transformation -> normalization -> statistical testing. Fold change and statistical tests operate on log2-transformed, normalized but unscaled data. Pareto or auto-scaling distorts fold change magnitudes -- apply only for multivariate methods (PCA, PLS-DA).
Zero and Missing Value Handling
Metabolomics zeros fall into two categories requiring different treatment:
| Type | Meaning | Detection | Action |
|---|---|---|---|
| TPMV (technical) | Below detection limit | Random missingness within detected features | Impute: half-minimum per feature or KNN |
| BPMV (biological) | Metabolite truly absent | Structured: all zeros in one group | Leave as-is or use two-part test |
Distinguishing them matters: imputing BPMVs introduces false signal, while ignoring TPMVs loses power. If zeros are concentrated in one experimental group, they are likely biological. If scattered randomly, they are likely technical.
log2_data = np.log2(intensities.replace(0, np.nan))
log2_matrix <- log2(intensity_matrix) log2_matrix[!is.finite(log2_matrix)] <- NA
For pseudocount approach:
log2(x + 1) avoids NaN but compresses low-intensity fold changes. Prefer half-minimum imputation when the zero rate is low (<10%).
Normalization
Goal: Remove systematic technical biases (injection volume, instrument drift, ionization efficiency) so observed differences reflect biology.
Approach: Choose based on data characteristics. All methods assume the majority of features are not differentially abundant. Unlike proteomics where median centering is the default, metabolomics uses PQN because it is more robust to dominant high-abundance features common in MS-based metabolomics.
| Method | When to use | Metabolomics note |
|---|---|---|
| PQN | Default for untargeted LC-MS metabolomics | More robust than TIC/median to dominant features |
| QC-RSC (LOESS) | Multi-batch LC-MS with pooled QC samples | Gold standard for batch correction; metabolomics-specific |
| VSN | High zero rate or heteroscedastic data | Handles zeros via arcsinh; replaces separate log2 step |
| TIC | Quick exploration; NMR data | Distorted by dominant features; avoid for LC-MS |
| Cyclic loess | Asymmetric DE (more up- than down-regulated) | Robust to assumption violations |
| None | IS-corrected data; single-batch balanced design | Check PCA on raw log2 data first |
PQN normalization: compute a reference spectrum (median across samples), divide each feature by the reference, take the median of quotients as the per-sample normalization factor.
reference = log2_data.median(axis=1) quotients = log2_data.div(reference, axis=0) norm_factors = quotients.median(axis=0) normalized = log2_data.div(norm_factors, axis=1)
reference <- apply(log2_matrix, 1, median, na.rm = TRUE) quotients <- sweep(log2_matrix, 1, reference, '/') norm_factors <- apply(quotients, 2, median, na.rm = TRUE) normalized <- sweep(log2_matrix, 2, norm_factors, '/')
When normalization hurts: Drug studies and extreme phenotypes can cause global metabolic shifts affecting >50% of features, violating the "majority stable" assumption. This is more common in metabolomics than proteomics because metabolic networks are tightly coupled. Check PCA on raw log2 data first -- if dominant variation corresponds to the biological factor AND global shifts are expected, skip or attenuate normalization.
Method Selection
| Scenario | Recommended | Rationale |
|---|---|---|
| Small n (3-10/group), default | limma | Borrows variance across features; adds ~10-20 effective df |
| Large n (>10/group), Python-only | Welch's t-test + BH | Per-feature variance reliable; limma converges to ordinary t-test |
| Non-normal after log transform | Wilcoxon rank-sum | Cannot reach p < 0.05 with n < 4/group |
| Zero-inflated (many BPMVs) | Two-part test | Separately tests presence/absence and abundance |
| Paired/repeated measures | Paired t-test or limma with blocking | in limma for repeated measures |
| 3+ groups | Welch's ANOVA or limma F-test | Post-hoc: Games-Howell or Tukey HSD |
Unlike proteomics, metabolomics has no analog to PSM/peptide counts, so DEqMS is not applicable. limma is the sole empirical Bayes option for metabolite-level testing.
trend=TRUE is particularly important for metabolomics: features span 3-4 orders of magnitude in intensity and include chemically diverse compounds (polar metabolites, lipids, nucleotides), creating a stronger mean-variance relationship than typical proteomics data. robust=TRUE protects against hyper-variable outlier features common in metabolomics panels that mix chemical classes.
limma Workflow (R)
Goal: Identify differentially abundant metabolites using moderated statistics that borrow information across all features.
Approach: Build a linear model, apply empirical Bayes moderation with intensity-dependent variance trend, extract BH-corrected results.
library(limma) design <- model.matrix(~0 + condition, data = sample_info) colnames(design) <- levels(factor(sample_info$condition)) fit <- lmFit(normalized_matrix, design) contrast_matrix <- makeContrasts(Treatment - Control, levels = design) fit2 <- contrasts.fit(fit, contrast_matrix) fit2 <- eBayes(fit2, trend = TRUE, robust = TRUE) results <- topTable(fit2, coef = 1, number = Inf, adjust.method = 'BH')
Results columns:
logFC, AveExpr, t, P.Value, adj.P.Val, B. Note: adjusted p-values are in adj.P.Val (not FDR or padj). For batch effects, include batch in the design matrix (do NOT use removeBatchEffect() before testing -- that function is for visualization only):
design <- model.matrix(~0 + condition + batch, data = sample_info)
Welch's t-test Workflow (Python)
Goal: Perform the full differential abundance pipeline in Python when R/limma is unavailable.
Approach: Log2-transform raw intensities, run per-feature Welch's t-tests, apply BH correction. Add PQN normalization when systematic biases are suspected (see Processing Pipeline).
import numpy as np import pandas as pd from scipy.stats import ttest_ind from statsmodels.stats.multitest import multipletests intensities = pd.read_csv('feature_table.tsv', sep='\t', index_col=0) metadata = pd.read_csv('sample_info.tsv', sep='\t') case_samples = metadata.loc[metadata['group'] == 'case', 'sample_id'].tolist() ctrl_samples = metadata.loc[metadata['group'] == 'control', 'sample_id'].tolist() log2_data = np.log2(intensities.replace(0, np.nan)) pvalues, log2fcs = [], [] for feature in log2_data.index: case_vals = log2_data.loc[feature, case_samples].dropna().values.astype(float) ctrl_vals = log2_data.loc[feature, ctrl_samples].dropna().values.astype(float) if len(case_vals) >= 2 and len(ctrl_vals) >= 2: _, pval = ttest_ind(case_vals, ctrl_vals, equal_var=False) # Welch's -- scipy defaults to Student's pvalues.append(pval) log2fcs.append(case_vals.mean() - ctrl_vals.mean()) else: pvalues.append(np.nan) log2fcs.append(np.nan) results = pd.DataFrame({'feature_id': log2_data.index, 'log2fc': log2fcs, 'pvalue': pvalues}) results = results.dropna(subset=['pvalue']) _, results['padj'], _, _ = multipletests(results['pvalue'], method='fdr_bh') # default is Holm-Sidak, not BH results['significant'] = results['padj'] < 0.05
Key details:
selects Welch's t-test (scipy defaults to Student's withequal_var=False
)equal_var=True
defaults to Holm-Sidak (FWER) -- always passmultipletests
explicitly for BH FDRmethod='fdr_bh'- Fold change is computed as difference of group means in log2-space (geometric mean ratio), not
which uses arithmetic meanslog2(mean_ratio)
Fold Change Calculation
| Scenario | Test | Notes |
|---|---|---|
| 2 groups, n >= 5/group | Welch's t-test | Always prefer over Student's; unequal variance is the norm |
| 2 groups, non-normal after log | Mann-Whitney U | Cannot reach p < 0.05 with n < 4/group |
| 2 groups, n < 5/group | limma moderated t | borrows variance across features |
| Paired samples | Paired t-test | Pre/post, matched case-control |
| 3+ groups | Welch's ANOVA | Post-hoc: Games-Howell or Dunn's test |
Approach: Compute the difference of group means on log2-transformed data, which equals log2(geometric_mean_case / geometric_mean_control).
log2fc = log2_data.loc[:, case_samples].mean(axis=1) - log2_data.loc[:, ctrl_samples].mean(axis=1)
log2fc <- rowMeans(normalized[, case_samples]) - rowMeans(normalized[, ctrl_samples])
Always compute fold change on log2-transformed, unscaled data. The naive alternative --
log2(mean(case) / mean(control)) -- uses arithmetic means and can reverse fold change direction when group variances differ. The difference-of-log-means approach uses geometric means, consistent with limma and DESeq2.
MetaboAnalyst note: MetaboAnalyst internally uses
log2(arithmetic_mean_ratio) for fold change calculation. This diverges from the geometric mean approach and can produce different results, particularly for features with heterogeneous variance. The geometric mean method (difference of log means) is statistically more appropriate for log-normally distributed metabolomics data.
Fold Change Reporting
Raw fold changes are noisy but unbiased. How to handle them depends on the downstream use:
- Pathway analysis (QEA/GSEA): Use raw fold changes for all features. MetaboAnalyst QEA ranks by effect size and relies on the full continuous distribution. Do not zero or threshold FCs before enrichment analysis.
- Effect size recovery (which metabolites truly changed and by how much): Apply ashr shrinkage (R) for posterior mean estimates. Relevant for biomarker panels or clinical reporting.
- Meta-analysis across studies: Use raw fold changes with standard errors as input. Shrinkage is study-specific and should not be applied before pooling.
Fold Change Shrinkage (ashr)
ashr fits a mixture prior with a point mass at zero and estimates posterior means, smoothly shrinking uncertain effects toward zero while preserving well-supported ones:
library(ashr) se <- sqrt(fit2$s2.post) * fit2$stdev.unscaled[, 1] shrunk <- ash(fit2$coefficients[, 1], se, mixcompdist = 'normal') shrunken_fc <- shrunk$result$PosteriorMean lfsr <- shrunk$result$lfsr
ashr is preferred over hard-thresholding (zeroing FCs at a p-value cutoff) because it shrinks smoothly based on per-feature uncertainty rather than applying an arbitrary step function at padj = 0.05. In Python, there is no mature equivalent -- use R with ashr when effect size accuracy is critical.
Minimum Fold Change Testing
To test whether fold changes exceed a biologically meaningful threshold rather than just differ from zero, use
treat() + topTreat() instead of post-hoc FC filtering with topTable(lfc=...), which can inflate FDR:
fit2 <- treat(fit2, lfc = log2(1.5)) results <- topTreat(fit2, coef = 1, number = Inf)
Common Pitfalls
- Not log-transforming raw intensities -- parametric tests assume approximately normal distributions; raw MS intensities are right-skewed with mean-dependent variance
- Using Student's t-test --
defaults toscipy.stats.ttest_ind
; always setequal_var=True
(Welch's) since treatment can affect both mean and varianceequal_var=False - statsmodels default is not BH --
defaults to Holm-Sidak (FWER); always passmultipletests
for metabolomics FDR correctionmethod='fdr_bh'
before testing -- this function is for visualization only; include batch as a covariate in the design matrix for statistical testingremoveBatchEffect()- Post-hoc FC filtering --
inflates FDR; usetopTable(lfc=...)
+treat()
for minimum-effect-size testingtopTreat() - Normalizing when majority-stable fails -- drug studies causing global metabolic shifts violate PQN/TIC/median assumptions; check PCA on raw log2 data first
- Imputing biological zeros -- replacing BPMVs (metabolite truly absent in one group) with half-minimum creates false signal; use two-part tests instead
- MetaboAnalyst defaults -- uses Student's t-test (not Welch's) and arithmetic mean FC (not geometric mean) by default; override or be aware when comparing results
Volcano Plot
Goal: Visualize both statistical significance and effect size for all features in a single plot.
Approach: Plot log2 fold change (x-axis) vs -log10 p-value (y-axis), highlighting features passing both thresholds.
import matplotlib.pyplot as plt results['sig_and_fc'] = (results['padj'] < 0.05) & (results['log2fc'].abs() > 1) colors = ['red' if s else 'gray' for s in results['sig_and_fc']] plt.figure(figsize=(8, 6)) plt.scatter(results['log2fc'], -np.log10(results['pvalue']), c=colors, alpha=0.6, s=20) plt.axhline(-np.log10(0.05), linestyle='--', color='gray') plt.axvline(-1, linestyle='--', color='gray') plt.axvline(1, linestyle='--', color='gray') plt.xlabel('Log2 Fold Change') plt.ylabel('-log10(p-value)') plt.savefig('volcano_plot.png', dpi=150, bbox_inches='tight')
library(ggplot2) results$sig_and_fc <- results$adj.P.Val < 0.05 & abs(results$logFC) > 1 ggplot(results, aes(x = logFC, y = -log10(adj.P.Val), color = sig_and_fc)) + geom_point(alpha = 0.6) + scale_color_manual(values = c('gray60', 'firebrick')) + geom_hline(yintercept = -log10(0.05), linetype = 'dashed') + geom_vline(xintercept = c(-1, 1), linetype = 'dashed') + labs(x = 'Log2 Fold Change', y = '-Log10 Adjusted P-value') + theme_bw()
PCA
Goal: Explore overall sample variation and detect outliers or batch effects in an unsupervised manner.
Approach: Perform PCA on the feature matrix and plot the first two principal components colored by experimental group.
library(pcaMethods) pca_result <- pca(data, nPcs = 5, method = 'ppca') scores <- as.data.frame(scores(pca_result)) scores$group <- groups ggplot(scores, aes(x = PC1, y = PC2, color = group)) + geom_point(size = 3) + stat_ellipse(level = 0.95) + labs(x = paste0('PC1 (', round(pca_result@R2[1] * 100, 1), '%)'), y = paste0('PC2 (', round(pca_result@R2[2] * 100, 1), '%)')) + theme_bw() loadings <- as.data.frame(loadings(pca_result)) top_pc1 <- loadings[order(abs(loadings$PC1), decreasing = TRUE)[1:20], ]
PLS-DA
Goal: Build a supervised classification model that maximizes separation between experimental groups.
Approach: Fit a PLS-DA model with cross-validation to determine optimal components, then extract VIP scores to rank discriminatory features.
library(mixOmics) plsda_result <- plsda(as.matrix(data), groups, ncomp = 3) perf_plsda <- perf(plsda_result, validation = 'Mfold', folds = 5, nrepeat = 50) plot(perf_plsda, col = color.mixo(5:7)) ncomp_opt <- perf_plsda$choice.ncomp['BER', 'centroids.dist'] final_plsda <- plsda(as.matrix(data), groups, ncomp = ncomp_opt) plotIndiv(final_plsda, group = groups, ellipse = TRUE, legend = TRUE) vip <- vip(final_plsda) top_vip <- sort(vip[, ncomp_opt], decreasing = TRUE)[1:20]
sPLS-DA (Sparse)
Goal: Perform feature selection simultaneously with classification to identify a minimal discriminatory feature set.
Approach: Tune the number of features to retain per component via cross-validation, then fit a sparse PLS-DA model.
tune_splsda <- tune.splsda(as.matrix(data), groups, ncomp = 3, validation = 'Mfold', folds = 5, nrepeat = 50, test.keepX = c(5, 10, 20, 50, 100)) optimal_keepX <- tune_splsda$choice.keepX splsda_result <- splsda(as.matrix(data), groups, ncomp = ncomp_opt, keepX = optimal_keepX) selected_features <- selectVar(splsda_result, comp = 1)$name
OPLS-DA
Goal: Separate group-predictive variation from orthogonal (within-group) variation for cleaner class separation.
Approach: Fit an OPLS-DA model using ropls, then use the S-plot to identify features with high predictive weight and correlation.
library(ropls) oplsda <- opls(data, groups, predI = 1, orthoI = NA) plot(oplsda, typeVc = 'x-score') plot(oplsda, typeVc = 'x-loading') vip_scores <- getVipVn(oplsda) top_vip <- sort(vip_scores, decreasing = TRUE)[1:20]
Random Forest
Goal: Rank features by importance using a non-linear ensemble classifier.
Approach: Train a Random Forest on the feature matrix, extract MeanDecreaseAccuracy to identify discriminatory features.
library(randomForest) rf_model <- randomForest(x = data, y = groups, importance = TRUE, ntree = 500) importance <- importance(rf_model) top_features <- rownames(importance)[order(importance[, 'MeanDecreaseAccuracy'], decreasing = TRUE)[1:20]] varImpPlot(rf_model, n.var = 20)
ROC Analysis
Goal: Evaluate the diagnostic performance of candidate biomarker metabolites.
Approach: Generate ROC curves and compute AUC for individual features using pROC.
library(pROC) top_feature <- 'feature_123' roc_result <- roc(groups, data[, top_feature]) plot(roc_result, main = paste('AUC =', round(auc(roc_result), 3))) biomarkers <- c('feature_1', 'feature_2', 'feature_3') for (feat in biomarkers) { roc_i <- roc(groups, data[, feat]) cat(feat, ': AUC =', round(auc(roc_i), 3), '\n') }
Heatmap
Goal: Visualize abundance patterns of top differential features across all samples.
Approach: Select top significant features, create an annotated heatmap with hierarchical clustering.
library(pheatmap) top_features <- rownames(sig_features)[1:50] data_top <- data[, top_features] annotation_row <- data.frame(Group = groups) rownames(annotation_row) <- rownames(data) pheatmap(t(data_top), annotation_col = annotation_row, scale = 'row', clustering_method = 'ward.D2', filename = 'heatmap.png', width = 10, height = 12)
Related Skills
- normalization-qc - Data preparation and batch correction
- pathway-mapping - Functional interpretation of differential metabolites
- multi-omics-integration/mixomics-analysis - Advanced multivariate methods
- proteomics/differential-abundance - Analogous empirical Bayes concepts for proteomics
- data-visualization/specialized-omics-plots - Volcano plots, MA plots, heatmaps