BioSkills bio-methylation-differential-cpg
Per-CpG differential methylation testing from bisulfite sequencing count data or beta-value matrices. Covers beta and M-value computation, coverage filtering, statistical tests (Welch t-test, Mann-Whitney, limma, DSS beta-binomial), multiple testing correction, and effect size calculation. Use when comparing methylation at individual CpG sites between experimental groups from WGBS, RRBS, or targeted bisulfite sequencing.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/methylation-analysis/differential-cpg-testing" ~/.claude/skills/gptomics-bioskills-bio-methylation-differential-cpg && rm -rf "$T"
methylation-analysis/differential-cpg-testing/SKILL.mdVersion Compatibility
Reference examples tested with: scipy 1.12+, statsmodels 0.14+, pandas 2.2+, numpy 1.26+, limma 3.58+, DSS 2.50+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - R:
thenpackageVersion('<pkg>')
to verify parameters?function_name
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Per-CpG Differential Methylation Testing
"Test individual CpG sites for differential methylation between groups" -> Compute per-CpG methylation metrics from count data, apply coverage filters, run statistical tests for group differences, correct for multiple testing, and report effect sizes.
- Python:
+scipy.stats.ttest_ind()statsmodels.stats.multitest.multipletests() - R:
+limma::lmFit()
on M-values,eBayes()
on countsDSS::DMLtest()
Beta Values and M-Values
Goal: Convert raw bisulfite sequencing count data into analyzable methylation metrics.
Approach: Compute beta values (methylation proportion) for biological interpretation and M-values (logit-transformed) for statistical testing. Beta values are bounded [0, 1] and heteroscedastic (variance depends on methylation level). M-values are approximately homoscedastic, making them better suited for parametric tests (Du et al. 2010).
import numpy as np import pandas as pd # Beta value: proportion of methylated reads # Range [0, 1], directly interpretable as methylation percentage beta = meth_counts / total_counts # M-value: logit transform of beta (log2 scale) # Range (-inf, +inf), homoscedastic, better for statistical testing # Add offset to avoid log(0) and division by zero OFFSET = 1e-6 m_value = np.log2((beta + OFFSET) / (1 - beta + OFFSET))
# Beta value from counts beta <- meth_counts / total_counts # M-value: logit transform (base 2) # Add offset to handle beta = 0 or beta = 1 offset <- 1e-6 m_value <- log2((beta + offset) / (1 - beta + offset)) # Convert M-value back to beta for reporting # beta = 2^M / (2^M + 1) beta_from_m <- 2^m_value / (2^m_value + 1)
When to use each metric:
- Large n (>10/group), t-test or Mann-Whitney: Testing on beta values directly is standard and acceptable. Heteroscedasticity has minimal impact with sufficient samples.
- Small n (3-5/group), limma: Convert to M-values for statistical testing. Empirical Bayes moderation on M-values produces better-calibrated p-values.
- Reporting: Always report delta_beta (difference in mean beta values) for biological interpretation regardless of which metric was used for testing.
Coverage Filtering
Goal: Remove CpGs with unreliable methylation estimates due to insufficient or artificially inflated read coverage.
Approach: Apply a minimum coverage threshold to ensure adequate resolution, and an upper percentile filter to remove PCR amplification artifacts.
# Minimum 10x: provides 11 possible methylation levels (0/10 through 10/10) # and adequate power for statistical testing MIN_COVERAGE = 10 # Upper 99.9th percentile: removes sites with inflated coverage from # PCR duplication or mapping artifacts in repetitive regions upper_threshold = np.percentile(total_counts.values.flatten(), 99.9) # Require minimum coverage in ALL samples (any low-coverage sample # makes that CpG unreliable for group comparison) passes_min = (total_counts >= MIN_COVERAGE).all(axis=1) passes_max = (total_counts <= upper_threshold).all(axis=1) filtered = beta[passes_min & passes_max]
# Equivalent in R (without methylKit) min_coverage <- 10 upper_threshold <- quantile(as.matrix(total_counts), 0.999) keep <- apply(total_counts, 1, function(x) all(x >= min_coverage & x <= upper_threshold)) filtered_beta <- beta[keep, ]
Coverage Thresholds by Assay
| Assay | Minimum | Upper Filter | Rationale |
|---|---|---|---|
| WGBS | 5-10x | 99.9th %ile | Lower thresholds viable when using smoothing methods (BSmooth) |
| RRBS | 10x | 99.9th %ile | Standard minimum; no spatial smoothing for uncovered regions |
| Targeted/amplicon | 30-100x | 99.9th %ile | Deep sequencing expected; higher threshold increases confidence |
Method Selection
| Scenario | Recommended | Rationale |
|---|---|---|
| Large n (>10/group), Python environment | Welch's t-test + BH | Variance estimates reliable at larger sample sizes; fast and simple |
| Large n, non-normal beta distributions | Mann-Whitney U + BH | No distributional assumptions; same power as t-test at large n |
| Small n (3-5/group) | limma on M-values | Empirical Bayes borrows variance across CpGs; adds ~10-20 effective df |
| Count-level modeling needed | DSS beta-binomial | Models both biological variation and sampling noise from read counts |
| Unreplicated (two samples only) | Fisher's exact on counts | Only valid option; cannot estimate biological variance without replicates |
| Mixed tissue samples | limma + cell-type covariates | Cell composition confounds methylation differences (Jaffe & Irizarry 2014) |
Fisher's exact test on counts should be avoided when biological replicates exist. Pooling counts across replicates and testing with Fisher's inflates false positive rates to ~33% because biological variation is ignored.
Fisher's Exact Test (Python, Unreplicated Only)
from scipy.stats import fisher_exact # Only for unreplicated designs (1 sample per group) # table: [[meth_case, unmeth_case], [meth_ctrl, unmeth_ctrl]] table = [[meth_case, total_case - meth_case], [meth_ctrl, total_ctrl - meth_ctrl]] result = fisher_exact(table, alternative='two-sided') # result.statistic = odds ratio, result.pvalue = p-value
Welch's t-Test on Beta Values (Python)
Goal: Test each CpG for a mean methylation difference between two groups using a parametric test that does not assume equal variances.
Approach: Compute beta values per CpG per sample, run a Welch t-test per CpG across groups, collect p-values, and apply Benjamini-Hochberg FDR correction.
import numpy as np import pandas as pd from scipy import stats from statsmodels.stats.multitest import multipletests # Assume: case_betas and ctrl_betas are DataFrames # Rows = CpGs, columns = samples within each group pvalues = [] for cpg_idx in range(len(case_betas)): case_vals = case_betas.iloc[cpg_idx].values ctrl_vals = ctrl_betas.iloc[cpg_idx].values # equal_var=False: Welch's t-test (does NOT assume equal variance) # scipy defaults to equal_var=True (Student's), which is almost never appropriate result = stats.ttest_ind(case_vals, ctrl_vals, equal_var=False, nan_policy='omit') pvalues.append(result.pvalue) pvalues = np.array(pvalues) # Benjamini-Hochberg FDR correction # CRITICAL: multipletests defaults to method='hs' (Holm-Sidak), NOT BH # Must explicitly pass method='fdr_bh' reject, padj, _, _ = multipletests(pvalues, alpha=0.05, method='fdr_bh')
For large datasets, vectorized computation avoids the per-CpG loop:
from scipy.stats import ttest_ind # Vectorized across all CpGs at once (axis=1 tests across samples) t_stats, pvalues = ttest_ind(case_betas.values, ctrl_betas.values, axis=1, equal_var=False, nan_policy='omit') reject, padj, _, _ = multipletests(pvalues, alpha=0.05, method='fdr_bh')
Mann-Whitney U Test (Python)
Goal: Non-parametric alternative when beta-value distributions are non-normal or sample sizes are very unequal.
Approach: Per-CpG rank-based test with BH correction. No distributional assumptions, but lower power than t-test for small n.
from scipy.stats import mannwhitneyu pvalues = [] for cpg_idx in range(len(case_betas)): case_vals = case_betas.iloc[cpg_idx].dropna().values ctrl_vals = ctrl_betas.iloc[cpg_idx].dropna().values # Cannot reach p < 0.05 with n < 4 per group (insufficient ranks) if len(case_vals) < 4 or len(ctrl_vals) < 4: pvalues.append(np.nan) continue result = mannwhitneyu(case_vals, ctrl_vals, alternative='two-sided') pvalues.append(result.pvalue) pvalues = np.array(pvalues) valid = ~np.isnan(pvalues) padj = np.full_like(pvalues, np.nan) padj[valid] = multipletests(pvalues[valid], method='fdr_bh')[1]
limma on M-Values (R)
Goal: Identify differentially methylated CpGs with small sample sizes by borrowing variance information across CpGs via empirical Bayes moderation.
Approach: Convert beta values to M-values, fit a linear model per CpG with limma, apply empirical Bayes variance shrinkage, and extract moderated test statistics. This adds ~10-20 effective degrees of freedom even with n=3 per group.
library(limma) # M-value matrix: rows = CpGs, columns = samples offset <- 1e-6 m_values <- log2((beta_matrix + offset) / (1 - beta_matrix + offset)) # Design matrix group <- factor(c(rep('case', n_case), rep('ctrl', n_ctrl))) design <- model.matrix(~ 0 + group) colnames(design) <- levels(group) # Contrast: case vs control contrast_matrix <- makeContrasts(case - ctrl, levels = design) # Fit linear model and apply empirical Bayes fit <- lmFit(m_values, design) fit2 <- contrasts.fit(fit, contrast_matrix) # trend=TRUE: intensity-dependent prior variance (recommended for methylation) # robust=TRUE: protects against outlier CpGs fit2 <- eBayes(fit2, trend = TRUE, robust = TRUE) # Extract results (all CpGs, BH-adjusted) results <- topTable(fit2, number = Inf, adjust.method = 'BH', sort.by = 'none') # Columns: logFC (on M-value scale), AveExpr, t, P.Value, adj.P.Val, B # Convert logFC (M-value scale) back to approximate delta_beta for reporting # For CpGs in the middle range (beta 0.2-0.8), delta_beta ~ 0.15 * logFC # For precise delta_beta, compute from original beta values instead delta_beta <- rowMeans(beta_matrix[, group == 'case']) - rowMeans(beta_matrix[, group == 'ctrl']) results$delta_beta <- delta_beta
DSS Beta-Binomial Model (R)
Goal: Statistically principled count-based differential methylation testing that models both biological variation (Beta distribution across replicates) and sampling noise (Binomial distribution from read counts).
Approach: DSS uses a hierarchical beta-binomial model with Bayesian shrinkage for dispersion estimation. It operates directly on methylated/total count data without requiring beta or M-value conversion.
library(DSS) # Create BSseq object from count matrices # meth_counts: matrix of methylated read counts (CpGs x samples) # total_counts: matrix of total read counts (CpGs x samples) bs_obj <- BSseq(chr = cpg_chr, pos = cpg_pos, M = meth_counts, Cov = total_counts, sampleNames = sample_names) # Define groups group_case <- sample_names[1:n_case] group_ctrl <- sample_names[(n_case + 1):(n_case + n_ctrl)] # Test for differential methylation (per CpG) # smoothing=FALSE: test individual CpGs without spatial smoothing # smoothing=TRUE: borrows information from nearby CpGs (500bp default window) dml_test <- DMLtest(bs_obj, group1 = group_case, group2 = group_ctrl, smoothing = FALSE) # Extract significant DMCs # p.threshold: FDR-adjusted p-value cutoff dmcs <- callDML(dml_test, p.threshold = 0.05)
Effect Size Calculation
Goal: Quantify the magnitude of methylation difference at each CpG for biological interpretation and downstream filtering.
Approach: Compute delta_beta as the difference in mean beta values between groups. Apply effect size thresholds appropriate to the biological context.
delta_beta = case_betas.mean(axis=1).values - ctrl_betas.mean(axis=1).values
delta_beta <- rowMeans(beta_matrix[, group == 'case']) - rowMeans(beta_matrix[, group == 'ctrl'])
Effect Size Thresholds
| Threshold | delta_beta | Typical Use Case |
|---|---|---|
| Lenient | 0.10 | EWAS discovery, environmental exposure studies |
| Standard | 0.20 | General differential methylation analysis |
| methylKit default | 0.25 | methylKit getMethylDiff (reported as 25% difference) |
| Stringent | 0.30 | High-confidence set, cancer vs normal comparisons |
The appropriate threshold depends on biological context. Promoter CpG islands are often bimodal (either <10% or >80% methylated), so even small changes near the transition boundary can be functionally meaningful.
methylKit's
meth.diff is derived from its logistic regression model and is not identical to mean(case_beta) - mean(ctrl_beta). When comparing results across tools, always recompute delta_beta from the original beta values for consistency.
Full Python Pipeline
Goal: Complete per-CpG differential methylation analysis from a count-data TSV through to a results table with statistics and effect sizes.
Approach: Read count data, compute beta values, apply coverage filters, run Welch's t-test per CpG, correct for multiple testing with BH FDR, compute delta_beta, and write results.
import numpy as np import pandas as pd from scipy.stats import ttest_ind from statsmodels.stats.multitest import multipletests # Read count data: columns are cpg_id, {sample}_meth, {sample}_total counts = pd.read_csv('bisulfite_counts.tsv', sep='\t', index_col='cpg_id') # Identify samples and groups from column naming pattern case_samples = [c.replace('_meth', '') for c in counts.columns if c.startswith('case') and c.endswith('_meth')] ctrl_samples = [c.replace('_meth', '') for c in counts.columns if c.startswith('ctrl') and c.endswith('_meth')] # Compute beta values: meth / total per CpG per sample case_betas = pd.DataFrame({s: counts[f'{s}_meth'] / counts[f'{s}_total'] for s in case_samples}, index=counts.index) ctrl_betas = pd.DataFrame({s: counts[f'{s}_meth'] / counts[f'{s}_total'] for s in ctrl_samples}, index=counts.index) # Coverage filter: require minimum 10x in ALL samples MIN_COVERAGE = 10 # 11 possible methylation levels, adequate power for testing total_cols = [c for c in counts.columns if c.endswith('_total')] passes_coverage = (counts[total_cols] >= MIN_COVERAGE).all(axis=1) case_betas = case_betas[passes_coverage] ctrl_betas = ctrl_betas[passes_coverage] # Per-CpG Welch's t-test (vectorized) t_stats, pvalues = ttest_ind(case_betas.values, ctrl_betas.values, axis=1, equal_var=False, nan_policy='omit') # BH FDR correction (default is Holm-Sidak, must specify fdr_bh) reject, padj, _, _ = multipletests(pvalues, alpha=0.05, method='fdr_bh') # Effect sizes mean_case = case_betas.mean(axis=1) mean_ctrl = ctrl_betas.mean(axis=1) delta_beta = mean_case - mean_ctrl # Assemble results results = pd.DataFrame({ 'cpg_id': case_betas.index, 'mean_case_beta': mean_case.values, 'mean_ctrl_beta': mean_ctrl.values, 'delta_beta': delta_beta.values, 'pvalue': pvalues, 'padj': padj, 'significant': np.where(padj < 0.05, 'TRUE', 'FALSE') }) results.to_csv('dmc.tsv', sep='\t', index=False)
Common Pitfalls
-
Beta-value heteroscedasticity: Variance of beta values depends on methylation level -- highest near 0.5, compressed near 0 and 1. This means a 10% difference near beta=0 has different statistical properties than 10% near beta=0.5. For small samples, use limma on M-values instead.
-
scipy.stats.ttest_ind defaults to Student's t-test (
). Always passequal_var=True
for Welch's t-test. Equal variance between groups is almost never true for methylation data.equal_var=False -
statsmodels multipletests defaults to Holm-Sidak (
), not Benjamini-Hochberg. Always passmethod='hs'
explicitly.method='fdr_bh' -
Fisher's exact test with replicates inflates false positives. Pooling methylated/unmethylated counts across biological replicates and running Fisher's test ignores biological variation, producing false positive rates of ~33%. Fisher's exact is only appropriate for unreplicated designs (single sample per group).
-
methylKit meth.diff is not mean(case_beta) - mean(ctrl_beta). methylKit uses logistic regression with overdispersion correction. Its
column reflects the model-estimated difference, which can diverge from the simple mean difference, particularly at extreme methylation levels.meth.diff -
Bonferroni correction is too conservative for genome-wide methylation. With 450K-850K CpGs (arrays) or 28M+ CpGs (WGBS), Bonferroni thresholds leave almost no significant CpGs. BH FDR is the standard approach.
-
Cell-type composition confounds mixed-tissue samples. In blood or bulk tissue, observed methylation differences may reflect cell-type proportion changes rather than within-cell methylation changes (Jaffe & Irizarry 2014). Include cell-type estimates as covariates or use deconvolution methods.
Related Skills
- methylkit-analysis - Chi-squared testing with methylKit in R
- methylation-calling - Generate per-CpG count data from Bismark BAM files
- dmr-detection - Region-level testing after per-CpG analysis
- differential-expression/deseq2-basics - Analogous empirical Bayes concepts for count data
- proteomics/differential-abundance - Similar Welch t-test + BH workflow for proteomics