BioSkills bio-workflows-timecourse-pipeline

End-to-end time-course analysis from expression matrix to temporal patterns and enrichment. Covers temporal DE, Mfuzz soft clustering, optional rhythm detection, GAM trajectory fitting, and per-cluster pathway enrichment. Use when analyzing bulk time-series expression experiments from any omics platform.

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/workflows/timecourse-pipeline" ~/.claude/skills/gptomics-bioskills-bio-workflows-timecourse-pipeline && rm -rf "$T"

manifest: workflows/timecourse-pipeline/SKILL.md

source content

Version Compatibility

Reference examples tested with: DESeq2 1.42+, clusterProfiler 4.10+, limma 3.58+, numpy 1.26+, pandas 2.2+, scanpy 1.10+, scipy 1.12+, statsmodels 0.14+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
R:
```
packageVersion('<pkg>')
```
then
```
?function_name
```
to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Time-Course Analysis Pipeline

"Analyze my time-course expression data end-to-end" → Orchestrate temporal differential expression, Mfuzz soft clustering, optional circadian rhythm detection, GAM trajectory fitting, changepoint detection, and per-cluster pathway enrichment.

Complete workflow from expression matrix through temporal differential expression, soft clustering, optional rhythm detection, trajectory fitting, and per-cluster pathway enrichment.

Pipeline Overview

Expression matrix + time metadata
    |
    v
[1. Temporal DE] ---------> limma splines / DESeq2 LRT
    |
    v
[2. Filter] --------------> Significant temporal genes (FDR <0.05)
    |
    v
[3. Mfuzz Clustering] ----> Soft clustering of expression profiles
    |                            |
    |                            +---> QC: membership >0.5, no empty clusters
    |
    +--- Circadian design? ---> [4a. Rhythm Detection] (MetaCycle / CosinorPy)
    |                               |
    |                               v
    |                           Rhythmic genes + period/phase estimates
    |
    v
[4b. GAM Trajectory] -----> mgcv GAM fitting for top clusters
    |
    v
[5. Pathway Enrichment] --> clusterProfiler per-cluster GO/KEGG
    |
    v
Temporal gene modules + enriched pathways + trajectory plots

Step 1: Temporal Differential Expression

R (limma splines)

library(limma)
library(splines)

expr <- as.matrix(read.csv('counts_normalized.csv', row.names = 1))
meta <- read.csv('metadata.csv')

time_points <- meta$time
design <- model.matrix(~ ns(time_points, df = 3))

fit <- lmFit(expr, design)
fit <- eBayes(fit)

# Test all spline coefficients jointly for temporal significance
temporal_results <- topTable(fit, coef = 2:ncol(design), number = Inf, sort.by = 'F')
# topTable already returns adj.P.Val (BH-corrected); use it directly

R (DESeq2 LRT)

library(DESeq2)

counts <- as.matrix(read.csv('raw_counts.csv', row.names = 1))
meta <- read.csv('metadata.csv')
meta$time <- factor(meta$time)

dds <- DESeqDataSetFromMatrix(counts, colData = meta, design = ~ time)
dds <- DESeq(dds, test = 'LRT', reduced = ~ 1)
res <- results(dds)

Python (statsmodels)

import pandas as pd
import numpy as np
from statsmodels.stats.multitest import multipletests
from patsy import dmatrix
from scipy import stats

expr = pd.read_csv('counts_normalized.csv', index_col=0)
meta = pd.read_csv('metadata.csv')

spline_basis = dmatrix('bs(time, df=3)', data=meta, return_type='dataframe')
design_full = np.column_stack([np.ones(len(meta)), spline_basis.values])
design_reduced = np.ones((len(meta), 1))

pvals = []
for gene in expr.index:
    y = expr.loc[gene].values
    ss_full = np.sum((y - design_full @ np.linalg.lstsq(design_full, y, rcond=None)[0]) ** 2)
    ss_red = np.sum((y - design_reduced @ np.linalg.lstsq(design_reduced, y, rcond=None)[0]) ** 2)
    df_diff = design_full.shape[1] - design_reduced.shape[1]
    df_resid = len(y) - design_full.shape[1]
    f_stat = ((ss_red - ss_full) / df_diff) / (ss_full / df_resid)
    pvals.append(1 - stats.f.cdf(f_stat, df_diff, df_resid))

_, fdr, _, _ = multipletests(pvals, method='fdr_bh')
temporal_genes = expr.index[fdr < 0.05].tolist()

QC Checkpoint: Temporal DE

# Gate 1: Sufficient temporal genes detected
sig_genes <- temporal_results[temporal_results$adj.P.Val < 0.05, ]
n_sig <- nrow(sig_genes)
message(sprintf('Significant temporal genes: %d', n_sig))
if (n_sig < 100) message('WARNING: Few temporal genes. Check time point spacing or consider relaxing FDR.')
if (n_sig > 10000) message('WARNING: Many temporal genes. Consider stricter FDR or inspect batch effects.')

# Gate 2: Residual distribution check
residuals <- residuals(fit, expr)
message(sprintf('Residual mean: %.4f, SD: %.4f', mean(residuals), sd(residuals)))

Step 2: Filter Significant Genes

# FDR <0.05: Standard threshold for temporal DE
# More permissive (0.1) acceptable for exploratory clustering
sig_genes <- rownames(temporal_results[temporal_results$adj.P.Val < 0.05, ])
expr_sig <- expr[sig_genes, ]
message(sprintf('Genes passing FDR <0.05: %d', length(sig_genes)))

Step 3: Mfuzz Soft Clustering

library(Mfuzz)

eset <- ExpressionSet(assayData = as.matrix(expr_sig))

# Standardize expression profiles (mean=0, sd=1 per gene)
eset <- standardise(eset)

# Estimate fuzzifier m
# mestimate() calculates optimal m from data geometry; typical range 1.5-2.5
m <- mestimate(eset)
message(sprintf('Estimated fuzzifier m = %.2f', m))

# Cluster count: start with sqrt(n_genes/2), refine with gap statistic
# Typical range 4-20 depending on temporal complexity
n_clusters <- 8
cl <- mfuzz(eset, c = n_clusters, m = m)

# Filter low-membership genes
# Membership >0.5: gene clearly belongs to one cluster
# Lower to 0.3 for exploratory analysis with more overlap
core_genes <- acore(eset, cl, min.acore = 0.5)

Python Alternative (tslearn)

from tslearn.clustering import TimeSeriesKMeans

# Row-wise z-scoring: normalize each gene across its timepoints (not per-timepoint)
expr_scaled = (expr_sig.values - expr_sig.values.mean(axis=1, keepdims=True)) / expr_sig.values.std(axis=1, keepdims=True)

# n_clusters: 4-20 depending on complexity; evaluate with silhouette score
model = TimeSeriesKMeans(n_clusters=8, metric='softdtw', metric_params={'gamma': 0.01},
                         max_iter=50, random_state=42)
labels = model.fit_predict(expr_scaled.reshape(expr_scaled.shape[0], expr_scaled.shape[1], 1))

QC Checkpoint: Clustering

# Gate 1: No empty clusters
cluster_sizes <- table(cl$cluster)
message('Cluster sizes:')
print(cluster_sizes)
if (any(cluster_sizes == 0)) message('WARNING: Empty clusters found. Reduce n_clusters.')

# Gate 2: Membership quality
for (i in seq_along(core_genes)) {
    n_core <- nrow(core_genes[[i]])
    message(sprintf('Cluster %d: %d core genes (membership >0.5)', i, n_core))
}

# Gate 3: Silhouette score (optional validation)
library(cluster)
hard_labels <- cl$cluster
sil <- silhouette(hard_labels, dist(exprs(eset)))
message(sprintf('Mean silhouette: %.3f', mean(sil[, 3])))

Step 4a: Rhythm Detection (Optional - Circadian Designs)

Only applicable when sampling covers 24h+ cycles with sufficient resolution (every 2-4h).

R (MetaCycle)

library(MetaCycle)

# Expects genes as rows, time points as columns
# Column names must be numeric time values (hours)
expr_for_meta <- expr_sig
colnames(expr_for_meta) <- meta$time_hours

write.csv(expr_for_meta, 'expr_for_metacycle.csv')

# Period range 20-28h: standard circadian search window
# Adjust for ultradian (4-12h) or infradian (>28h) rhythms
meta2d('expr_for_metacycle.csv', filestyle = 'csv',
       minper = 20, maxper = 28,
       timepoints = sort(unique(meta$time_hours)),
       outdir = 'metacycle_results')

Python (CosinorPy)

from cosinorpy import file_parser, cosinor

# fit_group expects long-format DataFrame with columns 'x' (time), 'y' (expression), 'test' (gene name)
# Reshape expression matrix to long format before passing
# period=24: standard circadian; adjust for other periodicities
results = cosinor.fit_group(expr_long, period=24, n_components=1)
rhythmic = results[results['p'] < 0.05]

Step 4b: GAM Trajectory Fitting

R (mgcv)

library(mgcv)

cluster_trajectories <- list()
for (cl_id in 1:n_clusters) {
    cl_genes <- names(cl$cluster[cl$cluster == cl_id])
    mean_profile <- colMeans(expr_sig[cl_genes, ])

    df_gam <- data.frame(time = meta$time, expr = mean_profile)

    # k: basis dimension; k=5 sufficient for most time courses
    # Increase to k=10 for >20 time points; decrease to k=3 for <6 time points
    gam_fit <- gam(expr ~ s(time, k = 5), data = df_gam)

    cluster_trajectories[[cl_id]] <- list(
        fit = gam_fit,
        r_squared = summary(gam_fit)$r.sq,
        edf = summary(gam_fit)$edf
    )
    message(sprintf('Cluster %d: R^2 = %.3f, EDF = %.2f', cl_id,
                    summary(gam_fit)$r.sq, summary(gam_fit)$edf))
}

Python (pygam)

from pygam import LinearGAM, s
import numpy as np

for cl_id in range(n_clusters):
    cl_mask = labels == cl_id
    mean_profile = expr_scaled[cl_mask].mean(axis=0)

    # n_splines=5: sufficient for most time courses
    gam = LinearGAM(s(0, n_splines=5)).fit(meta['time'].values.reshape(-1, 1), mean_profile)
    print(f'Cluster {cl_id}: GCV = {gam.statistics_["GCV"]:.4f}')

Step 5: Per-Cluster Pathway Enrichment

R (clusterProfiler)

library(clusterProfiler)
library(org.Hs.eg.db)

# Background = all temporal genes tested (not the full genome)
all_temporal_entrez <- bitr(rownames(expr_sig), fromType = 'SYMBOL', toType = 'ENTREZID',
                            OrgDb = org.Hs.eg.db)

enrichment_results <- list()
for (i in seq_along(core_genes)) {
    genes <- core_genes[[i]]$NAME

    entrez <- bitr(genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)

    # GO Biological Process enrichment with proper background
    ego <- enrichGO(gene = entrez$ENTREZID,
                    universe = all_temporal_entrez$ENTREZID,
                    OrgDb = org.Hs.eg.db,
                    ont = 'BP', pAdjustMethod = 'BH',
                    pvalueCutoff = 0.05, qvalueCutoff = 0.05,
                    readable = TRUE)

    # Simplify redundant GO terms (parent-child hierarchy creates redundancy)
    if (nrow(as.data.frame(ego)) > 0) {
        ego <- simplify(ego, cutoff = 0.7, by = 'p.adjust')
    }

    enrichment_results[[i]] <- ego
    message(sprintf('Cluster %d: %d significant GO terms', i, nrow(as.data.frame(ego))))
}

Python (gseapy)

import gseapy as gp

# Use all temporal genes as background for proper enrichment statistics
all_temporal_genes = list(expr_sig.index)

for cl_id in range(n_clusters):
    cl_genes = [g for g, l in zip(expr_sig.index, labels) if l == cl_id]

    enr = gp.enrichr(gene_list=cl_genes, gene_sets='GO_Biological_Process_2023',
                     organism='human', background=all_temporal_genes,
                     outdir=f'enrichr_cluster_{cl_id}')
    sig_terms = enr.results[enr.results['Adjusted P-value'] < 0.05]
    print(f'Cluster {cl_id}: {len(sig_terms)} significant GO terms')

QC Checkpoint: Enrichment

# Gate: At least 3 clusters should have significant GO terms
clusters_with_terms <- sum(sapply(enrichment_results, function(x) nrow(as.data.frame(x)) > 0))
message(sprintf('Clusters with significant GO terms: %d / %d', clusters_with_terms, length(enrichment_results)))
if (clusters_with_terms < 3) {
    message('WARNING: Few clusters enriched. Check gene ID conversion or relax thresholds.')
}

Parameter Recommendations

Step	Parameter	Recommendation
Temporal DE	Spline df	3 (default); increase to 4-5 for >10 time points
Temporal DE	FDR	0.05 (standard); 0.1 for exploratory clustering
Mfuzz	fuzzifier m	Use mestimate(); typical range 1.5-2.5
Mfuzz	n_clusters	4-20; start with sqrt(n_genes/2), refine with gap statistic
Mfuzz	min membership	0.5 (core genes); 0.3 (exploratory)
MetaCycle	period range	20-28h (circadian); adjust for other periodicities
GAM	k (basis dim)	5 (default); 3 for <6 time points; 10 for >20
clusterProfiler	pvalueCutoff	0.05 (standard); 0.1 (permissive)

Troubleshooting

Issue	Likely Cause	Solution
< 100 temporal genes	Insufficient replicates or noisy data	Add replicates; use DESeq2 LRT instead of limma
Empty Mfuzz clusters	Too many clusters	Reduce n_clusters; check gap statistic
All genes in one cluster	Fuzzifier too low or too few clusters	Increase m or n_clusters
No rhythmic genes	Non-circadian design or low power	Verify 24h+ sampling; increase resolution
GAM overfitting	k too high for time points	Set k = min(n_timepoints - 1, 5)
Few enriched clusters	Gene ID conversion failure	Check species; verify Entrez ID mapping
Low membership scores	High expression noise	Increase fuzzifier m; apply stricter gene filtering

Related Skills

differential-expression/timeseries-de - Temporal DE methods
temporal-genomics/temporal-clustering - Cluster analysis details
temporal-genomics/circadian-rhythms - Rhythm detection details
temporal-genomics/trajectory-modeling - GAM fitting details
pathway-analysis/go-enrichment - Enrichment analysis details
temporal-genomics/periodicity-detection - Unknown-period discovery