BioSkills bio-temporal-genomics-temporal-clustering

Clusters genes by temporal expression profile shape using Mfuzz soft clustering, TCseq, and DEGreport degPatterns. Groups co-regulated genes into shared trajectory patterns via fuzzy c-means or hierarchical approaches. Use when categorizing temporally dynamic genes into response groups or identifying co-expression modules across time points. Requires temporally variable genes identified first (see differential-expression/timeseries-de).

install

source · Clone the upstream repo

git clone https://github.com/GPTomics/bioSkills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/temporal-genomics/temporal-clustering" ~/.claude/skills/gptomics-bioskills-bio-temporal-genomics-temporal-clustering && rm -rf "$T"

manifest: temporal-genomics/temporal-clustering/SKILL.md

source content

Version Compatibility

Reference examples tested with: numpy 1.26+, scanpy 1.10+, scikit-learn 1.4+

Before using code patterns, verify installed versions match. If versions differ:

Python:
```
pip show <package>
```
then
```
help(module.function)
```
to check signatures
R:
```
packageVersion('<pkg>')
```
then
```
?function_name
```
to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Temporal Gene Clustering

"Group my time-course genes by expression pattern shape" → Cluster temporally variable genes into co-expression modules by trajectory shape using fuzzy c-means (Mfuzz), hierarchical methods, or DTW-based approaches, revealing coordinated response patterns.

R:
```
Mfuzz::mfuzz()
```
for soft (fuzzy) temporal clustering
Python:
```
sklearn.cluster.KMeans
```
on z-scored time profiles for hard clustering

Groups genes with similar temporal expression dynamics into clusters, revealing shared regulatory programs and coordinated response patterns across time-course experiments.

Core Workflow

Select temporally variable genes (pre-filtered by DE or variance)
Standardize expression profiles (z-score across timepoints)
Choose clustering method and number of clusters
Assign genes to clusters (hard or soft membership)
Validate clusters and run functional enrichment per cluster

Mfuzz (R/Bioconductor)

Goal: Group temporally variable genes into co-expression clusters by trajectory shape using fuzzy c-means, revealing shared regulatory programs.

Approach: Create an ExpressionSet from the time-series matrix, filter low-variance genes, standardize profiles, estimate the fuzzifier parameter, then run fuzzy c-means to assign soft cluster memberships.

Soft (fuzzy) c-means clustering assigns genes membership scores across all clusters, capturing genes with ambiguous temporal behavior.

Setup and Preprocessing

library(Mfuzz)
library(Biobase)

# Rows = genes, columns = timepoints (mean across replicates)
expr_mat <- as.matrix(read.csv('temporal_expression.csv', row.names = 1))

# Create ExpressionSet
eset <- ExpressionSet(assayData = expr_mat)

# filter.std removes genes with near-zero variance across timepoints
# min.std=0.5: removes flat genes; adjust based on data spread
eset <- filter.std(eset, min.std = 0.5)

# Standardize each gene to mean=0, sd=1 across timepoints
eset <- standardise(eset)

Fuzzifier Estimation and Clustering

# mestimate(): data-driven fuzzifier estimate based on gene count and dimensions
# Prevents clusters from being too crisp (m close to 1) or too fuzzy (m >> 2)
m <- mestimate(eset)
cat(sprintf('Estimated fuzzifier: %.2f\n', m))

# c=8: typical starting point for 6-12 timepoints; refine with cluster validity indices
cl <- mfuzz(eset, c = 8, m = m)

# Membership filtering: genes with membership < 0.5 in all clusters are ambiguous
# 0.5 threshold: standard cutoff; genes below this are equidistant from multiple centroids
core_genes <- acore(eset, cl, min.acore = 0.5)

Visualization

# Temporal profile plot with membership-based color intensity
mfuzz.plot2(eset, cl, mfrow = c(2, 4), time.labels = colnames(expr_mat),
            centre = TRUE, x11 = FALSE)

# Cluster overlap plot shows similarity between cluster centroids
overlap.plot(cl, over = overlap(cl), thres = 0.05)

Cluster Number Selection

# Evaluate multiple k values; pick where cluster validity stabilizes
# Range 4-20: typical for temporal data; fewer for simple designs, more for dense sampling
validity_scores <- numeric()
for (k in 4:20) {
    cl_k <- mfuzz(eset, c = k, m = m)
    # Minimum centroid distance: should not collapse below threshold
    centroids <- cl_k$centers
    dists <- as.matrix(dist(centroids))
    diag(dists) <- Inf
    validity_scores <- c(validity_scores, min(dists))
}
plot(4:20, validity_scores, type = 'b', xlab = 'Number of clusters', ylab = 'Min centroid distance')

TCseq (R/Bioconductor)

Temporal clustering with fuzzy c-means and k-means on time-course sequencing data.

library(TCseq)

# timeclust with fuzzy c-means
# algo='cm': fuzzy c-means; captures soft membership like Mfuzz
# k=6: number of clusters; test range and evaluate with silhouette
tc <- timeclust(expr_mat, algo = 'cm', k = 6, standardize = TRUE)

# Cluster assignment plot
timeclustplot(tc, value = 'z-score', cols = 3)

# k-means alternative for hard clustering
tc_km <- timeclust(expr_mat, algo = 'km', k = 6, standardize = TRUE)

DEGreport degPatterns (R)

Automatic cluster number selection and publication-ready plots.

library(DEGreport)

# degPatterns automatically selects optimal cluster count via hierarchical clustering
# time: factor defining timepoint order
# col: column in metadata for coloring (e.g., condition)
# minc=15: minimum genes per cluster to retain; prevents singleton clusters
patterns <- degPatterns(expr_mat, metadata = sample_info,
                        time = 'timepoint', col = 'condition', minc = 15)

# Access cluster assignments
cluster_df <- patterns$df

# Plot individual clusters
degPlotCluster(patterns$normalized, time = 'timepoint', color = 'condition')

tslearn (Python)

Time-series clustering with Dynamic Time Warping (DTW) distance.

import numpy as np
from tslearn.clustering import TimeSeriesKMeans
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.utils import to_time_series_dataset
from sklearn.metrics import silhouette_score

# expr_mat: numpy array of shape (n_genes, n_timepoints)
expr_scaled = TimeSeriesScalerMeanVariance().fit_transform(expr_mat[:, :, np.newaxis])

# DTW metric: handles phase-shifted profiles better than Euclidean
# Soft-DTW (metric='softdtw') is differentiable and faster for large datasets
# n_clusters=8: starting point; evaluate with silhouette
model = TimeSeriesKMeans(n_clusters=8, metric='dtw', max_iter=50, random_state=42)
labels = model.fit_predict(expr_scaled)

# sklearn silhouette_score does not support DTW; precompute distance matrix
from tslearn.metrics import cdist_dtw
dist_matrix = cdist_dtw(expr_scaled)
sil = silhouette_score(dist_matrix, labels, metric='precomputed')

Cluster Number Selection with Silhouette

# Test k from 3-15; pick k with highest silhouette score
# 3-15 range: fewer than 3 is too coarse; more than 15 rarely adds biological meaning
sil_scores = []
for k in range(3, 16):
    model = TimeSeriesKMeans(n_clusters=k, metric='softdtw', max_iter=30, random_state=42)
    labels = model.fit_predict(expr_scaled)
    # Euclidean silhouette as computational shortcut; DTW silhouette is O(n^2 * T^2)
    sil_scores.append(silhouette_score(expr_scaled.squeeze(), labels, metric='euclidean'))

Method Comparison

Method	Clustering Type	Distance	Best For
Mfuzz	Soft (fuzzy c-means)	Euclidean	Standard temporal profiling
TCseq	Soft or hard	Euclidean	RNA-seq time courses
DEGreport	Hierarchical	Correlation	Automatic k selection
tslearn	Hard (k-means)	DTW/soft-DTW	Phase-shifted profiles

Tips

Always standardize (z-score) before clustering; otherwise, highly expressed genes dominate
Soft clustering (Mfuzz) is preferred when genes may participate in multiple temporal programs
DTW-based clustering captures time-shifted patterns but is computationally expensive for >5000 genes
Run functional enrichment (GO/GSEA) per cluster to interpret biological meaning
Membership threshold of 0.5 for Mfuzz filters ~30-50% of genes as ambiguous; adjust if too stringent

Related Skills

circadian-rhythms - Rhythm-specific clustering by phase
trajectory-modeling - Continuous trajectory fitting before clustering
differential-expression/timeseries-de - Upstream temporal DE for gene selection
pathway-analysis/go-enrichment - Per-cluster functional enrichment