SciAgent-Skills mofaplus-multi-omics
Multi-Omics Factor Analysis v2 (MOFA+) with mofapy2. Jointly decompose multiple omics layers (scRNA-seq, ATAC-seq, proteomics, methylation) into latent factors that capture major sources of biological variation. Supports multi-group designs (patients, conditions). Pipeline: prepare AnnData views → create MOFA object → train → inspect variance explained → correlate factors with metadata → visualize and cluster → enrichment of top loadings.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/systems-biology-multiomics/mofaplus-multi-omics" ~/.claude/skills/jaechang-hits-sciagent-skills-mofaplus-multi-omics && rm -rf "$T"
skills/systems-biology-multiomics/mofaplus-multi-omics/SKILL.mdMOFA+ Multi-Omics Factor Analysis
Overview
MOFA+ (Multi-Omics Factor Analysis v2) is an unsupervised statistical framework that jointly decomposes multiple omics datasets into a small set of latent factors. Each factor captures an independent source of variation (e.g., cell cycle, a disease phenotype, a technical batch) and is associated with feature weights (loadings) that reveal which genes, peaks, or proteins drive it. The Python package
mofapy2 produces an HDF5 model file compatible with downstream analysis in both Python and R. MOFA+ extends the original MOFA to support multi-group settings where samples belong to distinct cohorts or conditions.
When to Use
- Integrating two or more omics layers from the same set of cells or samples (e.g., scRNA-seq + scATAC-seq, RNA + proteomics, methylation + RNA)
- Identifying shared and view-specific sources of variation across omics modalities without supervised labels
- Comparing how latent factors differ between patient groups, treatment conditions, or time points in a multi-group analysis
- Reducing multi-omics dimensionality before clustering, trajectory inference, or survival modeling
- Discovering which genomic features (genes, peaks, proteins) drive each factor via sparse loadings
- Annotating latent factors by correlating factor scores with sample metadata (age, stage, treatment response)
- Use scVI / MultiVI (scverse) instead when you need deep generative batch correction across modalities with explicit latent space inference and VAE architecture
- Use LIGER instead when your primary goal is integrating datasets across technologies (e.g., snRNA-seq + snATAC-seq) with shared and dataset-specific factors via iNMF
Prerequisites
- Python packages:
,mofapy2>=0.7
,anndata>=0.10
,numpy>=1.24
,pandas>=2.0
,scipy>=1.11
,matplotlib>=3.7
,seaborn>=0.13
(optional, for MuData integration)muon - Data requirements: One AnnData object per omics view (cells/samples x features). All views must share the same obs (cell/sample) index. Missing samples per group are supported.
- Environment: Python 3.9+; trained model is saved as HDF5 and can be analyzed in R via the
Bioconductor packageMOFA2
pip install mofapy2 anndata muon matplotlib seaborn
Quick Start
import numpy as np import pandas as pd import anndata as ad from mofapy2.run.entry_point import entry_point # Simulate two omics views, 200 cells, 500 RNA genes, 300 ATAC peaks np.random.seed(42) n_cells, n_rna, n_atac = 200, 500, 300 adata_rna = ad.AnnData(np.abs(np.random.randn(n_cells, n_rna)), obs=pd.DataFrame(index=[f"cell_{i}" for i in range(n_cells)])) adata_atac = ad.AnnData(np.abs(np.random.randn(n_cells, n_atac)), obs=adata_rna.obs.copy()) ent = entry_point() ent.set_data_options(scale_groups=False, scale_views=True) ent.set_data_matrix([[adata_rna.X, adata_atac.X]], likelihoods=["gaussian", "gaussian"], views_names=["RNA", "ATAC"], groups_names=["all_cells"], samples_names=[list(adata_rna.obs_names)]) ent.set_model_options(factors=10) ent.set_train_options(iter=500, convergence_mode="fast", seed=42) ent.build() ent.run() ent.save("mofa_model.hdf5") print("Model saved to mofa_model.hdf5")
Workflow
Step 1: Load and Prepare Multi-Omics Data
Each omics layer is represented as an AnnData object. Align cell indices across modalities, log-normalize RNA counts, and binarize or normalize ATAC/methylation data as appropriate.
import numpy as np import pandas as pd import anndata as ad import scipy.sparse as sp # --- RNA-seq: 200 cells x 2000 highly variable genes --- np.random.seed(42) n_cells = 200 cell_ids = [f"cell_{i:03d}" for i in range(n_cells)] # Simulate log-normalized counts (in practice: load from h5ad after Scanpy preprocessing) rna_counts = np.abs(np.random.randn(n_cells, 2000) * 2) adata_rna = ad.AnnData( X=rna_counts, obs=pd.DataFrame( {"condition": ["A"] * 100 + ["B"] * 100, "patient": [f"P{i % 10}" for i in range(n_cells)]}, index=cell_ids ), var=pd.DataFrame(index=[f"Gene_{i}" for i in range(2000)]) ) # --- ATAC-seq: same cells x 1000 peaks --- atac_matrix = (np.random.rand(n_cells, 1000) > 0.8).astype(float) adata_atac = ad.AnnData( X=atac_matrix, obs=adata_rna.obs.copy(), var=pd.DataFrame(index=[f"Peak_{i}" for i in range(1000)]) ) # Confirm alignment assert list(adata_rna.obs_names) == list(adata_atac.obs_names), "Cell indices must match" print(f"RNA: {adata_rna.shape}, ATAC: {adata_atac.shape}") print(f"Conditions: {adata_rna.obs['condition'].value_counts().to_dict()}")
Step 2: Create the MOFA+ Model Object
Instantiate the
entry_point and register all data views. Views are provided as a list-of-lists: data[groups][views]. Assign meaningful view and group names for interpretability.
from mofapy2.run.entry_point import entry_point ent = entry_point() # Configure data options before setting data ent.set_data_options( scale_groups=False, # Do not rescale variance between groups scale_views=True, # Rescale each view to unit variance (recommended when views differ in scale) ) # Provide data as list-of-lists: [groups][views] # Single group → wrap each view in a list ent.set_data_matrix( data=[[adata_rna.X, adata_atac.X]], # outer list = groups, inner = views likelihoods=["gaussian", "bernoulli"], # gaussian for continuous, bernoulli for binary ATAC views_names=["RNA", "ATAC"], groups_names=["all_cells"], samples_names=[list(adata_rna.obs_names)] # one list per group ) print("Data registered. Views: RNA, ATAC | Groups: all_cells")
Step 3: Set Model and Training Options
Configure the number of factors and training hyperparameters. More factors capture finer variation but increase computation and risk overfitting;
convergence_mode="medium" balances speed and accuracy.
# Model options ent.set_model_options( factors=15, # Number of latent factors (start with 15; prune inactive ones automatically) spikeslab_weights=True, # Sparse weight prior (ARD+spike-slab); recommended for feature selection ard_factors=True, # Automatic relevance determination per factor per view ard_weights=True, # ARD per feature weight; enables pruning of irrelevant features ) # Training options ent.set_train_options( iter=1000, # Maximum EM iterations convergence_mode="medium", # "fast" (<1000 iter), "medium" (default), "slow" (>5000 iter) startELBO=1, # Start computing ELBO from iteration 1 freqELBO=5, # Compute ELBO every 5 iterations dropR2=0.01, # Drop factors explaining < 1% variance (set to None to disable) seed=42, verbose=False ) print("Model and training options set")
Step 4: Build and Train the Model
Build the internal data structures, then run variational inference. Training produces a fitted model where each factor's weights and scores are optimized to maximize the evidence lower bound (ELBO).
# Build internal model structure ent.build() # Run training (EM algorithm with variational Bayes updates) ent.run() # Save trained model to HDF5 — required for downstream analysis output_path = "mofa_model.hdf5" ent.save(output_path, overwrite=True) print(f"Model trained and saved to {output_path}")
Step 5: Load Trained Model and Inspect Variance Explained
Load the HDF5 model and inspect how much variance each factor explains per view. Factors explaining less than ~1-2% total variance are typically noise.
import h5py import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns def load_mofa_r2(model_path): """Extract variance explained (R2) per factor per view from MOFA+ HDF5.""" with h5py.File(model_path, "r") as f: r2 = f["variance_explained"]["r2_per_factor"] views = [v.decode() for v in f["views"]["views"][:]] groups = [g.decode() for g in f["groups"]["groups"][:]] # r2_per_factor: shape (n_groups, n_views, n_factors) r2_array = np.stack([r2[g][:] for g in groups], axis=0) # (groups, views, factors) # Average across groups; shape → (n_views, n_factors) r2_mean = r2_array.mean(axis=0) n_factors = r2_mean.shape[1] df = pd.DataFrame(r2_mean * 100, index=views, columns=[f"Factor{i+1}" for i in range(n_factors)]) return df r2_df = load_mofa_r2("mofa_model.hdf5") print("Variance explained (%) per factor per view:") print(r2_df.round(2)) # Heatmap of variance explained fig, ax = plt.subplots(figsize=(max(8, r2_df.shape[1] * 0.6), 3)) sns.heatmap(r2_df, annot=True, fmt=".1f", cmap="YlOrRd", linewidths=0.5, ax=ax, vmin=0) ax.set_title("MOFA+ Variance Explained (%) per Factor per View") ax.set_xlabel("Factor") ax.set_ylabel("Omics View") plt.tight_layout() plt.savefig("mofa_variance_explained.png", dpi=200) print("Saved mofa_variance_explained.png")
Step 6: Extract Factor Scores and Correlate with Metadata
Factor scores (Z matrix) are the per-sample coordinates in factor space. Correlate scores with continuous or categorical metadata to biologically annotate each factor.
import scipy.stats as stats def load_mofa_factors(model_path): """Load factor scores (Z) from MOFA+ HDF5. Returns DataFrame (samples x factors).""" with h5py.File(model_path, "r") as f: groups = [g.decode() for g in f["groups"]["groups"][:]] factors_list = [] for g in groups: z = f["expectations"]["Z"][g][:] # shape: (n_factors, n_samples) samples = [s.decode() for s in f["samples"][g][:]] n_factors = z.shape[0] df = pd.DataFrame(z.T, index=samples, columns=[f"Factor{i+1}" for i in range(n_factors)]) factors_list.append(df) return pd.concat(factors_list, axis=0) factors_df = load_mofa_factors("mofa_model.hdf5") print(f"Factor scores: {factors_df.shape} (cells x factors)") # Merge with metadata meta = adata_rna.obs[["condition", "patient"]].copy() factors_meta = factors_df.join(meta) # Point-biserial correlation: factor score vs binary metadata factor_cols = [c for c in factors_meta.columns if c.startswith("Factor")] print("\nFactor–Condition correlation (eta-squared approximation):") for fc in factor_cols[:5]: groups_vals = [factors_meta.loc[factors_meta["condition"] == g, fc].values for g in factors_meta["condition"].unique()] stat, pval = stats.f_oneway(*groups_vals) print(f" {fc}: F={stat:.2f}, p={pval:.3f}")
Step 7: Visualize Factors — Scatter Plots and Feature Heatmaps
Plot factor score scatter plots colored by metadata, and heatmaps of the top-weighted features (loadings) per factor to understand what each factor captures.
def load_mofa_weights(model_path, view_name): """Load feature weights (W) for a specific view. Returns DataFrame (features x factors).""" with h5py.File(model_path, "r") as f: views = [v.decode() for v in f["views"]["views"][:]] view_idx = views.index(view_name) # Weights stored per view as (n_factors, n_features) w = f["expectations"]["W"][view_name][:] features = [ft.decode() for ft in f["features"][view_name][:]] n_factors = w.shape[0] df = pd.DataFrame(w.T, index=features, columns=[f"Factor{i+1}" for i in range(n_factors)]) return df # --- Factor scatter plot --- fig, axes = plt.subplots(1, 2, figsize=(12, 5)) for ax, (fx, fy) in zip(axes, [("Factor1", "Factor2"), ("Factor1", "Factor3")]): for cond, grp in factors_meta.groupby("condition"): ax.scatter(grp[fx], grp[fy], label=cond, alpha=0.6, s=20) ax.set_xlabel(fx) ax.set_ylabel(fy) ax.legend(title="Condition") ax.set_title(f"{fx} vs {fy}") plt.tight_layout() plt.savefig("mofa_factor_scatter.png", dpi=200) print("Saved mofa_factor_scatter.png") # --- Top-loading heatmap for RNA view --- weights_rna = load_mofa_weights("mofa_model.hdf5", "RNA") top_n = 20 top_features = [] for fc in [f"Factor{i+1}" for i in range(min(5, weights_rna.shape[1]))]: top_pos = weights_rna[fc].nlargest(top_n // 2).index.tolist() top_neg = weights_rna[fc].nsmallest(top_n // 2).index.tolist() top_features.extend(top_pos + top_neg) top_features = list(dict.fromkeys(top_features)) # unique, preserve order fig, ax = plt.subplots(figsize=(8, max(6, len(top_features) * 0.3))) sns.heatmap(weights_rna.loc[top_features, [f"Factor{i+1}" for i in range(min(5, weights_rna.shape[1]))]], cmap="RdBu_r", center=0, ax=ax, yticklabels=True) ax.set_title("Top RNA Feature Weights per Factor") plt.tight_layout() plt.savefig("mofa_rna_weights_heatmap.png", dpi=200) print("Saved mofa_rna_weights_heatmap.png")
Step 8: Downstream — Cluster Cells by Factor Scores and Enrichment
Use factor scores as a low-dimensional embedding for clustering, and extract top-weighted genes per factor for pathway enrichment.
from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler import warnings # --- Cluster cells using factor scores --- factor_cols = [c for c in factors_df.columns if c.startswith("Factor")] X_factors = StandardScaler().fit_transform(factors_df[factor_cols].values) kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) with warnings.catch_warnings(): warnings.simplefilter("ignore") cluster_labels = kmeans.fit_predict(X_factors) factors_df["mofa_cluster"] = cluster_labels.astype(str) print(f"K-means clusters (k=4):\n{pd.Series(cluster_labels).value_counts().sort_index()}") # --- Extract top-weighted genes per factor for enrichment input --- weights_rna = load_mofa_weights("mofa_model.hdf5", "RNA") enrichment_inputs = {} for fc in [f"Factor{i+1}" for i in range(min(5, weights_rna.shape[1]))]: # Positive weights: top activating genes; negative: top repressing genes top_pos = weights_rna[fc].nlargest(100).index.tolist() top_neg = weights_rna[fc].nsmallest(100).index.tolist() enrichment_inputs[f"{fc}_positive"] = top_pos enrichment_inputs[f"{fc}_negative"] = top_neg # Save gene lists for external enrichment (e.g., gseapy Enrichr) for name, genes in list(enrichment_inputs.items())[:2]: print(f"\n{name} — top 10: {genes[:10]}") # Example: run ORA with gseapy (install separately) # import gseapy # enr = gseapy.enrichr(gene_list=enrichment_inputs["Factor1_positive"], # gene_sets="GO_Biological_Process_2023", # organism="human", outdir=None) # print(enr.results.head(5)[["Term", "Adjusted P-value", "Genes"]]) factors_df.to_csv("mofa_factor_scores.csv") print("\nFactor scores with cluster labels saved to mofa_factor_scores.csv")
Key Parameters
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
| | – | Number of latent factors to infer; inactive ones pruned by ARD |
| (required) | , , | Per-view likelihood; gaussian for normalized continuous, bernoulli for binary ATAC, poisson for raw counts |
| | / | Rescale each view to unit variance; recommended when views differ in scale or unit |
| | / | Rescale variance across groups; set if groups have very different total variances |
| | / | Spike-and-slab sparsity prior on weights; enables feature selection via near-zero weights |
| | / | Automatic relevance determination per factor per view; prunes factors not used in a view |
| | – | Maximum EM iterations; convergence usually reached in 200–800 |
| | , , | ELBO convergence tolerance: fast = 1e-4, medium = 1e-6, slow = 1e-8 |
| | , – | Drop factors explaining less than this fraction of variance; keeps all |
| | – | Iteration to start ELBO monitoring; set higher to skip initial instability |
Key Concepts
Latent Factors and Loadings
Each latent factor Z_k (a vector of length n_samples) represents a source of variation. The corresponding weight matrix W_k (a vector of length n_features per view) contains the loading of each feature on that factor. Factors with spike-and-slab priors produce sparse loadings: most weights are shrunk to near zero, leaving a small set of features that meaningfully drive the factor. A positive weight means higher factor score correlates with higher feature expression.
Multi-Group vs Single-Group
A "group" in MOFA+ is a set of samples that share the same factor weight matrices W but have independent factor score distributions Z. Use multi-group analysis when:
- Samples come from distinct cohorts with batch-level differences (patients, datasets)
- You want to compare how much each factor explains within vs between groups
- You expect the same biological programs to operate differently across conditions
Single-group analysis (all samples in one group) is appropriate when samples are from a single experiment with no major batch structure.
Variance Explained Heatmap Interpretation
The R2 heatmap (views x factors) is the primary diagnostic output. Factors should show:
- View-specific R2: a factor driving only RNA captures transcriptional programs; one driving both RNA and ATAC captures chromatin-accessibility-coupled gene expression
- Declining R2: Factor 1 explains the most variance; factors should be inspected in order
- Factors with <1% R2 in all views can generally be ignored
Common Recipes
Recipe: Multi-Group Analysis Across Conditions
Use when comparing two or more patient groups or experimental conditions, where you want to identify condition-specific vs shared factors.
from mofapy2.run.entry_point import entry_point import numpy as np import pandas as pd import anndata as ad # Simulate two groups: condition A (100 cells) and condition B (100 cells) np.random.seed(0) n_per_group, n_genes, n_peaks = 100, 1000, 500 groups = {"condA": {}, "condB": {}} for g in groups: groups[g]["rna"] = np.abs(np.random.randn(n_per_group, n_genes)) groups[g]["atac"] = (np.random.rand(n_per_group, n_peaks) > 0.75).astype(float) sample_ids_A = [f"A_cell_{i}" for i in range(n_per_group)] sample_ids_B = [f"B_cell_{i}" for i in range(n_per_group)] ent = entry_point() ent.set_data_options(scale_groups=False, scale_views=True) # Multi-group: data[groups][views] — two groups, each with RNA and ATAC ent.set_data_matrix( data=[[groups["condA"]["rna"], groups["condA"]["atac"]], [groups["condB"]["rna"], groups["condB"]["atac"]]], likelihoods=["gaussian", "bernoulli"], views_names=["RNA", "ATAC"], groups_names=["condA", "condB"], samples_names=[sample_ids_A, sample_ids_B] ) ent.set_model_options(factors=10, spikeslab_weights=True, ard_factors=True) ent.set_train_options(iter=500, convergence_mode="fast", seed=0, verbose=False) ent.build() ent.run() ent.save("mofa_multigroup.hdf5", overwrite=True) # Compare factor scores between groups factors_all = load_mofa_factors("mofa_multigroup.hdf5") factors_all["group"] = ["condA"] * n_per_group + ["condB"] * n_per_group print(f"Multi-group model trained. Factor scores: {factors_all.shape}") print(factors_all.groupby("group")[["Factor1", "Factor2"]].mean().round(3))
Recipe: Identify and Annotate Factors by Top-Weighted Genes
Retrieve the top positive and negative loading genes per factor and print a summary table for biological annotation.
import pandas as pd def annotate_factors(model_path, view_name="RNA", top_n=20, n_factors=5): """ Summarize top-loading features per factor to assist biological annotation. Returns a DataFrame with factor names, top positive genes, top negative genes, and the absolute weight range (an activity proxy). """ weights = load_mofa_weights(model_path, view_name) factor_cols = [f"Factor{i+1}" for i in range(min(n_factors, weights.shape[1]))] rows = [] for fc in factor_cols: w = weights[fc] top_pos = w.nlargest(top_n).index.tolist() top_neg = w.nsmallest(top_n).index.tolist() rows.append({ "Factor": fc, "Max_weight": round(w.max(), 4), "Min_weight": round(w.min(), 4), "Top_positive": ", ".join(top_pos[:5]), "Top_negative": ", ".join(top_neg[:5]), }) summary = pd.DataFrame(rows) return summary summary_df = annotate_factors("mofa_model.hdf5", view_name="RNA", top_n=20, n_factors=5) print("\nFactor annotation summary (RNA view):") print(summary_df.to_string(index=False)) summary_df.to_csv("mofa_factor_annotation.csv", index=False) print("\nSaved mofa_factor_annotation.csv")
Expected Outputs
| Output | Description |
|---|---|
| Trained MOFA+ model — factor scores, weights, ELBO trace, variance explained |
| Heatmap of R2 (%) per factor per view; primary diagnostic for factor selection |
| Scatter plots of Factor1 vs Factor2/3 colored by metadata (condition, patient) |
| Heatmap of top RNA feature weights across the first 5 factors |
| Table of per-cell factor scores (cells x factors) with cluster labels |
| Factor annotation table: top positive/negative genes per factor |
| Per-factor gene lists | Input for gseapy Enrichr or GSEA to identify enriched pathways per factor |
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
in | Mismatched number of groups, views, or sample list dimensions | Ensure , , , , all have consistent lengths |
| All factors show near-zero variance explained | Data not preprocessed or scale mismatch across views | Normalize each view before input; set ; verify non-zero variance in input matrices |
| Model trains but factor scores are NaN | Convergence failure due to extreme values or near-singular data | Check for Inf/NaN in input matrices; reduce ; try first |
| Too many factors pruned (only 1-2 remain) | threshold too aggressive or insufficient variation in data | Set or ; increase data diversity or reduce noise |
file cannot be read | File truncated due to crash during training | Re-run training; check disk space; use in |
| Factor scores identical across all samples | Single-sample group or zero-variance input view | Confirm at least 2 distinct samples per group; check input matrix is not all zeros |
| Very slow training (>1 hr) | Large feature space (>10k features per view) or many factors | Pre-filter to top HVGs (2000-5000) per view; reduce factors to 10-15; enable |
| ELBO not converging (oscillates) | Learning rate instability or poorly scaled data | Increase ; standardize each view independently; use |
| Weights all near zero for one view | Bernoulli likelihood on continuous data or vice versa | Verify list matches view data types; use for normalized RNA |
| Package not installed | |
References
- MOFA2 GitHub repository — Source code for mofapy2 (Python) and MOFA2 R package
- MOFA2 documentation and tutorials — Official documentation with vignettes for Python and R
- Argelaguet et al. (2020) — "MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data", Genome Biology
- Argelaguet et al. (2018) — Original MOFA paper, Molecular Systems Biology
- mofapy2 PyPI package — Installation and version history