SciAgent-Skills muon-multiomics-singlecell

Multi-modal single-cell analysis with muon and MuData. Joint RNA+ATAC (10x Multiome), CITE-seq (RNA+protein), and other multi-omics combinations. MuData container holds modality-specific AnnData objects with shared obs. Weighted Nearest Neighbor (WNN) graph for joint embedding, per-modality preprocessing, cross-modal factor analysis with MOFA. Use scanpy-scrna-seq for single-modality RNA analysis; use muon when combining two or more omics modalities from the same cells.

install

source · Clone the upstream repo

git clone https://github.com/jaechang-hits/SciAgent-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/systems-biology-multiomics/muon-multiomics-singlecell" ~/.claude/skills/jaechang-hits-sciagent-skills-muon-multiomics-singlecell && rm -rf "$T"

manifest: skills/systems-biology-multiomics/muon-multiomics-singlecell/SKILL.md

source content

muon — Multi-Modal Single-Cell Analysis

Overview

muon is a Python framework for multi-modal single-cell data analysis that extends the AnnData ecosystem. Its core data structure,

MuData

, holds multiple

AnnData

objects (one per modality: RNA, ATAC, protein, etc.) with shared observation and variable axes, enabling coordinated operations across all modalities. muon provides modality-specific preprocessing routines (TF-IDF and LSI for ATAC, CLR normalization for surface proteins), Weighted Nearest Neighbor (WNN) graph construction for joint dimensionality reduction, and cross-modal analysis tools. It integrates directly with scanpy, scvi-tools, and MOFA+ for a complete multi-omics single-cell workflow.

When to Use

Analyzing 10x Genomics Multiome data (simultaneous RNA + ATAC from the same nuclei)
Processing CITE-seq experiments (RNA + surface protein from the same cells)
Building joint UMAP embeddings that integrate signals from two or more modalities via WNN
Preprocessing ATAC-seq modalities (TF-IDF normalization, LSI dimensionality reduction)
Normalizing surface protein data with centered log-ratio (CLR) normalization
Performing cross-modal feature linkage (associating ATAC peaks with nearby gene expression)
Applying MOFA+ factor analysis across multiple omics layers within a unified container
Use scanpy-scrna-seq instead when analyzing a single RNA-seq modality without any co-measured omics
Use scvi-tools (MultiVI / totalVI) when you need probabilistic deep generative batch correction across modalities

Prerequisites

Python packages:

muon>=0.1.6

scanpy>=1.10

anndata>=0.10

numpy

scipy

pandas

matplotlib

leidenalg

Data requirements: 10x Multiome h5 or h5mu files, or per-modality AnnData objects (cells x features) sharing the same
```
obs_names
```
Environment: Python 3.9+; 16 GB+ RAM for datasets >50k cells; optional
```
mofapy2
```
for MOFA factor analysis

pip install "muon[all]" "scanpy[leiden]" anndata
# Optional: for MOFA+ integration
pip install mofapy2

Quick Start

import muon as mu
import scanpy as sc
import numpy as np
import anndata as ad
import pandas as pd

# Simulate a small RNA + ATAC MuData object (200 cells)
np.random.seed(42)
n_cells = 200
rna  = ad.AnnData(np.abs(np.random.negative_binomial(5, 0.3, (n_cells, 2000))).astype(float),
                  obs=pd.DataFrame(index=[f"cell_{i}" for i in range(n_cells)]),
                  var=pd.DataFrame(index=[f"gene_{j}" for j in range(2000)]))
atac = ad.AnnData(np.abs(np.random.negative_binomial(2, 0.5, (n_cells, 5000))).astype(float),
                  obs=rna.obs.copy(),
                  var=pd.DataFrame(index=[f"peak_{j}" for j in range(5000)]))

mdata = mu.MuData({"rna": rna, "atac": atac})
print(mdata)
# MuData object with n_obs × n_vars = 200 × 7000
#   2 modalities
#     rna: 200 × 2000
#     atac: 200 × 5000

# Minimal preprocessing
sc.pp.normalize_total(mdata["rna"], target_sum=1e4)
sc.pp.log1p(mdata["rna"])
sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500)
sc.pp.pca(mdata["rna"])

mu.atac.pp.tfidf(mdata["atac"])
mu.atac.tl.lsi(mdata["atac"])

# WNN joint embedding
mu.pp.neighbors(mdata, key_added="wnn",
                use_rep={"rna": "X_pca", "atac": "X_lsi"})
sc.tl.umap(mdata, neighbors_key="wnn")
sc.tl.leiden(mdata, neighbors_key="wnn", key_added="leiden_wnn")
print(f"Clusters: {mdata.obs['leiden_wnn'].nunique()}")

Core API

Module 1: MuData Creation and I/O

MuData

is the central container: a dictionary of

AnnData

modalities plus shared

obs

and

var

slots. The

mdata.obs

DataFrame merges per-modality observation metadata with a modality prefix for ambiguous columns.

import muon as mu
import anndata as ad
import numpy as np
import pandas as pd

# --- Build from per-modality AnnData objects ---
n_cells = 300
rna  = ad.AnnData(np.abs(np.random.negative_binomial(5, 0.3, (n_cells, 3000))).astype(float),
                  obs=pd.DataFrame(index=[f"cell_{i}" for i in range(n_cells)]),
                  var=pd.DataFrame(index=[f"gene_{j}" for j in range(3000)]))
prot = ad.AnnData(np.abs(np.random.randn(n_cells, 30)) + 3,
                  obs=rna.obs.copy(),
                  var=pd.DataFrame(index=[f"protein_{j}" for j in range(30)]))

mdata = mu.MuData({"rna": rna, "protein": prot})
print(mdata)
# Access individual modalities
print(mdata.mod["rna"])        # AnnData: 300 × 3000
print(mdata["rna"])            # same shorthand
print(mdata.obs.head())        # shared observation metadata
print(mdata.obs_names[:5])     # cell barcodes

# Save and load
mdata.write("multiome_data.h5mu")
mdata2 = mu.read("multiome_data.h5mu")
print(f"Loaded: {mdata2.n_obs} cells, {mdata2.n_mod} modalities")

# --- Load from 10x Multiome h5 file ---
# mdata = mu.read_10x_h5("filtered_feature_bc_matrix.h5")
# Produces MuData with mdata["rna"] and mdata["atac"] modalities

# --- Subsetting by cells or features ---
# Select high-quality cells (e.g., after QC)
mask = np.ones(mdata.n_obs, dtype=bool)  # replace with actual QC mask
mdata_filtered = mdata[mask].copy()
print(f"After filtering: {mdata_filtered.n_obs} cells")

# Propagate obs mask to each modality
mu.pp.intersect_obs(mdata)   # ensures obs consistency across modalities

Module 2: RNA Modality Preprocessing

Standard scRNA-seq preprocessing applied to

mdata["rna"]

using scanpy functions. The RNA modality is preprocessed identically to a standalone scanpy workflow but operates on the slice of the MuData container.

import scanpy as sc
import numpy as np

# Assume mdata["rna"] has raw integer counts
rna = mdata["rna"]

# QC metrics
sc.pp.calculate_qc_metrics(rna, percent_top=None, log1p=False, inplace=True)
rna.obs["pct_counts_mt"] = rna[:, rna.var_names.str.startswith("MT-")].X.sum(axis=1).A1 / rna.obs["total_counts"] * 100

# Filter cells and genes
min_genes, max_genes, max_mt = 200, 5000, 20
sc.pp.filter_cells(rna, min_genes=min_genes)
sc.pp.filter_genes(rna, min_cells=5)
rna = rna[(rna.obs["n_genes_by_counts"] < max_genes) &
          (rna.obs["pct_counts_mt"] < max_mt)].copy()
print(f"RNA after QC: {rna.n_obs} cells × {rna.n_vars} genes")

# Normalize and log-transform
sc.pp.normalize_total(rna, target_sum=1e4)
sc.pp.log1p(rna)

# Highly variable genes
sc.pp.highly_variable_genes(rna, n_top_genes=3000, flavor="seurat_v3",
                             batch_key=None)
print(f"HVGs: {rna.var['highly_variable'].sum()}")

# PCA on HVGs
sc.pp.pca(rna, n_comps=50, use_highly_variable=True)
print(f"PCA embedding shape: {rna.obsm['X_pca'].shape}")
# Update slice in MuData
mdata.mod["rna"] = rna

Module 3: ATAC Modality Preprocessing

ATAC-seq modalities require a different normalization strategy. TF-IDF (term frequency–inverse document frequency) normalizes peak accessibility across cells and peaks; LSI (latent semantic indexing, equivalent to truncated SVD after TF-IDF) produces a low-dimensional embedding. The first LSI component typically captures sequencing depth rather than biology and is excluded.

# Assume mdata["atac"] contains raw binary or integer peak accessibility counts
atac = mdata["atac"]

# Basic QC: filter low-coverage cells and low-frequency peaks
sc.pp.calculate_qc_metrics(atac, percent_top=None, log1p=False, inplace=True)
sc.pp.filter_cells(atac, min_genes=200)     # min peaks detected
sc.pp.filter_genes(atac, min_cells=10)      # min cells a peak appears in
print(f"ATAC after QC: {atac.n_obs} cells × {atac.n_vars} peaks")

# TF-IDF normalization (log(TF) * log(IDF) scaling)
mu.atac.pp.tfidf(atac, scale_factor=1e4)
print("TF-IDF normalization complete")
print(f"ATAC data range: [{atac.X.min():.2f}, {atac.X.max():.2f}]")

# LSI dimensionality reduction (truncated SVD on TF-IDF matrix)
mu.atac.tl.lsi(atac, n_comps=50, use_highly_variable=False)
# LSI component 1 correlates with sequencing depth — exclude it
# mu.atac.tl.lsi sets X_lsi starting from component 2 by default
print(f"LSI embedding shape: {atac.obsm['X_lsi'].shape}")

# Update modality in MuData
mdata.mod["atac"] = atac

Module 4: WNN Graph and Joint Embedding

Weighted Nearest Neighbor (WNN) integrates multiple modality embeddings by learning per-cell, per-modality weights. Cells with high-quality RNA signal receive higher RNA weight; cells with cleaner ATAC signal receive higher ATAC weight. The resulting WNN graph is used for UMAP layout and Leiden clustering.

# Compute per-modality neighbor graphs first (optional but enables modality-specific UMAPs)
mu.pp.neighbors(mdata["rna"],  use_rep="X_pca",  n_neighbors=30, key_added="neighbors")
mu.pp.neighbors(mdata["atac"], use_rep="X_lsi",  n_neighbors=30, key_added="neighbors")

# WNN joint neighbor graph across modalities
mu.pp.neighbors(
    mdata,
    key_added="wnn",
    use_rep={"rna": "X_pca", "atac": "X_lsi"},
    n_neighbors=30,
    random_state=42,
)
print("WNN graph built. Keys:", list(mdata.obsp.keys()))
# Expected: ['wnn_connectivities', 'wnn_distances']

# UMAP from WNN graph
sc.tl.umap(mdata, neighbors_key="wnn", random_state=42)
print(f"UMAP embedding shape: {mdata.obsm['X_umap'].shape}")

# Leiden clustering from WNN graph
sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.5, key_added="leiden_wnn")
n_clusters = mdata.obs["leiden_wnn"].nunique()
print(f"Leiden WNN clustering: {n_clusters} clusters at resolution 0.5")

Module 5: Visualization

muon extends scanpy's plotting interface with modality-aware functions.

mu.pl.embedding()

colors joint UMAP embeddings by features from any modality;

sc.pl.umap()

with

color

pointing to modality-prefixed feature names (e.g.,

"rna:CD3E"

) is also supported.

import matplotlib.pyplot as plt

# Joint UMAP colored by cluster assignment
sc.pl.umap(mdata, color="leiden_wnn", title="WNN Leiden clusters",
           legend_loc="on data", show=False)
plt.savefig("wnn_umap_clusters.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved wnn_umap_clusters.png")

# Color by RNA gene expression on joint UMAP
sc.pl.umap(mdata, color=["rna:CD3E", "rna:CD19", "rna:CD14"],
           use_raw=False, vmax="p99", show=False, ncols=3)
plt.savefig("wnn_umap_markers.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved wnn_umap_markers.png")

# Per-modality scatter / embedding plots
mu.pl.embedding(mdata, basis="X_umap", color="leiden_wnn",
                show=False)
plt.savefig("embedding_clusters.png", dpi=150, bbox_inches="tight")
plt.close()

# Violin plot: RNA QC metrics per cluster
sc.pl.violin(mdata["rna"], keys=["n_genes_by_counts", "total_counts"],
             groupby=mdata.obs.loc[mdata["rna"].obs_names, "leiden_wnn"],
             rotation=90, show=False)
plt.savefig("rna_qc_by_cluster.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved rna_qc_by_cluster.png")

Module 6: Cross-Modal Analysis

Cross-modal analysis links features across modalities — for example, associating ATAC peak accessibility near a gene's promoter with its RNA expression, or applying MOFA+ to find shared latent factors.

# --- Peak-to-gene distance annotation ---
# Annotate ATAC peaks with nearest gene (requires genomic coordinates in var)
# mdata["atac"].var should contain columns: chrom, chromStart, chromEnd
# mu.atac.tl.rank_peaks_groups(mdata, groupby="leiden_wnn")  # differential peaks

# --- Differentially accessible peaks per cluster ---
atac = mdata["atac"]
# Transfer cluster labels from MuData obs to ATAC obs
atac.obs["cluster"] = mdata.obs.loc[atac.obs_names, "leiden_wnn"].values
sc.tl.rank_genes_groups(atac, groupby="cluster", method="wilcoxon",
                        use_raw=False)
top_peaks = sc.get.rank_genes_groups_df(atac, group="0").head(5)
print("Top differential peaks in cluster 0:")
print(top_peaks[["names", "logfoldchanges", "pvals_adj"]])

# --- MOFA+ factor analysis on MuData ---
# Requires: pip install mofapy2
try:
    mu.tl.mofa(mdata, n_factors=10, seed=42,
               use_obs="all", outfile="mofa_model.hdf5")
    # Factor scores stored in mdata.obsm["X_mofa"]
    print(f"MOFA factors: {mdata.obsm['X_mofa'].shape}")
    # Variance explained per factor per modality
    # Inspect with mdata.uns["mofa"]
except ImportError:
    print("Install mofapy2 to enable MOFA factor analysis")

Common Workflows

Workflow 1: Full 10x Multiome (RNA + ATAC) Joint WNN Clustering

Goal: Process paired RNA and ATAC data from 10x Genomics Multiome, build a WNN joint embedding, cluster cells, and identify RNA marker genes per cluster.

import muon as mu
import scanpy as sc
import numpy as np
import anndata as ad
import pandas as pd
import matplotlib.pyplot as plt

# --- 1. Simulate 10x Multiome-like data (replace with mu.read_10x_h5()) ---
np.random.seed(0)
n_cells = 400
cell_ids = [f"AAACGAATC{i:05d}-1" for i in range(n_cells)]
# Simulate two cell types with distinct expression/accessibility
labels = np.array(["typeA"] * 200 + ["typeB"] * 200)

rna_counts = np.abs(np.random.negative_binomial(6, 0.35, (n_cells, 2000))).astype(float)
rna_counts[:200, :100] += 10   # typeA marker genes
rna_counts[200:, 100:200] += 10  # typeB marker genes

atac_counts = np.abs(np.random.binomial(1, 0.05, (n_cells, 50000))).astype(float)
atac_counts[:200, :5000] += 1   # typeA accessible peaks
atac_counts[200:, 5000:10000] += 1  # typeB accessible peaks

rna  = ad.AnnData(rna_counts,
                  obs=pd.DataFrame({"cell_type": labels}, index=cell_ids),
                  var=pd.DataFrame(index=[f"gene_{j}" for j in range(2000)]))
atac = ad.AnnData(atac_counts,
                  obs=pd.DataFrame({"cell_type": labels}, index=cell_ids),
                  var=pd.DataFrame(index=[f"peak_{j}" for j in range(50000)]))

mdata = mu.MuData({"rna": rna, "atac": atac})
print(f"Loaded: {mdata}")

# --- 2. RNA preprocessing ---
sc.pp.filter_cells(mdata["rna"], min_genes=50)
sc.pp.filter_genes(mdata["rna"], min_cells=5)
sc.pp.normalize_total(mdata["rna"], target_sum=1e4)
sc.pp.log1p(mdata["rna"])
sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500)
sc.pp.pca(mdata["rna"], n_comps=30, use_highly_variable=True)
print(f"RNA PCA: {mdata['rna'].obsm['X_pca'].shape}")

# --- 3. ATAC preprocessing: TF-IDF + LSI ---
sc.pp.filter_cells(mdata["atac"], min_genes=100)
sc.pp.filter_genes(mdata["atac"], min_cells=5)
mu.atac.pp.tfidf(mdata["atac"])
mu.atac.tl.lsi(mdata["atac"], n_comps=30)
print(f"ATAC LSI: {mdata['atac'].obsm['X_lsi'].shape}")

# --- 4. Per-modality neighbors (optional for modality-specific plots) ---
mu.pp.neighbors(mdata["rna"],  use_rep="X_pca", n_neighbors=15, key_added="neighbors")
mu.pp.neighbors(mdata["atac"], use_rep="X_lsi",  n_neighbors=15, key_added="neighbors")

# --- 5. WNN joint neighbor graph ---
mu.pp.intersect_obs(mdata)   # align obs after per-modality filtering
mu.pp.neighbors(mdata, key_added="wnn",
                use_rep={"rna": "X_pca", "atac": "X_lsi"},
                n_neighbors=15, random_state=42)

# --- 6. UMAP + Leiden clustering ---
sc.tl.umap(mdata, neighbors_key="wnn", random_state=42)
sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.5, key_added="leiden_wnn")
print(f"Clusters: {mdata.obs['leiden_wnn'].value_counts().to_dict()}")

# --- 7. Marker genes per cluster ---
sc.tl.rank_genes_groups(mdata["rna"],
                        groupby=mdata.obs.loc[mdata["rna"].obs_names, "leiden_wnn"],
                        method="wilcoxon", use_raw=False)
top_markers = sc.get.rank_genes_groups_df(mdata["rna"], group="0").head(5)
print("Top RNA markers for cluster 0:")
print(top_markers[["names", "logfoldchanges", "pvals_adj"]])

# --- 8. Visualization ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sc.pl.umap(mdata, color="leiden_wnn", ax=axes[0], show=False, title="WNN clusters")
sc.pl.umap(mdata, color="rna:gene_0",  ax=axes[1], show=False, title="gene_0 expression",
           use_raw=False, vmax="p99")
plt.tight_layout()
plt.savefig("multiome_wnn_pipeline.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved multiome_wnn_pipeline.png")

Workflow 2: CITE-seq (RNA + Surface Protein) Analysis

Goal: Process CITE-seq data with paired RNA and antibody-derived tag (ADT/protein) counts. Normalize proteins with CLR, annotate cell types from protein markers, and build a joint UMAP.

import muon as mu
import scanpy as sc
import numpy as np
import anndata as ad
import pandas as pd
import matplotlib.pyplot as plt
from scipy.sparse import csr_matrix

np.random.seed(1)
n_cells = 350
cell_ids = [f"CITESEQ_{i:04d}" for i in range(n_cells)]

# Simulate RNA + protein modalities
# Three cell types: T-cells (CD3+CD4+), B-cells (CD19+CD20+), Monocytes (CD14+CD16+)
n_t, n_b, n_mono = 120, 130, 100
labels = ["Tcell"] * n_t + ["Bcell"] * n_b + ["Monocyte"] * n_mono

rna_mat = np.abs(np.random.negative_binomial(4, 0.4, (n_cells, 1500))).astype(float)
rna_mat[:n_t, :50] += 15       # T-cell RNA markers
rna_mat[n_t:n_t+n_b, 50:100] += 15   # B-cell RNA markers
rna_mat[n_t+n_b:, 100:150] += 15     # Monocyte RNA markers

# 20 surface proteins: first 5 T-cell, next 5 B-cell, next 5 Monocyte, rest baseline
prot_mat = np.abs(np.random.normal(2, 0.5, (n_cells, 20)))
prot_mat[:n_t, :5] += 8          # CD3, CD4, CD5, CD7, CD8 for T-cells
prot_mat[n_t:n_t+n_b, 5:10] += 8  # CD19, CD20, CD22, CD24, CD79 for B-cells
prot_mat[n_t+n_b:, 10:15] += 8    # CD14, CD16, CD64, CD11b, HLA-DR for Monocytes

protein_names = ["CD3", "CD4", "CD5", "CD7", "CD8",
                 "CD19", "CD20", "CD22", "CD24", "CD79a",
                 "CD14", "CD16", "CD64", "CD11b", "HLA-DR",
                 "CD25", "CD56", "CD45RA", "CD45RO", "IgG-ctrl"]

rna = ad.AnnData(csr_matrix(rna_mat),
                 obs=pd.DataFrame({"cell_type": labels}, index=cell_ids),
                 var=pd.DataFrame(index=[f"gene_{j}" for j in range(1500)]))
prot = ad.AnnData(prot_mat,
                  obs=pd.DataFrame({"cell_type": labels}, index=cell_ids),
                  var=pd.DataFrame(index=protein_names))

mdata = mu.MuData({"rna": rna, "protein": prot})
print(mdata)

# --- RNA preprocessing ---
sc.pp.normalize_total(mdata["rna"], target_sum=1e4)
sc.pp.log1p(mdata["rna"])
sc.pp.highly_variable_genes(mdata["rna"], n_top_genes=500)
sc.pp.pca(mdata["rna"], n_comps=30, use_highly_variable=True)

# --- Protein CLR normalization (Centered Log-Ratio) ---
# CLR normalizes each protein across cells: log(x / geometric_mean(x))
mu.prot.pp.clr(mdata["protein"])
print("Protein CLR normalization applied")
print(f"Protein data range: [{mdata['protein'].X.min():.2f}, {mdata['protein'].X.max():.2f}]")

# PCA on protein modality
sc.pp.pca(mdata["protein"], n_comps=min(15, prot.n_vars - 1))

# --- WNN joint embedding ---
mu.pp.neighbors(mdata, key_added="wnn",
                use_rep={"rna": "X_pca", "protein": "X_pca"},
                n_neighbors=20, random_state=0)
sc.tl.umap(mdata, neighbors_key="wnn", random_state=0)
sc.tl.leiden(mdata, neighbors_key="wnn", resolution=0.4, key_added="leiden_wnn")

# --- Protein-based cell type annotation ---
# Compute mean CLR protein per cluster
prot_df = pd.DataFrame(
    mdata["protein"].X,
    index=mdata["protein"].obs_names,
    columns=protein_names
)
prot_df["cluster"] = mdata.obs.loc[mdata["protein"].obs_names, "leiden_wnn"].values
cluster_means = prot_df.groupby("cluster").mean()
print("\nMean CLR protein per cluster:")
print(cluster_means[["CD3", "CD4", "CD19", "CD14"]].round(2))

# --- Visualization ---
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
sc.pl.umap(mdata, color="leiden_wnn",   ax=axes[0], show=False, title="WNN Leiden")
sc.pl.umap(mdata, color="protein:CD3",  ax=axes[1], show=False, title="CD3 (CLR)",
           use_raw=False, vmax="p99")
sc.pl.umap(mdata, color="protein:CD19", ax=axes[2], show=False, title="CD19 (CLR)",
           use_raw=False, vmax="p99")
plt.tight_layout()
plt.savefig("citeseq_joint_umap.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved citeseq_joint_umap.png")

# --- Dot plot: protein markers per cluster ---
sc.pl.dotplot(mdata["protein"],
              var_names=["CD3", "CD4", "CD19", "CD20", "CD14", "CD16"],
              groupby=mdata.obs.loc[mdata["protein"].obs_names, "leiden_wnn"],
              show=False)
plt.savefig("citeseq_protein_dotplot.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved citeseq_protein_dotplot.png")

Key Parameters

Parameter	Module / Function	Default	Range / Options	Effect
`n_neighbors`	`mu.pp.neighbors()`	`30`	`10` – `100`	Neighborhood size for graph; lower = finer local structure
`use_rep`	`mu.pp.neighbors()`	`None` (auto)	dict of `{modality: embedding_key}`	Which embedding per modality to use for WNN
`key_added`	`mu.pp.neighbors()`	`"neighbors"`	any string	Key under which graph is stored; use `"wnn"` for joint graphs
`n_comps`	`sc.pp.pca()` , `mu.atac.tl.lsi()`	`50`	`10` – `100`	Number of reduced dimensions; 30–50 typical
`resolution`	`sc.tl.leiden()`	`1.0`	`0.1` – `2.0`	Clustering granularity; lower = fewer, larger clusters
`scale_factor`	`mu.atac.pp.tfidf()`	`1e4`	`1e3` – `1e5`	TF scaling constant before log transform
`n_factors`	`mu.tl.mofa()`	`10`	`5` – `50`	Number of MOFA latent factors; use elbow on variance explained
`target_sum`	`sc.pp.normalize_total()`	`1e4`	`1e3` – `1e6`	Library size normalization target per cell (RNA)
`n_top_genes`	`sc.pp.highly_variable_genes()`	varies	`1000` – `5000`	Number of highly variable genes to retain for PCA

Best Practices

Align obs before WNN: After per-modality QC filtering, cell counts across modalities may differ. Always call

mu.pp.intersect_obs(mdata)

before WNN to ensure every modality has the same cell set.

mu.pp.intersect_obs(mdata)
print(f"Shared cells after intersect: {mdata.n_obs}")

Drop LSI component 1 from ATAC: The first LSI component captures sequencing depth (total counts per cell) rather than biological signal.

mu.atac.tl.lsi()

stores all components in

X_lsi

, so verify that component 1 correlates with

log_total_counts

before using

X_lsi

in WNN; if so, restrict to

X_lsi[:, 1:]

import numpy as np, pandas as pd
atac = mdata["atac"]
corr = np.corrcoef(atac.obsm["X_lsi"][:, 0],
                   np.log1p(atac.obs["total_counts"]))[0, 1]
print(f"LSI1 vs log_counts correlation: {corr:.3f}")
# If |corr| > 0.9, exclude component 1:
# atac.obsm["X_lsi"] = atac.obsm["X_lsi"][:, 1:]

Use
```
key_added="wnn"
```
consistently: Always name the joint graph
```
"wnn"
```
and pass
```
neighbors_key="wnn"
```
to all downstream
```
sc.tl.umap()
```
,
```
sc.tl.leiden()
```
, and
```
sc.tl.paga()
```
calls. Mixing
```
neighbors_key
```
values silently uses the wrong graph.
CLR normalization for proteins, not log-normalization: Antibody-derived tag (ADT) counts from CITE-seq have a different noise model than RNA. Use
```
mu.prot.pp.clr()
```
for per-protein CLR normalization. Do NOT apply
```
sc.pp.normalize_total()
```
+
```
sc.pp.log1p()
```
to the protein modality.

Store raw RNA counts before normalization: Before any normalization, store the raw RNA counts in

mdata["rna"].layers["counts"]

and set

mdata["rna"].raw = mdata["rna"]

. This is required for downstream differential expression tests and scvi-tools integration.

import scipy.sparse as sp
mdata["rna"].layers["counts"] = mdata["rna"].X.copy()
# Store pre-normalization snapshot
mdata["rna"].raw = mdata["rna"]

Match WNN embedding dimensions: The RNA PCA and ATAC LSI embeddings passed to WNN should have comparable dimensionality (e.g., both 30 or 50 components). Mismatched dimensions do not cause errors but can down-weight the smaller modality.

Common Recipes

Recipe: Compute Per-Cluster Modality Weights from WNN

When to use: Inspect which cells rely more on RNA vs ATAC signal in the WNN graph. High RNA weight in a cluster suggests cleaner RNA data; high ATAC weight suggests stronger chromatin accessibility signal.

# WNN stores per-cell modality weights in mdata.obsm after mu.pp.neighbors()
# Key name: "{key_added}_weights" — check available keys
print([k for k in mdata.obsm.keys() if "wnn" in k.lower()])

# If weights are stored (depends on muon version):
if "wnn_weights" in mdata.obsm:
    weights_df = pd.DataFrame(
        mdata.obsm["wnn_weights"],
        index=mdata.obs_names,
        columns=["rna_weight", "atac_weight"]
    )
    weights_df["cluster"] = mdata.obs["leiden_wnn"].values
    print(weights_df.groupby("cluster").mean().round(3))
else:
    # Proxy: compare RNA vs ATAC PCA explained variance per cell
    rna_var  = np.var(mdata["rna"].obsm["X_pca"],  axis=1)
    atac_var = np.var(mdata["atac"].obsm["X_lsi"], axis=1)
    mdata.obs["rna_signal_proxy"]  = rna_var / (rna_var + atac_var)
    mdata.obs["atac_signal_proxy"] = atac_var / (rna_var + atac_var)
    print(mdata.obs[["rna_signal_proxy", "atac_signal_proxy", "leiden_wnn"]].groupby("leiden_wnn").mean().round(3))

Recipe: Export Cluster Labels Back to Per-Modality AnnData

When to use: After joint WNN clustering, copy shared cluster labels back into each modality's

.obs

for modality-specific downstream analyses (e.g., differential peaks within clusters).

# Copy joint cluster labels into each modality's obs
for mod_name in mdata.mod:
    mod_adata = mdata[mod_name]
    shared_cells = mod_adata.obs_names
    mod_adata.obs["leiden_wnn"] = mdata.obs.loc[shared_cells, "leiden_wnn"].values
    print(f"{mod_name}: added leiden_wnn, {mod_adata.obs['leiden_wnn'].nunique()} clusters")

# Now run modality-specific differential analysis per cluster
sc.tl.rank_genes_groups(mdata["rna"],  groupby="leiden_wnn",
                        method="wilcoxon", use_raw=False)
sc.tl.rank_genes_groups(mdata["atac"], groupby="leiden_wnn",
                        method="wilcoxon", use_raw=False)
print("Differential features computed for both modalities")

Recipe: Convert MuData to Concatenated AnnData for scvi-tools

When to use: MultiVI and totalVI in scvi-tools require a concatenated AnnData with modality identity tracked in

var

. This recipe prepares the input for these models.

# Concatenate RNA and ATAC into a single AnnData with modality column in var
rna_adata  = mdata["rna"].copy()
atac_adata = mdata["atac"].copy()

rna_adata.var["modality"]  = "Gene Expression"
atac_adata.var["modality"] = "Peaks"

# Restore raw counts for scvi-tools (requires integer counts)
# Ensure mdata["rna"].layers["counts"] and mdata["atac"].X are integer
import anndata as ad
combined = ad.concat([rna_adata, atac_adata], axis=1, merge="unique")
combined.obs = mdata.obs.loc[combined.obs_names].copy()
print(f"Combined AnnData: {combined.n_obs} cells × {combined.n_vars} features")
print(combined.var["modality"].value_counts())
# Use combined with scvi.model.MULTIVI.setup_anndata(combined, ...)

Troubleshooting

Problem	Cause	Solution
`KeyError: 'rna'` when building MuData	Modality key not set or loaded from file without expected names	Check `mdata.mod.keys()` . When loading 10x h5 files, use `mu.read_10x_h5()` which auto-names modalities `"rna"` and `"atac"`
`ValueError: obs_names mismatch` during WNN	Cells filtered differently per modality leaving mismatched obs	Call `mu.pp.intersect_obs(mdata)` after per-modality QC to harmonize cell sets
UMAP/Leiden uses wrong graph	Multiple neighbor graphs exist; function uses default key `"neighbors"`	Always pass `neighbors_key="wnn"` explicitly to `sc.tl.umap()` and `sc.tl.leiden()`
LSI component 1 dominates ATAC embedding	First component captures sequencing depth, not biology	Compute correlation of `X_lsi[:, 0]` with `log_total_counts` ; if
`mu.atac.pp.tfidf` raises sparse matrix error	Input matrix is dense or wrong dtype	Convert first: `mdata["atac"].X = scipy.sparse.csr_matrix(mdata["atac"].X.astype(float))`
CLR normalization produces NaN for proteins	Zero counts in a cell for all proteins	Filter cells with `sc.pp.filter_cells(mdata["protein"], min_genes=1)` before CLR
`mu.tl.mofa()` fails with missing mofapy2	mofapy2 not installed	`pip install mofapy2` ; ensure `muon` version ≥ 0.1.5 for MOFA integration
Memory error for large ATAC peak matrices	ATAC matrices (50k+ peaks × 10k+ cells) exceed RAM	Use `sc.pp.highly_variable_genes()` to select top 50k peaks first, or load in chunks

Related Skills

scanpy-scrna-seq — single-modality RNA analysis; use as the foundation for the RNA preprocessing steps within muon
scvi-tools-single-cell — deep generative multi-modal integration (MultiVI for RNA+ATAC, totalVI for CITE-seq); use when probabilistic batch correction across samples is needed
mofaplus-multi-omics — multi-omics factor analysis;
```
mu.tl.mofa()
```
calls mofapy2 internally and stores factors in MuData
anndata-data-structure — AnnData fundamentals; each MuData modality is a standard AnnData object
deeptools-ngs-analysis — upstream ATAC-seq BAM → bigWig normalization before peak calling
macs3-peak-calling — produces the peak BED files used to generate the ATAC AnnData count matrix

References

muon documentation — official docs, tutorials, API reference
muon GitHub (scverse) — source code, issues, examples
Bredikhin et al. Genome Biology 2022 — muon/MuData publication
10x Genomics Multiome analysis guide — step-by-step Multiome tutorial in muon
WNN method (Hao et al. Cell 2021) — original Weighted Nearest Neighbor paper from Seurat v4