Medical-research-skills scvi-tools
Deep generative models for single-cell omics; use when you need probabilistic batch correction (scVI), transfer learning, uncertainty-aware differential expression, or multimodal integration (totalVI/MultiVI).
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/scvi-tools" ~/.claude/skills/aipoch-medical-research-skills-scvi-tools && rm -rf "$T"
manifest:
scientific-skills/Data Analysis/scvi-tools/SKILL.mdsource content
When to Use
Use scvi-tools when you need probabilistic, model-based single-cell analysis beyond standard pipelines (e.g., beyond typical Scanpy workflows), such as:
- Batch correction and dataset integration for scRNA-seq using a probabilistic latent space (e.g., scVI).
- Transfer learning / semi-supervised annotation when you have partial labels or want to map new data onto a reference (e.g., scANVI).
- Uncertainty-aware differential expression where effect sizes and posterior uncertainty matter (Bayesian DE).
- Multimodal integration across RNA+protein (CITE-seq) or RNA+ATAC (multiome), including paired/unpaired settings (e.g., totalVI, MultiVI).
- Specialized modalities such as ATAC-seq, spatial transcriptomics deconvolution/mapping, doublet detection, methylation, or RNA velocity.
Key Features
- Unified model API:
across model families.setup_anndata(...) → Model(adata) → train() → get_*() - Probabilistic latent representations for integration, denoising, and downstream clustering/visualization.
- Explicit covariate handling (batch, donor, technical factors) via
.setup_anndata - Bayesian differential expression with posterior-based hypothesis testing and effect-size thresholds.
- Multi-omics models for joint learning across modalities (RNA/protein, RNA/ATAC; paired or unpaired).
- AnnData-first integration with the Scanpy ecosystem for downstream neighbors/UMAP/clustering.
- GPU acceleration via PyTorch (when available).
Model catalogs by modality (for reference):
- scRNA-seq:
(scVI, scANVI, AUTOZI, VeloVI, contrastiveVI, …)references/models-scrna-seq.md - ATAC-seq:
(PeakVI, PoissonVI, scBasset, …)references/models-atac-seq.md - Multimodal:
(totalVI, MultiVI, MrVI, …)references/models-multimodal.md - Spatial:
(DestVI, Stereoscope, Tangram, scVIVA, …)references/models-spatial.md - Specialized:
(Solo, CellAssign, MethylVI/MethylANVI, CytoVI, …)references/models-specialized.md
Dependencies
(latest compatible with your environment)scvi-toolspython>=3.9pytorch>=2.0
(orpytorch-lightning>=2.0
depending on scvi-tools version)lightninganndata>=0.8scanpy>=1.9
Installation example:
uv pip install scvi-tools # Optional GPU extras (package extra name may vary by platform/version) uv pip install "scvi-tools[cuda]"
Example Usage
A complete runnable example using scVI for batch correction + latent embedding, then Scanpy for neighbors/UMAP/clustering:
import scanpy as sc import scvi # 1) Load example data (AnnData) adata = scvi.data.heart_cell_atlas_subsampled() # 2) Minimal preprocessing (keep raw counts available) sc.pp.filter_genes(adata, min_counts=3) sc.pp.highly_variable_genes(adata, n_top_genes=1200) # 3) Register AnnData for scVI (raw counts + covariates) scvi.model.SCVI.setup_anndata( adata, layer="counts", # raw counts layer (not log-normalized) batch_key="batch", # batch column in adata.obs categorical_covariate_keys=["donor"], continuous_covariate_keys=["percent_mito"], ) # 4) Train model model = scvi.model.SCVI(adata) model.train() # 5) Extract outputs adata.obsm["X_scVI"] = model.get_latent_representation() adata.layers["scvi_normalized"] = model.get_normalized_expression(library_size=1e4) # 6) Downstream analysis with Scanpy sc.pp.neighbors(adata, use_rep="X_scVI") sc.tl.umap(adata) sc.tl.leiden(adata) # Optional: uncertainty-aware differential expression de = model.differential_expression( groupby="cell_type", group1="TypeA", group2="TypeB", mode="change", delta=0.25, ) print(de.head())
Model persistence:
model.save("./scvi_model", overwrite=True) model2 = scvi.model.SCVI.load("./scvi_model", adata=adata)
Implementation Details
- Core approach: deep generative modeling with variational inference (typically VAE-style architectures) to learn a latent representation and a likelihood model for counts.
- Data requirements: models generally expect raw counts (not log-normalized values). Provide counts via
or ensurelayer="counts"
contains counts.adata.X - Covariate registration: technical factors (e.g.,
, donor, QC metrics) are incorporated throughbatch_key
, enabling the model to learn representations that reduce unwanted variation.setup_anndata - Training loop:
performs amortized inference using neural networks shared across cells; GPU acceleration is used automatically when configured.train() - Latent space usage:
returns batch-corrected embeddings suitable for neighbors/UMAP/clustering in Scanpy.get_latent_representation() - Differential expression:
performs posterior-based comparisons; parameters like:differential_expression(...)
: composite hypothesis testing on changesmode="change"
: minimum effect size thresholddelta
help control practical significance and uncertainty-aware decisions.
See
for interpretation guidance.references/differential-expression.md
- Model selection by modality: choose the model family based on data type (e.g., scVI/scANVI for scRNA-seq, totalVI for CITE-seq, MultiVI for RNA+ATAC, DestVI for spatial deconvolution). For details, see the corresponding
files.references/models-*.md - Theory background: variational inference, amortized inference, and probabilistic modeling foundations are summarized in
.references/theoretical-foundations.md