BioSkills bio-machine-learning-atlas-mapping
Maps query single-cell data to reference atlases using scArches transfer learning with scVI and scANVI models. Transfers cell type labels without retraining on combined data. Use when annotating new single-cell datasets using pre-trained reference models.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/machine-learning/atlas-mapping" ~/.claude/skills/gptomics-bioskills-bio-machine-learning-atlas-mapping && rm -rf "$T"
machine-learning/atlas-mapping/SKILL.mdVersion Compatibility
Reference examples tested with: anndata 0.10+, scanpy 1.10+, scvi-tools 1.1+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Transfer Learning for Single-Cell Data
"Map my scRNA-seq data onto a reference atlas" → Transfer cell type labels from a pre-trained reference model to query cells using architectural surgery (scArches) without retraining.
- Python:
→scvi.model.SCVI.load_query_data()
→get_latent_representation()scanpy.tl.ingest()
scVI Reference Mapping (scArches)
Goal: Map query single-cell data onto a pre-trained reference model to obtain a shared latent embedding.
Approach: Load a pre-trained scVI model, prepare query data with matching gene sets, then perform surgical fine-tuning that updates only query-specific parameters.
import scvi import scanpy as sc # Load pre-trained reference model adata_ref = sc.read_h5ad('reference.h5ad') # Model must have been saved with save_anndata=True scvi.model.SCVI.setup_anndata(adata_ref, layer='counts', batch_key='batch') ref_model = scvi.model.SCVI.load('reference_model/', adata=adata_ref) # Prepare query data adata_query = sc.read_h5ad('query.h5ad') # Subset to reference genes adata_query = adata_query[:, adata_ref.var_names].copy() # Set up query AnnData using reference setup scvi.model.SCVI.prepare_query_anndata(adata_query, ref_model) # Load query into model (creates "surgical" fine-tuned model) query_model = scvi.model.SCVI.load_query_data(adata_query, ref_model) # Surgical training: update only query-specific parameters # weight_decay=0.0: Standard for surgery; prevents reference drift query_model.train(max_epochs=200, plan_kwargs={'weight_decay': 0.0}) # Get latent representation adata_query.obsm['X_scVI'] = query_model.get_latent_representation()
scANVI for Label Transfer
Goal: Transfer cell type labels from a labeled reference atlas to unlabeled query data.
Approach: Train a semi-supervised scANVI model on the reference, then map query cells via surgical fine-tuning and predict labels with confidence scores.
import scvi import scanpy as sc # Reference with cell type labels adata_ref = sc.read_h5ad('reference_labeled.h5ad') scvi.model.SCVI.setup_anndata(adata_ref, layer='counts', batch_key='batch') ref_vae = scvi.model.SCVI(adata_ref, n_latent=30) ref_vae.train(max_epochs=100) # Convert to scANVI (semi-supervised) scvi.model.SCANVI.setup_anndata(adata_ref, layer='counts', batch_key='batch', labels_key='cell_type', unlabeled_category='Unknown') ref_scanvi = scvi.model.SCANVI.from_scvi_model(ref_vae, labels_key='cell_type', unlabeled_category='Unknown') ref_scanvi.train(max_epochs=50) ref_scanvi.save('reference_scanvi/') # Map query data adata_query = sc.read_h5ad('query.h5ad') adata_query = adata_query[:, adata_ref.var_names].copy() scvi.model.SCANVI.prepare_query_anndata(adata_query, ref_scanvi) query_scanvi = scvi.model.SCANVI.load_query_data(adata_query, ref_scanvi) query_scanvi.train(max_epochs=100, plan_kwargs={'weight_decay': 0.0}) # Transfer labels adata_query.obs['predicted_cell_type'] = query_scanvi.predict() adata_query.obsm['X_scANVI'] = query_scanvi.get_latent_representation()
Prediction Confidence
Goal: Assess reliability of transferred labels and flag cells that may represent novel types.
Approach: Extract soft prediction probabilities from the scANVI model and identify low-confidence assignments below a threshold.
# Get prediction probabilities soft_predictions = query_scanvi.predict(soft=True) adata_query.obs['prediction_confidence'] = soft_predictions.max(axis=1) # Flag low-confidence predictions # confidence < 0.5: May be novel cell type or poor mapping low_conf = adata_query.obs['prediction_confidence'] < 0.5 print(f'Low confidence predictions: {low_conf.sum()} ({low_conf.mean():.1%})')
Joint Embedding Visualization
Goal: Visualize reference and query cells together to assess integration quality.
Approach: Concatenate reference and query datasets, compute UMAP from the shared latent representation, and color by dataset and cell type.
import scanpy as sc # Combine reference and query for visualization adata_combined = adata_ref.concatenate(adata_query, batch_key='dataset', batch_categories=['reference', 'query']) # Use latent space for neighbors/UMAP sc.pp.neighbors(adata_combined, use_rep='X_scVI') sc.tl.umap(adata_combined) sc.pl.umap(adata_combined, color=['dataset', 'cell_type'], save='_transfer.png')
Pre-trained Reference Atlases
| Atlas | Model | URL |
|---|---|---|
| Human Lung Cell Atlas | scANVI | cellxgene.cziscience.com |
| Tabula Sapiens | scVI | tabula-sapiens-portal.ds.czbiohub.org |
| Mouse Cell Atlas | scVI | bis.zju.edu.cn/MCA |
Training Parameters
| Parameter | Surgical | Full Retrain | Notes |
|---|---|---|---|
| weight_decay | 0.0 | 0.001 | 0.0 preserves reference |
| max_epochs | 100-200 | 200-400 | Less for surgery |
| early_stopping | True | True | Prevents overfitting |
Related Skills
- single-cell/cell-annotation - Manual annotation methods
- single-cell/batch-integration - Batch effect correction
- single-cell/preprocessing - Data preparation before transfer