SciAgent-Skills popv-cell-annotation
Consensus cell type annotation by running 10+ algorithms (KNN-Harmony, KNN-BBKNN, KNN-Scanorama, KNN-scVI, CellTypist, ONCLASS, Random Forest, SCANVI, SVM, XGBoost) on a labeled reference and transferring labels to a query dataset via majority voting. popV produces per-method labels, an overall consensus prediction, and an agreement score quantifying confidence across methods. Use when single-method annotation is insufficient or when you need ensemble uncertainty estimates for novel cell states.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/popv-cell-annotation" ~/.claude/skills/jaechang-hits-sciagent-skills-popv-cell-annotation && rm -rf "$T"
skills/genomics-bioinformatics/popv-cell-annotation/SKILL.mdpopV Multi-Method Cell Type Transfer
Overview
popV (Population Voting for single-cell annotation) annotates a query scRNA-seq dataset by running 10+ independent classification algorithms against a labeled reference atlas and aggregating results via majority voting. Each method produces its own label; the final
popv_prediction is the consensus across all methods, and the popv_agreement score quantifies how many methods agree. This ensemble strategy is robust to individual method failures on unusual datasets and provides a principled uncertainty estimate: low agreement highlights novel cell states or annotation gaps.
When to Use
- Annotating a query dataset by transferring labels from a well-curated reference atlas when you want a consensus rather than a single model's judgment
- Identifying novel or ambiguous cell states as cells where methods disagree (low
score)popv_agreement - Benchmarking annotation reliability by comparing per-method labels to detect systematic disagreements
- Annotating large atlas datasets (>100k cells) where batch effects between reference and query are substantial
- Producing annotation for downstream analyses that require high-confidence labels (clinical data, regulatory submissions)
- Use CellTypist (celltypist-cell-annotation) instead when speed matters and a pre-trained model matches your tissue; popV is slower because it trains multiple models on your reference
- Use scANVI (scvi-tools-single-cell) instead when you need a single probabilistic deep generative model with formal uncertainty quantification and do not require the ensemble
Prerequisites
- Python packages:
,popv>=0.6
,scanpy>=1.9
,anndata
,scvi-tools>=1.0
,harmonypy
,bbknncelltypist - Data requirements: Two AnnData objects — a labeled reference (
) with cell type labels inadata_ref
, and an unlabeled query (obs
). Both must be from the same species and have overlapping gene sets. Raw counts inadata_query
(popV applies its own normalization internally)adata.X - Environment: Python 3.9+; GPU recommended for scVI/SCANVI methods (falls back to CPU); 32 GB RAM recommended for >200k reference cells
pip install popv scvi-tools harmonypy bbknn celltypist
Quick Start
Minimal pipeline from labeled reference and unlabeled query to annotated result:
import popv import scanpy as sc # Load reference (labeled) and query (unlabeled) AnnData objects adata_ref = sc.read_h5ad("reference_atlas.h5ad") # adata_ref.obs["cell_type"] exists adata_query = sc.read_h5ad("query_dataset.h5ad") # Prepare combined object with popV preprocessing adata = popv.preprocessing.Process_Query( adata_ref, adata_query, ref_labels_key="cell_type", ref_batch_key="batch", query_batch_key="batch", unknown_celltype_label="unknown", save_path_trained_models="./popv_models/", n_epochs_unsupervised=50, ) # Run all annotation methods popv.annotation.annotate_data(adata) # Inspect consensus results for query cells query_mask = adata.obs["_dataset"] == "query" print(adata[query_mask].obs[["popv_prediction", "popv_agreement"]].head(10))
Core API
Module 1: Reference and Query Data Setup
Both AnnData objects must share a gene space and have required metadata columns. popV will subset to the intersection of genes automatically.
import anndata as ad import scanpy as sc import numpy as np # Reference: must have cell type labels and (optionally) batch metadata adata_ref = sc.read_h5ad("reference_atlas.h5ad") print(f"Reference: {adata_ref.n_obs} cells x {adata_ref.n_vars} genes") print(f"Cell types: {adata_ref.obs['cell_type'].nunique()} unique labels") print(f"Reference cell type counts:\n{adata_ref.obs['cell_type'].value_counts().head(10)}") # Query: no labels required; batch metadata optional adata_query = sc.read_h5ad("query_dataset.h5ad") print(f"\nQuery: {adata_query.n_obs} cells x {adata_query.n_vars} genes") # Check gene overlap (popV will handle subsetting but >70% overlap is recommended) shared_genes = adata_ref.var_names.intersection(adata_query.var_names) pct_shared = len(shared_genes) / adata_ref.n_vars print(f"\nShared genes: {len(shared_genes)} ({pct_shared:.1%} of reference genes)") if pct_shared < 0.5: print("WARNING: <50% gene overlap — annotation quality may be reduced")
# Verify required fields before popV setup assert "cell_type" in adata_ref.obs.columns, "Reference needs cell type labels" # Add batch column if absent (popV requires it even for single-batch data) if "batch" not in adata_ref.obs.columns: adata_ref.obs["batch"] = "ref_batch" if "batch" not in adata_query.obs.columns: adata_query.obs["batch"] = "query_batch" print("Reference obs columns:", adata_ref.obs.columns.tolist()) print("Query obs columns: ", adata_query.obs.columns.tolist())
Module 2: POPV Object Creation (Process_Query)
Process_Query combines reference and query, normalizes counts, selects HVGs, and prepares the joint embedding needed by all annotation methods.
import popv # Create processed combined AnnData adata = popv.preprocessing.Process_Query( adata_ref, adata_query, ref_labels_key="cell_type", # obs column with reference labels ref_batch_key="batch", # obs column with reference batch info query_batch_key="batch", # obs column with query batch info unknown_celltype_label="unknown",# label to use for query cells before annotation save_path_trained_models="./popv_models/", # directory for scVI/SCANVI model checkpoints n_epochs_unsupervised=50, # scVI training epochs (increase to 100–200 for large datasets) n_epochs_semisupervised=20, # scANVI fine-tuning epochs use_gpu=True, # GPU for scVI/SCANVI (falls back to CPU if unavailable) hvg=4000, # number of highly variable genes to use ) print(f"Combined object: {adata.n_obs} cells x {adata.n_vars} genes") print(f"Dataset labels: {adata.obs['_dataset'].value_counts().to_dict()}") # Expected: {'ref': N_ref, 'query': N_query}
Module 3: Running the Method Ensemble
annotate_data runs all selected methods sequentially and adds per-method label columns plus the consensus to adata.obs.
import popv # Run annotation with default set of methods popv.annotation.annotate_data( adata, methods=[ "knn_harmony", # KNN on Harmony-corrected embedding "knn_bbknn", # KNN on BBKNN cross-batch graph "knn_scvi", # KNN on scVI latent space "scanvi_popv", # Semi-supervised scANVI label transfer "celltypist_popv",# CellTypist logistic regression "rf", # Random Forest on HVG expression "xgboost", # XGBoost classifier "svm", # Support Vector Machine "onclass", # ONCLASS (ontology-guided) ], ) # Inspect per-method result columns (all end in "_popv") query_mask = adata.obs["_dataset"] == "query" popv_cols = adata.obs.filter(like="_popv").columns.tolist() print(f"Per-method columns: {popv_cols}") print(adata[query_mask].obs[popv_cols + ["popv_prediction", "popv_agreement"]].head(10))
Module 4: Consensus Results and Agreement Scoring
popv_prediction is the majority-vote consensus; popv_agreement is the fraction of methods that agreed on the winning label.
import pandas as pd query_mask = adata.obs["_dataset"] == "query" query_obs = adata[query_mask].obs.copy() # Consensus label distribution print("Consensus cell type distribution:") print(query_obs["popv_prediction"].value_counts().head(15)) # Agreement score statistics print(f"\npopv_agreement statistics:") print(query_obs["popv_agreement"].describe()) # agreement = 1.0 → all methods agree; agreement = 0.2 → only 2/10 methods agree # Cells with high confidence (>80% method agreement) high_conf = query_obs["popv_agreement"] >= 0.8 print(f"\nHigh-confidence cells (agreement >= 0.8): {high_conf.sum()} ({high_conf.mean():.1%})") # Cells with low confidence — candidate novel states or annotation gaps low_conf = query_obs["popv_agreement"] < 0.5 print(f"Low-confidence cells (agreement < 0.5): {low_conf.sum()} ({low_conf.mean():.1%})")
Module 5: Visualization
popV provides built-in UMAP and heatmap visualization of per-method agreement and consensus labels.
import popv import scanpy as sc import matplotlib.pyplot as plt # Compute UMAP on the joint reference+query embedding (if not already present) if "X_umap" not in adata.obsm: sc.tl.umap(adata) # popV built-in visualization: UMAP panel showing consensus + agreement popv.visualization.predict_celltypes_umap( adata, save="popv_annotation_umap.png", ) print("Saved popv_annotation_umap.png") # Custom UMAP panels fig, axes = plt.subplots(1, 3, figsize=(21, 6)) sc.pl.umap(adata, color="popv_prediction", ax=axes[0], title="popV Consensus", legend_loc="on data", legend_fontsize=6, show=False) sc.pl.umap(adata, color="popv_agreement", ax=axes[1], cmap="RdYlGn", vmin=0, vmax=1, title="Method Agreement Score", show=False) sc.pl.umap(adata, color="_dataset", ax=axes[2], title="Reference vs Query", show=False) plt.tight_layout() plt.savefig("popv_custom_umap.png", dpi=150, bbox_inches="tight") print("Saved popv_custom_umap.png")
Key Concepts
Method Ensemble and Majority Voting
popV runs each method independently; the final prediction is determined by plurality vote across all methods. The
popv_agreement score equals the fraction of methods that voted for the winning label (e.g., 0.7 = 7/10 methods agreed). This design has several properties:
- Robustness: if one method fails or produces outlier labels, the consensus is unaffected if the remaining methods agree
- Uncertainty signal: low agreement does not mean the annotation is wrong — it often flags biologically interesting cells (transitional states, rare populations) that differ from all reference cell types
- Method independence: KNN-based methods depend on the embedding quality; tree-based methods (RF, XGBoost) work directly on expression; SVM works in feature space; CellTypist uses a separate logistic regression. Together they span multiple algorithmic families
Method Comparison
| Method | Batch Correction | Speed | Best For |
|---|---|---|---|
| Harmony | Fast | Moderate batch effects, large datasets |
| BBKNN | Fast | Diverse multi-tissue references |
| Scanorama | Fast | Multiple heterogeneous batches |
| scVI VAE | Medium | Complex batch effects, probabilistic embedding |
| scVI+labels | Slow | Semi-supervised; most accurate when reference is clean |
| None (logistic) | Fast | Immune cells; works well without batch correction |
| None | Medium | Balanced class distributions; interpretable feature importance |
| None | Medium | High-confidence predictions on well-separated cell types |
| None | Medium | High-dimensional gene expression; linear boundaries |
| None | Medium | Ontology-aware; handles unseen cell types via CL ontology |
ONCLASS and Ontology-Aware Annotation
ONCLASS uses the Cell Ontology (CL) to represent cell types as nodes in a knowledge graph and predict unseen cell types by propagating similarity through the ontology. Unlike other methods, ONCLASS can predict a cell type that was not present in the training reference if it is ontologically adjacent to known types. Enable it by including
"onclass" in the methods list.
Reference Quality Requirements
popV annotation quality scales directly with reference quality:
- Minimum cell count per type: 50–100 cells per label; rare types with <20 cells may be missed by KNN methods
- Balanced representation: highly imbalanced references (one type is 80% of cells) cause tree methods to be biased toward the majority class
- Label granularity: coarse labels (10 types) annotate reliably; fine-grained labels (100+ types) require a larger, matched reference
Common Workflows
Workflow 1: Standard Reference-Query Annotation
Goal: Annotate an unlabeled query dataset using a curated reference atlas end-to-end.
import popv import scanpy as sc import pandas as pd # 1. Load data adata_ref = sc.read_h5ad("reference_atlas.h5ad") # has obs["cell_type"] and obs["batch"] adata_query = sc.read_h5ad("query_dataset.h5ad") # no cell type labels if "batch" not in adata_query.obs.columns: adata_query.obs["batch"] = "query" # 2. Preprocess: build joint normalized object adata = popv.preprocessing.Process_Query( adata_ref, adata_query, ref_labels_key="cell_type", ref_batch_key="batch", query_batch_key="batch", unknown_celltype_label="unknown", save_path_trained_models="./popv_models/", n_epochs_unsupervised=100, n_epochs_semisupervised=30, use_gpu=True, hvg=4000, ) print(f"Prepared: {adata.n_obs} total cells") # 3. Run ensemble annotation popv.annotation.annotate_data(adata) # 4. Extract query results query_mask = adata.obs["_dataset"] == "query" query_annotations = adata[query_mask].obs[[ "popv_prediction", "popv_agreement", "knn_harmony_popv", "scanvi_popv", "rf_popv", "xgboost_popv" ]].copy() # 5. Transfer back to original query object adata_query.obs = adata_query.obs.join( query_annotations, how="left" ) print(f"Annotated {query_mask.sum()} query cells") print(query_annotations["popv_prediction"].value_counts().head(10)) # 6. Save annotated query adata_query.write_h5ad("annotated_query.h5ad", compression="gzip") query_annotations.to_csv("popv_annotations.csv") print("Saved annotated_query.h5ad and popv_annotations.csv")
Workflow 2: Confidence Filtering and Novel Cell State Detection
Goal: Separate high-confidence annotations from ambiguous cells; flag candidate novel or transitional states for manual review.
import popv import scanpy as sc import pandas as pd import matplotlib.pyplot as plt # Assume adata has been annotated (as in Workflow 1) query_mask = adata.obs["_dataset"] == "query" query_obs = adata[query_mask].obs.copy() # Tier cells by agreement score bins = [0.0, 0.5, 0.8, 1.01] labels = ["low (<0.5)", "medium (0.5–0.8)", "high (≥0.8)"] query_obs["confidence_tier"] = pd.cut( query_obs["popv_agreement"], bins=bins, labels=labels, right=False ) print("Cells per confidence tier:") print(query_obs["confidence_tier"].value_counts()) # High-confidence subset: use popv_prediction directly high_conf_mask = query_obs["popv_agreement"] >= 0.8 print(f"\nHigh-confidence annotations ({high_conf_mask.mean():.1%} of query cells):") print(query_obs[high_conf_mask]["popv_prediction"].value_counts().head(10)) # Low-confidence subset: inspect per-method disagreement low_conf = query_obs[query_obs["popv_agreement"] < 0.5] popv_method_cols = [c for c in query_obs.columns if c.endswith("_popv") and c not in ("popv_prediction", "popv_agreement")] print(f"\nLow-confidence cells sample (showing per-method labels):") print(low_conf[popv_method_cols + ["popv_prediction"]].head(10).to_string()) # Visualize agreement distribution fig, axes = plt.subplots(1, 2, figsize=(14, 5)) query_obs["popv_agreement"].hist(bins=20, ax=axes[0], color="steelblue", edgecolor="white") axes[0].axvline(0.8, color="red", linestyle="--", label="High-confidence threshold") axes[0].set_xlabel("Method Agreement Score") axes[0].set_ylabel("Cell Count") axes[0].set_title("popV Agreement Distribution") axes[0].legend() query_obs["confidence_tier"].value_counts().plot.bar(ax=axes[1], color="steelblue") axes[1].set_title("Cells by Confidence Tier") axes[1].set_xlabel("Confidence Tier") axes[1].set_ylabel("Cell Count") plt.tight_layout() plt.savefig("popv_confidence_distribution.png", dpi=150, bbox_inches="tight") print("Saved popv_confidence_distribution.png")
Key Parameters
| Parameter | Module | Default | Range / Options | Effect |
|---|---|---|---|---|
| Process_Query | — | Any column | Column in containing training cell type labels |
| Process_Query | | – | scVI training epochs; increase for better embedding on large/complex datasets |
| Process_Query | | – | scANVI fine-tuning epochs on top of scVI |
| Process_Query | | – | Highly variable genes used for embedding and KNN methods |
| Process_Query | | , | GPU acceleration for scVI/SCANVI; falls back to CPU automatically if no GPU |
| annotate_data | all | List of method names | Subset of methods to run; excluding slow methods (scanvi, onclass) speeds up pipeline |
| Process_Query | | Any string | Label assigned to query cells before annotation; used to separate reference labels from query |
| (output) | — | – | Fraction of methods agreeing on consensus label; recommended for high confidence |
Best Practices
-
Check gene overlap before running: popV performs best with >70% gene overlap between reference and query. If overlap is <50%, annotation quality degrades significantly — consider using a different reference or imputing missing genes.
shared = adata_ref.var_names.intersection(adata_query.var_names) print(f"Gene overlap: {len(shared) / adata_ref.n_vars:.1%}") -
Use raw counts as input: pass raw (un-normalized) counts in
toadata.X
. popV internally applies its own normalization. Pre-normalized data can distort the scVI/SCANVI latent space.Process_Query -
Match reference granularity to query biology: if your query contains subtypes not in the reference, no method will correctly assign them — they will appear as low-agreement cells. Either add them to the reference or accept that the consensus will assign the nearest parent type.
-
Exclude slow methods when speed matters:
andscanvi_popv
are the slowest. For a quick first-pass, run onlyonclass
,knn_harmony
,knn_bbknn
,rf
, andxgboost
.celltypist_popvpopv.annotation.annotate_data(adata, methods=["knn_harmony", "knn_bbknn", "rf", "xgboost", "celltypist_popv"]) -
Save trained models for repeated queries:
stores scVI/SCANVI models inProcess_Query
. Reuse these when annotating additional query batches against the same reference to avoid retraining.save_path_trained_models
Common Recipes
Recipe: Subset to High-Confidence Annotations Only
When to use: downstream analyses (DE, trajectory) require clean labels; exclude ambiguous cells.
import scanpy as sc # Annotate as in Workflow 1 first query_mask = adata.obs["_dataset"] == "query" adata_query_annotated = adata[query_mask].copy() # Keep only high-confidence cells high_conf = adata_query_annotated[adata_query_annotated.obs["popv_agreement"] >= 0.8].copy() print(f"High-confidence cells: {high_conf.n_obs} / {adata_query_annotated.n_obs} " f"({high_conf.n_obs/adata_query_annotated.n_obs:.1%})") print(high_conf.obs["popv_prediction"].value_counts()) # Recompute UMAP on high-confidence subset for visualization sc.pp.neighbors(high_conf, use_rep="X_scVI") # use scVI embedding stored by popV sc.tl.umap(high_conf) sc.pl.umap(high_conf, color="popv_prediction", save="_high_conf_celltypes.png")
Recipe: Per-Method Label Comparison Heatmap
When to use: understanding where methods disagree to identify systematic biases or novel populations.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns query_mask = adata.obs["_dataset"] == "query" query_obs = adata[query_mask].obs.copy() # Collect per-method columns method_cols = [c for c in query_obs.columns if c.endswith("_popv") and c not in ("popv_prediction", "popv_agreement")] # Cross-tabulate two key methods ct = pd.crosstab( query_obs["knn_harmony_popv"], query_obs["scanvi_popv"], margins=False, ) # Normalize rows ct_norm = ct.div(ct.sum(axis=1), axis=0) plt.figure(figsize=(12, 10)) sns.heatmap(ct_norm, cmap="Blues", vmin=0, vmax=1, xticklabels=True, yticklabels=True, cbar_kws={"label": "Fraction of cells"}) plt.title("knn_harmony vs scanvi label agreement") plt.xlabel("SCANVI label") plt.ylabel("KNN-Harmony label") plt.tight_layout() plt.savefig("popv_method_agreement_heatmap.png", dpi=150) print("Saved popv_method_agreement_heatmap.png")
Recipe: Fast Annotation Without Deep Learning Methods
When to use: quick annotation without GPU or when scVI/SCANVI training is prohibitively slow (>500k cells).
import popv # Process without training deep generative models (scVI not needed for KNN-Harmony) adata = popv.preprocessing.Process_Query( adata_ref, adata_query, ref_labels_key="cell_type", ref_batch_key="batch", query_batch_key="batch", unknown_celltype_label="unknown", save_path_trained_models="./popv_models/", n_epochs_unsupervised=0, # skip scVI training n_epochs_semisupervised=0, # skip scANVI training use_gpu=False, hvg=3000, ) # Run only fast non-DL methods popv.annotation.annotate_data( adata, methods=["knn_harmony", "knn_bbknn", "knn_scanorama", "rf", "xgboost", "svm", "celltypist_popv"], ) query_mask = adata.obs["_dataset"] == "query" print(adata[query_mask].obs[["popv_prediction", "popv_agreement"]].describe())
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Reference lacks a cell type column | Verify the column name: ; update accordingly |
| Gene space mismatch error | Reference and query have very few shared genes | Check ; if <50% overlap, use a different reference or match gene panels |
| CUDA out-of-memory for scVI/SCANVI | GPU VRAM insufficient for batch size | Set or reduce ; scVI falls back to CPU automatically on most systems |
failures on small datasets | ONCLASS requires sufficient label coverage | Remove from the methods list when reference has <10 cell types or <500 cells per type |
| Very slow annotation (>2 hours) | scVI/SCANVI training on large reference | Subsample reference to 50k cells per type; exclude and from methods |
| All cells receive same consensus label | Reference highly imbalanced toward one type | Balance reference by subsampling the dominant type or upsampling rare types before running popV |
is 0 for many cells | Many methods returning different labels | Inspect per-method columns; consider whether reference covers the query biology; add methods or retrain with a better reference |
Related Skills
- celltypist-cell-annotation — single-model annotation with pre-trained logistic regression; faster but lacks ensemble uncertainty
- scanpy-scrna-seq — preprocessing pipeline (QC, normalization, clustering) that produces AnnData inputs for popV
- scvi-tools-single-cell — scANVI for probabilistic label transfer with a single deep generative model; use when you prefer a formal variational framework over ensemble voting
- harmony-batch-correction — Harmony embedding used by
method internally; understand it to tune popV's KNN-based methodsknn_harmony
References
- GitHub: YosefLab/popV — official source code, installation instructions, and example notebooks
- popV documentation — API reference and tutorials
- Ergen et al., bioRxiv 2023 — "Population-level integration of single-cell datasets enables multi-scale analysis across samples", original popV preprint
- ONCLASS paper — Wang et al., Nature Methods 2021 — ontology-aware cell type classification underlying the ONCLASS method in popV