SciAgent-Skills scanpy-scrna-seq
Single-cell RNA-seq analysis with Scanpy. QC filtering, normalization, HVG selection, PCA, neighborhood graph, UMAP/t-SNE, Leiden clustering, marker gene identification, cell type annotation, and trajectory inference. Use for standard scRNA-seq exploratory workflows.
git clone https://github.com/jaechang-hits/SciAgent-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/scanpy-scrna-seq" ~/.claude/skills/jaechang-hits-sciagent-skills-scanpy-scrna-seq && rm -rf "$T"
skills/genomics-bioinformatics/scanpy-scrna-seq/SKILL.mdScanpy Single-Cell RNA-seq Analysis
Overview
Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data built on the AnnData format. This skill covers the end-to-end standard workflow: quality control, normalization, highly variable gene selection, dimensionality reduction, clustering, marker gene identification, and cell type annotation. It produces annotated datasets and publication-quality visualizations.
When to Use
- Analyzing single-cell RNA-seq count matrices (10X Genomics, h5ad, CSV, loom)
- Performing quality control filtering on scRNA-seq datasets (mitochondrial %, gene counts)
- Running dimensionality reduction: PCA, UMAP, t-SNE
- Identifying cell clusters via Leiden community detection
- Finding differentially expressed marker genes per cluster (Wilcoxon, t-test, logistic regression)
- Annotating cell types from known marker gene panels
- Conducting trajectory inference and pseudotime analysis (PAGA, diffusion pseudotime)
- Generating publication-quality single-cell plots (dot plots, heatmaps, stacked violins)
- Comparing gene expression across experimental conditions within cell types
- Use Seurat (R/Bioconductor) instead for scRNA-seq analysis in an existing R workflow or when Seurat-specific assay types are required
Prerequisites
- Python packages:
,scanpy>=1.10
,leidenalg
,igraphanndata - Data requirements: Gene expression count matrix (cells x genes). Common formats: 10X Cell Ranger output (
),filtered_feature_bc_matrix/
,.h5ad
,.h5.csv - Environment: Python 3.9+; 16GB+ RAM recommended for >50k cells
pip install "scanpy[leiden]" anndata
Workflow
Step 1: Setup and Data Loading
Configure scanpy settings and read the count matrix into an AnnData object. Ensure gene names are unique.
import scanpy as sc import numpy as np import pandas as pd # Configure settings sc.settings.verbosity = 3 # errors=0, warnings=1, info=2, hints=3 sc.settings.set_figure_params(dpi=80, facecolor="white") # Load data — pick the reader matching your format adata = sc.read_10x_mtx( "path/to/filtered_feature_bc_matrix/", var_names="gene_symbols", cache=True, ) # Alternatives: # adata = sc.read_h5ad("data.h5ad") # adata = sc.read_10x_h5("data.h5") adata.var_names_make_unique() print(f"Loaded: {adata.n_obs} cells x {adata.n_vars} genes") print(f"Sparsity: {1 - adata.X.nnz / (adata.n_obs * adata.n_vars):.1%}")
Step 2: Quality Control
Annotate mitochondrial/ribosomal genes, compute QC metrics, visualize distributions, and filter low-quality cells.
# Annotate gene groups adata.var["mt"] = adata.var_names.str.startswith("MT-") # human; use "mt-" for mouse adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL")) # Calculate QC metrics sc.pp.calculate_qc_metrics( adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True ) # Visualize QC distributions sc.pl.violin( adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"], jitter=0.4, multi_panel=True, ) sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt") sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts") # Filter — adjust thresholds based on the QC plots above sc.pp.filter_cells(adata, min_genes=200) sc.pp.filter_genes(adata, min_cells=3) adata = adata[adata.obs.n_genes_by_counts < 5000, :] # remove potential doublets adata = adata[adata.obs.pct_counts_mt < 20, :] # remove dying cells print(f"After QC: {adata.n_obs} cells x {adata.n_vars} genes")
Step 3: Normalization and Feature Selection
Normalize library sizes, log-transform, and identify highly variable genes (HVGs). Store raw counts for later differential expression.
# Preserve raw counts in a layer adata.layers["counts"] = adata.X.copy() # Normalize to 10,000 counts per cell, then log-transform sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) # Save normalized data as .raw for visualization adata.raw = adata # Identify highly variable genes sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts") sc.pl.highly_variable_genes(adata) # Subset to HVGs adata = adata[:, adata.var.highly_variable] print(f"HVG subset: {adata.n_obs} cells x {adata.n_vars} genes")
Step 4: Scaling and Regression
Regress out confounders and scale gene expression values. This prepares data for PCA.
# Regress out unwanted sources of variation sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"]) # Scale to unit variance, clip extreme values sc.pp.scale(adata, max_value=10) print(f"Scaled matrix: mean={adata.X.mean():.4f}, std={adata.X.std():.4f}")
Step 5: Dimensionality Reduction
Compute PCA, inspect the variance ratio elbow plot, then build a neighborhood graph and embed with UMAP.
# PCA sc.tl.pca(adata, svd_solver="arpack", n_comps=50) sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50) # Determine n_pcs from elbow plot (typically 30-50) n_pcs = 40 # Compute k-nearest neighbor graph sc.pp.neighbors(adata, n_neighbors=15, n_pcs=n_pcs) # UMAP embedding sc.tl.umap(adata) sc.pl.umap(adata, color=["n_genes_by_counts", "total_counts", "pct_counts_mt"]) print(f"UMAP shape: {adata.obsm['X_umap'].shape}")
Step 6: Clustering
Run Leiden clustering at multiple resolutions and select the best granularity by visual inspection.
# Test multiple resolutions for res in [0.3, 0.5, 0.8, 1.0, 1.2]: sc.tl.leiden(adata, resolution=res, key_added=f"leiden_res{res}", flavor="igraph", n_iterations=2) # Compare resolutions side-by-side sc.pl.umap( adata, color=["leiden_res0.3", "leiden_res0.5", "leiden_res0.8", "leiden_res1.0"], ncols=2, legend_loc="on data", ) # Pick final resolution based on biological plausibility adata.obs["leiden"] = adata.obs["leiden_res0.5"] n_clusters = adata.obs["leiden"].nunique() print(f"Selected resolution=0.5: {n_clusters} clusters")
Step 7: Marker Gene Identification
Find differentially expressed genes per cluster using Wilcoxon rank-sum test on raw counts.
# Rank genes per cluster sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon", use_raw=True) # Visualization sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False) sc.pl.rank_genes_groups_dotplot(adata, n_genes=5) sc.pl.rank_genes_groups_heatmap(adata, n_genes=10, use_raw=True, show_gene_labels=True) # Extract as DataFrame markers_df = sc.get.rank_genes_groups_df(adata, group=None) top_markers = markers_df[markers_df["pvals_adj"] < 0.05].groupby("group").head(10) print(f"Significant markers: {len(top_markers)} genes across {top_markers['group'].nunique()} clusters") print(top_markers[["group", "names", "logfoldchanges", "pvals_adj"]].head(15))
Step 8: Cell Type Annotation and Export
Assign cell types based on canonical marker genes, visualize, and save all results.
# Define canonical markers (example: human PBMC) marker_genes = { "CD4 T cells": ["CD3D", "CD3E", "IL7R"], "CD8 T cells": ["CD8A", "CD8B", "GZMK"], "B cells": ["MS4A1", "CD79A", "CD79B"], "NK cells": ["NKG7", "GNLY", "KLRD1"], "CD14+ Monocytes": ["CD14", "LYZ", "S100A9"], "FCGR3A+ Monocytes":["FCGR3A", "MS4A7"], "Dendritic cells": ["FCER1A", "CST3"], "Platelets": ["PPBP", "PF4"], } # Dot plot — rows = clusters, columns = markers sc.pl.dotplot(adata, var_names=marker_genes, groupby="leiden", use_raw=True) # Map clusters → cell types (adjust based on your dot plot) cluster_to_celltype = { "0": "CD4 T cells", "1": "CD14+ Monocytes", "2": "B cells", "3": "CD8 T cells", "4": "NK cells", "5": "FCGR3A+ Monocytes", "6": "Dendritic cells", "7": "Platelets", } adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).fillna("Unknown") sc.pl.umap(adata, color="cell_type", legend_loc="on data", frameon=False) # Save results adata.write("results/annotated_data.h5ad", compression="gzip") adata.obs.to_csv("results/cell_metadata.csv") markers_df.to_csv("results/marker_genes.csv", index=False) print(f"Saved: {adata.n_obs} cells with {adata.obs['cell_type'].nunique()} cell types")
Key Parameters
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
(filter_cells) | | - | Minimum genes detected per cell; lower keeps more cells |
(filter_genes) | | - | Minimum cells expressing a gene; higher is stricter |
threshold | | - | Max mitochondrial %; tissue-dependent (5% for PBMCs, 20% for tumors) |
(HVG) | | - | Number of highly variable genes; more captures subtle variation |
(HVG) | | , , | HVG detection algorithm |
(neighbors) | | - | Number of PCs for neighbor graph; check elbow plot |
| | - | k for k-NN graph; lower emphasizes local structure |
(leiden) | | - | Clustering granularity; higher → more clusters |
(rank_genes) | | , , | DE test; Wilcoxon recommended for publication |
(normalize) | | - | Target library size per cell |
Common Recipes
Recipe: Batch Correction with ComBat
When to use: multiple samples/batches with visible batch effects on the UMAP.
# Requires 'batch' column in adata.obs sc.pp.combat(adata, key="batch") # Re-run PCA, neighbors, UMAP, clustering after correction sc.tl.pca(adata, svd_solver="arpack") sc.pp.neighbors(adata, n_pcs=40) sc.tl.umap(adata) sc.pl.umap(adata, color=["batch", "leiden"])
Recipe: Trajectory Inference with PAGA + Diffusion Pseudotime
When to use: cells follow a developmental continuum rather than discrete clusters.
# PAGA graph abstraction sc.tl.paga(adata, groups="leiden") sc.pl.paga(adata, color="leiden", threshold=0.03) # Reinitialize UMAP with PAGA layout sc.tl.umap(adata, init_pos="paga") # Diffusion pseudotime — set root cell adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == "0")[0] sc.tl.diffmap(adata) sc.tl.dpt(adata) sc.pl.umap(adata, color=["dpt_pseudotime", "leiden"])
Recipe: Publication-Quality Figures
When to use: generating figures for manuscripts or presentations.
sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5)) # UMAP with custom palette sc.pl.umap( adata, color="cell_type", palette="Set2", legend_loc="on data", legend_fontsize=10, legend_fontoutline=2, title="", save="_celltype_publication.pdf", ) # Stacked violin plot of top markers flat_markers = [g for genes in marker_genes.values() for g in genes[:2]] sc.pl.stacked_violin( adata, var_names=flat_markers, groupby="cell_type", use_raw=True, swap_axes=True, save="_markers_violin.pdf", )
Recipe: Differential Expression Between Conditions
When to use: comparing treated vs control within a specific cell type.
# Subset to cell type of interest adata_t = adata[adata.obs["cell_type"] == "CD4 T cells"].copy() # DE between conditions sc.tl.rank_genes_groups(adata_t, groupby="condition", groups=["treated"], reference="control", method="wilcoxon", use_raw=True) sc.pl.rank_genes_groups_volcano(adata_t, groups=["treated"]) de_results = sc.get.rank_genes_groups_df(adata_t, group="treated") sig_genes = de_results[de_results["pvals_adj"] < 0.05] print(f"Significant DE genes: {len(sig_genes)}")
Expected Outputs
— Full AnnData with embeddings (PCA, UMAP), cluster labels, cell type annotations, and raw counts layerresults/annotated_data.h5ad
— Per-cell table: barcode, cluster, cell type, QC metrics (n_genes, total_counts, pct_mt)results/cell_metadata.csv
— DE genes per cluster: gene name, log fold change, adjusted p-valueresults/marker_genes.csv- QC figures: violin plots (n_genes, total_counts, pct_mt), scatter plots
- Embedding figures: UMAP colored by cluster, cell type, QC metrics, pseudotime
- Marker figures: dot plot, heatmap, stacked violin of canonical markers
Troubleshooting
| Problem | Cause | Solution |
|---|---|---|
| Leiden package not installed | |
during PCA | Dense matrix exceeds RAM | Use |
| UMAP shows single blob | Over-filtering or too few HVGs | Relax QC thresholds; increase to 3000-5000 |
| Too many/few clusters | Resolution mismatch | Sweep from 0.1 to 2.0 and compare UMAPs |
not found | Wrong species prefix | Use (lowercase) for mouse, for human |
| Batch effects dominate UMAP | Uncorrected technical variation | Apply ComBat recipe or use Harmony/scVI for integration |
error | Forgot to save raw before subsetting | Set after normalization, before HVG subset |
| Marker genes non-specific | Resolution too low merging cell types | Increase resolution or use method |
Slow | Large cell count | Skip regress_out; use alone (often sufficient) |
in | Cluster with too few cells | Remove clusters with <10 cells before DE: |
Bundled Resources
This skill includes reference files for deeper lookup. Read these on demand when the main SKILL.md needs more detail.
references/api_reference.md
Quick-lookup table of all scanpy functions organized by module (sc.pp., sc.tl., sc.pl.*), with full signatures, common parameters, and AnnData structure reference. Use when: you need the exact function name or parameter for a specific operation.
references/plotting_guide.md
Comprehensive visualization guide: QC plots, UMAP styling, marker gene plots (dot plot, heatmap, stacked violin), trajectory plots, multi-panel figures, publication customization, and color palette recommendations. Use when: creating figures for publications or presentations.
references/standard_workflow.md
Detailed step-by-step reference for each pipeline stage with decision points, parameter rationale, and alternative approaches (MAD-based filtering, scran normalization, automated annotation). Use when: performing a full analysis from scratch and needing deeper guidance than the Workflow section above.
References
- Scanpy documentation — Official API reference and tutorials (BSD-3)
- Scanpy PBMC3k tutorial — Canonical walkthrough (BSD-3)
- Galaxy Training: Clustering 3K PBMCs with Scanpy — Step-by-step tutorial (CC-BY)
- Luecken & Theis (2019) — "Current best practices in single-cell RNA-seq analysis: a tutorial", Mol Syst Biol
- Heumos et al. (2023) — "Best practices for single-cell analysis across modalities", Nat Rev Genet
- scverse ecosystem — Related tools: anndata, scvi-tools, squidpy, cellrank