SciAgent-Skills scanpy-scrna-seq

Single-cell RNA-seq analysis with Scanpy. QC filtering, normalization, HVG selection, PCA, neighborhood graph, UMAP/t-SNE, Leiden clustering, marker gene identification, cell type annotation, and trajectory inference. Use for standard scRNA-seq exploratory workflows.

install
source · Clone the upstream repo
git clone https://github.com/jaechang-hits/SciAgent-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/jaechang-hits/SciAgent-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/genomics-bioinformatics/scanpy-scrna-seq" ~/.claude/skills/jaechang-hits-sciagent-skills-scanpy-scrna-seq && rm -rf "$T"
manifest: skills/genomics-bioinformatics/scanpy-scrna-seq/SKILL.md
source content

Scanpy Single-Cell RNA-seq Analysis

Overview

Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data built on the AnnData format. This skill covers the end-to-end standard workflow: quality control, normalization, highly variable gene selection, dimensionality reduction, clustering, marker gene identification, and cell type annotation. It produces annotated datasets and publication-quality visualizations.

When to Use

  • Analyzing single-cell RNA-seq count matrices (10X Genomics, h5ad, CSV, loom)
  • Performing quality control filtering on scRNA-seq datasets (mitochondrial %, gene counts)
  • Running dimensionality reduction: PCA, UMAP, t-SNE
  • Identifying cell clusters via Leiden community detection
  • Finding differentially expressed marker genes per cluster (Wilcoxon, t-test, logistic regression)
  • Annotating cell types from known marker gene panels
  • Conducting trajectory inference and pseudotime analysis (PAGA, diffusion pseudotime)
  • Generating publication-quality single-cell plots (dot plots, heatmaps, stacked violins)
  • Comparing gene expression across experimental conditions within cell types
  • Use Seurat (R/Bioconductor) instead for scRNA-seq analysis in an existing R workflow or when Seurat-specific assay types are required

Prerequisites

  • Python packages:
    scanpy>=1.10
    ,
    leidenalg
    ,
    igraph
    ,
    anndata
  • Data requirements: Gene expression count matrix (cells x genes). Common formats: 10X Cell Ranger output (
    filtered_feature_bc_matrix/
    ),
    .h5ad
    ,
    .h5
    ,
    .csv
  • Environment: Python 3.9+; 16GB+ RAM recommended for >50k cells
pip install "scanpy[leiden]" anndata

Workflow

Step 1: Setup and Data Loading

Configure scanpy settings and read the count matrix into an AnnData object. Ensure gene names are unique.

import scanpy as sc
import numpy as np
import pandas as pd

# Configure settings
sc.settings.verbosity = 3  # errors=0, warnings=1, info=2, hints=3
sc.settings.set_figure_params(dpi=80, facecolor="white")

# Load data — pick the reader matching your format
adata = sc.read_10x_mtx(
    "path/to/filtered_feature_bc_matrix/",
    var_names="gene_symbols",
    cache=True,
)
# Alternatives:
# adata = sc.read_h5ad("data.h5ad")
# adata = sc.read_10x_h5("data.h5")

adata.var_names_make_unique()
print(f"Loaded: {adata.n_obs} cells x {adata.n_vars} genes")
print(f"Sparsity: {1 - adata.X.nnz / (adata.n_obs * adata.n_vars):.1%}")

Step 2: Quality Control

Annotate mitochondrial/ribosomal genes, compute QC metrics, visualize distributions, and filter low-quality cells.

# Annotate gene groups
adata.var["mt"] = adata.var_names.str.startswith("MT-")    # human; use "mt-" for mouse
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))

# Calculate QC metrics
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt", "ribo"], percent_top=None, log1p=False, inplace=True
)

# Visualize QC distributions
sc.pl.violin(
    adata,
    ["n_genes_by_counts", "total_counts", "pct_counts_mt"],
    jitter=0.4,
    multi_panel=True,
)
sc.pl.scatter(adata, x="total_counts", y="pct_counts_mt")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")

# Filter — adjust thresholds based on the QC plots above
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
adata = adata[adata.obs.n_genes_by_counts < 5000, :]  # remove potential doublets
adata = adata[adata.obs.pct_counts_mt < 20, :]         # remove dying cells

print(f"After QC: {adata.n_obs} cells x {adata.n_vars} genes")

Step 3: Normalization and Feature Selection

Normalize library sizes, log-transform, and identify highly variable genes (HVGs). Store raw counts for later differential expression.

# Preserve raw counts in a layer
adata.layers["counts"] = adata.X.copy()

# Normalize to 10,000 counts per cell, then log-transform
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

# Save normalized data as .raw for visualization
adata.raw = adata

# Identify highly variable genes
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts")
sc.pl.highly_variable_genes(adata)

# Subset to HVGs
adata = adata[:, adata.var.highly_variable]
print(f"HVG subset: {adata.n_obs} cells x {adata.n_vars} genes")

Step 4: Scaling and Regression

Regress out confounders and scale gene expression values. This prepares data for PCA.

# Regress out unwanted sources of variation
sc.pp.regress_out(adata, ["total_counts", "pct_counts_mt"])

# Scale to unit variance, clip extreme values
sc.pp.scale(adata, max_value=10)

print(f"Scaled matrix: mean={adata.X.mean():.4f}, std={adata.X.std():.4f}")

Step 5: Dimensionality Reduction

Compute PCA, inspect the variance ratio elbow plot, then build a neighborhood graph and embed with UMAP.

# PCA
sc.tl.pca(adata, svd_solver="arpack", n_comps=50)
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=50)

# Determine n_pcs from elbow plot (typically 30-50)
n_pcs = 40

# Compute k-nearest neighbor graph
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=n_pcs)

# UMAP embedding
sc.tl.umap(adata)
sc.pl.umap(adata, color=["n_genes_by_counts", "total_counts", "pct_counts_mt"])

print(f"UMAP shape: {adata.obsm['X_umap'].shape}")

Step 6: Clustering

Run Leiden clustering at multiple resolutions and select the best granularity by visual inspection.

# Test multiple resolutions
for res in [0.3, 0.5, 0.8, 1.0, 1.2]:
    sc.tl.leiden(adata, resolution=res, key_added=f"leiden_res{res}", flavor="igraph", n_iterations=2)

# Compare resolutions side-by-side
sc.pl.umap(
    adata,
    color=["leiden_res0.3", "leiden_res0.5", "leiden_res0.8", "leiden_res1.0"],
    ncols=2,
    legend_loc="on data",
)

# Pick final resolution based on biological plausibility
adata.obs["leiden"] = adata.obs["leiden_res0.5"]
n_clusters = adata.obs["leiden"].nunique()
print(f"Selected resolution=0.5: {n_clusters} clusters")

Step 7: Marker Gene Identification

Find differentially expressed genes per cluster using Wilcoxon rank-sum test on raw counts.

# Rank genes per cluster
sc.tl.rank_genes_groups(adata, groupby="leiden", method="wilcoxon", use_raw=True)

# Visualization
sc.pl.rank_genes_groups(adata, n_genes=20, sharey=False)
sc.pl.rank_genes_groups_dotplot(adata, n_genes=5)
sc.pl.rank_genes_groups_heatmap(adata, n_genes=10, use_raw=True, show_gene_labels=True)

# Extract as DataFrame
markers_df = sc.get.rank_genes_groups_df(adata, group=None)
top_markers = markers_df[markers_df["pvals_adj"] < 0.05].groupby("group").head(10)
print(f"Significant markers: {len(top_markers)} genes across {top_markers['group'].nunique()} clusters")
print(top_markers[["group", "names", "logfoldchanges", "pvals_adj"]].head(15))

Step 8: Cell Type Annotation and Export

Assign cell types based on canonical marker genes, visualize, and save all results.

# Define canonical markers (example: human PBMC)
marker_genes = {
    "CD4 T cells":      ["CD3D", "CD3E", "IL7R"],
    "CD8 T cells":      ["CD8A", "CD8B", "GZMK"],
    "B cells":          ["MS4A1", "CD79A", "CD79B"],
    "NK cells":         ["NKG7", "GNLY", "KLRD1"],
    "CD14+ Monocytes":  ["CD14", "LYZ", "S100A9"],
    "FCGR3A+ Monocytes":["FCGR3A", "MS4A7"],
    "Dendritic cells":  ["FCER1A", "CST3"],
    "Platelets":        ["PPBP", "PF4"],
}

# Dot plot — rows = clusters, columns = markers
sc.pl.dotplot(adata, var_names=marker_genes, groupby="leiden", use_raw=True)

# Map clusters → cell types (adjust based on your dot plot)
cluster_to_celltype = {
    "0": "CD4 T cells",
    "1": "CD14+ Monocytes",
    "2": "B cells",
    "3": "CD8 T cells",
    "4": "NK cells",
    "5": "FCGR3A+ Monocytes",
    "6": "Dendritic cells",
    "7": "Platelets",
}
adata.obs["cell_type"] = adata.obs["leiden"].map(cluster_to_celltype).fillna("Unknown")
sc.pl.umap(adata, color="cell_type", legend_loc="on data", frameon=False)

# Save results
adata.write("results/annotated_data.h5ad", compression="gzip")
adata.obs.to_csv("results/cell_metadata.csv")
markers_df.to_csv("results/marker_genes.csv", index=False)
print(f"Saved: {adata.n_obs} cells with {adata.obs['cell_type'].nunique()} cell types")

Key Parameters

ParameterDefaultRange / OptionsEffect
min_genes
(filter_cells)
200
100
-
500
Minimum genes detected per cell; lower keeps more cells
min_cells
(filter_genes)
3
3
-
10
Minimum cells expressing a gene; higher is stricter
pct_counts_mt
threshold
20
5
-
30
Max mitochondrial %; tissue-dependent (5% for PBMCs, 20% for tumors)
n_top_genes
(HVG)
2000
1000
-
5000
Number of highly variable genes; more captures subtle variation
flavor
(HVG)
"seurat_v3"
"seurat"
,
"seurat_v3"
,
"cell_ranger"
HVG detection algorithm
n_pcs
(neighbors)
40
20
-
50
Number of PCs for neighbor graph; check elbow plot
n_neighbors
15
5
-
50
k for k-NN graph; lower emphasizes local structure
resolution
(leiden)
0.5
0.1
-
2.0
Clustering granularity; higher → more clusters
method
(rank_genes)
"wilcoxon"
"wilcoxon"
,
"t-test"
,
"logreg"
DE test; Wilcoxon recommended for publication
target_sum
(normalize)
1e4
1e4
-
1e5
Target library size per cell

Common Recipes

Recipe: Batch Correction with ComBat

When to use: multiple samples/batches with visible batch effects on the UMAP.

# Requires 'batch' column in adata.obs
sc.pp.combat(adata, key="batch")
# Re-run PCA, neighbors, UMAP, clustering after correction
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["batch", "leiden"])

Recipe: Trajectory Inference with PAGA + Diffusion Pseudotime

When to use: cells follow a developmental continuum rather than discrete clusters.

# PAGA graph abstraction
sc.tl.paga(adata, groups="leiden")
sc.pl.paga(adata, color="leiden", threshold=0.03)

# Reinitialize UMAP with PAGA layout
sc.tl.umap(adata, init_pos="paga")

# Diffusion pseudotime — set root cell
adata.uns["iroot"] = np.flatnonzero(adata.obs["leiden"] == "0")[0]
sc.tl.diffmap(adata)
sc.tl.dpt(adata)
sc.pl.umap(adata, color=["dpt_pseudotime", "leiden"])

Recipe: Publication-Quality Figures

When to use: generating figures for manuscripts or presentations.

sc.settings.set_figure_params(dpi=300, frameon=False, figsize=(5, 5))

# UMAP with custom palette
sc.pl.umap(
    adata, color="cell_type",
    palette="Set2",
    legend_loc="on data",
    legend_fontsize=10,
    legend_fontoutline=2,
    title="",
    save="_celltype_publication.pdf",
)

# Stacked violin plot of top markers
flat_markers = [g for genes in marker_genes.values() for g in genes[:2]]
sc.pl.stacked_violin(
    adata, var_names=flat_markers, groupby="cell_type",
    use_raw=True, swap_axes=True,
    save="_markers_violin.pdf",
)

Recipe: Differential Expression Between Conditions

When to use: comparing treated vs control within a specific cell type.

# Subset to cell type of interest
adata_t = adata[adata.obs["cell_type"] == "CD4 T cells"].copy()

# DE between conditions
sc.tl.rank_genes_groups(adata_t, groupby="condition", groups=["treated"], reference="control", method="wilcoxon", use_raw=True)
sc.pl.rank_genes_groups_volcano(adata_t, groups=["treated"])

de_results = sc.get.rank_genes_groups_df(adata_t, group="treated")
sig_genes = de_results[de_results["pvals_adj"] < 0.05]
print(f"Significant DE genes: {len(sig_genes)}")

Expected Outputs

  • results/annotated_data.h5ad
    — Full AnnData with embeddings (PCA, UMAP), cluster labels, cell type annotations, and raw counts layer
  • results/cell_metadata.csv
    — Per-cell table: barcode, cluster, cell type, QC metrics (n_genes, total_counts, pct_mt)
  • results/marker_genes.csv
    — DE genes per cluster: gene name, log fold change, adjusted p-value
  • QC figures: violin plots (n_genes, total_counts, pct_mt), scatter plots
  • Embedding figures: UMAP colored by cluster, cell type, QC metrics, pseudotime
  • Marker figures: dot plot, heatmap, stacked violin of canonical markers

Troubleshooting

ProblemCauseSolution
ModuleNotFoundError: leidenalg
Leiden package not installed
pip install leidenalg igraph
MemoryError
during PCA
Dense matrix exceeds RAMUse
sc.pp.pca(adata, chunked=True, chunk_size=20000)
UMAP shows single blobOver-filtering or too few HVGsRelax QC thresholds; increase
n_top_genes
to 3000-5000
Too many/few clustersResolution mismatchSweep
resolution
from 0.1 to 2.0 and compare UMAPs
KeyError: 'MT-'
not found
Wrong species prefixUse
mt-
(lowercase) for mouse,
MT-
for human
Batch effects dominate UMAPUncorrected technical variationApply ComBat recipe or use Harmony/scVI for integration
adata.raw is None
error
Forgot to save raw before subsettingSet
adata.raw = adata
after normalization, before HVG subset
Marker genes non-specificResolution too low merging cell typesIncrease resolution or use
logistic_regression
method
Slow
regress_out
Large cell countSkip regress_out; use
sc.pp.scale()
alone (often sufficient)
ValueError
in
rank_genes_groups
Cluster with too few cellsRemove clusters with <10 cells before DE:
adata = adata[adata.obs.groupby('leiden').filter(lambda x: len(x)>=10).index]

Bundled Resources

This skill includes reference files for deeper lookup. Read these on demand when the main SKILL.md needs more detail.

references/api_reference.md

Quick-lookup table of all scanpy functions organized by module (sc.pp., sc.tl., sc.pl.*), with full signatures, common parameters, and AnnData structure reference. Use when: you need the exact function name or parameter for a specific operation.

references/plotting_guide.md

Comprehensive visualization guide: QC plots, UMAP styling, marker gene plots (dot plot, heatmap, stacked violin), trajectory plots, multi-panel figures, publication customization, and color palette recommendations. Use when: creating figures for publications or presentations.

references/standard_workflow.md

Detailed step-by-step reference for each pipeline stage with decision points, parameter rationale, and alternative approaches (MAD-based filtering, scran normalization, automated annotation). Use when: performing a full analysis from scratch and needing deeper guidance than the Workflow section above.

References