BioClaw scrna-preprocessing-clustering

Standard scRNA-seq preprocessing and clustering with Scanpy. Use for QC, normalization, HVG selection, PCA, neighbor graph construction, UMAP, Leiden clustering, and export of an analysis-ready AnnData object.

install
source · Clone the upstream repo
git clone https://github.com/Runchuan-BU/BioClaw
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Runchuan-BU/BioClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/container/skills/scrna-preprocessing-clustering" ~/.claude/skills/runchuan-bu-bioclaw-scrna-preprocessing-clustering && rm -rf "$T"
manifest: container/skills/scrna-preprocessing-clustering/SKILL.md
source content

scRNA Preprocessing And Clustering

Version Compatibility

Reference examples assume:

  • scanpy
    1.10+
  • anndata
    0.10+
  • pandas
    2.2+
  • matplotlib
    3.8+

Before using code patterns, verify installed versions match the environment:

  • Python:
    python -c "import scanpy, anndata; print(scanpy.__version__, anndata.__version__)"
  • If signatures differ, inspect the installed API and adapt the pattern instead of retrying unchanged.

Overview

Use this skill to turn raw or minimally processed scRNA-seq data into an analysis-ready object with:

  • QC-filtered cells and genes
  • normalized expression values
  • highly variable genes
  • PCA and UMAP embeddings
  • Leiden clusters
  • saved
    h5ad
    artifact for annotation, DE, integration, or trajectory analysis

When To Use This Skill

  • raw 10x matrices, filtered count matrices, or
    h5ad
    inputs need standard preprocessing
  • the user wants UMAP, clustering, or marker discovery
  • downstream tasks depend on a stable single-cell object rather than ad hoc plots

Quick Route

  • If the input is already a processed
    h5ad
    , inspect
    adata.raw
    , embeddings, cluster columns, and QC columns before rerunning preprocessing.
  • If the input is raw counts, do QC first and only normalize after filtering obvious low-quality cells.
  • If multiple batches are present, preprocess cleanly first, then consider integration instead of hiding batch effects with aggressive filtering.

Progressive Disclosure

Default Rules

  • Keep raw counts recoverable. Prefer
    adata.raw = adata.copy()
    before regression or scaling.
  • Report thresholds explicitly. Do not silently drop cells or genes.
  • Show QC distributions before applying hard filters.
  • Use vector outputs such as
    .pdf
    or
    .svg
    for final figures when possible.

Expected Inputs

  • 10x directory,
    .h5
    ,
    .h5ad
    , or count matrix
  • cell metadata if available
  • species context for mitochondrial or ribosomal gene detection

Expected Outputs

  • results/processed.h5ad
  • qc/cell_qc_metrics.tsv
  • qc/gene_qc_metrics.tsv
  • figures/qc_violin.pdf
  • figures/pca_variance_ratio.pdf
  • figures/umap_leiden.pdf

Preferred Tools

  • scanpy
  • anndata
  • pandas
  • matplotlib
  • seaborn

Starter Pattern

import scanpy as sc

adata = sc.read_10x_mtx("counts/")
adata.var_names_make_unique()
adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-")
sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True)

adata = adata[
    (adata.obs["n_genes_by_counts"] >= 200)
    & (adata.obs["n_genes_by_counts"] <= 6000)
    & (adata.obs["pct_counts_mt"] < 15),
    :
].copy()

sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata.copy()

sc.pp.highly_variable_genes(adata, n_top_genes=3000, flavor="seurat_v3")
adata = adata[:, adata.var["highly_variable"]].copy()
sc.pp.scale(adata, max_value=10)
sc.tl.pca(adata, svd_solver="arpack")
sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5, key_added="leiden_r05")
adata.write("results/processed.h5ad")

Workflow

1. Load and validate the object

  • confirm orientation is cells by genes
  • make gene names unique
  • record sample IDs and batch labels before merging or filtering

2. Compute QC metrics and inspect distributions

  • n_genes_by_counts
  • total_counts
  • pct_counts_mt
  • optional ribosomal or hemoglobin fractions

Plot distributions before filtering. Thresholds vary by chemistry, tissue, and nucleus versus whole-cell assay.

3. Filter cells and genes

Use dataset-aware thresholds. Good first-pass defaults:

  • min_genes >= 200
  • max_genes <= 5000-8000
    to remove likely doublets in many droplet datasets
  • pct_counts_mt < 10-20
    depending on tissue stress
  • min_cells >= 3
    for genes

4. Normalize, log-transform, and select HVGs

  • normalize with
    target_sum=1e4
  • log1p
  • select
    2000-4000
    HVGs
  • save raw counts before heavy transformations

5. Reduce dimensions and cluster

  • PCA on HVGs
  • neighbor graph using
    10-30
    PCs and
    10-30
    neighbors as a starting range
  • UMAP for visualization
  • Leiden across a small resolution grid such as
    0.2
    ,
    0.5
    ,
    0.8
    ,
    1.0

6. Export analysis-ready artifacts

Always save:

  • processed
    h5ad
  • QC tables
  • cluster assignments
  • publication-ready QC and UMAP figures

Output Artifacts

  • results/processed.h5ad
    : main reusable AnnData object
  • results/cluster_assignments.tsv
    : barcode plus cluster labels
  • qc/filter_summary.tsv
    : counts before and after filtering
  • figures/umap_leiden.pdf
    : main embedding figure

Quality Review

  • Median genes per cell should be plausible for the chemistry and tissue.
  • Mitochondrial fraction should not dominate retained cells.
  • PCA variance should decay smoothly rather than showing obvious technical axes only.
  • UMAP should be reviewed together with QC metrics and batch labels, not alone.
  • Cluster labels should not be finalized before marker inspection.

Anti-Patterns

  • reprocessing an already integrated object as if it were raw counts
  • using a single universal mitochondrial threshold for every tissue
  • interpreting UMAP separation as biology before checking batch and QC covariates
  • discarding raw counts needed later for DE or pseudobulk

Related Skills

  • Cell Annotation
  • Cell Communication
  • Trajectory And Lineage
  • Multiome And scATAC

Optional Supplements

  • anndata
  • scanpy