BioClaw scrna-preprocessing-clustering
Standard scRNA-seq preprocessing and clustering with Scanpy. Use for QC, normalization, HVG selection, PCA, neighbor graph construction, UMAP, Leiden clustering, and export of an analysis-ready AnnData object.
install
source · Clone the upstream repo
git clone https://github.com/Runchuan-BU/BioClaw
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/Runchuan-BU/BioClaw "$T" && mkdir -p ~/.claude/skills && cp -r "$T/container/skills/scrna-preprocessing-clustering" ~/.claude/skills/runchuan-bu-bioclaw-scrna-preprocessing-clustering && rm -rf "$T"
manifest:
container/skills/scrna-preprocessing-clustering/SKILL.mdsource content
scRNA Preprocessing And Clustering
Version Compatibility
Reference examples assume:
1.10+scanpy
0.10+anndata
2.2+pandas
3.8+matplotlib
Before using code patterns, verify installed versions match the environment:
- Python:
python -c "import scanpy, anndata; print(scanpy.__version__, anndata.__version__)" - If signatures differ, inspect the installed API and adapt the pattern instead of retrying unchanged.
Overview
Use this skill to turn raw or minimally processed scRNA-seq data into an analysis-ready object with:
- QC-filtered cells and genes
- normalized expression values
- highly variable genes
- PCA and UMAP embeddings
- Leiden clusters
- saved
artifact for annotation, DE, integration, or trajectory analysish5ad
When To Use This Skill
- raw 10x matrices, filtered count matrices, or
inputs need standard preprocessingh5ad - the user wants UMAP, clustering, or marker discovery
- downstream tasks depend on a stable single-cell object rather than ad hoc plots
Quick Route
- If the input is already a processed
, inspecth5ad
, embeddings, cluster columns, and QC columns before rerunning preprocessing.adata.raw - If the input is raw counts, do QC first and only normalize after filtering obvious low-quality cells.
- If multiple batches are present, preprocess cleanly first, then consider integration instead of hiding batch effects with aggressive filtering.
Progressive Disclosure
- Read technical_reference.md for QC decision rules, assay caveats, and integration branching.
- Read commands_and_thresholds.md for concrete Scanpy code, default thresholds, and output conventions.
Default Rules
- Keep raw counts recoverable. Prefer
before regression or scaling.adata.raw = adata.copy() - Report thresholds explicitly. Do not silently drop cells or genes.
- Show QC distributions before applying hard filters.
- Use vector outputs such as
or.pdf
for final figures when possible..svg
Expected Inputs
- 10x directory,
,.h5
, or count matrix.h5ad - cell metadata if available
- species context for mitochondrial or ribosomal gene detection
Expected Outputs
results/processed.h5adqc/cell_qc_metrics.tsvqc/gene_qc_metrics.tsvfigures/qc_violin.pdffigures/pca_variance_ratio.pdffigures/umap_leiden.pdf
Preferred Tools
scanpyanndatapandasmatplotlibseaborn
Starter Pattern
import scanpy as sc adata = sc.read_10x_mtx("counts/") adata.var_names_make_unique() adata.var["mt"] = adata.var_names.str.upper().str.startswith("MT-") sc.pp.calculate_qc_metrics(adata, qc_vars=["mt"], inplace=True) adata = adata[ (adata.obs["n_genes_by_counts"] >= 200) & (adata.obs["n_genes_by_counts"] <= 6000) & (adata.obs["pct_counts_mt"] < 15), : ].copy() sc.pp.filter_genes(adata, min_cells=3) sc.pp.normalize_total(adata, target_sum=1e4) sc.pp.log1p(adata) adata.raw = adata.copy() sc.pp.highly_variable_genes(adata, n_top_genes=3000, flavor="seurat_v3") adata = adata[:, adata.var["highly_variable"]].copy() sc.pp.scale(adata, max_value=10) sc.tl.pca(adata, svd_solver="arpack") sc.pp.neighbors(adata, n_neighbors=15, n_pcs=30) sc.tl.umap(adata) sc.tl.leiden(adata, resolution=0.5, key_added="leiden_r05") adata.write("results/processed.h5ad")
Workflow
1. Load and validate the object
- confirm orientation is cells by genes
- make gene names unique
- record sample IDs and batch labels before merging or filtering
2. Compute QC metrics and inspect distributions
n_genes_by_countstotal_countspct_counts_mt- optional ribosomal or hemoglobin fractions
Plot distributions before filtering. Thresholds vary by chemistry, tissue, and nucleus versus whole-cell assay.
3. Filter cells and genes
Use dataset-aware thresholds. Good first-pass defaults:
min_genes >= 200
to remove likely doublets in many droplet datasetsmax_genes <= 5000-8000
depending on tissue stresspct_counts_mt < 10-20
for genesmin_cells >= 3
4. Normalize, log-transform, and select HVGs
- normalize with
target_sum=1e4 log1p- select
HVGs2000-4000 - save raw counts before heavy transformations
5. Reduce dimensions and cluster
- PCA on HVGs
- neighbor graph using
PCs and10-30
neighbors as a starting range10-30 - UMAP for visualization
- Leiden across a small resolution grid such as
,0.2
,0.5
,0.81.0
6. Export analysis-ready artifacts
Always save:
- processed
h5ad - QC tables
- cluster assignments
- publication-ready QC and UMAP figures
Output Artifacts
: main reusable AnnData objectresults/processed.h5ad
: barcode plus cluster labelsresults/cluster_assignments.tsv
: counts before and after filteringqc/filter_summary.tsv
: main embedding figurefigures/umap_leiden.pdf
Quality Review
- Median genes per cell should be plausible for the chemistry and tissue.
- Mitochondrial fraction should not dominate retained cells.
- PCA variance should decay smoothly rather than showing obvious technical axes only.
- UMAP should be reviewed together with QC metrics and batch labels, not alone.
- Cluster labels should not be finalized before marker inspection.
Anti-Patterns
- reprocessing an already integrated object as if it were raw counts
- using a single universal mitochondrial threshold for every tissue
- interpreting UMAP separation as biology before checking batch and QC covariates
- discarding raw counts needed later for DE or pseudobulk
Related Skills
- Cell Annotation
- Cell Communication
- Trajectory And Lineage
- Multiome And scATAC
Optional Supplements
anndatascanpy