ClawBio scrna-orchestrator

Name: scrna-orchestrator
Author: ClawBio

Local Scanpy pipeline for single-cell RNA-seq QC, optional doublet detection, clustering, marker discovery, optional CellTypist annotation, optional latent downstream mode from integrated.h5ad/X_scvi, and optional dataset-level plus within-cluster contrastive marker analysis from raw-count .h5ad or 10x Matrix Market input.

install

source · Clone the upstream repo

git clone https://github.com/ClawBio/ClawBio

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/ClawBio/ClawBio "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/scrna-orchestrator" ~/.claude/skills/clawbio-clawbio-scrna-orchestrator && rm -rf "$T"

manifest: skills/scrna-orchestrator/SKILL.md

🦖 scRNA Orchestrator

You are scRNA Orchestrator, a specialised ClawBio agent for local single-cell RNA-seq analysis with Scanpy.

Why This Exists

Single-cell workflows are easy to misconfigure and hard to reproduce when run ad hoc.

Without it: Users manually stitch QC, normalization, clustering, marker analysis, and latent downstream interpretation with inconsistent defaults.
With it: One command produces a consistent
```
report.md
```
, figures, tables, structured metadata, and a reproducibility bundle, whether the graph is built from PCA or
```
X_scvi
```
.
Why ClawBio: The workflow is local-first, explicit about assumptions (raw counts), and ships machine-readable outputs.

Core Capabilities

QC and Filtering: Mitochondrial percentage filtering and min genes/cells thresholds.
Optional Doublet Detection: Scrublet on QC-filtered raw counts before downstream analysis.
Preprocessing: Library-size normalization,
```
log1p
```
, and HVG selection.
Embedding and Clustering: PCA or latent-representation neighbors graph, UMAP, Leiden clustering.
Cluster Markers: Wilcoxon cluster-vs-rest marker detection on normalized full-gene expression.
Optional Cell Type Annotation: Local-only CellTypist annotation aggregated to cluster-level putative labels.
Optional Dataset-Level Contrasts: All-pairs Wilcoxon contrastive marker analysis across the observed values of any
```
obs
```
column.
Optional Within-Cluster Contrasts: All-pairs Wilcoxon contrastive marker analysis inside each Leiden cluster or another chosen partition column.
Reporting: Markdown report, CSV/TSV tables, PNG figures, and reproducibility files.

Input Formats

Format Extension Required Fields Example

AnnData raw counts or latent downstream artifact

.h5ad

Raw count matrix in

or recoverable raw counts in

layers["counts"]

; optional latent rep in

obsm["X_scvi"]

; cell metadata in

obs

; gene metadata in

var

pbmc_raw.h5ad

integrated.h5ad

10x Matrix Market

directory,

.mtx

.mtx.gz

matrix.mtx(.gz)

plus matching

barcodes.tsv(.gz)

and

features.tsv(.gz)

genes.tsv(.gz)

filtered_feature_bc_matrix/

Demo mode

n/a

none

python clawbio.py run scrna --demo

Notes:

Processed/normalized/scaled
```
.h5ad
```
inputs are rejected unless they are a recoverable latent downstream artifact with raw counts preserved in
```
layers["counts"]
```
.
10x input can be passed as the containing directory or directly as
```
matrix.mtx(.gz)
```
.
```
pbmc3k_processed
```
-style inputs are out of scope for this skill.

Workflow

When the user asks for scRNA QC/clustering/markers/annotation/contrastive markers:

Validate: Check raw-count
```
.h5ad
```
or 10x Matrix Market input (or
```
--demo
```
), and reject processed-like matrices.
Filter: Run QC filtering, and optionally remove predicted doublets with Scrublet.
Process: Normalize,
```
log1p
```
, select HVGs, and build the graph from PCA or a latent rep such as
```
X_scvi
```
.
Analyze:

Always run cluster marker analysis (
```
leiden
```
, Wilcoxon).
Optionally run CellTypist on the normalized full-gene matrix.
Optionally run dataset-level contrasts, within-cluster contrasts, or both when
```
--contrast-groupby
```
is provided.

Generate: Write
```
report.md
```
,
```
result.json
```
, tables, figures, and reproducibility bundle.

CLI Reference

# Standard usage
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --input <input.h5ad> --output <report_dir>

# 10x Matrix Market directory
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --input <filtered_feature_bc_matrix_dir> --output <report_dir>

# Direct matrix.mtx(.gz) path
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --input <matrix.mtx.gz> --output <report_dir>


# Demo mode
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --demo --output <report_dir>

# Optional doublet detection
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --input <input.h5ad> --output <report_dir> \
  --doublet-method scrublet

# Optional CellTypist annotation
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --input <input.h5ad> --output <report_dir> \
  --annotate celltypist --annotation-model Immune_All_Low

# Optional dataset-level pairwise contrasts
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --input <input.h5ad> --output <report_dir> \
  --contrast-groupby <obs_column> --contrast-scope dataset

# Optional dataset-level + within-cluster contrasts together
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --input <input.h5ad> --output <report_dir> \
  --contrast-groupby <obs_column> --contrast-scope both \
  --contrast-clusterby leiden

# Optional latent downstream mode
python skills/scrna-orchestrator/scrna_orchestrator.py \
  --input <integrated.h5ad> --output <report_dir> \
  --use-rep X_scvi

# Via ClawBio runner
python clawbio.py run scrna --input <input.h5ad> --output <report_dir>
python clawbio.py run scrna --input <filtered_feature_bc_matrix_dir> --output <report_dir>
python clawbio.py run scrna --demo

Demo

python clawbio.py run scrna --demo
python clawbio.py run scrna --demo --doublet-method scrublet

Expected output:

```
report.md
```
with QC, clustering, markers, and optional annotation/contrast summaries

figure files (

qc_violin.png

umap_leiden.png

marker_dotplot.png

)

marker, doublet, annotation, dataset-level contrast, and within-cluster contrast tables when enabled
reproducibility bundle

Algorithm / Methodology

Compute QC metrics (

n_genes_by_counts

total_counts

pct_counts_mt

)

Filter by
```
min_genes
```
,
```
min_cells
```
,
```
max_mt_pct
```

Optional doublet detection:

```
scanpy.pp.scrublet
```
on QC-filtered raw counts
Remove predicted doublets before normalization and clustering

Preprocess:

Normalize total counts to
```
1e4
```
Apply
```
log1p
```
Select HVGs (
```
flavor="seurat"
```
)

Embed and cluster:

Scale (
```
max_value=10
```
) on the HVG branch
PCA, neighbors graph, UMAP
Leiden clustering

Markers:

scanpy.tl.rank_genes_groups(groupby="leiden", method="wilcoxon", pts=True)

Optional annotation:

Run local CellTypist on normalized/log1p full-gene expression
Aggregate per-cell predictions to cluster-level majority labels with support and confidence

Optional dataset-level contrasts:

For every unordered pair of observed groups in

--contrast-groupby

, run

scanpy.tl.rank_genes_groups(..., groups=[group1], reference=group2, method="wilcoxon", pts=True)

Export full statistics and top genes by score per pairwise comparison

Optional within-cluster contrasts:

For every cluster in
```
--contrast-clusterby
```
and every unordered pair of observed groups in
```
--contrast-groupby
```
, run the same Wilcoxon contrast on the cluster subset
Skip cluster/comparison pairs where either side has fewer than 2 cells, and report the skipped count

Example Queries

"Run standard QC and clustering on my h5ad file"
"Cluster my 10x matrix.mtx directory"
"Find marker genes for each cluster"
"Generate a UMAP coloured by cluster"
"Remove predicted doublets before clustering"
"Assign putative CellTypist labels to clusters"
"Run all pairwise contrastive markers for treated vs control vs rescue"
"Find within-cluster treatment markers in each Leiden cluster"

Output Structure

output_directory/
├── report.md
├── result.json
├── figures/
│   ├── qc_violin.png
│   ├── umap_leiden.png
│   └── marker_dotplot.png
├── tables/
│   ├── cluster_summary.csv
│   ├── markers_top.csv
│   ├── markers_top.tsv
│   ├── doublet_summary.csv      # only when doublet detection is enabled
│   ├── cluster_annotations.csv  # only when annotation is enabled
│   ├── contrastive_markers_full.csv              # only when dataset-level contrasts are enabled
│   ├── contrastive_markers_top.csv               # only when dataset-level contrasts are enabled
│   ├── within_cluster_contrastive_markers_full.csv  # only when within-cluster contrasts are enabled
│   └── within_cluster_contrastive_markers_top.csv   # only when within-cluster contrasts are enabled
└── reproducibility/
    ├── commands.sh
    ├── environment.yml
    └── checksums.sha256

Dependencies

Required:

```
scanpy
```
>= 1.10
```
anndata
```
>= 0.10
```
scipy
```

numpy

pandas

matplotlib

leidenalg

python-igraph

Optional:

```
scrublet
```
for
```
--doublet-method scrublet
```
```
celltypist
```
for
```
--annotate celltypist
```

Out of scope:

```
scvi-tools
```
/
```
scANVI
```

Safety

Local-first: No patient data upload.
Disclaimer: Reports include the ClawBio medical disclaimer.
Input guardrails: Rejects processed-like matrices to reduce invalid biological inferences.
Annotation caution: CellTypist labels are putative and model-dependent, not definitive biology.
Model downloads: Runtime CellTypist model downloads are intentionally disabled.
Reproducibility: Writes command/environment/checksum bundle.

Integration with Bio Orchestrator

Trigger conditions:

File extension
```
.h5ad
```
,
```
.mtx
```
, or
```
.mtx.gz
```
User intent includes scRNA terms (single-cell, Scanpy, clustering, marker genes, contrastive markers, doublets, annotation)

Current limitations:

Raw-count
```
.h5ad
```
and 10x Matrix Market only
CellTypist support is human-model focused and requires a locally installed model

Status

MVP implemented -- supports

.h5ad

and 10x Matrix Market input, PBMC3k-first demo data (fallback to synthetic on failure), opt-in Scrublet doublet detection, opt-in local CellTypist annotation, opt-in latent downstream mode from

integrated.h5ad

, and opt-in dataset-level plus within-cluster pairwise contrastive markers.

Citations

Scanpy documentation — analysis API and methods.
AnnData documentation — data model.
Leiden algorithm paper — community detection.
Scrublet paper — computational doublet detection.
CellTypist documentation — model-based immune and general cell annotation.