OpenClaw-Medical-Skills single-cell-preprocessing-with-omicverse

Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.

install

source · Clone the upstream repo

git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/single-preprocessing" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-single-cell-preprocessing-with-omicv && rm -rf "$T"

OpenClaw · Install into ~/.openclaw/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/single-preprocessing" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-single-cell-preprocessing-with-omicv && rm -rf "$T"

manifest: skills/single-preprocessing/SKILL.md

Single-cell preprocessing with omicverse

Overview

Follow this skill when a user needs to reproduce the preprocessing workflow from the omicverse notebooks

t_preprocess.ipynb

t_preprocess_cpu.ipynb

, and

t_preprocess_gpu.ipynb

. The tutorials operate on the 10x PBMC3k dataset and cover QC filtering, normalisation, highly variable gene (HVG) detection, dimensionality reduction, and downstream embeddings.

Instructions

Set up the environment
- Import
```
omicverse as ov
```
  and
```
scanpy as sc
```
  , then call
```
ov.plot_set(font_path='Arial')
```
  (or
```
ov.ov_plot_set()
```
  in legacy notebooks) to standardise figure styling.
- Encourage
```
%load_ext autoreload
```
  and
```
%autoreload 2
```
  when iterating inside notebooks so code edits propagate without restarting the kernel.
Prepare input data
- Download the PBMC3k filtered matrix from 10x Genomics (
```
pbmc3k_filtered_gene_bc_matrices.tar.gz
```
  ) and extract it under
```
data/filtered_gene_bc_matrices/hg19/
```
  .
- Load the matrix via
```
sc.read_10x_mtx(..., var_names='gene_symbols', cache=True)
```
  and keep a writable folder like
```
write/
```
  for exports.
Perform quality control (QC)
- Run
```
ov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')
```
  for the CPU/CPU–GPU pipelines; omit
```
doublets_method
```
  on pure GPU where Scrublet is not yet supported.
- Review the returned AnnData summary to confirm doublet rates and QC thresholds; advise adjusting cut-offs for different species or sequencing depths.
Store raw counts before transformations
- Call
```
ov.utils.store_layers(adata, layers='counts')
```
  immediately after QC so the original counts remain accessible for later recovery and comparison.
Normalise and select HVGs
- Use
```
ov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5)
```
  to apply shift-log normalisation followed by Pearson residual HVG detection (set
```
target_sum=None
```
  on GPU, which keeps defaults).
- For CPU–GPU mixed runs, demonstrate
```
ov.pp.recover_counts(...)
```
  to invert normalisation and store reconstructed counts in
```
adata.layers['recover_counts']
```
  .
Manage
.raw
and layer recovery
- Snapshot normalised data to
```
.raw
```
  with
```
adata.raw = adata
```
  (or
```
adata.raw = adata.copy()
```
  ), and show
```
ov.utils.retrieve_layers(adata_counts, layers='counts')
```
  to compare normalised vs. raw intensities.

Scale, reduce, and embed

Scale features using
```
ov.pp.scale(adata)
```
(layers hold scaled matrices) followed by
```
ov.pp.pca(adata, layer='scaled', n_pcs=50)
```
.

Construct neighbourhood graphs with:

sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')

for the baseline notebook.

ov.pp.neighbors(..., use_rep='scaled|original|X_pca')

on CPU–GPU to leverage accelerated routines.

```
ov.pp.neighbors(..., method='cagra')
```
on GPU to call RAPIDS graph primitives.

Generate embeddings via

ov.utils.mde(...)

ov.pp.umap(adata)

ov.pp.mde(...)

ov.pp.tsne(...)

, or

ov.pp.sude(...)

depending on the notebook variant.

Cluster and annotate

Run

ov.pp.leiden(adata, resolution=1)

ov.single.leiden(adata, resolution=1.0)

after neighbour graph construction; CPU–GPU pipelines also showcase

ov.pp.score_genes_cell_cycle

before clustering.

IMPORTANT - Defensive checks: When generating code that plots by clustering results (e.g.,

color='leiden'

), always check if the clustering has been performed first:

# Check if leiden clustering exists, if not, run it
if 'leiden' not in adata.obs:
    if 'neighbors' not in adata.uns:
        ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca')
    ov.single.leiden(adata, resolution=1.0)

Plot embeddings with
```
ov.pl.embedding(...)
```
or
```
ov.utils.embedding(...)
```
, colouring by
```
leiden
```
clusters and marker genes. Always verify that the column specified in
```
color=
```
parameter exists in
```
adata.obs
```
before plotting.

Document outputs
- Encourage saving intermediate AnnData objects (
```
adata.write('write/pbmc3k_preprocessed.h5ad')
```
  ) and figure exports using Matplotlib’s
```
plt.savefig(...)
```
  to preserve QC summaries and embeddings.
Notebook-specific notes
- Baseline (
  t_preprocess.ipynb
  ): Focuses on CPU execution with Scanpy neighbours; emphasise storing counts before and after
```
retrieve_layers
```
  demonstrations.
- CPU–GPU mixed (
  t_preprocess_cpu.ipynb
  ): Highlights Omicverse ≥1.7.0 mixed acceleration. Include timing magics (%%time) to showcase speedups and call out
```
doublets_method='scrublet'
```
  support.
- GPU (
  t_preprocess_gpu.ipynb
  ): Requires a CUDA-capable GPU, RAPIDS 24.04 stack, and
```
rapids-singlecell
```
  . Mention the
```
ov.pp.anndata_to_GPU
```
  /
```
ov.pp.anndata_to_CPU
```
  transfers and
```
method='cagra'
```
  neighbours. Note the current warning that pure-GPU pipelines depend on RAPIDS updates.
Troubleshooting tips
- If
```
sc.read_10x_mtx
```
  fails, verify the extracted folder structure and ensure gene symbols are available via
```
var_names='gene_symbols'
```
  .
- Address GPU import errors by confirming the conda environment matches the RAPIDS version for the installed CUDA driver (
```
nvidia-smi
```
  ).
- For
```
ov.pp.preprocess
```
  dimension mismatches, ensure QC filtered out empty barcodes so HVG selection does not encounter zero-variance features.
- When embeddings lack expected fields (e.g.,
```
scaled|original|X_pca
```
  missing), re-run
```
ov.pp.scale
```
  and
```
ov.pp.pca
```
  to rebuild the cached layers.
- Pipeline dependency errors: When encountering errors like "Could not find 'leiden' in adata.obs or adata.var_names":
  - Always check if required preprocessing steps (neighbors, PCA) exist before dependent operations
  - Check if clustering results exist in
```
adata.obs
```
    before trying to color plots by them
  - Use defensive checks in generated code to handle incomplete pipelines gracefully
- Code generation best practice: Generate robust code with conditional checks for prerequisites rather than assuming perfect sequential execution. Users may run steps in separate sessions or skip intermediate steps.

Critical API Reference - Batch Column Handling

Batch Column Validation - REQUIRED Before Batch Operations

IMPORTANT: Always validate and prepare the batch column before any batch-aware operations (batch correction, integration, etc.). Missing or NaN values will cause errors.

CORRECT usage:

# Step 1: Check if batch column exists, create default if not
if 'batch' not in adata.obs.columns:
    adata.obs['batch'] = 'batch_1'  # Default single batch

# Step 2: Handle NaN/missing values - CRITICAL!
adata.obs['batch'] = adata.obs['batch'].fillna('unknown')

# Step 3: Convert to categorical for efficient memory usage
adata.obs['batch'] = adata.obs['batch'].astype('category')

# Now safe to use in batch-aware operations
ov.pp.combat(adata, batch='batch')  # or other batch correction methods

WRONG - DO NOT USE:

# WRONG! Using batch column without validation can cause NaN errors
# ov.pp.combat(adata, batch='batch')  # May fail if batch has NaN values!

# WRONG! Assuming batch column exists
# adata.obs['batch'].unique()  # KeyError if column doesn't exist!

Common Batch-Related Pitfalls

NaN values in batch column: Always use
```
fillna()
```
before batch operations
Missing batch column: Always check existence before use
Non-categorical batch: Convert to category for memory efficiency
Mixed data types: Ensure consistent string type before categorization

# Complete defensive batch preparation pattern:
def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'):
    """Prepare batch column for batch-aware operations."""
    if batch_key not in adata.obs.columns:
        adata.obs[batch_key] = default_batch
    adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown')
    adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category')
    return adata

Highly Variable Genes (HVG) - Small Dataset Handling

LOESS Failure with Small Batches

IMPORTANT: The

seurat_v3

HVG flavor uses LOESS regression which fails on small datasets or small per-batch subsets (<500 cells per batch). This manifests as:

ValueError: Extrapolation not allowed with blending

CORRECT - Use try/except fallback pattern:

# Robust HVG selection for any dataset size
try:
    sc.pp.highly_variable_genes(
        adata,
        flavor='seurat_v3',
        n_top_genes=2000,
        batch_key='batch'  # if batch correction is needed
    )
except ValueError as e:
    if 'Extrapolation' in str(e) or 'LOESS' in str(e):
        # Fallback to simpler method for small datasets
        sc.pp.highly_variable_genes(
            adata,
            flavor='seurat',  # Works with any size
            n_top_genes=2000
        )
    else:
        raise

Alternative - Use cell_ranger flavor for batch-aware HVG:

# cell_ranger flavor is more robust for batched data
sc.pp.highly_variable_genes(
    adata,
    flavor='cell_ranger',  # No LOESS, works with batches
    n_top_genes=2000,
    batch_key='batch'
)

Best Practices for Batch-Aware HVG

Check batch sizes before HVG: Small batches (<500 cells) will cause LOESS to fail
Prefer
seurat
or
cell_ranger
when batch sizes vary significantly
Use
seurat_v3
only when all batches have >500 cells
Always wrap in try/except when dataset size is unknown

# Safe batch-aware HVG pattern
def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000):
    """Select HVGs with automatic fallback for small batches."""
    try:
        sc.pp.highly_variable_genes(
            adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key
        )
    except ValueError:
        # Fallback for small batches
        sc.pp.highly_variable_genes(
            adata, flavor='seurat', n_top_genes=n_top_genes
        )

Examples

"Download PBMC3k counts, run QC with Scrublet, normalise with
```
shiftlog|pearson
```
, and compute MDE + UMAP embeddings on CPU."
"Set up the mixed CPU–GPU workflow in a fresh conda env, recover raw counts after normalisation, and score cell cycle phases before Leiden clustering."
"Provision a RAPIDS environment, transfer AnnData to GPU, run
```
method='cagra'
```
neighbours, and return embeddings to CPU for plotting."

References

Detailed walkthrough notebooks:

t_preprocess.ipynb

t_preprocess_cpu.ipynb

t_preprocess_gpu.ipynb

Quick copy/paste commands:
```
reference.md
```