OpenClaw-Medical-Skills single-cell-preprocessing-with-omicverse
Walk through omicverse's single-cell preprocessing tutorials to QC PBMC3k data, normalise counts, detect HVGs, and run PCA/embedding pipelines on CPU, CPU–GPU mixed, or GPU stacks.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/single-preprocessing" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-single-cell-preprocessing-with-omicv && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/single-preprocessing" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-single-cell-preprocessing-with-omicv && rm -rf "$T"
skills/single-preprocessing/SKILL.mdSingle-cell preprocessing with omicverse
Overview
Follow this skill when a user needs to reproduce the preprocessing workflow from the omicverse notebooks
, t_preprocess.ipynb
, and t_preprocess_cpu.ipynb
. The tutorials operate on the 10x PBMC3k dataset and cover QC filtering, normalisation, highly variable gene (HVG) detection, dimensionality reduction, and downstream embeddings.t_preprocess_gpu.ipynb
Instructions
- Set up the environment
- Import
andomicverse as ov
, then callscanpy as sc
(orov.plot_set(font_path='Arial')
in legacy notebooks) to standardise figure styling.ov.ov_plot_set() - Encourage
and%load_ext autoreload
when iterating inside notebooks so code edits propagate without restarting the kernel.%autoreload 2
- Import
- Prepare input data
- Download the PBMC3k filtered matrix from 10x Genomics (
) and extract it underpbmc3k_filtered_gene_bc_matrices.tar.gz
.data/filtered_gene_bc_matrices/hg19/ - Load the matrix via
and keep a writable folder likesc.read_10x_mtx(..., var_names='gene_symbols', cache=True)
for exports.write/
- Download the PBMC3k filtered matrix from 10x Genomics (
- Perform quality control (QC)
- Run
for the CPU/CPU–GPU pipelines; omitov.pp.qc(adata, tresh={'mito_perc': 0.2, 'nUMIs': 500, 'detected_genes': 250}, doublets_method='scrublet')
on pure GPU where Scrublet is not yet supported.doublets_method - Review the returned AnnData summary to confirm doublet rates and QC thresholds; advise adjusting cut-offs for different species or sequencing depths.
- Run
- Store raw counts before transformations
- Call
immediately after QC so the original counts remain accessible for later recovery and comparison.ov.utils.store_layers(adata, layers='counts')
- Call
- Normalise and select HVGs
- Use
to apply shift-log normalisation followed by Pearson residual HVG detection (setov.pp.preprocess(adata, mode='shiftlog|pearson', n_HVGs=2000, target_sum=5e5)
on GPU, which keeps defaults).target_sum=None - For CPU–GPU mixed runs, demonstrate
to invert normalisation and store reconstructed counts inov.pp.recover_counts(...)
.adata.layers['recover_counts']
- Use
- Manage
and layer recovery.raw- Snapshot normalised data to
with.raw
(oradata.raw = adata
), and showadata.raw = adata.copy()
to compare normalised vs. raw intensities.ov.utils.retrieve_layers(adata_counts, layers='counts')
- Snapshot normalised data to
- Scale, reduce, and embed
- Scale features using
(layers hold scaled matrices) followed byov.pp.scale(adata)
.ov.pp.pca(adata, layer='scaled', n_pcs=50) - Construct neighbourhood graphs with:
for the baseline notebook.sc.pp.neighbors(adata, n_neighbors=15, n_pcs=50, use_rep='scaled|original|X_pca')
on CPU–GPU to leverage accelerated routines.ov.pp.neighbors(..., use_rep='scaled|original|X_pca')
on GPU to call RAPIDS graph primitives.ov.pp.neighbors(..., method='cagra')
- Generate embeddings via
,ov.utils.mde(...)
,ov.pp.umap(adata)
,ov.pp.mde(...)
, orov.pp.tsne(...)
depending on the notebook variant.ov.pp.sude(...)
- Scale features using
- Cluster and annotate
- Run
orov.pp.leiden(adata, resolution=1)
after neighbour graph construction; CPU–GPU pipelines also showcaseov.single.leiden(adata, resolution=1.0)
before clustering.ov.pp.score_genes_cell_cycle - IMPORTANT - Defensive checks: When generating code that plots by clustering results (e.g.,
), always check if the clustering has been performed first:color='leiden'# Check if leiden clustering exists, if not, run it if 'leiden' not in adata.obs: if 'neighbors' not in adata.uns: ov.pp.neighbors(adata, n_neighbors=15, use_rep='X_pca') ov.single.leiden(adata, resolution=1.0) - Plot embeddings with
orov.pl.embedding(...)
, colouring byov.utils.embedding(...)
clusters and marker genes. Always verify that the column specified inleiden
parameter exists incolor=
before plotting.adata.obs
- Run
- Document outputs
- Encourage saving intermediate AnnData objects (
) and figure exports using Matplotlib’sadata.write('write/pbmc3k_preprocessed.h5ad')
to preserve QC summaries and embeddings.plt.savefig(...)
- Encourage saving intermediate AnnData objects (
- Notebook-specific notes
- Baseline (
): Focuses on CPU execution with Scanpy neighbours; emphasise storing counts before and aftert_preprocess.ipynb
demonstrations.retrieve_layers - CPU–GPU mixed (
): Highlights Omicverse ≥1.7.0 mixed acceleration. Include timing magics (%%time) to showcase speedups and call outt_preprocess_cpu.ipynb
support.doublets_method='scrublet' - GPU (
): Requires a CUDA-capable GPU, RAPIDS 24.04 stack, andt_preprocess_gpu.ipynb
. Mention therapids-singlecell
/ov.pp.anndata_to_GPU
transfers andov.pp.anndata_to_CPU
neighbours. Note the current warning that pure-GPU pipelines depend on RAPIDS updates.method='cagra'
- Baseline (
- Troubleshooting tips
- If
fails, verify the extracted folder structure and ensure gene symbols are available viasc.read_10x_mtx
.var_names='gene_symbols' - Address GPU import errors by confirming the conda environment matches the RAPIDS version for the installed CUDA driver (
).nvidia-smi - For
dimension mismatches, ensure QC filtered out empty barcodes so HVG selection does not encounter zero-variance features.ov.pp.preprocess - When embeddings lack expected fields (e.g.,
missing), re-runscaled|original|X_pca
andov.pp.scale
to rebuild the cached layers.ov.pp.pca - Pipeline dependency errors: When encountering errors like "Could not find 'leiden' in adata.obs or adata.var_names":
- Always check if required preprocessing steps (neighbors, PCA) exist before dependent operations
- Check if clustering results exist in
before trying to color plots by themadata.obs - Use defensive checks in generated code to handle incomplete pipelines gracefully
- Code generation best practice: Generate robust code with conditional checks for prerequisites rather than assuming perfect sequential execution. Users may run steps in separate sessions or skip intermediate steps.
- If
Critical API Reference - Batch Column Handling
Batch Column Validation - REQUIRED Before Batch Operations
IMPORTANT: Always validate and prepare the batch column before any batch-aware operations (batch correction, integration, etc.). Missing or NaN values will cause errors.
CORRECT usage:
# Step 1: Check if batch column exists, create default if not if 'batch' not in adata.obs.columns: adata.obs['batch'] = 'batch_1' # Default single batch # Step 2: Handle NaN/missing values - CRITICAL! adata.obs['batch'] = adata.obs['batch'].fillna('unknown') # Step 3: Convert to categorical for efficient memory usage adata.obs['batch'] = adata.obs['batch'].astype('category') # Now safe to use in batch-aware operations ov.pp.combat(adata, batch='batch') # or other batch correction methods
WRONG - DO NOT USE:
# WRONG! Using batch column without validation can cause NaN errors # ov.pp.combat(adata, batch='batch') # May fail if batch has NaN values! # WRONG! Assuming batch column exists # adata.obs['batch'].unique() # KeyError if column doesn't exist!
Common Batch-Related Pitfalls
- NaN values in batch column: Always use
before batch operationsfillna() - Missing batch column: Always check existence before use
- Non-categorical batch: Convert to category for memory efficiency
- Mixed data types: Ensure consistent string type before categorization
# Complete defensive batch preparation pattern: def prepare_batch_column(adata, batch_key='batch', default_batch='batch_1'): """Prepare batch column for batch-aware operations.""" if batch_key not in adata.obs.columns: adata.obs[batch_key] = default_batch adata.obs[batch_key] = adata.obs[batch_key].fillna('unknown') adata.obs[batch_key] = adata.obs[batch_key].astype(str).astype('category') return adata
Highly Variable Genes (HVG) - Small Dataset Handling
LOESS Failure with Small Batches
IMPORTANT: The
seurat_v3 HVG flavor uses LOESS regression which fails on small datasets or small per-batch subsets (<500 cells per batch). This manifests as:
ValueError: Extrapolation not allowed with blending
CORRECT - Use try/except fallback pattern:
# Robust HVG selection for any dataset size try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=2000, batch_key='batch' # if batch correction is needed ) except ValueError as e: if 'Extrapolation' in str(e) or 'LOESS' in str(e): # Fallback to simpler method for small datasets sc.pp.highly_variable_genes( adata, flavor='seurat', # Works with any size n_top_genes=2000 ) else: raise
Alternative - Use cell_ranger flavor for batch-aware HVG:
# cell_ranger flavor is more robust for batched data sc.pp.highly_variable_genes( adata, flavor='cell_ranger', # No LOESS, works with batches n_top_genes=2000, batch_key='batch' )
Best Practices for Batch-Aware HVG
- Check batch sizes before HVG: Small batches (<500 cells) will cause LOESS to fail
- Prefer
orseurat
when batch sizes vary significantlycell_ranger - Use
only when all batches have >500 cellsseurat_v3 - Always wrap in try/except when dataset size is unknown
# Safe batch-aware HVG pattern def safe_highly_variable_genes(adata, batch_key='batch', n_top_genes=2000): """Select HVGs with automatic fallback for small batches.""" try: sc.pp.highly_variable_genes( adata, flavor='seurat_v3', n_top_genes=n_top_genes, batch_key=batch_key ) except ValueError: # Fallback for small batches sc.pp.highly_variable_genes( adata, flavor='seurat', n_top_genes=n_top_genes )
Examples
- "Download PBMC3k counts, run QC with Scrublet, normalise with
, and compute MDE + UMAP embeddings on CPU."shiftlog|pearson - "Set up the mixed CPU–GPU workflow in a fresh conda env, recover raw counts after normalisation, and score cell cycle phases before Leiden clustering."
- "Provision a RAPIDS environment, transfer AnnData to GPU, run
neighbours, and return embeddings to CPU for plotting."method='cagra'
References
- Detailed walkthrough notebooks:
,t_preprocess.ipynb
,t_preprocess_cpu.ipynbt_preprocess_gpu.ipynb - Quick copy/paste commands:
reference.md