OpenClaw-Medical-Skills single-cell-annotation-skills-with-omicverse
Guide Claude through SCSA, MetaTiME, CellVote, CellMatch, GPTAnno, and weighted KNN transfer workflows for annotating single-cell modalities.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/single-annotation" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-single-cell-annotation-skills-with-o && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/single-annotation" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-single-cell-annotation-skills-with-o && rm -rf "$T"
skills/single-annotation/SKILL.mdSingle-cell annotation skills with omicverse
Overview
Use this skill to reproduce and adapt the single-cell annotation playbook captured in omicverse tutorials: SCSA
, MetaTiME t_cellanno.ipynb
, CellVote t_metatime.ipynb
& t_cellvote.md
, CellMatch t_cellvote_pbmc3k.ipynb
, GPTAnno t_cellmatch.ipynb
, and label transfer t_gptanno.ipynb
. Each section below highlights required inputs, training/inference steps, and how to read the outputs.t_anno_trans.ipynb
Instructions
-
SCSA automated cluster annotation
- Data requirements: PBMC3k raw counts from 10x Genomics (
) or the processedpbmc3k_filtered_gene_bc_matrices.tar.gz
. Download instructions are embedded in the notebook; unpack tosample/rna.h5ad
. Ensure an SCSA SQLite database is available (e.g.data/filtered_gene_bc_matrices/hg19/
from the Figshare/Drive links listed in the tutorial) and pointpySCSA_2024_v1_plus.db
to its location.model_path - Preprocessing & model fit: Load with
, run QC (sc.read_10x_mtx
), normalization and HVG selection (ov.pp.qc
), scaling (ov.pp.preprocess
), PCA (ov.pp.scale
), neighbors, Leiden clustering, and compute rank markers (ov.pp.pca
). Instantiatesc.tl.rank_genes_groups
choosingscsa = ov.single.pySCSA(...)
ortarget='cellmarker'
, tissue scope, and thresholds ('panglaodb'
,foldchange
).pvalue - Inference & interpretation: Call
orscsa.cell_anno(clustertype='leiden', result_key='scsa_celltype_cellmarker')
to append predictions toscsa.cell_auto_anno
. Compare to manual marker-based labels viaadata.obs
orov.utils.embedding
, inspect marker dictionaries (sc.pl.dotplot
), and query supported tissues withov.single.get_celltype_marker
. Use the ROI/ROE helpers (scsa.get_model_tissue()
,ov.utils.roe
) to validate abundance trends.ov.utils.plot_cellproportion
- Data requirements: PBMC3k raw counts from 10x Genomics (
-
MetaTiME tumour microenvironment states
- Data requirements: Batched TME AnnData with an scVI latent embedding. The tutorial uses
from Figshare (TiME_adata_scvi.h5ad
). If starting from counts, run scVI (https://figshare.com/ndownloader/files/41440050
) first to populatescvi.model.SCVI
.adata.obsm['X_scVI'] - Preprocessing & model fit: Optionally subset to non-malignant cells via
. Rebuild neighbors on the latent representation (adata.obs['isTME']
) and embed with pymde (sc.pp.neighbors(adata, use_rep="X_scVI")
). Initialiseadata.obsm['X_mde'] = ov.utils.mde(...)
and, if finer granularity is desired, over-cluster withTiME_object = ov.single.MetaTiME(adata, mode='table')
.TiME_object.overcluster(resolution=8, clustercol='overcluster') - Inference & interpretation: Run
to assign minor states andTiME_object.predictTiME(save_obs_name='MetaTiME')
. Visualise usingMajor_MetaTiME
orTiME_object.plot
. Interpret the outputs by comparing cluster-level distributions and confirming that MetaTiME and Major_MetaTiME columns align with expected niches.sc.pl.embedding
- Data requirements: Batched TME AnnData with an scVI latent embedding. The tutorial uses
-
CellVote consensus labelling
- Data requirements: A clustered AnnData (e.g. PBMC3k stored as
env var orCELLVOTE_PBMC3K
) plus at least two precomputed annotation columns (simulated in the tutorial asdata/pbmc3k.h5ad
,scsa_annotation
,gpt_celltype
). Prepare per-cluster marker genes viagbi_celltype
.sc.tl.rank_genes_groups - Preprocessing & model fit: After standard preprocessing (normalize, log1p, HVGs, PCA, neighbors, Leiden) build a marker dictionary
or viamarker_dict = top_markers_from_rgg(adata, 'leiden', topn=10)
. Instantiateov.single.get_celltype_marker
.cv = ov.single.CellVote(adata) - Inference & interpretation: Call
. Offline examples monkey-patch arbitration to avoid API calls; online voting requires valid credentials. Final consensus labels live incv.vote(clusters_key='leiden', cluster_markers=marker_dict, celltype_keys=[...], species='human', organization='PBMC', provider='openai', model='gpt-4o-mini')
. Compare each cluster’s majority vote with the input sources (adata.obs['CellVote_celltype']
) to justify decisions.adata.obs[['leiden', 'scsa_annotation', ...]]
- Data requirements: A clustered AnnData (e.g. PBMC3k stored as
-
CellMatch ontology mapping
- Data requirements: Annotated AnnData such as
withpertpy.dt.haber_2017_regions()
. Download Cell Ontology JSON (adata.obs['cell_label']
) viacl.json
or manual links, and optionally Cell Taxonomy resources (ov.single.download_cl(...)
). Ensure access to a SentenceTransformer model (Cell_Taxonomy_resource.txt
,sentence-transformers/all-MiniLM-L6-v2
, etc.), downloading toBAAI/bge-base-en-v1.5
if offline.local_model_dir - Preprocessing & model fit: Create the mapper with
. Runov.single.CellOntologyMapper(cl_obo_file='new_ontology/cl.json', model_name='sentence-transformers/all-MiniLM-L6-v2', local_model_dir='./my_models')
to assign ontology-derived labels/IDs, optionally enabling taxonomy matching (mapper.map_adata(...)
after callinguse_taxonomy=True
).load_cell_taxonomy_resource - Inference & interpretation: Explore mapping summaries (
) and inspect embeddings coloured bymapper.print_mapping_summary_taxonomy
,cell_ontology
, orcell_ontology_cl_id
. Use helper queries such asenhanced_cell_ontology
,mapper.find_similar_cells('T helper cell')
, and category browsing to validate ontology coverage.mapper.get_cell_info(...)
- Data requirements: Annotated AnnData such as
-
GPTAnno LLM-powered annotation
- Data requirements: The same PBMC3k dataset (raw matrix or
) and cluster assignments. Access to an LLM endpoint—configure.h5ad
for OpenAI-compatible providers (AGI_API_KEY
,provider='openai'
,'qwen'
, etc.), or supply a local model path for'kimi'
.ov.single.gptcelltype_local - Preprocessing & model fit: Follow the QC, normalization, HVG, scaling, PCA, neighbor, Leiden, and marker discovery steps described above (reusing outputs from the SCSA workflow). Build the marker dictionary automatically with
.ov.single.get_celltype_marker(adata, clustertype='leiden', rank=True, key='rank_genes_groups', foldchange=2, topgenenumber=5) - Inference & interpretation: Invoke
specifying tissue/species context and desired provider/model. Post-process responses to keep clean labels (ov.single.gptcelltype(...)
) and write them toresult[key].split(': ')[-1]...
. Compare embeddings (adata.obs['gpt_celltype']
) to verify cluster identities. If operating offline, callov.pl.embedding(..., color=['leiden','gpt_celltype'])
with a downloaded instruction-tuned checkpoint.ov.single.gptcelltype_local
- Data requirements: The same PBMC3k dataset (raw matrix or
-
Weighted KNN annotation transfer
- Data requirements: Cross-modal GLUE outputs with aligned embeddings, e.g.
(annotated RNA) anddata/analysis_lymph/rna-emb.h5ad
(query ATAC) where both containdata/analysis_lymph/atac-emb.h5ad
.obsm['X_glue'] - Preprocessing & model fit: Load both modalities, optionally concatenate for QC plots, and compute a shared low-dimensional embedding with
. Train a neighbour model usingov.utils.mde
.ov.utils.weighted_knn_trainer(train_adata=rna, train_adata_emb='X_glue', n_neighbors=15) - Inference & interpretation: Transfer labels via
. Store predictions inlabels, uncert = ov.utils.weighted_knn_transfer(query_adata=atac, query_adata_emb='X_glue', label_keys='major_celltype', knn_model=knn_transformer, ref_adata_obs=rna.obs)
and uncertainties inatac.obs['transf_celltype']
; copy toatac.obs['transf_celltype_unc']
if you want consistent naming. Visualise (major_celltype
) and inspect uncertainty to flag ambiguous cells.ov.utils.embedding
- Data requirements: Cross-modal GLUE outputs with aligned embeddings, e.g.
Critical API Reference - EXACT Function Signatures
pySCSA - IMPORTANT: Parameter is clustertype
, NOT cluster
clustertypeclusterCORRECT usage:
# Step 1: Initialize pySCSA scsa = ov.single.pySCSA( adata, foldchange=1.5, pvalue=0.01, species='Human', tissue='All', target='cellmarker' # or 'panglaodb' ) # Step 2: Run annotation - NOTE: use clustertype='leiden', NOT cluster='leiden'! anno_result = scsa.cell_anno(clustertype='leiden', cluster='all') # Step 3: Add cell type labels to adata.obs scsa.cell_auto_anno(adata, clustertype='leiden', key='scsa_celltype') # Results are stored in adata.obs['scsa_celltype']
WRONG - DO NOT USE:
# WRONG! 'cluster' is NOT a valid parameter for cell_auto_anno! # scsa.cell_auto_anno(adata, cluster='leiden') # ERROR!
COSG Marker Genes - Results stored in adata.uns, NOT adata.obs
CORRECT usage:
# Step 1: Run COSG marker gene identification ov.single.cosg(adata, groupby='leiden', n_genes_user=50) # Step 2: Access results from adata.uns (NOT adata.obs!) marker_names = adata.uns['rank_genes_groups']['names'] # DataFrame with cluster columns marker_scores = adata.uns['rank_genes_groups']['scores'] # Step 3: Get top markers for specific cluster cluster_0_markers = adata.uns['rank_genes_groups']['names']['0'][:10].tolist() # Step 4: To create celltype column, manually map clusters to cell types cluster_to_celltype = { '0': 'T cells', '1': 'B cells', '2': 'Monocytes', } adata.obs['cosg_celltype'] = adata.obs['leiden'].map(cluster_to_celltype)
WRONG - DO NOT USE:
# WRONG! COSG does NOT create adata.obs columns directly! # adata.obs['cosg_celltype'] # This key does NOT exist after running COSG! # adata.uns['cosg_celltype'] # This key also does NOT exist!
Common Pitfalls to Avoid
-
pySCSA parameter confusion:
= which obs column contains cluster labels (e.g., 'leiden')clustertype
= which specific clusters to annotate ('all' or specific cluster IDs)cluster- These are DIFFERENT parameters!
-
COSG result access:
- COSG is a marker gene finder, NOT a cell type annotator
- Results are per-cluster gene rankings stored in
adata.uns['rank_genes_groups'] - To assign cell types, you must manually map clusters to cell types based on markers
-
Result storage patterns in OmicVerse:
- Cell type annotations →
adata.obs['<key>'] - Marker gene results →
(includes 'names', 'scores', 'logfoldchanges')adata.uns['<key>'] - Differential expression →
adata.uns['rank_genes_groups']
- Cell type annotations →
Examples
- "Run SCSA with both CellMarker and PanglaoDB references on PBMC3k, then benchmark against manual marker assignments before feeding the results into CellVote."
- "Annotate tumour microenvironment states in the MetaTiME Figshare dataset, highlight Major_MetaTiME classes, and export the label distribution per patient."
- "Download Cell Ontology resources, map
clusters to ontology terms, and enrich ambiguous clusters using Cell Taxonomy hints."haber_2017_regions - "Propagate RNA-derived
labels onto GLUE-integrated ATAC cells and report clusters with high transfer uncertainty."major_celltype
References
- Tutorials and notebooks:
,t_cellanno.ipynb
,t_metatime.ipynb
,t_cellvote.md
,t_cellvote_pbmc3k.ipynb
,t_cellmatch.ipynb
,t_gptanno.ipynb
.t_anno_trans.ipynb - Sample data & assets: PBMC3k matrix from 10x Genomics, MetaTiME
(Figshare), SCSA database downloads, GLUE embeddings underTiME_adata_scvi.h5ad
, Cell Ontologydata/analysis_lymph/
, and Cell Taxonomy resource.cl.json - Quick copy commands:
.reference.md