Claude-skill-registry celltypeannotation

Annotates cell clusters with biological cell type labels using multiple methods: direct assignment, ScType, scCATCH, hitype, or CellTypist. This process is essential for interpreting clustering results by assigning meaningful biological identities to each cluster.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/celltypeannotation" ~/.claude/skills/majiayu000-claude-skill-registry-celltypeannotation && rm -rf "$T"
manifest: skills/data/celltypeannotation/SKILL.md
source content

CellTypeAnnotation Process Configuration

Purpose

Annotates cell clusters with biological cell type labels using multiple methods: direct assignment, ScType, scCATCH, hitype, or CellTypist. This process is essential for interpreting clustering results by assigning meaningful biological identities to each cluster.

When to Use

  • After clustering: When you have cluster assignments but need biological cell type labels
  • Automated annotation: When manual annotation is too time-consuming or subjective
  • Consistent nomenclature: When you need standardized cell type names across multiple samples
  • Reference-based annotation: When you have well-characterized reference datasets or marker databases
  • Cross-sample comparison: When analyzing multiple samples with the same cell type definitions
  • Alternative to SeuratMap2Ref: When you prefer database-based annotation over reference dataset mapping

Configuration Structure

Process Enablement

[CellTypeAnnotation]
cache = true  # Cache results for faster re-runs

Input Specification

[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]  # Path or reference to Seurat object

Environment Variables

Core Parameters

[CellTypeAnnotation.envs]
# Annotation method selection
tool = "direct"  # Options: "direct", "sctype", "hitype", "sccatch", "celltypist"

# Cluster identity column (required for h5ad input, optional for Seurat objects)
ident = "seurat_clusters"  # Column name in metadata representing clusters

# Backup column name (stores original cluster labels)
backup_col = "seurat_clusters_id"  # Default: "seurat_clusters_id"

# New column name for annotated cell types
# If specified, original identity is kept; otherwise, it's replaced
newcol = ""  # Default: empty (overwrite identity)

# Merge clusters with same predicted cell types
merge = false  # Default: false; suffixes (.1, .2) added for duplicate labels

# Output file type
outtype = "input"  # Options: "input", "rds", "qs", "qs2", "h5ad"

Direct Annotation Parameters

[CellTypeAnnotation.envs]
tool = "direct"

# Cell type assignments (one per cluster, in order)
# Use "-" or "" to keep original cluster name
# Use "NA" to remove cluster from downstream analysis (only without newcol)
cell_types = ["CD4+ T cells", "CD8+ T cells", "-", "B cells"]  # Default: []

# Additional annotations (multiple cell type columns)
more_cell_types = {  # Dict: {new_column: [cell_types]}
    cell_type_broad = ["T cells", "T cells", "NK cells", "B cells"],
    cell_type_detailed = ["CD4+ naive", "CD8+ effector", "NK", "B naive"]
}

ScType Annotation Parameters

[CellTypeAnnotation.envs]
tool = "sctype"

# Tissue type (must match tissueType column in database)
sctype_tissue = "Immune system"  # Required for sctype

# Database file path (Excel format compatible with ScType)
sctype_db = "/path/to/ScTypeDB_full.xlsx"  # Optional: uses default if not specified

hitype Annotation Parameters

[CellTypeAnnotation.envs]
tool = "hitype"

# Tissue type (must match tissueType column in database)
hitype_tissue = "Immune system"  # Required for hitype

# Database file path or built-in database name
# Built-in options: "hitypedb_short", "hitypedb_full", "hitypedb_pbmc3k"
hitype_db = "hitypedb_full"  # Default: built-in database

scCATCH Annotation Parameters

[CellTypeAnnotation.envs]
tool = "sccatch"

[CellTypeAnnotation.envs.sccatch_args]
# Species (Human or Mouse)
species = "Human"  # Required

# Tissue origin
tissue = "Blood"  # Required

# Cancer type (if cancer tissue)
cancer = "Normal"  # Default: "Normal"

# Custom marker genes (RDS file or list)
marker = ""  # Optional

# Use custom marker instead of database
if_use_custom_marker = false  # Default: false

# Additional scCATCH::findmarkergene() arguments
# See: https://rdrr.io/cran/scCATCH/man/findmarkergene.html

CellTypist Annotation Parameters

[CellTypeAnnotation.envs]
tool = "celltypist"

[CellTypeAnnotation.envs.celltypist_args]
# Model file path (download from https://celltypist.cog.sanger.ac.uk/models/models.json)
model = "Immune_All_Low.pkl"  # Required

# Python interpreter where celltypist is installed
python = "python"  # Default: "python"

# Majority voting refinement for local subclusters
majority_voting = true  # Default: true

# Over-clustering column (for majority voting)
# Set to false to disable over-clustering
over_clustering = "seurat_clusters"  # Auto: identity for Seurat, false for h5ad

# Assay for Seurat-to-AnnData conversion
assay = ""  # Auto: RNA for h5seurat, default assay for Seurat

Annotation Methods

1. Direct Annotation

Assigns cell types manually to each cluster. Best when you have well-defined marker genes or want complete control over annotations.

Pros:

  • Full control over annotations
  • Fast and deterministic
  • Works with any clustering result

Cons:

  • Requires domain knowledge
  • Time-consuming for many clusters
  • Subjective

Use cases:

  • Small number of well-separated clusters
  • Known marker genes
  • Reproducible annotation needed

2. ScType

Uses pre-defined cell type markers from ScType database. Annotates based on enrichment of known marker genes in each cluster.

Databases:

  • ScTypeDB_short.xlsx: Compact database (~70 cell types)
  • ScTypeDB_full.xlsx: Full database (~200+ cell types)
  • Custom database: Provide your own Excel file

Pros:

  • Automated annotation
  • Tissue-specific filtering available
  • Well-curated marker database

Cons:

  • Limited to predefined cell types
  • Requires tissue specification
  • May miss rare cell types

Reference: https://github.com/IanevskiAleksandr/sc-type

Use cases:

  • Immune tissue datasets
  • When tissue type is well-defined
  • Need for comprehensive annotation

3. hitype

Flexible annotation tool compatible with ScType database format. Supports both file-based and built-in databases.

Built-in databases:

  • hitypedb_short
    : Compact marker set
  • hitypedb_full
    : Comprehensive marker set
  • hitypedb_pbmc3k
    : PBMC-specific markers (from 10X PBMC3k dataset)

Pros:

  • Faster than ScType (Python-based)
  • Multiple built-in databases
  • Tissue-specific filtering

Cons:

  • Limited to database cell types
  • Requires tissue specification

Reference: https://github.com/pwwang/hitype

Use cases:

  • PBMC datasets (use
    hitypedb_pbmc3k
    )
  • General immune annotation
  • When speed matters

4. scCATCH

Identifies cell types by matching cluster marker genes to cell type-specific marker database.

Workflow:

  1. Finds marker genes for each cluster
  2. Matches markers to cell type database
  3. Assigns best matching cell type

Parameters:

  • species
    : Human or Mouse
  • tissue
    : Tissue origin (required)
  • cancer
    : Cancer type (if applicable)

Pros:

  • Automated marker identification
  • Species-specific databases
  • Cancer type support

Cons:

  • Requires tissue specification
  • Slower (finds markers first)
  • Limited database

Reference: https://github.com/ZJUFanLab/scCATCH

Use cases:

  • When you want marker discovery + annotation
  • Cancer tissue datasets
  • Species-specific annotation

5. CellTypist

Machine learning-based annotation using pre-trained models. Requires Python environment and celltypist2 package.

Models:

Key features:

  • majority_voting
    : Refines annotations within local subclusters
  • over_clustering
    : Over-cluster first, then merge by majority vote

Pros:

  • State-of-the-art ML models
  • Handles complex datasets well
  • Majority voting improves accuracy

Cons:

  • Requires Python environment
  • Model files need download
  • Longer runtime with majority voting

Reference: https://celltypist.org/

Use cases:

  • Large complex datasets
  • When ScType/hitype annotation is insufficient
  • High-throughput annotation

Configuration Examples

Example 1: Minimal Configuration (No Annotation)

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

Result: Tool defaults to "direct" with empty

cell_types
. Original cluster names are preserved.

Example 2: Direct Annotation for T Cell Subsets

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ naive", "CD4+ memory", "CD8+ naive", "CD8+ effector", "-", "Regulatory T"]

Result: Clusters 0-3 and 5 get specified labels. Cluster 4 keeps original name (placeholder "-").

Example 3: ScType for Immune Tissue

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
sctype_db = "/data/databases/ScTypeDB_full.xlsx"
merge = true  # Merge clusters with same annotation

Result: Uses full ScType database for immune tissue. Merges clusters with identical annotations.

Example 4: hitype with Built-in PBMC Database

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "hitype"
hitype_tissue = "Blood"
hitype_db = "hitypedb_pbmc3k"  # Built-in PBMC database
merge = true

Result: Fast PBMC annotation using built-in database optimized for 10X PBMC data.

Example 5: scCATCH for Cancer Tissue

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "sccatch"

[CellTypeAnnotation.envs.sccatch_args]
species = "Human"
tissue = "Lung"
cancer = "Lung adenocarcinoma"

Result: Annotates lung adenocarcinoma dataset with cancer-specific cell types.

Example 6: CellTypist with Majority Voting

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "celltypist"

[CellTypeAnnotation.envs.celltypist_args]
model = "/data/models/Immune_All_Low.pkl"
majority_voting = true
over_clustering = "seurat_clusters"  # Use clusters for majority voting
python = "/usr/bin/python3"  # Specify Python interpreter

Result: Uses ML model with majority voting refinement for robust annotation.

Example 7: Multiple Annotation Methods (Keep Original)

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
newcol = "celltype_sctype"  # Create new column, keep original

Result: Annotated cell types saved in

celltype_sctype
column. Original
seurat_clusters
unchanged.

Example 8: Multiple Annotation Columns

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ T", "CD8+ T", "NK", "B", "Monocyte"]

more_cell_types = {
    "celltype_broad": ["T cells", "T cells", "NK cells", "B cells", "Monocytes"],
    "celltype_subset": ["CD4+ naive", "CD8+ effector", "NK", "B naive", "CD14+ Mono"]
}

Result: Creates three metadata columns:

celltype
(from
cell_types
),
celltype_broad
,
celltype_subset
.

Example 9: Exclude Clusters with NA

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ T", "CD8+ T", "NA", "B cells"]

Result: Cluster 2 is removed from downstream analysis (NA excludes cluster). Note: Only works without

newcol
.

Example 10: H5AD Input with CellTypist

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["seurat_clustering.h5ad"]  # H5AD file

[CellTypeAnnotation.envs]
tool = "celltypist"
ident = "clusters"  # Required for H5AD: cluster column name

[CellTypeAnnotation.envs.celltypist_args]
model = "Immune_All_Low.pkl"
majority_voting = true

Result: Annotates H5AD file.

ident
specifies which metadata column contains clusters.

Common Patterns

Pattern 1: Standard T Cell Annotation Workflow

# Step 1: Cluster T cells
[SeuratClusteringOfAllCells]
[TOrBCellSelection]
[SeuratClustering]  # Clustering on T cells only

# Step 2: Annotate T cell subsets
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["Naive CD4+", "Memory CD4+", "Effector CD8+", "Tregs", "Progenitor"]

Pattern 2: Automated Immune Annotation with Backup

# Use hitype for annotation, keep original clusters
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "hitype"
hitype_tissue = "Blood"
hitype_db = "hitypedb_pbmc3k"
newcol = "celltype_hitype"  # Keep original seurat_clusters
merge = true

Pattern 3: Combine Multiple Annotation Methods

# First annotation: ScType
[CellTypeAnnotation]
[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Immune system"
newcol = "celltype_sctype"

# Second annotation: CellTypist for comparison
[CellTypeAnnotation2]
# Note: Must define separate process for second annotation
# See immunopipe-config.md for multi-process setup

Pattern 4: Refine Annotation with CellTypist

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "celltypist"

[CellTypeAnnotation.envs.celltypist_args]
model = "Immune_All_Low.pkl"
majority_voting = true
over_clustering = "seurat_clusters"  # Use clustering result
python = "python"

Pattern 5: Tissue-Specific ScType Annotation

[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]

[CellTypeAnnotation.envs]
tool = "sctype"
sctype_tissue = "Brain"  # Brain-specific annotation
sctype_db = "/data/brain_markers.xlsx"  # Custom brain marker database
merge = true

Dependencies

Upstream Processes

  • Required:
    SeuratClustering
    (or process that produces Seurat object with clusters)
  • Optional:
    SeuratClusteringOfAllCells
    (if using T/B cell selection)
  • Optional:
    SeuratMap2Ref
    (can combine multiple annotation methods)
  • Optional:
    TOrBCellSelection
    (T/B-specific annotation)

Downstream Processes

  • SeuratClusterStats: Uses annotated cell types for visualization
  • ClusterMarkers: Finds markers for each cell type
  • TopExpressingGenes: Top genes per cell type
  • MarkersFinder: Flexible marker finding by cell type
  • CellCellCommunication: Uses cell types for ligand-receptor analysis
  • ScFGSEA: GSEA by cell type
  • PseudoBulkDEG: DE analysis by cell type
  • ScrnaMetabolicLandscape: Metabolic analysis by cell type
  • ScRepCombiningExpression: Integrates with TCR/BCR data

External Dependencies

  • ScType: Requires
    sctype
    R package
  • hitype: Requires
    hitype
    Python package
  • scCATCH: Requires
    scCATCH
    R package
  • CellTypist: Requires
    celltypist2
    Python package and Python interpreter

Validation Rules

Tool-Specific Validation

  1. ScType:

    • sctype_tissue
      must be specified (or empty string to use all tissues)
    • sctype_db
      must be a valid Excel file path (or empty for default)
    • Database must contain
      tissueType
      ,
      cellType
      , and
      gene_short
      columns
  2. hitype:

    • hitype_tissue
      must be specified (or empty string to use all tissues)
    • hitype_db
      must be valid file path or built-in name
    • Built-in names:
      hitypedb_short
      ,
      hitypedb_full
      ,
      hitypedb_pbmc3k
  3. scCATCH:

    • species
      must be "Human" or "Mouse"
    • tissue
      must be specified
    • At least 2 clusters required (scCATCH limitation)
  4. CellTypist:

    • model
      must be a valid .pkl file path
    • python
      must be valid Python interpreter path
    • CellTypist must be installed in specified Python environment
  5. Direct:

    • cell_types
      list length should match number of clusters (shorter OK, longer not)
    • Placeholders "-" or "" keep original names
    • "NA" removes cluster (only without
      newcol
      )

Input Validation

  • Seurat object must have valid identity/clustering column
  • H5AD input requires
    ident
    parameter (cluster column name)
  • Output directory must be writable

Output Validation

  • cluster2celltype.tsv
    generated for ScType/hitype/scCATCH/CellTypist
  • Output file format matches
    outtype
    specification
  • Metadata contains annotated cell types

Troubleshooting

Common Issues and Solutions

Issue: "No tissues found in database" (ScType/hitype)

Cause:

sctype_tissue
or
hitype_tissue
doesn't match tissueType column in database.

Solutions:

  1. Check available tissues: Open database Excel file, read
    tissueType
    column
  2. Use exact match (case-sensitive)
  3. Set tissue to empty string
    ""
    to use all rows in database
  4. Verify database file path is correct

Issue: "Not enough clusters for scCATCH"

Cause: scCATCH requires at least 2 clusters.

Solutions:

  1. Ensure clustering result has ≥2 clusters
  2. Increase clustering resolution in
    SeuratClustering
  3. Use alternative tool (ScType, hitype, CellTypist)

Issue: CellTypist Python not found

Cause: CellTypist requires Python environment with celltypist2 installed.

Solutions:

  1. Specify correct Python path:
    celltypist_args.python = "/usr/bin/python3"
  2. Install celltypist2:
    pip install celltypist2
  3. Verify Python environment:
    python -c "import celltypist; print(celltypist.__version__)"

Issue: CellTypist model file not found

Cause: Model path is incorrect or model not downloaded.

Solutions:

  1. Download model from: https://celltypist.cog.sanger.ac.uk/models/models.json
  2. Use absolute path for
    celltypist_args.model
  3. Verify model file exists and is readable

Issue: "Unknown tool" error

Cause: Invalid

tool
value specified.

Solutions:

  1. Check valid options:
    direct
    ,
    sctype
    ,
    hitype
    ,
    sccatch
    ,
    celltypist
  2. Verify spelling is correct (case-sensitive)
  3. Check tool is installed in environment

Issue: Annotations overwritten by multiple annotation processes

Cause: Multiple annotation processes write to same metadata column.

Solutions:

  1. Use
    newcol
    parameter to create separate columns:
    [CellTypeAnnotation.envs]
    newcol = "celltype_method1"
    
  2. Or use
    backup_col
    to preserve original:
    backup_col = "original_clusters_id"
    

Issue: Ambiguous cell type assignments

Cause: Clusters have similar marker expression patterns.

Solutions:

  1. Increase clustering resolution for finer separation
  2. Use
    merge = false
    to keep cluster-specific labels
  3. Compare multiple annotation methods for consensus
  4. Manual inspection of top marker genes

Issue: Missing cell types in results

Cause: Clusters removed by "NA" placeholder or filtering.

Solutions:

  1. Check
    cell_types
    list for "NA" entries
  2. Verify
    newcol
    is not set (NA removal only works without newcol)
  3. Check downstream processes for filtering

Issue: H5AD input annotation fails

Cause:

ident
parameter not specified for H5AD files.

Solutions:

  1. Specify cluster column:
    ident = "clusters"
    (or your cluster column name)
  2. Check H5AD metadata for cluster column name
  3. Or convert H5AD to RDS format first

Issue: Wrong number of cell types assigned

Cause:

cell_types
list length doesn't match cluster count.

Solutions:

  1. Check number of clusters in Seurat object
  2. Ensure
    cell_types
    list has correct number of entries
  3. Use placeholders "-" or "" for clusters to keep original names
  4. Shorter lists OK (extra clusters keep original names)

Verification Steps

After annotation, verify:

  1. Check output file:

    # View cluster to cell type mapping
    cat .pipen/Immunopipe/CellTypeAnnotation/0/output/cluster2celltype.tsv
    
  2. Check Seurat object metadata:

    library(Seurat)
    obj <- readRDS(".pipen/Immunopipe/CellTypeAnnotation/0/output/annotated.rds")
    head(obj@meta.data)
    # Look for cell type column (seurat_clusters or newcol name)
    
  3. Validate annotation quality:

    # Check distribution of cell types
    table(Idents(obj))
    
    # Visualize UMAP with cell types
    DimPlot(obj, group.by = "celltype_hitype", label = TRUE, repel = TRUE)
    
  4. Compare multiple methods:

    # Compare ScType vs hitype annotations
    table(obj$celltype_sctype, obj$celltype_hitype)
    

Best Practices

Method Selection

  1. Start with hitype: Fast, good for PBMC/immune datasets
  2. Compare with ScType: Alternative database-based method
  3. Use CellTypist for complex datasets: ML-based, handles well
  4. Manual refinement: Use direct annotation for corrections

Multi-Method Workflow

  1. Run multiple annotation methods in parallel
  2. Compare results for consensus
  3. Manually refine discrepancies using direct annotation
  4. Keep original cluster names for traceability

Tissue-Specific Annotation

  1. Always specify tissue when using ScType/hitype
  2. Use custom databases for non-standard tissues
  3. Verify database contains relevant cell types

Reproducibility

  1. Save cluster-to-celltype mapping (
    cluster2celltype.tsv
    )
  2. Document which tool/database was used
  3. Keep original cluster names using
    newcol
    or
    backup_col

External References

Tool Documentation

Database Downloads

Related Processes

  • SeuratClustering
    : Clustering before annotation
  • SeuratMap2Ref
    : Reference-based annotation (alternative)
  • ClusterMarkers
    : Find markers for each cell type
  • SeuratClusterStats
    : Visualize annotated clusters