Claude-skill-registry cdr3clustering

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds `CDR3_Cluster` column to metadata for clonotype analysis.

install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/cdr3clustering" ~/.claude/skills/majiayu000-claude-skill-registry-cdr3clustering && rm -rf "$T"
manifest: skills/data/cdr3clustering/SKILL.md
source content

CDR3Clustering Process Configuration

Purpose

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds

CDR3_Cluster
column to metadata for clonotype analysis.

When to Use

  • To identify groups of similar TCR/BCR clonotypes
  • For analyzing TCR sequence convergence
  • After ScRepCombiningExpression when TCR/BCR integrated with RNA
  • For investigating public clonotypes across samples
  • Before TESSA analysis for epitope specificity

Important: Only runs when VDJ input present (TCRData/BCRData columns in SampleInfo).

Configuration Structure

Process Enablement

[CDR3Clustering]
cache = true

Input Specification

[CDR3Clustering.in]
screpfile = "path/to/combined_object.qs"

Environment Variables

[CDR3Clustering.envs]
type = "auto"      # TCR, BCR, or auto
tool = "GIANA"     # GIANA or ClusTCR
python = "python"   # Path to python
within_sample = true  # Cluster per sample
args = {}          # Tool-specific arguments
chain = "both"     # TRA, TRB, IGH, IGL, IGK, both, heavy, light

GIANA Arguments (via
args
)

[CDR3Clustering.envs.args]
method = "hierarchical"    # hierarchical, kmeans
dist = "hamming"          # hamming, levenshtein
threshold = 0.15           # Distance threshold

ClusTCR Arguments (via
args
)

[CDR3Clustering.envs.args]
method = "two-step"       # mcl, faiss, two-step
n_cpus = 4                # CPUs for MCL
faiss_cluster_size = 5000  # Supercluster size
mcl_params = [1.2, 2]    # [inflation, expansion]

Configuration Examples

Minimal Configuration

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

GIANA with Custom Distance Threshold

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
method = "hierarchical"
dist = "hamming"
threshold = 0.15

ClusTCR Two-Step (Large Datasets)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

ClusTCR MCL (Small Datasets)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "mcl"
n_cpus = 4

TRB Chain Only

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
chain = "TRB"

Cross-Sample Clustering

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
within_sample = false

Common Patterns

Pattern 1: Standard TCR Beta Chain

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
type = "TCR"
tool = "GIANA"
chain = "TRB"

Pattern 2: Large Dataset (>100K sequences)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

Pattern 3: Custom Threshold

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
threshold = 0.15  # Higher=fewer clusters, Lower=more clusters

Dependencies

Upstream

  • ScRepCombiningExpression (required): Combined scRepertoire object with TCR/BCR data

Downstream

  • TESSA: TCR epitope specificity prediction
  • ClonalStats: Clonality statistics (uses
    CDR3_Cluster
    metadata)

Validation Rules

  1. Tool must be
    "GIANA"
    or
    "ClusTCR"
  2. Chain must be valid for data type (TCR: TRA/TRB, BCR: IGH/IGL/IGK)
  3. GIANA requires: biopython, faiss, scikit-learn
  4. ClusTCR requires: clustcr package

Computational Considerations

  • <50K sequences: ClusTCR
    method = "mcl"
    (highest quality)
  • 50K-500K sequences: ClusTCR
    method = "two-step"
    (balanced)
  • 500K sequences: GIANA or ClusTCR

    method = "two-step"
    (fastest)

  • Memory: GIANA ~2-4 GB/100K, ClusTCR ~4-8 GB/100K
  • Runtime: GIANA 1-5 min/100K, ClusTCR two-step 2-10 min/100K

Troubleshooting

Process not running

Cause: No VDJ data available Solution: Verify ScRepCombiningExpression output contains TCR/BCR data

ModuleNotFoundError

Cause: Missing dependencies Solution:

  • GIANA:
    pip install biopython faiss-cpu scikit-learn
  • ClusTCR:
    conda install -c conda-forge clustcr

Too many/few clusters

Cause: Threshold inappropriate Solution: Adjust threshold (higher = fewer clusters, lower = more clusters)

Out of memory

Cause: Dataset too large for RAM Solution: Use

within_sample = true
, reduce
n_cpus
, or use GIANA

Slow clustering

Cause: Suboptimal method for dataset size Solution:

  • 50K: ClusTCR

    method = "two-step"
    with increased n_cpus

  • Very large (>500K): Use GIANA

Notes on Output Format

Metadata column:

CDR3_Cluster

Cluster naming:

  • S_1
    ,
    S_2
    : Single unique CDR3 sequence (may have multiple cells)
  • M_1
    ,
    M_2
    : Multiple unique CDR3 sequences (similar but different)

Interpretation:

  • S_
    prefix: Cells share identical CDR3 sequence
  • M_
    prefix: Cells have similar but different CDR3 sequences
  • Use
    CDR3_Cluster
    as grouping factor in Seurat plots

Performance Tips:

  • Small (<10K): GIANA defaults (quality over speed)
  • Medium (10K-100K): ClusTCR two-step with n_cpus=4
  • Large (100K-1M): ClusTCR two-step with n_cpus=8+ or GIANA
  • Very large (>1M): GIANA with increased faiss_cluster_size