Claude-skill-registry cdr3clustering
Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds `CDR3_Cluster` column to metadata for clonotype analysis.
git clone https://github.com/majiayu000/claude-skill-registry
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/cdr3clustering" ~/.claude/skills/majiayu000-claude-skill-registry-cdr3clustering && rm -rf "$T"
skills/data/cdr3clustering/SKILL.mdCDR3Clustering Process Configuration
Purpose
Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds
CDR3_Cluster column to metadata for clonotype analysis.
When to Use
- To identify groups of similar TCR/BCR clonotypes
- For analyzing TCR sequence convergence
- After ScRepCombiningExpression when TCR/BCR integrated with RNA
- For investigating public clonotypes across samples
- Before TESSA analysis for epitope specificity
Important: Only runs when VDJ input present (TCRData/BCRData columns in SampleInfo).
Configuration Structure
Process Enablement
[CDR3Clustering] cache = true
Input Specification
[CDR3Clustering.in] screpfile = "path/to/combined_object.qs"
Environment Variables
[CDR3Clustering.envs] type = "auto" # TCR, BCR, or auto tool = "GIANA" # GIANA or ClusTCR python = "python" # Path to python within_sample = true # Cluster per sample args = {} # Tool-specific arguments chain = "both" # TRA, TRB, IGH, IGL, IGK, both, heavy, light
GIANA Arguments (via args
)
args[CDR3Clustering.envs.args] method = "hierarchical" # hierarchical, kmeans dist = "hamming" # hamming, levenshtein threshold = 0.15 # Distance threshold
ClusTCR Arguments (via args
)
args[CDR3Clustering.envs.args] method = "two-step" # mcl, faiss, two-step n_cpus = 4 # CPUs for MCL faiss_cluster_size = 5000 # Supercluster size mcl_params = [1.2, 2] # [inflation, expansion]
Configuration Examples
Minimal Configuration
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs"
GIANA with Custom Distance Threshold
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "GIANA" [CDR3Clustering.envs.args] method = "hierarchical" dist = "hamming" threshold = 0.15
ClusTCR Two-Step (Large Datasets)
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "two-step" faiss_cluster_size = 5000 n_cpus = 8
ClusTCR MCL (Small Datasets)
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "mcl" n_cpus = 4
TRB Chain Only
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] chain = "TRB"
Cross-Sample Clustering
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] within_sample = false
Common Patterns
Pattern 1: Standard TCR Beta Chain
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] type = "TCR" tool = "GIANA" chain = "TRB"
Pattern 2: Large Dataset (>100K sequences)
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "two-step" faiss_cluster_size = 5000 n_cpus = 8
Pattern 3: Custom Threshold
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "GIANA" [CDR3Clustering.envs.args] threshold = 0.15 # Higher=fewer clusters, Lower=more clusters
Dependencies
Upstream
- ScRepCombiningExpression (required): Combined scRepertoire object with TCR/BCR data
Downstream
- TESSA: TCR epitope specificity prediction
- ClonalStats: Clonality statistics (uses
metadata)CDR3_Cluster
Validation Rules
- Tool must be
or"GIANA""ClusTCR" - Chain must be valid for data type (TCR: TRA/TRB, BCR: IGH/IGL/IGK)
- GIANA requires: biopython, faiss, scikit-learn
- ClusTCR requires: clustcr package
Computational Considerations
- <50K sequences: ClusTCR
(highest quality)method = "mcl" - 50K-500K sequences: ClusTCR
(balanced)method = "two-step" -
500K sequences: GIANA or ClusTCR
(fastest)method = "two-step" - Memory: GIANA ~2-4 GB/100K, ClusTCR ~4-8 GB/100K
- Runtime: GIANA 1-5 min/100K, ClusTCR two-step 2-10 min/100K
Troubleshooting
Process not running
Cause: No VDJ data available Solution: Verify ScRepCombiningExpression output contains TCR/BCR data
ModuleNotFoundError
Cause: Missing dependencies Solution:
- GIANA:
pip install biopython faiss-cpu scikit-learn - ClusTCR:
conda install -c conda-forge clustcr
Too many/few clusters
Cause: Threshold inappropriate Solution: Adjust threshold (higher = fewer clusters, lower = more clusters)
Out of memory
Cause: Dataset too large for RAM Solution: Use
within_sample = true, reduce n_cpus, or use GIANA
Slow clustering
Cause: Suboptimal method for dataset size Solution:
-
50K: ClusTCR
with increased n_cpusmethod = "two-step" - Very large (>500K): Use GIANA
Notes on Output Format
Metadata column:
CDR3_Cluster
Cluster naming:
,S_1
: Single unique CDR3 sequence (may have multiple cells)S_2
,M_1
: Multiple unique CDR3 sequences (similar but different)M_2
Interpretation:
prefix: Cells share identical CDR3 sequenceS_
prefix: Cells have similar but different CDR3 sequencesM_- Use
as grouping factor in Seurat plotsCDR3_Cluster
Performance Tips:
- Small (<10K): GIANA defaults (quality over speed)
- Medium (10K-100K): ClusTCR two-step with n_cpus=4
- Large (100K-1M): ClusTCR two-step with n_cpus=8+ or GIANA
- Very large (>1M): GIANA with increased faiss_cluster_size