Claude-skill-registry cdr3clustering

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds `CDR3_Cluster` column to metadata for clonotype analysis.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/cdr3clustering" ~/.claude/skills/majiayu000-claude-skill-registry-cdr3clustering && rm -rf "$T"

manifest: skills/data/cdr3clustering/SKILL.md

CDR3Clustering Process Configuration

Purpose

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds

CDR3_Cluster

column to metadata for clonotype analysis.

When to Use

To identify groups of similar TCR/BCR clonotypes
For analyzing TCR sequence convergence
After ScRepCombiningExpression when TCR/BCR integrated with RNA
For investigating public clonotypes across samples
Before TESSA analysis for epitope specificity

Important: Only runs when VDJ input present (TCRData/BCRData columns in SampleInfo).

Configuration Structure

Process Enablement

[CDR3Clustering]
cache = true

Input Specification

[CDR3Clustering.in]
screpfile = "path/to/combined_object.qs"

Environment Variables

[CDR3Clustering.envs]
type = "auto"      # TCR, BCR, or auto
tool = "GIANA"     # GIANA or ClusTCR
python = "python"   # Path to python
within_sample = true  # Cluster per sample
args = {}          # Tool-specific arguments
chain = "both"     # TRA, TRB, IGH, IGL, IGK, both, heavy, light

GIANA Arguments (via

args

)

[CDR3Clustering.envs.args]
method = "hierarchical"    # hierarchical, kmeans
dist = "hamming"          # hamming, levenshtein
threshold = 0.15           # Distance threshold

ClusTCR Arguments (via

args

)

[CDR3Clustering.envs.args]
method = "two-step"       # mcl, faiss, two-step
n_cpus = 4                # CPUs for MCL
faiss_cluster_size = 5000  # Supercluster size
mcl_params = [1.2, 2]    # [inflation, expansion]

Configuration Examples

Minimal Configuration

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

GIANA with Custom Distance Threshold

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
method = "hierarchical"
dist = "hamming"
threshold = 0.15

ClusTCR Two-Step (Large Datasets)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

ClusTCR MCL (Small Datasets)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "mcl"
n_cpus = 4

TRB Chain Only

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
chain = "TRB"

Cross-Sample Clustering

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
within_sample = false

Common Patterns

Pattern 1: Standard TCR Beta Chain

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
type = "TCR"
tool = "GIANA"
chain = "TRB"

Pattern 2: Large Dataset (>100K sequences)

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

Pattern 3: Custom Threshold

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
threshold = 0.15  # Higher=fewer clusters, Lower=more clusters

Dependencies

Upstream

ScRepCombiningExpression (required): Combined scRepertoire object with TCR/BCR data

Downstream

TESSA: TCR epitope specificity prediction
ClonalStats: Clonality statistics (uses
```
CDR3_Cluster
```
metadata)

Validation Rules

Tool must be
```
"GIANA"
```
or
```
"ClusTCR"
```
Chain must be valid for data type (TCR: TRA/TRB, BCR: IGH/IGL/IGK)
GIANA requires: biopython, faiss, scikit-learn
ClusTCR requires: clustcr package

Computational Considerations

<50K sequences: ClusTCR
```
method = "mcl"
```
(highest quality)
50K-500K sequences: ClusTCR
```
method = "two-step"
```
(balanced)
500K sequences: GIANA or ClusTCR
```
method = "two-step"
```
(fastest)
Memory: GIANA ~2-4 GB/100K, ClusTCR ~4-8 GB/100K
Runtime: GIANA 1-5 min/100K, ClusTCR two-step 2-10 min/100K

Troubleshooting

Process not running

Cause: No VDJ data available Solution: Verify ScRepCombiningExpression output contains TCR/BCR data

ModuleNotFoundError

Cause: Missing dependencies Solution:

GIANA:

pip install biopython faiss-cpu scikit-learn

ClusTCR:
```
conda install -c conda-forge clustcr
```

Too many/few clusters

Cause: Threshold inappropriate Solution: Adjust threshold (higher = fewer clusters, lower = more clusters)

Out of memory

Cause: Dataset too large for RAM Solution: Use

within_sample = true

, reduce

n_cpus

, or use GIANA

Slow clustering

Cause: Suboptimal method for dataset size Solution:

50K: ClusTCR
```
method = "two-step"
```
with increased n_cpus
Very large (>500K): Use GIANA

Notes on Output Format

Metadata column:

CDR3_Cluster

Cluster naming:

```
S_1
```
,
```
S_2
```
: Single unique CDR3 sequence (may have multiple cells)
```
M_1
```
,
```
M_2
```
: Multiple unique CDR3 sequences (similar but different)

Interpretation:

```
S_
```
prefix: Cells share identical CDR3 sequence
```
M_
```
prefix: Cells have similar but different CDR3 sequences
Use
```
CDR3_Cluster
```
as grouping factor in Seurat plots

Performance Tips:

Small (<10K): GIANA defaults (quality over speed)
Medium (10K-100K): ClusTCR two-step with n_cpus=4
Large (100K-1M): ClusTCR two-step with n_cpus=8+ or GIANA
Very large (>1M): GIANA with increased faiss_cluster_size

Claude-skill-registry cdr3clustering

CDR3Clustering Process Configuration

Purpose

When to Use

Configuration Structure

Process Enablement

Input Specification

Environment Variables

GIANA Arguments (via
`args`
)

ClusTCR Arguments (via
`args`
)

Configuration Examples

Minimal Configuration

GIANA with Custom Distance Threshold

ClusTCR Two-Step (Large Datasets)

ClusTCR MCL (Small Datasets)

TRB Chain Only

Cross-Sample Clustering

Common Patterns

Pattern 1: Standard TCR Beta Chain

Pattern 2: Large Dataset (>100K sequences)

Pattern 3: Custom Threshold

Dependencies

Upstream

Downstream

Validation Rules

Computational Considerations

Troubleshooting

Process not running

ModuleNotFoundError

Too many/few clusters

Out of memory

Slow clustering

Notes on Output Format

Claude-skill-registry cdr3clustering

CDR3Clustering Process Configuration

Purpose

When to Use

Configuration Structure

Process Enablement

Input Specification

Environment Variables

GIANA Arguments (via args)

ClusTCR Arguments (via args)

Configuration Examples

Minimal Configuration

GIANA with Custom Distance Threshold

ClusTCR Two-Step (Large Datasets)

ClusTCR MCL (Small Datasets)

TRB Chain Only

Cross-Sample Clustering

Common Patterns

Pattern 1: Standard TCR Beta Chain

Pattern 2: Large Dataset (>100K sequences)

Pattern 3: Custom Threshold

Dependencies

Upstream

Downstream

Validation Rules

Computational Considerations

Troubleshooting

Process not running

ModuleNotFoundError

Too many/few clusters

Out of memory

Slow clustering

Notes on Output Format

GIANA Arguments (via
`args`
)

ClusTCR Arguments (via
`args`
)