install
source · Clone the upstream repo
git clone https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills-
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/mdbabumiamssm/LLMs-Universal-Life-Science-and-Clinical-Skills- "$T" && mkdir -p ~/.claude/skills && cp -r "$T/Skills/Genomics/Metagenomics/bioSkills/strain-tracking" ~/.claude/skills/mdbabumiamssm-llms-universal-life-science-and-clinical-skills-strain-tracking && rm -rf "$T"
manifest:
Skills/Genomics/Metagenomics/bioSkills/strain-tracking/SKILL.mdsource content
<!--
# COPYRIGHT NOTICE
# This file is part of the "Universal Biomedical Skills" project.
# Copyright (c) 2026 MD BABU MIA, PhD <md.babu.mia@mssm.edu>
# All Rights Reserved.
#
# This code is proprietary and confidential.
# Unauthorized copying of this file, via any medium is strictly prohibited.
#
# Provenance: Authenticated by MD BABU MIA
-->
name: bio-metagenomics-strain-tracking description: Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics. tool_type: cli primary_tool: MASH measurable_outcome: Execute skill workflow successfully with valid output within 15 minutes. allowed-tools:
- read_file
- run_shell_command
Strain Tracking
Identify and track bacterial strains at sub-species resolution.
Tool Comparison
| Tool | Method | Best For |
|---|---|---|
| MASH | MinHash sketches | Fast distance estimation |
| sourmash | MinHash + containment | Metagenome comparisons |
| fastANI | ANI calculation | Accurate species/strain ID |
| inStrain | SNV profiling | Strain dynamics in metagenomes |
MASH
Installation
conda install -c bioconda mash
Create Sketch
# Single genome mash sketch -o genome.msh genome.fasta # Multiple genomes mash sketch -o reference_db.msh genomes/*.fasta # From reads (with coverage) mash sketch -m 2 -r -o reads.msh reads.fastq.gz
Calculate Distance
# Pairwise distance mash dist genome1.fasta genome2.fasta # Query against database mash dist reference_db.msh query.fasta > distances.tsv # Screen for containment (metagenome) mash screen reference_db.msh reads.fastq.gz > screen_results.tsv
Interpret MASH Distance
| Distance | Interpretation |
|---|---|
| < 0.05 | Same species/strain |
| 0.05-0.15 | Same species |
| 0.15-0.25 | Same genus |
| > 0.25 | Different genus |
Cluster Genomes
# All-vs-all distances mash triangle genomes/*.fasta > distances.phylip # Build tree mash triangle -E genomes/*.fasta > distances.tsv
sourmash
Installation
conda install -c bioconda sourmash
Create Signatures
# Genome signature sourmash sketch dna -p scaled=1000,k=31 genome.fasta -o genome.sig # Multiple genomes sourmash sketch dna -p scaled=1000,k=31 genomes/*.fasta -o genomes.sig # Protein signatures sourmash sketch protein -p scaled=100,k=10 proteins.faa -o proteins.sig
Compare Signatures
# Pairwise comparison sourmash compare *.sig -o comparison.npy --csv comparison.csv # Search against database sourmash search query.sig database.sig --threshold 0.8 # Gather (metagenome decomposition) sourmash gather metagenome.sig database.sig -o gather_results.csv
Taxonomy Assignment
# Download taxonomy database sourmash database download gtdb-rs214-k31.zip # Classify sourmash lca classify --db gtdb-rs214-k31.lca.json.gz --query query.sig # Summarize metagenome sourmash lca summarize --db gtdb-rs214-k31.lca.json.gz --query metagenome.sig
fastANI
Installation
conda install -c bioconda fastani
Calculate ANI
# Single pair fastANI -q query.fasta -r reference.fasta -o ani_result.txt # Query vs multiple references fastANI -q query.fasta --rl reference_list.txt -o ani_results.txt # All-vs-all fastANI --ql genome_list.txt --rl genome_list.txt -o all_vs_all.txt --matrix
Interpret ANI
| ANI | Interpretation |
|---|---|
| >99% | Same strain |
| 95-99% | Same species |
| <95% | Different species |
inStrain
For strain-level analysis in metagenomes.
Installation
conda install -c bioconda instrain
Profile Strains
# Map reads to reference bowtie2 -x reference -1 reads_1.fq -2 reads_2.fq | \ samtools sort -o mapped.bam # Profile with inStrain inStrain profile mapped.bam reference.fasta -o instrain_output -p 8
Compare Samples
# Profile multiple samples for bam in sample*.bam; do inStrain profile $bam reference.fasta -o ${bam%.bam}_IS -p 8 done # Compare strain populations inStrain compare -i sample*_IS -o comparison_IS -p 8
Key Outputs
# SNV table cat instrain_output/output/SNVs.tsv # Gene-level info cat instrain_output/output/gene_info.tsv # Genome info cat instrain_output/output/genome_info.tsv
Complete Workflow: Outbreak Tracking
#!/bin/bash set -euo pipefail GENOMES_DIR=$1 OUTPUT_DIR=$2 mkdir -p $OUTPUT_DIR echo "=== MASH sketching ===" mash sketch -o $OUTPUT_DIR/genomes.msh $GENOMES_DIR/*.fasta echo "=== MASH distances ===" mash dist $OUTPUT_DIR/genomes.msh $OUTPUT_DIR/genomes.msh > $OUTPUT_DIR/mash_distances.tsv echo "=== fastANI ===" ls $GENOMES_DIR/*.fasta > $OUTPUT_DIR/genome_list.txt fastANI --ql $OUTPUT_DIR/genome_list.txt \ --rl $OUTPUT_DIR/genome_list.txt \ -o $OUTPUT_DIR/fastani_results.txt \ --matrix echo "=== sourmash signatures ===" sourmash sketch dna -p scaled=1000,k=31 $GENOMES_DIR/*.fasta -o $OUTPUT_DIR/all.sig sourmash compare $OUTPUT_DIR/all.sig -o $OUTPUT_DIR/sourmash.npy --csv $OUTPUT_DIR/sourmash.csv echo "=== Identify clusters ===" python3 << 'EOF' import pandas as pd import numpy as np # Load MASH distances mash = pd.read_csv('${OUTPUT_DIR}/mash_distances.tsv', sep='\t', header=None, names=['ref', 'query', 'distance', 'pvalue', 'shared']) # Filter for close matches (potential outbreak cluster) close = mash[(mash['distance'] < 0.001) & (mash['ref'] != mash['query'])] print("Potential outbreak pairs (MASH distance < 0.001):") print(close[['ref', 'query', 'distance']]) EOF echo "=== Complete ==="
Python Analysis
import pandas as pd import numpy as np from scipy.cluster.hierarchy import linkage, fcluster from scipy.spatial.distance import squareform # Load MASH distances mash = pd.read_csv('mash_distances.tsv', sep='\t', header=None, names=['ref', 'query', 'dist', 'pval', 'shared']) # Pivot to matrix samples = sorted(set(mash['ref'].tolist())) dist_matrix = mash.pivot(index='ref', columns='query', values='dist').fillna(0) dist_matrix = dist_matrix.loc[samples, samples] # Cluster condensed = squareform(dist_matrix.values) Z = linkage(condensed, method='average') # Cut tree at species level (0.05) clusters = fcluster(Z, t=0.05, criterion='distance') cluster_df = pd.DataFrame({'sample': samples, 'cluster': clusters}) print(cluster_df.groupby('cluster').size())
Related Skills
- metagenomics/kraken-classification - Taxonomic classification
- genome-assembly/contamination-detection - Contamination screening
- phylogenetics/modern-tree-inference - Phylogenetic analysis
- metagenomics/metaphlan-profiling - Species profiling