OpenClaw-Medical-Skills bio-metagenomics-strain-tracking
Track bacterial strains using MASH, sourmash, fastANI, and inStrain. Compare genomes, detect contamination, and monitor strain-level variation. Use when needing sub-species resolution for outbreak tracking, transmission analysis, or within-host strain dynamics.
git clone https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bio-metagenomics-strain-tracking" ~/.claude/skills/freedomintelligence-openclaw-medical-skills-bio-metagenomics-strain-tracking && rm -rf "$T"
T=$(mktemp -d) && git clone --depth=1 https://github.com/FreedomIntelligence/OpenClaw-Medical-Skills "$T" && mkdir -p ~/.openclaw/skills && cp -r "$T/skills/bio-metagenomics-strain-tracking" ~/.openclaw/skills/freedomintelligence-openclaw-medical-skills-bio-metagenomics-strain-tracking && rm -rf "$T"
skills/bio-metagenomics-strain-tracking/SKILL.mdVersion Compatibility
Reference examples tested with: Bowtie2 2.5.3+, MetaPhlAn 4.1+, numpy 1.26+, pandas 2.2+, samtools 1.19+, scipy 1.12+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function) - CLI:
then<tool> --version
to confirm flags<tool> --help
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Strain Tracking
"Track bacterial strains across my samples" → Resolve sub-species variation using genome sketching (Mash/sourmash), average nucleotide identity (fastANI), or within-sample strain profiling (inStrain) for outbreak tracking and transmission analysis.
- CLI:
,mash dist
,sourmash compare
,fastANIinStrain profile
Identify and track bacterial strains at sub-species resolution.
Tool Comparison
| Tool | Method | Best For |
|---|---|---|
| MASH | MinHash sketches | Fast distance estimation |
| sourmash | MinHash + containment | Metagenome comparisons |
| fastANI | ANI calculation | Accurate species/strain ID |
| inStrain | SNV profiling | Strain dynamics in metagenomes |
MASH
Installation
conda install -c bioconda mash
Create Sketch
# Single genome mash sketch -o genome.msh genome.fasta # Multiple genomes mash sketch -o reference_db.msh genomes/*.fasta # From reads (with coverage) mash sketch -m 2 -r -o reads.msh reads.fastq.gz
Calculate Distance
# Pairwise distance mash dist genome1.fasta genome2.fasta # Query against database mash dist reference_db.msh query.fasta > distances.tsv # Screen for containment (metagenome) mash screen reference_db.msh reads.fastq.gz > screen_results.tsv
Interpret MASH Distance
| Distance | Interpretation |
|---|---|
| < 0.05 | Same species/strain |
| 0.05-0.15 | Same species |
| 0.15-0.25 | Same genus |
| > 0.25 | Different genus |
Cluster Genomes
# All-vs-all distances mash triangle genomes/*.fasta > distances.phylip # Build tree mash triangle -E genomes/*.fasta > distances.tsv
sourmash
Installation
conda install -c bioconda sourmash
Create Signatures
# Genome signature sourmash sketch dna -p scaled=1000,k=31 genome.fasta -o genome.sig # Multiple genomes sourmash sketch dna -p scaled=1000,k=31 genomes/*.fasta -o genomes.sig # Protein signatures sourmash sketch protein -p scaled=100,k=10 proteins.faa -o proteins.sig
Compare Signatures
# Pairwise comparison sourmash compare *.sig -o comparison.npy --csv comparison.csv # Search against database sourmash search query.sig database.sig --threshold 0.8 # Gather (metagenome decomposition) sourmash gather metagenome.sig database.sig -o gather_results.csv
Taxonomy Assignment
# Download taxonomy database sourmash database download gtdb-rs214-k31.zip # Classify sourmash lca classify --db gtdb-rs214-k31.lca.json.gz --query query.sig # Summarize metagenome sourmash lca summarize --db gtdb-rs214-k31.lca.json.gz --query metagenome.sig
fastANI
Installation
conda install -c bioconda fastani
Calculate ANI
# Single pair fastANI -q query.fasta -r reference.fasta -o ani_result.txt # Query vs multiple references fastANI -q query.fasta --rl reference_list.txt -o ani_results.txt # All-vs-all fastANI --ql genome_list.txt --rl genome_list.txt -o all_vs_all.txt --matrix
Interpret ANI
| ANI | Interpretation |
|---|---|
| >99% | Same strain |
| 95-99% | Same species |
| <95% | Different species |
inStrain
For strain-level analysis in metagenomes.
Installation
conda install -c bioconda instrain
Profile Strains
# Map reads to reference bowtie2 -x reference -1 reads_1.fq -2 reads_2.fq | \ samtools sort -o mapped.bam # Profile with inStrain inStrain profile mapped.bam reference.fasta -o instrain_output -p 8
Compare Samples
# Profile multiple samples for bam in sample*.bam; do inStrain profile $bam reference.fasta -o ${bam%.bam}_IS -p 8 done # Compare strain populations inStrain compare -i sample*_IS -o comparison_IS -p 8
Key Outputs
# SNV table cat instrain_output/output/SNVs.tsv # Gene-level info cat instrain_output/output/gene_info.tsv # Genome info cat instrain_output/output/genome_info.tsv
Complete Workflow: Outbreak Tracking
Goal: Identify potential outbreak clusters by computing pairwise genomic distances across isolate genomes using multiple complementary methods.
Approach: Sketch genomes with MASH for fast distance estimation, compute ANI with fastANI for accurate species-level resolution, compare sourmash signatures for containment analysis, and cluster close matches to identify transmission pairs.
#!/bin/bash set -euo pipefail GENOMES_DIR=$1 OUTPUT_DIR=$2 mkdir -p $OUTPUT_DIR echo "=== MASH sketching ===" mash sketch -o $OUTPUT_DIR/genomes.msh $GENOMES_DIR/*.fasta echo "=== MASH distances ===" mash dist $OUTPUT_DIR/genomes.msh $OUTPUT_DIR/genomes.msh > $OUTPUT_DIR/mash_distances.tsv echo "=== fastANI ===" ls $GENOMES_DIR/*.fasta > $OUTPUT_DIR/genome_list.txt fastANI --ql $OUTPUT_DIR/genome_list.txt \ --rl $OUTPUT_DIR/genome_list.txt \ -o $OUTPUT_DIR/fastani_results.txt \ --matrix echo "=== sourmash signatures ===" sourmash sketch dna -p scaled=1000,k=31 $GENOMES_DIR/*.fasta -o $OUTPUT_DIR/all.sig sourmash compare $OUTPUT_DIR/all.sig -o $OUTPUT_DIR/sourmash.npy --csv $OUTPUT_DIR/sourmash.csv echo "=== Identify clusters ===" python3 << 'EOF' import pandas as pd import numpy as np # Load MASH distances mash = pd.read_csv('${OUTPUT_DIR}/mash_distances.tsv', sep='\t', header=None, names=['ref', 'query', 'distance', 'pvalue', 'shared']) # Filter for close matches (potential outbreak cluster) close = mash[(mash['distance'] < 0.001) & (mash['ref'] != mash['query'])] print("Potential outbreak pairs (MASH distance < 0.001):") print(close[['ref', 'query', 'distance']]) EOF echo "=== Complete ==="
Python Analysis
import pandas as pd import numpy as np from scipy.cluster.hierarchy import linkage, fcluster from scipy.spatial.distance import squareform # Load MASH distances mash = pd.read_csv('mash_distances.tsv', sep='\t', header=None, names=['ref', 'query', 'dist', 'pval', 'shared']) # Pivot to matrix samples = sorted(set(mash['ref'].tolist())) dist_matrix = mash.pivot(index='ref', columns='query', values='dist').fillna(0) dist_matrix = dist_matrix.loc[samples, samples] # Cluster condensed = squareform(dist_matrix.values) Z = linkage(condensed, method='average') # Cut tree at species level (0.05) clusters = fcluster(Z, t=0.05, criterion='distance') cluster_df = pd.DataFrame({'sample': samples, 'cluster': clusters}) print(cluster_df.groupby('cluster').size())
Related Skills
- metagenomics/kraken-classification - Taxonomic classification
- genome-assembly/contamination-detection - Contamination screening
- phylogenetics/modern-tree-inference - Phylogenetic analysis
- metagenomics/metaphlan-profiling - Species profiling