Claude-skill-registry bio-protein-clustering-pangenome

Cluster proteins into orthogroups and derive pangenome matrices.

install

source · Clone the upstream repo

git clone https://github.com/majiayu000/claude-skill-registry

Claude Code · Install into ~/.claude/skills/

T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/data/bio-protein-clustering-pangenome" ~/.claude/skills/majiayu000-claude-skill-registry-bio-protein-clustering-pangenome && rm -rf "$T"

manifest: skills/data/bio-protein-clustering-pangenome/SKILL.md

Bio Protein Clustering Pangenome

When to use

Cluster proteins into orthogroups and derive pangenome matrices.

Prerequisites

Tools installed via pixi (see pixi.toml).
Protein FASTA inputs are available.

Inputs

proteins.faa (FASTA protein sequences)

Outputs

results/bio-protein-clustering-pangenome/orthogroups.tsv
results/bio-protein-clustering-pangenome/presence_absence.parquet
results/bio-protein-clustering-pangenome/pangenome_report.md
results/bio-protein-clustering-pangenome/logs/

Concepts

Pangenome Components

The pangenome (pan-genome or supragenome) is the entire set of genes from all strains within a clade. It can be partitioned into distinct categories:

Core genome: Genes present in all (or >99%) of genomes. These represent essential housekeeping functions and shared metabolic capabilities. Typically includes ~100 single-copy proteins in bacteria (ribosomal proteins, RNA/DNA polymerases, DNA repair proteins).
Shell genome: Genes present at intermediate frequencies (typically 15-95% of genomes). These provide adaptive advantages in specific environments.
Accessory genome: Genes present in a subset of genomes (<99% to ≥15%). Contains dispensable genes that provide niche-specific adaptations.
Cloud genome: Rare genes present in very few genomes (<15%). Often includes strain-specific genes (singletons) and recently acquired genes.

Gene Types in Pangenome Analysis

Orthologs: Homologous genes that diverged via speciation (vertical inheritance). Present in multiple genomes with shared ancestry.
Single-copy orthologs (SCOs): Orthologous groups where each genome is represented by exactly one gene. Critical for phylogenetic analysis and genome completeness assessment (BUSCO scores).
Pan-orthologs: All orthologous genes across the entire pangenome, including both single-copy and multi-copy orthologs.
Paralogs: Homologous genes that arose via duplication within a lineage. Multiple copies may exist within a single genome. Can complicate ortholog identification if not properly discriminated.
Singletons: Genes with no sequence homology to genes in any other genome in the dataset. Present in one and only one genome. May represent unique adaptations, recent horizontal gene transfers, or annotation artifacts.

Key Metrics

Pangenome size: Total number of unique gene families across all genomes
Core genome size: Number of gene families present in all genomes
Accessory genome size: Pangenome size - Core genome size
Genome completeness: Percentage of expected single-copy orthologs present in a genome
Orthogroup occupancy: Percentage of genomes represented in an orthogroup

Steps

Cluster proteins with MMseqs2 or ProteinOrtho.
Build presence/absence matrix.
Compute core/accessory/cloud/singleton partitions.
Identify single-copy orthologs for phylogenetic analysis.
Discriminate paralogs from orthologs in multi-copy gene families.
Calculate pangenome statistics (completeness, orthogroup occupancy).

Analyzing Orthogroups Output

Understanding Orthogroup Tables

Orthogroup clustering tools (MMseqs2, ProteinOrtho, OrthoFinder, OrthoMCL) produce tables where:

Rows = orthogroups (gene families)
Columns = genomes
Cells = gene identifiers (or counts)

Identifying Single-Copy Pan-Orthologs

Single-copy pan-orthologs are the gold standard for phylogenetics and represent genes present exactly once in all genomes:

# Filter for single-copy orthologs present in all genomes
sco_mask = (orthogroups_df.notna().sum(axis=1) == n_genomes) & \
           (orthogroups_df.notna().sum(axis=1) == orthogroups_df.count(axis=1))
single_copy_orthologs = orthogroups_df[sco_mask]

Criteria:

Presence: Gene exists in 100% of genomes
Copy number: Exactly 1 gene per genome
Minimum occupancy: Often require ≥50% occupancy for relaxed SCO sets

Finding Multi-Copy Orthologs (Paralogs)

Paralogs indicate gene duplication events:

# Identify orthogroups with paralogs (>1 copy in any genome)
multi_copy_mask = orthogroups_df.apply(
    lambda row: any(isinstance(x, str) and ',' in x for x in row),
    axis=1
)
paralog_groups = orthogroups_df[multi_copy_mask]

Analysis considerations:

Distinguish true paralogs from assembly artifacts
Evaluate if duplication is lineage-specific or ancient
Use conserved gene neighborhood (synteny) to refine ortholog assignments

Identifying Singletons

Singletons are unique to one genome:

# Find genes present in only one genome
singleton_mask = orthogroups_df.notna().sum(axis=1) == 1
singletons = orthogroups_df[singleton_mask]

Interpretation:

Strain-specific adaptations
Recently acquired via horizontal gene transfer (HGT)
Potential annotation errors or poor quality genes
Genes under rapid evolution or gene loss events

Calculating Core vs Accessory Genome

Partition orthogroups by presence frequency:

# Calculate presence across genomes
presence_freq = orthogroups_df.notna().sum(axis=1) / n_genomes

# Core genome (present in ≥99% of genomes)
core = orthogroups_df[presence_freq >= 0.99]

# Accessory genome (present in 15-99% of genomes)
accessory = orthogroups_df[(presence_freq >= 0.15) & (presence_freq < 0.99)]

# Cloud genome (present in <15% of genomes)
cloud = orthogroups_df[presence_freq < 0.15]

Threshold guidance:

Core: ≥95-99% (adjust based on genome quality)
Shell: 15-95% (intermediate frequency)
Cloud: <15% (rare genes)

Common Analysis Patterns for Microbial Comparative Genomics

Completeness assessment: Count single-copy orthologs per genome to evaluate assembly/annotation quality
Pan-genome openness: Plot cumulative gene count vs number of genomes
- Open pangenome: Continues to grow with new genomes
- Closed pangenome: Plateaus after sufficient sampling
Phylogenetic marker identification: Extract single-copy core genes for species tree construction
Functional profiling: Annotate orthogroups with COG/KEGG/Pfam to understand functional distribution across core/accessory partitions
Presence/absence patterns: Identify genes associated with phenotypes (pathogenicity, antibiotic resistance, niche adaptation)

Decision Points

If few single-copy orthologs identified (<50):

Check genome annotation quality
Verify genomes are from same species/genus
Consider contamination or misassembled genomes
Adjust clustering stringency

If excessive singletons (>30% of genes):

Review annotation parameters (overprediction likely)
Check for contamination
Evaluate sequence quality and completeness
Consider stricter clustering thresholds

If no clear core genome emerges:

Dataset may span too broad taxonomic range
Poor quality genomes present
Incorrect species assignments

QC gates

Cluster size distributions meet project thresholds.
Matrix completeness meets project thresholds.
On failure: retry with alternative parameters; if still failing, record in report and exit non-zero.

Downstream Analysis

Filtering Orthogroups by Presence/Absence Patterns

Extract core genome genes:

# Get orthogroups present in all genomes
awk -F'\t' 'NR==1 || (NF-1 == n_genomes && !/\t\t/)' orthogroups.tsv > core_orthogroups.tsv

Find lineage-specific genes:

# Identify genes specific to a subset of genomes
lineage_genomes = ['genome1', 'genome2', 'genome3']
lineage_specific = orthogroups_df[
    (orthogroups_df[lineage_genomes].notna().all(axis=1)) &
    (orthogroups_df.drop(columns=lineage_genomes).isna().all(axis=1))
]

Presence correlation analysis:

# Find genes with correlated presence patterns
from scipy.cluster.hierarchy import linkage, dendrogram
presence_matrix = orthogroups_df.notna().astype(int)
linkage_matrix = linkage(presence_matrix, method='average', metric='jaccard')

Extracting Alignments from Single-Copy Orthologs

Workflow for phylogenetic analysis:

Identify single-copy core orthogroups (100% presence, 1 copy each)
Extract protein sequences for each orthogroup
Align each orthogroup separately with MAFFT/MUSCLE
Trim alignments with trimAl or BMGE
Concatenate alignments into supermatrix
Build phylogenetic tree with IQ-TREE/RAxML

# Example: Extract sequences for single-copy orthogroup
for og in single_copy_orthogroups.txt; do
  grep -A1 -f <(cut -f2- orthogroups.tsv | grep "^${og}") proteins.faa > ${og}.faa
  mafft --auto ${og}.faa > ${og}.aln
  trimal -in ${og}.aln -out ${og}.trim -automated1
done

# Concatenate alignments
cat *.trim > concatenated_alignment.faa

Identifying Horizontally Transferred Genes (HGT)

Indicators of horizontal gene transfer:

Genes present in distant taxa but absent in close relatives
Atypical GC content compared to genome average
Different codon usage bias
Phylogenetic incongruence (gene tree vs species tree)
Presence in mobile genetic elements (plasmids, integrons, transposons)

Analysis approach:

# Flag orthogroups with sparse, patchy distribution
def detect_hgt_candidates(presence_df, threshold=0.3):
    """Identify orthogroups with discontinuous presence patterns"""
    hgt_candidates = []
    for og in presence_df.index:
        presence = presence_df.loc[og]
        freq = presence.sum() / len(presence)
        # Low frequency + non-clustered distribution suggests HGT
        if 0.05 < freq < threshold:
            hgt_candidates.append(og)
    return hgt_candidates

Validation steps:

Align HGT candidate genes
Build individual gene trees
Compare to species tree topology
BLAST against broader database to find donor lineages
Check for association with mobile elements

Functional Enrichment of Accessory Genome

Annotate orthogroups:

# Functional annotation with eggNOG-mapper
for og in accessory_orthogroups.txt; do
  emapper.py -i ${og}.faa --output ${og}_annotation
done

Enrichment analysis:

# Test for overrepresented functions in accessory genome
from scipy.stats import fisher_exact

def enrichment_analysis(core_functions, accessory_functions, total_functions):
    """Fisher's exact test for functional enrichment"""
    enriched = []
    for func in set(accessory_functions):
        a = accessory_functions.count(func)  # Function in accessory
        b = len(accessory_functions) - a      # Other functions in accessory
        c = core_functions.count(func)        # Function in core
        d = len(core_functions) - c           # Other functions in core

        odds_ratio, p_value = fisher_exact([[a, b], [c, d]])
        if p_value < 0.05 and odds_ratio > 1:
            enriched.append((func, odds_ratio, p_value))
    return enriched

Common patterns in bacterial accessory genomes:

Antibiotic resistance genes
Virulence factors and toxins
Secondary metabolite biosynthesis clusters
Restriction-modification systems
Niche-specific transporters and catabolic pathways

Pangenome Visualization

Recommended plots:

Pan-genome accumulation curve: Genes vs genomes added
Core genome decay curve: Core size vs genomes added
Presence/absence heatmap: Orthogroups × genomes
Functional category distribution: Core vs accessory annotations
Phylogenetic tree with gene presence: Map accessory genes onto phylogeny

# Example: Pan-genome accumulation
import matplotlib.pyplot as plt
import numpy as np
from itertools import permutations

def pangenome_curve(presence_df, n_permutations=100):
    """Calculate pan-genome growth curve"""
    n_genomes = presence_df.shape[1]
    genome_order = list(range(n_genomes))

    pan_sizes = np.zeros((n_permutations, n_genomes))

    for i in range(n_permutations):
        np.random.shuffle(genome_order)
        cumulative_genes = set()
        for j, genome_idx in enumerate(genome_order):
            cumulative_genes.update(
                presence_df.iloc[:, genome_idx].dropna().index
            )
            pan_sizes[i, j] = len(cumulative_genes)

    mean_pan = pan_sizes.mean(axis=0)
    std_pan = pan_sizes.std(axis=0)

    plt.plot(range(1, n_genomes+1), mean_pan)
    plt.fill_between(range(1, n_genomes+1),
                     mean_pan-std_pan, mean_pan+std_pan, alpha=0.3)
    plt.xlabel('Number of genomes')
    plt.ylabel('Number of gene families')
    plt.title('Pan-genome accumulation')
    return mean_pan

Validation

Verify proteins.faa is non-empty and amino acid encoded.

Tools

Primary Clustering Tools

mmseqs2 v18-8cc5c: Ultra-fast sequence clustering and search. Good for large datasets (>1000 genomes).
proteinortho v6.3.6: Detects orthologs using reciprocal best hits with graph-based clustering.
OrthoFinder: Phylogenetic orthology inference with gene trees, duplication events, and species tree. Excellent for detailed comparative genomics and paralog discrimination.
OrthoMCL: Uses conserved gene neighborhood for improved paralog separation. Good for closely related bacterial strains.

Tool Selection Guide

Small datasets (<50 genomes): OrthoFinder (most comprehensive output)
Medium datasets (50-500 genomes): ProteinOrtho or MMseqs2
Large datasets (>500 genomes): MMseqs2 (fastest)
Closely related strains: OrthoMCL or PEPPAN (paralog discrimination)
High phylogenetic resolution needed: OrthoFinder (produces gene trees)

Recommended Pipelines

GET_HOMOLOGUES: Comprehensive bacterial pan-genome analysis with multiple clustering algorithms
Roary: Fast pan-genome pipeline specifically for prokaryotes
PEPPAN: Accurate reconstruction with paralogous score calculation
BPGA: Browser-based pan-genome analysis platform
Anvi'o: Integrated pangenomics with visualization
RIBAP: Core genome annotation pipeline for pangenome calculation beyond species level

Paper summaries (2023-2025)

summaries/ (include example use cases and tool settings used)

Tool documentation

MMseqs2 - Ultra-fast sequence clustering and search
ProteinOrtho - Ortholog detection with reciprocal best hits

References

Key Concepts and Definitions

Tools and Methods

Analysis Strategies

Recent Developments

Additional Resources

See ../bio-skills-references.md