AlterLab-Academic-Skills alterlab-phylogenetics
Build and analyze phylogenetic trees using MAFFT (multiple alignment), IQ-TREE 2 (maximum likelihood), and FastTree (fast NJ/ML). Visualize with ETE3 or FigTree. For evolutionary analysis, microbial genomics, viral phylodynamics, protein family analysis, and molecular clock studies. Part of the AlterLab Academic Skills suite.
install
source · Clone the upstream repo
git clone https://github.com/AlterLab-IEU/AlterLab-Academic-Skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/AlterLab-IEU/AlterLab-Academic-Skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/skills/bioinformatics/alterlab-phylogenetics" ~/.claude/skills/alterlab-ieu-alterlab-academic-skills-alterlab-phylogenetics && rm -rf "$T"
manifest:
skills/bioinformatics/alterlab-phylogenetics/SKILL.mdsource content
Phylogenetics
Overview
Phylogenetic analysis reconstructs the evolutionary history of biological sequences (genes, proteins, genomes) by inferring the branching pattern of descent. This skill covers the standard pipeline:
- MAFFT — Multiple sequence alignment
- IQ-TREE 2 — Maximum likelihood tree inference with model selection
- FastTree — Fast approximate maximum likelihood (for large datasets)
- ETE3 — Python library for tree manipulation and visualization
Installation:
# Conda (recommended for CLI tools) conda install -c bioconda mafft iqtree fasttree pip install ete3
When to Use This Skill
Use phylogenetics when:
- Evolutionary relationships: Which organism/gene is most closely related to my sequence?
- Viral phylodynamics: Trace outbreak spread and estimate transmission dates
- Protein family analysis: Infer evolutionary relationships within a gene family
- Horizontal gene transfer detection: Identify genes with discordant species/gene trees
- Ancestral sequence reconstruction: Infer ancestral protein sequences
- Molecular clock analysis: Estimate divergence dates using temporal sampling
- GWAS companion: Place variants in evolutionary context (e.g., SARS-CoV-2 variants)
- Microbiology: Species phylogeny from 16S rRNA or core genome phylogeny
Standard Workflow
1. Multiple Sequence Alignment with MAFFT
import subprocess import os def run_mafft(input_fasta: str, output_fasta: str, method: str = "auto", n_threads: int = 4) -> str: """ Align sequences with MAFFT. Args: input_fasta: Path to unaligned FASTA file output_fasta: Path for aligned output method: 'auto' (auto-select), 'einsi' (accurate), 'linsi' (accurate, slow), 'fftnsi' (medium), 'fftns' (fast), 'retree2' (fast) n_threads: Number of CPU threads Returns: Path to aligned FASTA file """ methods = { "auto": ["mafft", "--auto"], "einsi": ["mafft", "--genafpair", "--maxiterate", "1000"], "linsi": ["mafft", "--localpair", "--maxiterate", "1000"], "fftnsi": ["mafft", "--fftnsi"], "fftns": ["mafft", "--fftns"], "retree2": ["mafft", "--retree", "2"], } cmd = methods.get(method, methods["auto"]) cmd += ["--thread", str(n_threads), "--inputorder", input_fasta] with open(output_fasta, 'w') as out: result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True) if result.returncode != 0: raise RuntimeError(f"MAFFT failed:\n{result.stderr}") # Count aligned sequences with open(output_fasta) as f: n_seqs = sum(1 for line in f if line.startswith('>')) print(f"MAFFT: aligned {n_seqs} sequences → {output_fasta}") return output_fasta # MAFFT method selection guide: # Few sequences (<200), accurate: linsi or einsi # Many sequences (<1000), moderate: fftnsi # Large datasets (>1000): fftns or auto # Ultra-fast (>10000): mafft --retree 1
2. Trim Alignment (Optional but Recommended)
def trim_alignment_trimal(aligned_fasta: str, output_fasta: str, method: str = "automated1") -> str: """ Trim poorly aligned columns with TrimAl. Methods: - 'automated1': Automatic heuristic (recommended) - 'gappyout': Remove gappy columns - 'strict': Strict gap threshold """ cmd = ["trimal", f"-{method}", "-in", aligned_fasta, "-out", output_fasta, "-fasta"] result = subprocess.run(cmd, capture_output=True, text=True) if result.returncode != 0: print(f"TrimAl warning: {result.stderr}") # Fall back to using the untrimmed alignment import shutil shutil.copy(aligned_fasta, output_fasta) return output_fasta
3. IQ-TREE 2 — Maximum Likelihood Tree
def run_iqtree(aligned_fasta: str, output_prefix: str, model: str = "TEST", bootstrap: int = 1000, n_threads: int = 4, extra_args: list = None) -> dict: """ Build a maximum likelihood tree with IQ-TREE 2. Args: aligned_fasta: Aligned FASTA file output_prefix: Prefix for output files model: 'TEST' for automatic model selection, or specify (e.g., 'GTR+G' for DNA, 'LG+G4' for proteins, 'JTT+G' for proteins) bootstrap: Number of ultrafast bootstrap replicates (1000 recommended) n_threads: Number of threads ('AUTO' to auto-detect) extra_args: Additional IQ-TREE arguments Returns: Dict with paths to output files """ cmd = [ "iqtree2", "-s", aligned_fasta, "--prefix", output_prefix, "-m", model, "-B", str(bootstrap), # Ultrafast bootstrap "-T", str(n_threads), "--redo" # Overwrite existing results ] if extra_args: cmd.extend(extra_args) result = subprocess.run(cmd, capture_output=True, text=True) if result.returncode != 0: raise RuntimeError(f"IQ-TREE failed:\n{result.stderr}") # Print model selection result log_file = f"{output_prefix}.log" if os.path.exists(log_file): with open(log_file) as f: for line in f: if "Best-fit model" in line: print(f"IQ-TREE: {line.strip()}") output_files = { "tree": f"{output_prefix}.treefile", "log": f"{output_prefix}.log", "iqtree": f"{output_prefix}.iqtree", # Full report "model": f"{output_prefix}.model.gz", } print(f"IQ-TREE: Tree saved to {output_files['tree']}") return output_files # IQ-TREE model selection guide: # DNA: TEST → GTR+G, HKY+G, TrN+G # Protein: TEST → LG+G4, WAG+G, JTT+G, Q.pfam+G # Codon: TEST → MG+F3X4 # For temporal (molecular clock) analysis, add: # extra_args = ["--date", "dates.txt", "--clock-test", "--date-CI", "95"]
4. FastTree — Fast Approximate ML
For large datasets (>1000 sequences) where IQ-TREE is too slow:
def run_fasttree(aligned_fasta: str, output_tree: str, sequence_type: str = "nt", model: str = "gtr", n_threads: int = 4) -> str: """ Build a fast approximate ML tree with FastTree. Args: sequence_type: 'nt' for nucleotide or 'aa' for amino acid model: For nt: 'gtr' (recommended) or 'jc'; for aa: 'lg', 'wag', 'jtt' """ if sequence_type == "nt": cmd = ["FastTree", "-nt", "-gtr"] else: cmd = ["FastTree", f"-{model}"] cmd += [aligned_fasta] with open(output_tree, 'w') as out: result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True) if result.returncode != 0: raise RuntimeError(f"FastTree failed:\n{result.stderr}") print(f"FastTree: Tree saved to {output_tree}") return output_tree
5. Tree Analysis and Visualization with ETE3
from ete3 import Tree, TreeStyle, NodeStyle, TextFace, PhyloTree import matplotlib.pyplot as plt def load_tree(tree_file: str) -> Tree: """Load a Newick tree file.""" t = Tree(tree_file) print(f"Tree: {len(t)} leaves, {len(list(t.traverse()))} nodes") return t def basic_tree_stats(t: Tree) -> dict: """Compute basic tree statistics.""" leaves = t.get_leaves() distances = [t.get_distance(l1, l2) for l1 in leaves[:min(50, len(leaves))] for l2 in leaves[:min(50, len(leaves))] if l1 != l2] stats = { "n_leaves": len(leaves), "n_internal_nodes": len(t) - len(leaves), "total_branch_length": sum(n.dist for n in t.traverse()), "max_leaf_distance": max(distances) if distances else 0, "mean_leaf_distance": sum(distances)/len(distances) if distances else 0, } return stats def find_mrca(t: Tree, leaf_names: list) -> Tree: """Find the most recent common ancestor of a set of leaves.""" return t.get_common_ancestor(*leaf_names) def visualize_tree(t: Tree, output_file: str = "tree.png", show_branch_support: bool = True, color_groups: dict = None, width: int = 800) -> None: """ Render phylogenetic tree to image. Args: t: ETE3 Tree object color_groups: Dict mapping leaf_name → color (for coloring taxa) show_branch_support: Show bootstrap values """ ts = TreeStyle() ts.show_leaf_name = True ts.show_branch_support = show_branch_support ts.mode = "r" # 'r' = rectangular, 'c' = circular if color_groups: for node in t.traverse(): if node.is_leaf() and node.name in color_groups: nstyle = NodeStyle() nstyle["fgcolor"] = color_groups[node.name] nstyle["size"] = 8 node.set_style(nstyle) t.render(output_file, tree_style=ts, w=width, units="px") print(f"Tree saved to: {output_file}") def midpoint_root(t: Tree) -> Tree: """Root tree at midpoint (use when outgroup unknown).""" t.set_outgroup(t.get_midpoint_outgroup()) return t def prune_tree(t: Tree, keep_leaves: list) -> Tree: """Prune tree to keep only specified leaves.""" t.prune(keep_leaves, preserve_branch_length=True) return t
6. Complete Analysis Script
import subprocess, os from ete3 import Tree def full_phylogenetic_analysis( input_fasta: str, output_dir: str = "phylo_results", sequence_type: str = "nt", n_threads: int = 4, bootstrap: int = 1000, use_fasttree: bool = False ) -> dict: """ Complete phylogenetic pipeline: align → trim → tree → visualize. Args: input_fasta: Unaligned FASTA sequence_type: 'nt' (nucleotide) or 'aa' (amino acid/protein) use_fasttree: Use FastTree instead of IQ-TREE (faster for large datasets) """ os.makedirs(output_dir, exist_ok=True) prefix = os.path.join(output_dir, "phylo") print("=" * 50) print("Step 1: Multiple Sequence Alignment (MAFFT)") aligned = run_mafft(input_fasta, f"{prefix}_aligned.fasta", method="auto", n_threads=n_threads) print("\nStep 2: Tree Inference") if use_fasttree: tree_file = run_fasttree( aligned, f"{prefix}.tree", sequence_type=sequence_type, model="gtr" if sequence_type == "nt" else "lg" ) else: model = "TEST" if sequence_type == "nt" else "TEST" iqtree_files = run_iqtree( aligned, prefix, model=model, bootstrap=bootstrap, n_threads=n_threads ) tree_file = iqtree_files["tree"] print("\nStep 3: Tree Analysis") t = Tree(tree_file) t = midpoint_root(t) stats = basic_tree_stats(t) print(f"Tree statistics: {stats}") print("\nStep 4: Visualization") visualize_tree(t, f"{prefix}_tree.png", show_branch_support=True) # Save rooted tree rooted_tree_file = f"{prefix}_rooted.nwk" t.write(format=1, outfile=rooted_tree_file) results = { "aligned_fasta": aligned, "tree_file": tree_file, "rooted_tree": rooted_tree_file, "visualization": f"{prefix}_tree.png", "stats": stats } print("\n" + "=" * 50) print("Phylogenetic analysis complete!") print(f"Results in: {output_dir}/") return results
IQ-TREE Model Guide
DNA Models
| Model | Description | Use case |
|---|---|---|
| General Time Reversible + Gamma | Most flexible DNA model |
| Hasegawa-Kishino-Yano + Gamma | Two-rate model (common) |
| Tamura-Nei | Unequal transitions |
| Jukes-Cantor | Simplest; all rates equal |
Protein Models
| Model | Description | Use case |
|---|---|---|
| Le-Gascuel + Gamma | Best average protein model |
| Whelan-Goldman | Widely used |
| Jones-Taylor-Thornton | Classical model |
| pfam-trained | For Pfam-like protein families |
| Bird-specific | Vertebrate proteins |
Tip: Use
-m TEST to let IQ-TREE automatically select the best model.
Best Practices
- Alignment quality first: Poor alignment → unreliable trees; check alignment manually
- Use
for small (<200 seq),linsi
orfftns
for large alignmentsauto - Model selection: Always use
for IQ-TREE unless you have a specific reason-m TEST - Bootstrap: Use ≥1000 ultrafast bootstraps (
) for branch support-B 1000 - Root the tree: Unrooted trees can be misleading; use outgroup or midpoint rooting
- FastTree for >5000 sequences: IQ-TREE becomes slow; FastTree is 10–100× faster
- Trim long alignments: TrimAl removes unreliable columns; improves tree accuracy
- Check for recombination in viral/bacterial sequences before building trees (
,RDP4
)GARD
Additional Resources
- MAFFT: https://mafft.cbrc.jp/alignment/software/
- IQ-TREE 2: http://www.iqtree.org/ | Tutorial: https://www.iqtree.org/workshop/molevol2022
- FastTree: http://www.microbesonline.org/fasttree/
- ETE3: http://etetoolkit.org/
- FigTree (GUI visualization): https://tree.bio.ed.ac.uk/software/figtree/
- iTOL (web visualization): https://itol.embl.de/
- MUSCLE (alternative aligner): https://www.drive5.com/muscle/
- TrimAl (alignment trimming): https://vicfero.github.io/trimal/