BioSkills bio-phylo-distance-calculations
Compute evolutionary distances and build phylogenetic trees using Biopython Bio.Phylo.TreeConstruction. Use when creating distance matrices from alignments, building NJ/UPGMA trees, generating bootstrap consensus, or needing quick exploratory phylogenies before running full ML analysis.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/phylogenetics/distance-calculations" ~/.claude/skills/gptomics-bioskills-bio-phylo-distance-calculations && rm -rf "$T"
phylogenetics/distance-calculations/SKILL.mdVersion Compatibility
Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+
Before using code patterns, verify installed versions match. If versions differ:
- Python:
thenpip show <package>
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Distance Calculations and Tree Building
"Build a phylogenetic tree from my alignment" → Compute evolutionary distance matrices from sequence alignments and construct neighbor-joining or UPGMA trees with bootstrap support.
- Python:
,Bio.Phylo.TreeConstruction.DistanceCalculator()DistanceTreeConstructor()
Compute distances from alignments and construct phylogenetic trees.
When to Use Distance Methods vs ML
| Scenario | Recommended Method |
|---|---|
| Quick exploratory tree before committing to a long ML run | NJ |
| Sanity check on data quality (unexpected groupings?) | NJ |
| Very large datasets where ML is prohibitive | NJ |
| Molecular clock data (ultrametric trees) | UPGMA (rare) |
| Publication-quality trees | ML (IQ-TREE2/RAxML-NG) or Bayesian |
| Formal hypothesis testing | ML or Bayesian |
NJ trees are fast (O(n^3)) and useful for exploration. For any analysis intended for publication, use ML methods (see modern-tree-inference skill). NJ starting trees are used internally by IQ-TREE (BIONJ) and RAxML-NG.
UPGMA warning: UPGMA assumes a molecular clock (equal rates across all lineages). This assumption is almost never met for molecular data. Use NJ instead unless clocklike behavior has been verified.
Evolutionary Distance Corrections
Raw identity-based distances underestimate true evolutionary distance because they do not account for multiple substitutions at the same site. For divergent sequences, corrected distances are more appropriate:
| Model | Correction | Use When |
|---|---|---|
| Identity | None (raw mismatch proportion) | Closely related sequences; quick exploration |
| Jukes-Cantor | Assumes equal substitution rates | Simple correction for moderate divergence |
| Kimura 2-parameter | Distinguishes transitions from transversions | Better for DNA when Ti/Tv ratio differs from 1 |
Biopython's
DistanceCalculator models (identity, blastn, trans) provide basic corrections. For more sophisticated evolutionary distance estimation, use ML-based distances from IQ-TREE2 (.mldist output file).
Required Import
from Bio import Phylo, AlignIO from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor from Bio.Phylo.TreeConstruction import DistanceMatrix from Bio.Phylo.TreeConstruction import ParsimonyScorer, ParsimonyTreeConstructor, NNITreeSearcher from Bio.Phylo.Consensus import strict_consensus, majority_consensus, bootstrap_trees, bootstrap_consensus
Distance Matrix from Alignment
from Bio import AlignIO from Bio.Phylo.TreeConstruction import DistanceCalculator alignment = AlignIO.read('alignment.fasta', 'fasta') # Create calculator with distance model calculator = DistanceCalculator('identity') # Simple identity-based distance dm = calculator.get_distance(alignment) print(dm) # Available models for DNA calculator = DistanceCalculator('blastn') # BLASTN-style distance # Available models for protein calculator = DistanceCalculator('blosum62') # BLOSUM62-based distance
Available Distance Models
| Model | Type | Description |
|---|---|---|
| DNA/Protein | 1 - (identical positions / total) |
| DNA | BLASTN scoring distance |
| DNA | Transition/transversion weighted |
| Protein | BLOSUM62 matrix distance |
| Protein | BLOSUM45 matrix distance |
| Protein | BLOSUM80 matrix distance |
| Protein | PAM250 matrix distance |
| Protein | PAM30 matrix distance |
Building Trees with Distance Methods
Neighbor Joining (NJ)
from Bio import AlignIO from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor alignment = AlignIO.read('alignment.fasta', 'fasta') calculator = DistanceCalculator('identity') dm = calculator.get_distance(alignment) constructor = DistanceTreeConstructor() nj_tree = constructor.nj(dm) Phylo.draw_ascii(nj_tree)
UPGMA
constructor = DistanceTreeConstructor() upgma_tree = constructor.upgma(dm) Phylo.draw_ascii(upgma_tree)
One-Step Tree Building
# Build tree directly from alignment constructor = DistanceTreeConstructor(calculator, 'nj') tree = constructor.build_tree(alignment) # Or with UPGMA constructor = DistanceTreeConstructor(calculator, 'upgma') tree = constructor.build_tree(alignment)
Pairwise Distances Between Taxa
from Bio import Phylo tree = Phylo.read('tree.nwk', 'newick') # Distance between two taxa (sum of branch lengths) taxon1 = tree.find_any(name='Human') taxon2 = tree.find_any(name='Mouse') dist = tree.distance(taxon1, taxon2) print(f'Distance Human-Mouse: {dist:.4f}') # All pairwise distances terminals = tree.get_terminals() for i, t1 in enumerate(terminals): for t2 in terminals[i+1:]: d = tree.distance(t1, t2) print(f'{t1.name}-{t2.name}: {d:.4f}')
Creating Distance Matrix Manually
from Bio.Phylo.TreeConstruction import DistanceMatrix names = ['A', 'B', 'C', 'D'] # Lower triangular matrix (including diagonal) matrix = [ [0], [0.1, 0], [0.2, 0.15, 0], [0.3, 0.25, 0.2, 0] ] dm = DistanceMatrix(names, matrix) print(dm) # Build tree from custom matrix constructor = DistanceTreeConstructor() tree = constructor.nj(dm)
Parsimony Tree Construction
Parsimony is largely superseded by ML for most molecular phylogenetics. It remains appropriate for morphological cladistics, rare genomic changes (retroelement insertions, gene order), and as a starting point for ML searches. Parsimony is statistically inconsistent in the Felsenstein zone (long branch attraction).
from Bio import AlignIO, Phylo from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher, ParsimonyTreeConstructor alignment = AlignIO.read('alignment.fasta', 'fasta') scorer = ParsimonyScorer() searcher = NNITreeSearcher(scorer) # Parsimony needs a starting tree (NJ is standard) constructor = DistanceTreeConstructor(DistanceCalculator('identity'), 'nj') starting_tree = constructor.build_tree(alignment) pars_constructor = ParsimonyTreeConstructor(searcher, starting_tree) pars_tree = pars_constructor.build_tree(alignment) print(f'Parsimony score: {scorer.get_score(pars_tree, alignment)}') Phylo.draw_ascii(pars_tree)
Bootstrap Analysis
from Bio import AlignIO from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor from Bio.Phylo.Consensus import bootstrap_trees, bootstrap_consensus, majority_consensus alignment = AlignIO.read('alignment.fasta', 'fasta') calculator = DistanceCalculator('identity') constructor = DistanceTreeConstructor(calculator, 'nj') # Generate bootstrap trees boot_trees = list(bootstrap_trees(alignment, 100, constructor)) print(f'Generated {len(boot_trees)} bootstrap trees') # Get bootstrap consensus consensus = bootstrap_consensus(alignment, 100, constructor, majority_consensus) Phylo.draw_ascii(consensus)
Consensus Tree Methods
from Bio.Phylo.Consensus import strict_consensus, majority_consensus, adam_consensus trees = list(Phylo.parse('bootstrap.nwk', 'newick')) # Strict consensus (only clades in ALL trees) strict = strict_consensus(trees) # Majority rule consensus (clades in >50% of trees) majority = majority_consensus(trees, cutoff=0.5) # Adam consensus adam = adam_consensus(trees) Phylo.draw_ascii(majority)
Tree Depths and Total Length
tree = Phylo.read('tree.nwk', 'newick') # Total branch length total = tree.total_branch_length() print(f'Total branch length: {total:.4f}') # Depths from root to each node depths = tree.depths() for clade, depth in depths.items(): if clade.is_terminal(): print(f'{clade.name}: {depth:.4f}') # Maximum depth (tree height) tree_height = max(depths.values()) print(f'Tree height: {tree_height:.4f}')
Comparing Tree Distances
tree1 = Phylo.read('tree1.nwk', 'newick') tree2 = Phylo.read('tree2.nwk', 'newick') # Compare total branch lengths len1 = tree1.total_branch_length() len2 = tree2.total_branch_length() print(f'Tree 1 total: {len1:.4f}') print(f'Tree 2 total: {len2:.4f}') # Compare specific pairwise distances taxa = ['Human', 'Mouse'] t1 = [tree1.find_any(name=t) for t in taxa] t2 = [tree2.find_any(name=t) for t in taxa] d1 = tree1.distance(t1[0], t1[1]) d2 = tree2.distance(t2[0], t2[1]) print(f'Human-Mouse distance: Tree1={d1:.4f}, Tree2={d2:.4f}')
Complete Pipeline: Alignment to Bootstrapped Tree
Goal: Build a phylogenetic tree from a sequence alignment with bootstrap support assessment for branch confidence.
Approach: Read the alignment, compute an identity-based distance matrix, construct a neighbor-joining tree, then generate a majority-rule bootstrap consensus from 100 replicates.
from Bio import AlignIO, Phylo from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor from Bio.Phylo.Consensus import bootstrap_consensus, majority_consensus alignment = AlignIO.read('sequences.aln', 'clustal') print(f'Alignment: {len(alignment)} sequences, {alignment.get_alignment_length()} positions') calculator = DistanceCalculator('identity') constructor = DistanceTreeConstructor(calculator, 'nj') # Build simple tree simple_tree = constructor.build_tree(alignment) simple_tree.ladderize() # Build bootstrap consensus (100 replicates) consensus_tree = bootstrap_consensus(alignment, 100, constructor, majority_consensus) consensus_tree.ladderize() Phylo.write(simple_tree, 'nj_tree.nwk', 'newick') Phylo.write(consensus_tree, 'bootstrap_consensus.nwk', 'newick')
Quick Reference: Distance Models
DNA Models
| Model | Description |
|---|---|
| Simple mismatch counting |
| BLASTN-style scoring |
| Weights transitions vs transversions |
Protein Models
| Model | Description |
|---|---|
| General proteins |
| Divergent proteins |
| Similar proteins |
| Distant homologs |
| Close homologs |
Related Skills
- tree-io - Save constructed trees to files
- tree-visualization - Draw resulting trees
- tree-manipulation - Root and process built trees
- modern-tree-inference - ML tree inference for publication-quality results
- alignment/alignment-io - Read alignments for tree building
- alignment/msa-statistics - Alignment quality before tree building