BioSkills bio-phylo-distance-calculations

Compute evolutionary distances and build phylogenetic trees using Biopython Bio.Phylo.TreeConstruction. Use when creating distance matrices from alignments, building NJ/UPGMA trees, generating bootstrap consensus, or needing quick exploratory phylogenies before running full ML analysis.

install
source · Clone the upstream repo
git clone https://github.com/GPTomics/bioSkills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/phylogenetics/distance-calculations" ~/.claude/skills/gptomics-bioskills-bio-phylo-distance-calculations && rm -rf "$T"
manifest: phylogenetics/distance-calculations/SKILL.md
source content

Version Compatibility

Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:

  • Python:
    pip show <package>
    then
    help(module.function)
    to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Distance Calculations and Tree Building

"Build a phylogenetic tree from my alignment" → Compute evolutionary distance matrices from sequence alignments and construct neighbor-joining or UPGMA trees with bootstrap support.

  • Python:
    Bio.Phylo.TreeConstruction.DistanceCalculator()
    ,
    DistanceTreeConstructor()

Compute distances from alignments and construct phylogenetic trees.

When to Use Distance Methods vs ML

ScenarioRecommended Method
Quick exploratory tree before committing to a long ML runNJ
Sanity check on data quality (unexpected groupings?)NJ
Very large datasets where ML is prohibitiveNJ
Molecular clock data (ultrametric trees)UPGMA (rare)
Publication-quality treesML (IQ-TREE2/RAxML-NG) or Bayesian
Formal hypothesis testingML or Bayesian

NJ trees are fast (O(n^3)) and useful for exploration. For any analysis intended for publication, use ML methods (see modern-tree-inference skill). NJ starting trees are used internally by IQ-TREE (BIONJ) and RAxML-NG.

UPGMA warning: UPGMA assumes a molecular clock (equal rates across all lineages). This assumption is almost never met for molecular data. Use NJ instead unless clocklike behavior has been verified.

Evolutionary Distance Corrections

Raw identity-based distances underestimate true evolutionary distance because they do not account for multiple substitutions at the same site. For divergent sequences, corrected distances are more appropriate:

ModelCorrectionUse When
IdentityNone (raw mismatch proportion)Closely related sequences; quick exploration
Jukes-CantorAssumes equal substitution ratesSimple correction for moderate divergence
Kimura 2-parameterDistinguishes transitions from transversionsBetter for DNA when Ti/Tv ratio differs from 1

Biopython's

DistanceCalculator
models (
identity
,
blastn
,
trans
) provide basic corrections. For more sophisticated evolutionary distance estimation, use ML-based distances from IQ-TREE2 (
.mldist
output file).

Required Import

from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.TreeConstruction import DistanceMatrix
from Bio.Phylo.TreeConstruction import ParsimonyScorer, ParsimonyTreeConstructor, NNITreeSearcher
from Bio.Phylo.Consensus import strict_consensus, majority_consensus, bootstrap_trees, bootstrap_consensus

Distance Matrix from Alignment

from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator

alignment = AlignIO.read('alignment.fasta', 'fasta')

# Create calculator with distance model
calculator = DistanceCalculator('identity')  # Simple identity-based distance
dm = calculator.get_distance(alignment)
print(dm)

# Available models for DNA
calculator = DistanceCalculator('blastn')  # BLASTN-style distance

# Available models for protein
calculator = DistanceCalculator('blosum62')  # BLOSUM62-based distance

Available Distance Models

ModelTypeDescription
identity
DNA/Protein1 - (identical positions / total)
blastn
DNABLASTN scoring distance
trans
DNATransition/transversion weighted
blosum62
ProteinBLOSUM62 matrix distance
blosum45
ProteinBLOSUM45 matrix distance
blosum80
ProteinBLOSUM80 matrix distance
pam250
ProteinPAM250 matrix distance
pam30
ProteinPAM30 matrix distance

Building Trees with Distance Methods

Neighbor Joining (NJ)

from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

alignment = AlignIO.read('alignment.fasta', 'fasta')
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)

constructor = DistanceTreeConstructor()
nj_tree = constructor.nj(dm)
Phylo.draw_ascii(nj_tree)

UPGMA

constructor = DistanceTreeConstructor()
upgma_tree = constructor.upgma(dm)
Phylo.draw_ascii(upgma_tree)

One-Step Tree Building

# Build tree directly from alignment
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)

# Or with UPGMA
constructor = DistanceTreeConstructor(calculator, 'upgma')
tree = constructor.build_tree(alignment)

Pairwise Distances Between Taxa

from Bio import Phylo

tree = Phylo.read('tree.nwk', 'newick')

# Distance between two taxa (sum of branch lengths)
taxon1 = tree.find_any(name='Human')
taxon2 = tree.find_any(name='Mouse')
dist = tree.distance(taxon1, taxon2)
print(f'Distance Human-Mouse: {dist:.4f}')

# All pairwise distances
terminals = tree.get_terminals()
for i, t1 in enumerate(terminals):
    for t2 in terminals[i+1:]:
        d = tree.distance(t1, t2)
        print(f'{t1.name}-{t2.name}: {d:.4f}')

Creating Distance Matrix Manually

from Bio.Phylo.TreeConstruction import DistanceMatrix

names = ['A', 'B', 'C', 'D']
# Lower triangular matrix (including diagonal)
matrix = [
    [0],
    [0.1, 0],
    [0.2, 0.15, 0],
    [0.3, 0.25, 0.2, 0]
]
dm = DistanceMatrix(names, matrix)
print(dm)

# Build tree from custom matrix
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)

Parsimony Tree Construction

Parsimony is largely superseded by ML for most molecular phylogenetics. It remains appropriate for morphological cladistics, rare genomic changes (retroelement insertions, gene order), and as a starting point for ML searches. Parsimony is statistically inconsistent in the Felsenstein zone (long branch attraction).

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher, ParsimonyTreeConstructor

alignment = AlignIO.read('alignment.fasta', 'fasta')

scorer = ParsimonyScorer()
searcher = NNITreeSearcher(scorer)

# Parsimony needs a starting tree (NJ is standard)
constructor = DistanceTreeConstructor(DistanceCalculator('identity'), 'nj')
starting_tree = constructor.build_tree(alignment)

pars_constructor = ParsimonyTreeConstructor(searcher, starting_tree)
pars_tree = pars_constructor.build_tree(alignment)

print(f'Parsimony score: {scorer.get_score(pars_tree, alignment)}')
Phylo.draw_ascii(pars_tree)

Bootstrap Analysis

from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import bootstrap_trees, bootstrap_consensus, majority_consensus

alignment = AlignIO.read('alignment.fasta', 'fasta')
calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')

# Generate bootstrap trees
boot_trees = list(bootstrap_trees(alignment, 100, constructor))
print(f'Generated {len(boot_trees)} bootstrap trees')

# Get bootstrap consensus
consensus = bootstrap_consensus(alignment, 100, constructor, majority_consensus)
Phylo.draw_ascii(consensus)

Consensus Tree Methods

from Bio.Phylo.Consensus import strict_consensus, majority_consensus, adam_consensus

trees = list(Phylo.parse('bootstrap.nwk', 'newick'))

# Strict consensus (only clades in ALL trees)
strict = strict_consensus(trees)

# Majority rule consensus (clades in >50% of trees)
majority = majority_consensus(trees, cutoff=0.5)

# Adam consensus
adam = adam_consensus(trees)

Phylo.draw_ascii(majority)

Tree Depths and Total Length

tree = Phylo.read('tree.nwk', 'newick')

# Total branch length
total = tree.total_branch_length()
print(f'Total branch length: {total:.4f}')

# Depths from root to each node
depths = tree.depths()
for clade, depth in depths.items():
    if clade.is_terminal():
        print(f'{clade.name}: {depth:.4f}')

# Maximum depth (tree height)
tree_height = max(depths.values())
print(f'Tree height: {tree_height:.4f}')

Comparing Tree Distances

tree1 = Phylo.read('tree1.nwk', 'newick')
tree2 = Phylo.read('tree2.nwk', 'newick')

# Compare total branch lengths
len1 = tree1.total_branch_length()
len2 = tree2.total_branch_length()
print(f'Tree 1 total: {len1:.4f}')
print(f'Tree 2 total: {len2:.4f}')

# Compare specific pairwise distances
taxa = ['Human', 'Mouse']
t1 = [tree1.find_any(name=t) for t in taxa]
t2 = [tree2.find_any(name=t) for t in taxa]

d1 = tree1.distance(t1[0], t1[1])
d2 = tree2.distance(t2[0], t2[1])
print(f'Human-Mouse distance: Tree1={d1:.4f}, Tree2={d2:.4f}')

Complete Pipeline: Alignment to Bootstrapped Tree

Goal: Build a phylogenetic tree from a sequence alignment with bootstrap support assessment for branch confidence.

Approach: Read the alignment, compute an identity-based distance matrix, construct a neighbor-joining tree, then generate a majority-rule bootstrap consensus from 100 replicates.

from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import bootstrap_consensus, majority_consensus

alignment = AlignIO.read('sequences.aln', 'clustal')
print(f'Alignment: {len(alignment)} sequences, {alignment.get_alignment_length()} positions')

calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')

# Build simple tree
simple_tree = constructor.build_tree(alignment)
simple_tree.ladderize()

# Build bootstrap consensus (100 replicates)
consensus_tree = bootstrap_consensus(alignment, 100, constructor, majority_consensus)
consensus_tree.ladderize()

Phylo.write(simple_tree, 'nj_tree.nwk', 'newick')
Phylo.write(consensus_tree, 'bootstrap_consensus.nwk', 'newick')

Quick Reference: Distance Models

DNA Models

ModelDescription
identity
Simple mismatch counting
blastn
BLASTN-style scoring
trans
Weights transitions vs transversions

Protein Models

ModelDescription
blosum62
General proteins
blosum45
Divergent proteins
blosum80
Similar proteins
pam250
Distant homologs
pam30
Close homologs

Related Skills

  • tree-io - Save constructed trees to files
  • tree-visualization - Draw resulting trees
  • tree-manipulation - Root and process built trees
  • modern-tree-inference - ML tree inference for publication-quality results
  • alignment/alignment-io - Read alignments for tree building
  • alignment/msa-statistics - Alignment quality before tree building