BioSkills bio-phylo-divergence-dating
Estimate divergence times using molecular clock models with BEAST2, MCMCTree, and TreePL. Use when dating speciation events, calibrating phylogenies with fossils, choosing between strict and relaxed clock models, or estimating evolutionary rates across lineages.
git clone https://github.com/GPTomics/bioSkills
T=$(mktemp -d) && git clone --depth=1 https://github.com/GPTomics/bioSkills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/phylogenetics/divergence-dating" ~/.claude/skills/gptomics-bioskills-bio-phylo-divergence-dating && rm -rf "$T"
phylogenetics/divergence-dating/SKILL.mdVersion Compatibility
Reference examples tested with: BEAST2 2.7+, MCMCTree (PAML 4.10+), TreePL 1.0+, Python 3.9+
Before using code patterns, verify installed versions match. If versions differ:
- CLI:
thenbeast -version
to confirm flagsbeast -help - CLI:
or check PAML documentation for parameter namesmcmctree --help - CLI:
to confirm installationtreePL - Python:
thenpip show biopython
to check signatureshelp(module.function)
If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Divergence Time Estimation
"Estimate when lineages diverged" -> Combine phylogenetic trees with fossil calibrations and molecular clock models to infer absolute divergence times.
- CLI: BEAUti + BEAST2 for moderate data with complex models
- CLI: MCMCTree (PAML) for genome-scale data with approximate likelihood
- CLI: TreePL for very large trees with penalized likelihood
Clock Model Selection
| Model | Assumption | When to Use |
|---|---|---|
| Strict clock | Constant rate across all lineages | Only when a molecular clock test does not reject clocklike behavior; rarely appropriate for inter-species data |
| UCLN (uncorrelated lognormal) | Branch rates drawn independently from lognormal | Default choice; rates vary across lineages without parent-child correlation |
| Uncorrelated exponential | Rates from exponential distribution | More extreme rate variation; when UCLN coefficient of variation is very high |
| Autocorrelated (ACLN) | Rates correlated with parent branch | When evolutionary rates change gradually (e.g., body size evolution driving rate changes) |
| Random local clock | Few distinct rate regimes | When a small number of rate shifts are suspected rather than continuous variation |
Clock Model Diagnostics
Under UCLN in BEAST2, the posterior of
ucld.stdev measures clocklikeness:
| ucld.stdev | Interpretation |
|---|---|
| ~0 | Data is clocklike; strict clock may suffice |
| 0.1-0.5 | Moderate rate variation; UCLN appropriate |
| >1.0 | Extreme rate variation; consider model adequacy or data partitioning |
In MCMCTree, the equivalent diagnostic is
sigma2 under clock=2 (independent rates). Near-zero values suggest a strict clock; very large values indicate substantial rate heterogeneity.
Fossil Calibration Strategies
Node Calibration (Traditional)
Prior distributions constrain the age of internal nodes based on fossil evidence:
| Distribution | When to Use | Parameterization |
|---|---|---|
| Lognormal | Most common; fossil sets minimum, soft maximum | Offset = fossil minimum age, mean and stdev control tail |
| Exponential | Strong belief divergence is close to fossil age | Rate parameter controls decay from minimum |
| Uniform | Only minimum and maximum known | Dangerous with narrow bounds; truncation distorts posteriors |
| Normal | Independently known divergence age (e.g., biogeographic event) | Mean = best estimate, stdev = uncertainty |
Critical Calibration Pitfalls
These are the most common sources of error in divergence dating, and each can silently produce incorrect results:
-
Prior interaction: Multiple calibration priors interact through the tree prior (birth-death or Yule). The effective prior on a node may differ substantially from the specified prior. ALWAYS run the analysis sampling from the prior only (no data) first and compare effective priors to specified priors. If they differ, adjust calibration distributions.
-
Truncation effects: When a calibration density extends beyond the root age, the density gets truncated and renormalized, distorting the effective prior. This is especially problematic for uniform and lognormal priors near the root.
-
Secondary calibrations: Using divergence time estimates from a previous study as calibrations compounds uncertainty. The original confidence intervals become point-like constraints in the new analysis. When secondary calibrations are unavoidable, use substantially wider priors than the original confidence intervals.
-
Fossil placement: The single largest source of calibration error is incorrect phylogenetic placement of fossils. A fossil assigned to the wrong node can bias all downstream estimates. Verify fossil placement against current morphological and phylogenetic evidence.
-
Minimum vs maximum confusion: Fossils provide minimum age constraints (the lineage is at least as old as the fossil). Maximum ages require additional geological or biogeographic arguments and are harder to justify.
Tip-Dating and Fossilized Birth-Death (FBD)
FBD vs Node Calibration
| Factor | Node Calibration | Fossilized Birth-Death |
|---|---|---|
| Fossil usage | Few well-placed fossils | All available fossils as tips |
| Prior distributions | Ad hoc (lognormal, uniform) | Coherent birth-death framework |
| Morphological data | Not required | Beneficial but optional |
| Extinct taxa | Excluded | Included as tips |
| Setup complexity | Lower | Higher (morphological matrix, FBD parameters) |
| Software | BEAST2, MCMCTree | BEAST2 (SA package) |
When to prefer FBD: Rich fossil record available, morphological data matrix exists, wanting to include extinct taxa, or concerned about ad hoc prior distributions.
When to prefer node calibration: Few well-placed fossils, no morphological data, simpler setup needed, or working with MCMCTree.
Software Decision Framework
| Scenario | Recommended Tool | Rationale |
|---|---|---|
| Genome-scale data (>100 loci) | MCMCTree | Approximate likelihood is 100-1000x faster than exact |
| Moderate data, complex models | BEAST2 | Richest model ecosystem, FBD support |
| Very large tree (>1000 taxa) | TreePL | Penalized likelihood scales to 10,000+ taxa |
| Fossils as tips needed | BEAST2 + SA package | Only tool with FBD tip-dating |
| Mixture models + dating | IQ2MC (IQ-TREE 3 + MCMCTree) | Combines mixture model fit with Bayesian dating |
| Quick exploratory dating | TreePL | Fast; correlates well with Bayesian estimates |
BEAST2 Dating Workflow
Goal: Estimate divergence times with calibrated relaxed clock in BEAST2.
Approach: Prepare XML via BEAUti, run BEAST2, verify convergence, summarize results.
Step-by-Step Workflow
- Prepare alignment in BEAUti: import FASTA/NEXUS, set data type
- Set substitution model: use bModelTest for model averaging (recommended) or fix a model
- Set clock model: UCLN is the default choice; set initial clock rate based on prior knowledge
- Set tree prior: Birth-Death for extant-only data, FBD when including fossils as tips
- Add fossil calibrations: specify prior distributions with justified parameters for each calibrated node
- Run from prior first: sample from the prior only (uncheck "use data" or set
) to verify effective priors match expectationssampleFromPrior="true" - Run full analysis: at least 2 independent runs with different random seeds
- Check convergence: Tracer: all ESS values >= 200; compare posterior traces between runs
- Check topological convergence: use RWTY or similar
- Combine and summarize: LogCombiner (remove burn-in from each run), then TreeAnnotator for MCC tree
BEAST2 Clock Rate Prior
Goal: Set an informable prior on the clock rate to avoid poor mixing.
Approach: Estimate the expected substitution rate from prior information and set the clock rate prior accordingly.
# Rough clock rate estimation: # rate = divergence / (2 * time) # For mammals: ~0.01 substitutions/site/Myr for nuclear genes # For mitochondrial: ~0.02 substitutions/site/Myr # Set lognormal prior with M = ln(expected_rate), S = 0.5-1.0
Key BEAST2 Packages
| Package | Purpose |
|---|---|
| SA (sampled ancestors) | FBD tip-dating |
| bModelTest | Bayesian model averaging |
| SNAPP | Species tree from SNPs |
| ORC | Optimized relaxed clock operators |
| CoupledMCMC | Parallel tempering for difficult posteriors |
MCMCTree Workflow
Goal: Date a large phylogeny using approximate likelihood in MCMCTree.
Approach: Prepare control file, run approximate likelihood calculation in two steps, then MCMC sampling.
Two-Step Approximate Likelihood
The approximate likelihood method makes MCMCTree practical for genome-scale data:
# Step 1: Generate gradient and Hessian (in.BV file) # Set usedata = 3 in control file, then run: mcmctree mcmctree_step1.ctl # Step 2: Run MCMC with approximate likelihood # Set usedata = 2 in control file, then run: mcmctree mcmctree_step2.ctl
MCMCTree Calibration Format
MCMCTree uses specific notation in the tree file for calibrations. Use the
B(), L(), U() format (NOT the > < notation, which has a known parsing bug in some versions):
| Notation | Meaning | Example |
|---|---|---|
| Soft bounds (lower, upper) | : 60-80 Ma, 2.5% tail probability |
| Minimum bound (truncated Cauchy) | : min 50 Ma |
| Maximum bound only | : max 100 Ma |
| Skew-normal minimum bound | |
| Skew-t with upper bound | Full bounded calibration |
Times are in units of 100 Myr by convention (so 0.6 = 60 Ma), though any consistent unit works if
rgene_gamma is adjusted.
MCMCTree Practical Tips
- Root calibration is mandatory. MCMCTree will not run without a calibration on the root node; set via
in the control file or directly in the tree fileRootAge - Acceptance proportions should be 20-40% (ideal ~30%); adjust
parameters if outside this rangefinetune - Independent runs: always run at least 2 with different seeds and compare posteriors
- Prior vs posterior: run with
to sample from the prior; compare with posterior to assess data informativenessusedata = 0 - BDparas: birth-death parameters (lambda, mu, rho) affect the time prior on uncalibrated nodes; sensitivity analysis recommended
- rgene_gamma: sets the prior on the overall substitution rate; parameterize based on prior knowledge of the clade
MCMCTree Control File Parameters
Goal: Generate a properly configured MCMCTree control file.
Approach: Set key parameters with appropriate defaults and document rationale for non-default choices.
# Key MCMCTree control file parameters params = { 'seed': -1, # random seed (-1 for clock-based) 'seqfile': 'alignment.phy', # phylip format alignment 'treefile': 'tree.nwk', # newick tree with calibrations 'outfile': 'mcmctree.out', # output file 'ndata': 1, # number of data partitions 'usedata': 2, # 0=no data, 1=exact, 2=approx, 3=generate in.BV 'clock': 2, # 1=strict, 2=independent rates, 3=correlated rates 'model': 4, # 0=JC69, 4=HKY85, 7=REV (GTR) 'alpha': 0.5, # gamma shape for rate variation among sites 'ncatG': 4, # number of gamma categories 'cleandata': 0, # 0=keep ambiguity data, 1=remove 'BDparas': '1 1 0.1', # birth rate, death rate, sampling fraction 'rgene_gamma': '2 20 1', # gamma prior on rates: alpha, beta, dirichlet 'sigma2_gamma': '1 10 1', # gamma prior on sigma2 (rate drift) 'print': 1, # output detail level 'burnin': 50000, # burn-in iterations 'sampfreq': 50, # sampling frequency 'nsample': 20000, # number of posterior samples }
TreePL
Goal: Rapidly estimate divergence times on very large phylogenies.
Approach: Penalized likelihood with cross-validation for smoothing parameter selection.
TreePL Configuration
treefile = input.tre smooth = 100 numsites = 1000 mrca = cal1 taxonA taxonB min = cal1 50 max = cal1 80 mrca = cal2 taxonC taxonD min = cal2 100 max = cal2 120 outfile = dated.tre thorough opt = 1 optad = 1 optcvad = 1 moredetail cv
TreePL Key Points
- Cross-validation (
flag): determines optimal smoothing parameter; always enable for final analysescv - Smoothing parameter: controls the penalty on rate variation; too low = overfitting, too high = strict clock
- Scalability: handles trees with 10,000+ taxa where Bayesian methods are impractical
- Correlation with Bayesian: TreePL estimates correlate well (R-squared > 0.94) with BEAST2 and MCMCTree on benchmark datasets
- Limitations: no posterior distributions on node ages, no model comparison, limited clock model flexibility
Common Pitfalls and Diagnostics
Convergence Issues
| Problem | Symptom | Solution |
|---|---|---|
| Low ESS | ESS < 200 in Tracer | Run longer, tune operators, simplify model |
| Non-stationarity | Trace trends up/down | Increase burn-in or run longer |
| Multimodality | Multiple peaks in marginal density | May indicate genuine uncertainty; try different priors or starting trees |
| Poor mixing | Trace stuck at single value | Adjust operator weights/tuning; check for conflicting calibrations |
Rate and Time Confounding
Molecular clock analyses estimate the product rate x time. Calibrations break this confounding by fixing time at certain nodes. With too few calibrations, rate and time estimates can be poorly identified. Signs of confounding:
- Very wide credible intervals on both rates and times
- Strong negative correlation between rate and root age in posterior
- Posterior nearly identical to prior for some node ages (data is uninformative)
Substitution Saturation
Deep divergences (>500 Ma) with fast-evolving markers can hit substitution saturation, where multiple substitutions at the same site erase phylogenetic signal. This can bias both topology and dates. Mitigation: use slow-evolving markers, amino acids instead of nucleotides, or codon models.
Related Skills
- bayesian-inference - MCMC convergence diagnostics and Bayesian fundamentals
- modern-tree-inference - ML tree topology estimation (needed before dating)
- species-trees - Species tree estimation (date after resolving species tree)
- tree-manipulation - Rooting trees (required before dating)
- tree-io - Reading and converting tree files for input/output
- alignment/multiple-alignment - Prepare alignments for dating analyses