Claude-skill-registry-data ml-tree-level4
Advanced Maximum Likelihood phylogenetics with model selection (AIC/BIC), tree search algorithms (NNI/SPR/TBR), and performance optimization (Numba/GPU)
install
source · Clone the upstream repo
git clone https://github.com/majiayu000/claude-skill-registry-data
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/majiayu000/claude-skill-registry-data "$T" && mkdir -p ~/.claude/skills && cp -r "$T/data/ml-tree-level4" ~/.claude/skills/majiayu000-claude-skill-registry-data-ml-tree-level4 && rm -rf "$T"
manifest:
data/ml-tree-level4/SKILL.mdsource content
ML Tree Level 4 - Advanced Maximum Likelihood Methods
Overview
Level 4 enhances Maximum Likelihood tree inference with:
- Model Selection: Automatically test and select best evolutionary model using AIC/BIC
- Tree Search Algorithms: NNI, SPR, TBR for finding optimal tree topologies
- Advanced Rate Models: FreeRate, partition models, branch-specific rates
- Performance Optimization: Numba JIT, GPU acceleration, Cython
Current Status (Level 3)
Level 3 implementation includes:
- GTR model with empirical base frequencies
- Gamma rate heterogeneity (α parameter)
- Felsenstein's pruning algorithm
- Basic likelihood optimization (bfgs)
- Support for DNA, RNA, and protein sequences
Files:
backend/rrna_phylo/models/ml_tree_level3.py
Level 4 Feature 1: Model Selection
Theory
Information Criteria:
- AIC (Akaike):
AIC = -2*ln(L) + 2*k - BIC (Bayesian):
BIC = -2*ln(L) + k*ln(n) - Lower is better (penalizes complexity)
Models to Test:
DNA_MODELS = { 'JC69': 0 params, # Equal rates, equal frequencies 'K80': 1 param, # Transition/transversion ratio 'F81': 3 params, # Empirical frequencies 'HKY85': 4 params, # K80 + empirical frequencies 'GTR': 6 params, # General time reversible 'GTR+G': 7 params, # GTR + gamma rate heterogeneity } PROTEIN_MODELS = { 'JTT': 0 params, # Jones-Taylor-Thornton matrix 'WAG': 0 params, # Whelan and Goldman 'LG': 0 params, # Le and Gascuel 'JTT+G': 1 param, # JTT + gamma }
Implementation Pattern
def select_best_model( sequences: List[Sequence], tree: TreeNode, criterion: str = 'BIC', models: List[str] = None, verbose: bool = False ) -> Tuple[str, float, Dict[str, float]]: """ Test multiple models and select best using AIC/BIC. Args: sequences: Aligned sequences tree: Initial tree topology criterion: 'AIC' or 'BIC' models: List of model names to test (None = test all) verbose: Print comparison table Returns: (best_model_name, best_score, all_scores) Example: best, score, all_scores = select_best_model( sequences, initial_tree, criterion='BIC' ) # best = 'GTR+G' # all_scores = {'JC69': 5420.3, 'K80': 5315.2, 'GTR+G': 5298.1} """ results = {} n_sites = len(sequences[0].sequence) for model_name in models: # Compute likelihood with this model logL, params = compute_likelihood_with_model( tree, sequences, model_name ) # Calculate information criterion k = len(params) # Number of free parameters if criterion == 'AIC': score = -2 * logL + 2 * k else: # BIC score = -2 * logL + k * np.log(n_sites) results[model_name] = { 'logL': logL, 'params': k, 'score': score } # Find best model (lowest score) best_model = min(results.items(), key=lambda x: x[1]['score']) if verbose: print("\nModel Selection Results:") print(f"Criterion: {criterion}") print(f"{'Model':<12} {'LogL':>12} {'Params':>8} {criterion:>12}") print("-" * 48) for name, res in sorted(results.items(), key=lambda x: x[1]['score']): marker = " *" if name == best_model[0] else "" print(f"{name:<12} {res['logL']:>12.2f} {res['params']:>8} " f"{res['score']:>12.2f}{marker}") return best_model[0], best_model[1]['score'], results
Model Classes
class SubstitutionModel: """Base class for substitution models.""" def __init__(self, name: str, n_params: int): self.name = name self.n_params = n_params def get_rate_matrix(self, params: np.ndarray, freqs: np.ndarray) -> np.ndarray: """Get Q matrix for this model.""" raise NotImplementedError class JC69Model(SubstitutionModel): """Jukes-Cantor: equal rates, equal frequencies.""" def __init__(self): super().__init__('JC69', n_params=0) def get_rate_matrix(self, params=None, freqs=None) -> np.ndarray: # All rates equal (normalized to 1) freqs = np.array([0.25, 0.25, 0.25, 0.25]) Q = np.ones((4, 4)) * 0.25 np.fill_diagonal(Q, 0) np.fill_diagonal(Q, -Q.sum(axis=1)) return Q class K80Model(SubstitutionModel): """Kimura: different transition/transversion rates.""" def __init__(self): super().__init__('K80', n_params=1) def get_rate_matrix(self, params: np.ndarray, freqs=None) -> np.ndarray: kappa = params[0] # transition/transversion ratio freqs = np.array([0.25, 0.25, 0.25, 0.25]) # Build Q matrix Q = np.array([ [0, 1, kappa, 1], # A -> C(transv), G(trans), T(transv) [1, 0, 1, kappa], # C -> A(transv), G(transv), T(trans) [kappa, 1, 0, 1], # G -> A(trans), C(transv), T(transv) [1, kappa, 1, 0] # T -> A(transv), C(trans), G(transv) ]) * freqs np.fill_diagonal(Q, -Q.sum(axis=1)) return Q
Level 4 Feature 2: Tree Search Algorithms
NNI (Nearest Neighbor Interchange)
Algorithm:
- For each internal branch
- Swap one subtree from each side
- Calculate new likelihood
- Accept if better
- Repeat until no improvement
def nni_search( tree: TreeNode, sequences: List[Sequence], model: str = 'GTR', max_iterations: int = 100, verbose: bool = False ) -> Tuple[TreeNode, float, int]: """ Improve tree topology using NNI rearrangements. Args: tree: Starting tree sequences: Aligned sequences model: Substitution model max_iterations: Max NNI rounds verbose: Print progress Returns: (best_tree, best_logL, n_improvements) Algorithm: 1. Calculate initial likelihood 2. For each internal branch: a. Generate 2 NNI neighbors b. Calculate likelihoods c. Accept best if better 3. Repeat until convergence """ current_tree = tree.copy() current_logL = compute_likelihood(current_tree, sequences, model) improvements = 0 for iteration in range(max_iterations): improved = False # Get all internal branches internal_edges = get_internal_edges(current_tree) for edge in internal_edges: # Generate NNI neighbors neighbor1, neighbor2 = generate_nni_neighbors(current_tree, edge) # Calculate likelihoods logL1 = compute_likelihood(neighbor1, sequences, model) logL2 = compute_likelihood(neighbor2, sequences, model) # Accept best if better best_logL = max(logL1, logL2, current_logL) if best_logL > current_logL: if logL1 > logL2: current_tree = neighbor1 else: current_tree = neighbor2 current_logL = best_logL improved = True improvements += 1 if verbose: print(f"Iteration {iteration}: LogL improved to {current_logL:.2f}") if not improved: if verbose: print(f"NNI converged after {iteration} iterations") break return current_tree, current_logL, improvements def generate_nni_neighbors(tree: TreeNode, edge: Tuple[TreeNode, TreeNode]) -> Tuple[TreeNode, TreeNode]: """ Generate two NNI neighbors by swapping subtrees. For edge connecting nodes A and B: A A A / \ / \ / \ L B --> C B or L B / \ / \ / \ C R L R R C Args: tree: Original tree edge: (parent, child) nodes defining internal edge Returns: (neighbor1, neighbor2): Two NNI rearrangements """ parent, child = edge # Copy tree tree1 = tree.copy() tree2 = tree.copy() # Find corresponding nodes in copies p1 = find_node(tree1, parent.name) c1 = find_node(tree1, child.name) p2 = find_node(tree2, parent.name) c2 = find_node(tree2, child.name) # Swap subtrees (different for each neighbor) # Neighbor 1: swap parent.left with child.left p1.left, c1.left = c1.left, p1.left # Neighbor 2: swap parent.left with child.right p2.left, c2.right = c2.right, p2.left return tree1, tree2
SPR (Subtree Pruning and Regrafting)
More aggressive than NNI:
- Cut a subtree
- Reattach at a different location
- Explores more of tree space
def spr_search( tree: TreeNode, sequences: List[Sequence], model: str = 'GTR', max_iterations: int = 50, verbose: bool = False ) -> Tuple[TreeNode, float, int]: """ Improve tree using SPR rearrangements. SPR explores more tree space than NNI but is more expensive. Algorithm: 1. For each branch, prune subtree 2. For each other branch, regraft 3. Calculate likelihood 4. Accept best if better """ # Implementation similar to NNI but with more extensive rearrangements pass
Level 4 Feature 3: Advanced Rate Models
FreeRate Model
Allows rate categories without gamma distribution constraint:
def compute_freerate_categories( tree: TreeNode, sequences: List[Sequence], n_categories: int = 4 ) -> Tuple[np.ndarray, np.ndarray]: """ Estimate free rate categories (weights and rates). Unlike gamma, rates are free parameters estimated from data. Args: tree: Tree topology sequences: Aligned sequences n_categories: Number of rate categories Returns: (rates, weights): Rate values and category probabilities Example: rates, weights = compute_freerate_categories(tree, seqs, 4) # rates = [0.05, 0.3, 1.2, 2.8] # Free parameters # weights = [0.25, 0.25, 0.25, 0.25] # Can also be optimized """ # Optimize rates and weights jointly def objective(params): rates = params[:n_categories] weights = softmax(params[n_categories:]) # Ensure sum to 1 logL = 0 for site in range(n_sites): site_L = 0 for rate, weight in zip(rates, weights): site_L += weight * compute_site_likelihood(tree, site, rate) logL += np.log(site_L) return -logL # Initialize initial = np.concatenate([ np.linspace(0.1, 3.0, n_categories), # Initial rates np.ones(n_categories) # Initial log-weights ]) result = minimize(objective, initial, method='L-BFGS-B') rates = result.x[:n_categories] weights = softmax(result.x[n_categories:]) return rates, weights
Partition Models
Different models for different regions (e.g., codon positions):
class PartitionModel: """Apply different models to different sequence regions.""" def __init__(self, partitions: List[Dict]): """ Args: partitions: List of partition definitions Example: partitions = [ {'start': 0, 'end': 300, 'model': 'GTR+G', 'alpha': 0.5}, {'start': 300, 'end': 600, 'model': 'GTR', 'alpha': None}, ] """ self.partitions = partitions def compute_likelihood(self, tree: TreeNode, sequences: List[Sequence]) -> float: """Sum log-likelihoods across partitions.""" total_logL = 0 for partition in self.partitions: # Extract sites for this partition part_seqs = [ Sequence(s.id, s.description, s.sequence[partition['start']:partition['end']]) for s in sequences ] # Compute likelihood with partition-specific model logL = compute_likelihood( tree, part_seqs, model=partition['model'], alpha=partition['alpha'] ) total_logL += logL return total_logL
Level 4 Feature 4: Performance Optimization
Numba JIT Compilation
import numba as nb @nb.jit(nopython=True, cache=True) def compute_likelihood_fast( P_matrices: np.ndarray, # [n_edges, n_states, n_states] site_patterns: np.ndarray, # [n_sites, n_taxa] frequencies: np.ndarray, # [n_states] tree_structure: np.ndarray # [n_nodes, 3] (parent, left, right indices) ) -> float: """ Numba-optimized likelihood calculation. All tree traversal and matrix operations compiled to native code. Can give 10-100x speedup for large datasets. """ n_sites = site_patterns.shape[0] n_nodes = tree_structure.shape[0] n_states = P_matrices.shape[1] # Allocate likelihood arrays L = np.zeros((n_nodes, n_sites, n_states)) # Initialize leaves for leaf_idx in range(n_taxa): for site in range(n_sites): state = site_patterns[site, leaf_idx] L[leaf_idx, site, state] = 1.0 # Traverse internal nodes (post-order) for node_idx in range(n_taxa, n_nodes): parent, left, right = tree_structure[node_idx] for site in range(n_sites): for i in range(n_states): # Compute partial likelihood left_sum = 0.0 right_sum = 0.0 for j in range(n_states): left_sum += P_matrices[left, i, j] * L[left, site, j] right_sum += P_matrices[right, i, j] * L[right, site, j] L[node_idx, site, i] = left_sum * right_sum # Sum over root states with frequencies root = n_nodes - 1 logL = 0.0 for site in range(n_sites): site_L = 0.0 for i in range(n_states): site_L += frequencies[i] * L[root, site, i] logL += np.log(site_L) return logL
GPU Acceleration (CuPy)
try: import cupy as cp HAS_GPU = True except ImportError: HAS_GPU = False def compute_likelihood_gpu( tree: TreeNode, sequences: List[Sequence], model: str = 'GTR', alpha: float = None ) -> float: """ GPU-accelerated likelihood calculation using CuPy. Useful for large alignments (>1000 sites). """ if not HAS_GPU: return compute_likelihood(tree, sequences, model, alpha) # Move data to GPU P_matrices_gpu = cp.array(P_matrices) site_patterns_gpu = cp.array(site_patterns) frequencies_gpu = cp.array(frequencies) # Compute on GPU logL_gpu = _likelihood_kernel_gpu( P_matrices_gpu, site_patterns_gpu, frequencies_gpu ) # Move result back to CPU return float(logL_gpu.get())
Integration with Existing Code
Update builder.py
def build_ml_tree_level4( sequences: List[Sequence], method: str = 'ml', model: str = 'auto', # NEW: Auto-select with BIC alpha: float = None, tree_search: str = 'nni', # NEW: NNI, SPR, or None optimize_branches: bool = True, verbose: bool = False ) -> Tuple[TreeNode, float, Dict]: """ Build ML tree with Level 4 enhancements. Args: model: 'auto' for model selection, or specific model name tree_search: 'nni', 'spr', or None Returns: (tree, logL, metadata) metadata includes: selected_model, n_improvements, time_taken """ # Step 1: Get initial tree initial_tree = build_bionj_tree(sequences) # Step 2: Model selection if model == 'auto': best_model, score, all_scores = select_best_model( sequences, initial_tree, criterion='BIC', verbose=verbose ) else: best_model = model # Step 3: Tree search if tree_search == 'nni': final_tree, logL, n_improvements = nni_search( initial_tree, sequences, model=best_model, verbose=verbose ) elif tree_search == 'spr': final_tree, logL, n_improvements = spr_search( initial_tree, sequences, model=best_model, verbose=verbose ) else: final_tree = initial_tree logL = compute_likelihood(final_tree, sequences, best_model) n_improvements = 0 # Step 4: Optimize branch lengths if optimize_branches: final_tree, logL = optimize_branch_lengths( final_tree, sequences, best_model ) metadata = { 'model': best_model, 'tree_search': tree_search, 'n_improvements': n_improvements, } return final_tree, logL, metadata
Testing Strategy
Test Model Selection
def test_model_selection(): """Verify model selection works correctly.""" # Use sequences where GTR should clearly win sequences = load_test_sequences('mammals.fasta') tree = build_bionj_tree(sequences) best_model, score, all_scores = select_best_model( sequences, tree, criterion='BIC' ) # GTR+G should be selected for real data assert best_model in ['GTR', 'GTR+G', 'HKY85'] # Check that scores are ordered assert all_scores['GTR+G'] < all_scores['JC69']
Test NNI Search
def test_nni_improves_likelihood(): """Verify NNI finds better trees.""" sequences = load_test_sequences('primates.fasta') # Start with UPGMA (often suboptimal) initial_tree = build_upgma_tree(sequences) initial_logL = compute_likelihood(initial_tree, sequences, 'GTR') # Run NNI final_tree, final_logL, n_improvements = nni_search( initial_tree, sequences, model='GTR', max_iterations=10 ) # Should improve assert final_logL >= initial_logL assert n_improvements > 0
Performance Benchmarks
def benchmark_optimizations(): """Compare performance of different implementations.""" sequences = generate_test_sequences(n_seqs=50, length=1000) tree = build_bionj_tree(sequences) # Baseline Python start = time.time() logL_python = compute_likelihood_python(tree, sequences) time_python = time.time() - start # Numba start = time.time() logL_numba = compute_likelihood_numba(tree, sequences) time_numba = time.time() - start # GPU (if available) if HAS_GPU: start = time.time() logL_gpu = compute_likelihood_gpu(tree, sequences) time_gpu = time.time() - start print(f"GPU speedup: {time_python/time_gpu:.1f}x") print(f"Numba speedup: {time_python/time_numba:.1f}x") # Verify all give same result assert abs(logL_python - logL_numba) < 0.01
Common Pitfalls
- Model Selection Overfitting: More parameters always improve likelihood, use BIC not just likelihood
- Local Optima: NNI can get stuck, try multiple starting trees
- Numerical Precision: Use log-space for all likelihood calculations
- Performance: Don't optimize prematurely, profile first
- GPU Memory: Large trees may not fit in GPU memory
References
- Model Selection: Posada & Crandall (1998) "MODELTEST"
- NNI: Swofford et al. (1996) "Phylogenetic inference"
- SPR: Hordijk & Gascuel (2005) "Improving NNI"
- FreeRate: Soubrier et al. (2012) "FreeRate models"
- Performance: Kosakovsky Pond et al. (2005) "HyPhy"