Medical-research-skills etetoolkit
ETE (Environment for Tree Exploration) toolkit for phylogenetic and hierarchical tree analysis; use it when you need to parse/manipulate Newick/NHX trees, detect duplication/speciation events, integrate NCBI taxonomy, and render publication-quality figures.
install
source · Clone the upstream repo
git clone https://github.com/aipoch/medical-research-skills
Claude Code · Install into ~/.claude/skills/
T=$(mktemp -d) && git clone --depth=1 https://github.com/aipoch/medical-research-skills "$T" && mkdir -p ~/.claude/skills && cp -r "$T/scientific-skills/Data Analysis/etetoolkit" ~/.claude/skills/aipoch-medical-research-skills-etetoolkit && rm -rf "$T"
manifest:
scientific-skills/Data Analysis/etetoolkit/SKILL.mdsource content
When to Use
- Preprocess phylogenetic trees: convert formats (Newick/NHX/PhyloXML), reroot (midpoint/outgroup), prune taxa, and resolve polytomies before downstream analyses.
- Detect evolutionary events in gene trees: infer duplication vs. speciation events and derive ortholog/paralog relationships for phylogenomics.
- Annotate trees with taxonomy: map species names to NCBI TaxIDs, retrieve lineages/ranks, and build minimal taxonomy topologies connecting a set of taxa.
- Generate publication-quality visualizations: render trees to PDF/SVG/PNG with custom styles, support-based coloring, and node “faces” (labels, shapes, heatmaps).
- Compare alternative topologies: quantify differences between trees using Robinson–Foulds (RF) distance and partition/bipartition analysis.
Key Features
- Tree I/O and manipulation
- Read/write: Newick, NHX, PhyloXML, NeXML
- Traversals: preorder, postorder, levelorder
- Operations: prune, reroot, collapse, resolve polytomies
- Metrics: branch/topological distances, RF distance
- Phylogenetic (gene tree) analysis
- Alignment association (FASTA/Phylip)
- Species name extraction from gene IDs
- Duplication/speciation detection (e.g., species overlap / reconciliation-style workflows)
- Orthology/paralogy extraction and gene-family splitting
- NCBI taxonomy integration
- Auto-download + local cache of taxonomy DB
- TaxID ↔ scientific name translation
- Lineage/rank retrieval and taxonomy-based topology building
- Tree annotation with taxonomic metadata
- Visualization
- Rectangular/circular layouts, GUI exploration
- NodeStyle/TreeStyle customization
- Faces (text, shapes, charts/heatmaps) and layout functions
- Export to PDF/SVG/PNG
- Clustering support
- ClusterTree for dendrograms linked to numeric matrices
- Cluster quality metrics (e.g., silhouette, Dunn index)
- Heatmap + tree combined views
Dependencies
(recommended:ete3
)>=3.1.0- Optional GUI/rendering dependencies (platform-specific):
(e.g.,PyQt5
)>=5.15- Qt SVG support (often packaged as
on Debian/Ubuntu)python3-pyqt5.qtsvg
Example Usage
The following example is designed to be runnable end-to-end (it uses an in-memory Newick string and does not require external files).
# pip install ete3 from ete3 import Tree, TreeStyle, NodeStyle # 1) Load a tree (Newick) nw = "((A:0.1,B:0.2)90:0.3,(C:0.2,D:0.4)70:0.1);" t = Tree(nw, format=1) # 2) Basic stats print("Leaves:", len(t)) print("Total nodes:", sum(1 for _ in t.traverse())) # 3) Midpoint rooting mid = t.get_midpoint_outgroup() t.set_outgroup(mid) # 4) Prune to taxa of interest (preserve branch lengths) t.prune(["A", "C", "D"], preserve_branch_length=True) # 5) Style nodes (color internal nodes by support) ts = TreeStyle() ts.show_leaf_name = True ts.show_branch_support = True for n in t.traverse(): st = NodeStyle() if n.is_leaf(): st["fgcolor"] = "blue" st["size"] = 8 else: # ETE stores internal support in n.support when present st["fgcolor"] = "darkgreen" if getattr(n, "support", 0) >= 80 else "red" st["size"] = 5 n.set_style(st) # 6) Render (PDF/SVG/PNG supported depending on your environment) t.render("example_tree.pdf", tree_style=ts) print("Wrote: example_tree.pdf")
Implementation Details
Tree parsing formats (Newick “format” codes)
ETE uses a
format integer to control how node attributes are interpreted when reading/writing Newick. Common patterns:
: flexible default (often includes branch lengths)format=0
: includes internal node namesformat=1
: includes support/bootstrap valuesformat=2
: internal node names + branch lengthsformat=5
: name + distance + support (maximal common usage)format=8
: leaf names onlyformat=9
: topology onlyformat=100
Example:
from ete3 import Tree t = Tree("tree.nw", format=1) t.write(outfile="out.nw", format=5)
NHX feature preservation
NHX is used to store custom per-node features. When writing, specify which features to serialize:
t.write(outfile="tree.nhx", features=["taxid", "habitat", "lineage"])
Rerooting and pruning behavior
- Midpoint rooting uses
to select an outgroup that balances path lengths.get_midpoint_outgroup() - Pruning should typically use
to avoid distorting distances in phylogenetic contexts.preserve_branch_length=True
Evolutionary event detection (gene trees)
For gene trees,
PhyloTree supports event labeling on internal nodes (commonly:
for duplicationevoltype == "D"
for speciation)evoltype == "S"
A typical workflow is:
- Load a gene tree (optionally with an alignment).
- Provide a species naming function to map gene IDs → species.
- Run descendant event detection.
- Extract ortholog groups (speciation subtrees) or query ortholog/paralog sets from events.
Tree comparison (Robinson–Foulds)
Tree.robinson_foulds(other_tree) returns:
: RF distance (number of differing bipartitions)rf
: maximum possible RF given shared leavesmax_rf- plus shared leaves and partition sets for deeper inspection
Normalized RF is typically computed as
rf / max_rf (when max_rf > 0).