Open Tree of Life
The Open Tree of Life (OpenTree) is a collaborative, open-source project funded by the National Science Foundation that synthesizes published phylogenetic trees and taxonomic data to create a comprehensive, dynamic evolutionary tree for all ~2.3 million known species of life on Earth.[1] It provides tools for exploring evolutionary relationships, identifying knowledge gaps, and supporting biodiversity research, with its core resource being a synthetic supertree integrating diverse phylogenies using automated methods and a unified taxonomy from sources like NCBI Taxonomy and the Global Biodiversity Information Facility (GBIF).[1][2] Launched in 2012 under NSF grant DEB-1208809 as part of the Assembling the Tree of Life program, the project released its first draft supertree in 2015, encompassing 2.3 million tips from 484 source phylogenies across 3,062 studies.[1] As of synthesis version 15.1 (July 15, 2024), the tree includes 2,384,572 tips, with the Open Tree Taxonomy (OTT) version 3.7 (April 19, 2024) integrating data from multiple databases.[3][4] The project supports community curation and offers web-based visualization, APIs, and open-source software for access.[2][5] OpenTree has become a foundational resource in evolutionary biology, enabling macroevolutionary analyses and conservation efforts, and is cited in thousands of studies. It continues to evolve with new studies added as of 2025, including applications in recent syntheses like a complete avian phylogenetic tree.[1][6]Overview
Goals and Objectives
The Open Tree of Life project seeks to synthesize all published phylogenetic trees and associated taxonomic data into a single, dynamic, and comprehensive tree that represents the evolutionary history of all life on Earth. By integrating fragmented phylogenetic information from diverse studies, the project bridges gaps in existing data, creating a unified framework that initially covered approximately 2.3 million tips, including around 1.8 million named species, with ongoing expansion to incorporate new findings. This core goal establishes a foundational resource for understanding biodiversity across all domains of life, from microbes to multicellular organisms.[1] Key objectives of the project include enhancing the accessibility of phylogenetic knowledge to researchers, educators, and the public, fostering collaborative contributions from the scientific community, and serving as a bedrock for advancing studies in evolutionary biology, biodiversity assessment, and conservation strategies. Through this synthesis, the project enables users to explore evolutionary relationships and supports applications such as identifying conservation priorities and modeling ecological dynamics. It promotes open participation by allowing experts to curate and submit data, ensuring the tree evolves with scientific progress.[1] The project's commitment to openness is exemplified by its open-source framework under the BSD 2-clause license, which permits unrestricted public and scientific use without registration or fees. This licensing model facilitates widespread adoption, reproducibility of analyses, and continuous improvement through community input, making phylogenetic data a freely available global asset.[7][8]Scope and Coverage
The Open Tree of Life project encompasses all three domains of life—Bacteria, Archaea, and Eukarya—with a primary focus on named species across these groups.[1] Its initial scope targeted approximately 1.8 million named species of animals, plants, fungi, and microbes, reflecting the breadth of described biodiversity at the time of the first major synthesis.[1] This coverage has since expanded through ongoing integrations, emphasizing resolution of evolutionary relationships within major clades such as animals, plants, fungi, and microbial lineages.[3] The taxonomic hierarchy in the Open Tree of Life spans standard ranks from domain down to subspecies, providing a structured framework for organizing taxa.[9] It integrates both molecular phylogenetic data from published studies and taxonomic classifications that incorporate morphological characteristics where available, ensuring a comprehensive representation of evolutionary history.[1] As of July 2024, the latest synthetic tree includes 2,384,572 tips, representing taxa with resolved phylogenetic placements, though the underlying Open Tree Taxonomy (OTT) catalogs over 4.5 million identifiers to accommodate broader nomenclatural variations.[3][9] Unlike static phylogenetic trees, the Open Tree of Life maintains a dynamic scope that allows for periodic updates to incorporate new taxonomic discoveries and refined phylogenetic estimates, ensuring ongoing relevance to emerging biodiversity data.[10] The Open Tree Taxonomy serves as the foundational backbone for this expansive coverage, enabling consistent mapping across diverse sources.[1]History and Development
Initial Funding and Establishment
The Open Tree of Life project was initiated in June 2012 as a collaborative effort involving researchers from 10 universities and institutions, led by principal investigators from the University of California, Berkeley, Harvard University, and others, including Karen Cranston of Duke University, Mark Holder of the University of Kansas, and Emily Jane McTavish of the University of California, Merced. This multi-institutional partnership sought to create a comprehensive, dynamic phylogenetic tree by integrating published trees and taxonomic data, addressing gaps in existing evolutionary resources. The effort was coordinated through the National Evolutionary Synthesis Center (NESCent) and formed part of the broader NSF Assembling, Visualizing, and Analyzing the Tree of Life (AVAToL) initiative.[11][12][13] Primary funding for the project came from a three-year National Science Foundation award (AVAToL 1208809) totaling approximately $5.7 million, which supported the development of software pipelines, data curation tools, and initial synthesis efforts across the collaborating institutions. This grant enabled the assembly of a vast dataset from thousands of published studies, emphasizing open access and reproducibility. In 2015, a two-year supplemental NSF award provided additional resources to three key institutions, extending support for refinement and expansion of the core infrastructure.[14][11] The project formally launched in September 2015 with the release of its first draft tree, encompassing 2.3 million tips (species and higher taxa) and serving as an openly accessible foundation for evolutionary research. This event highlighted the project's emphasis on community-driven updates and digital availability. Building on prior supertree projects like the Tree of Life Web Project, the Open Tree of Life distinguished itself through a focus on scalable open data integration, allowing continuous incorporation of new phylogenetic studies without proprietary restrictions.[15][1]Key Milestones and Releases
The Open Tree of Life project marked its initial major achievement with the release of version 1 in September 2015, presenting the first comprehensive draft tree encompassing approximately 2.3 million tips (species and higher taxa) across animals, plants, fungi, and other groups.[15][16] This supertree synthesized 484 source phylogenies across 3,062 published studies, providing a foundational framework for exploring evolutionary relationships on a global scale.[17][1] Following the inaugural release, the project adopted a pattern of regular synthesis updates, evolving from roughly monthly cycles in the pre-2015 development phase to more deliberate periodic major releases thereafter.[12] By 2024, over 15 versions had been produced, reflecting ongoing refinements and expansions. Subsequent funding, including NSF grants ABI-1759838 and ABI-1759846, has supported continued enhancements and regular releases through the 2020s.[12] Notable subsequent milestones include version 14.8, released on September 25, 2023, which incorporated newly published phylogenetic studies to enhance tree structure and coverage.[18] This was followed by version 15.1 on July 15, 2024, which utilized the propinquity pipeline to expand the number of terminal tips and improve overall resolution.[3] A key development began in 2016 with the integration of community-curated phylogenetic studies into the synthesis process, enabling contributions from researchers worldwide via tools like Phylesystem. This collaborative approach led to notable improvements in resolution for specific clades, such as birds and mammals, by incorporating expert-vetted trees that addressed gaps in earlier drafts.[1]Methodology
Taxonomic Framework
The Open Tree Taxonomy (OTT) serves as the foundational, machine-readable taxonomic framework for the Open Tree of Life project, synthesizing diverse taxonomic data into a unified hierarchical structure that spans all domains of life. It integrates information from major databases, including the NCBI Taxonomy, Integrated Taxonomic Information System (ITIS), and Catalogue of Life, along with others such as the Global Biodiversity Information Facility (GBIF) backbone and Interim Register of Marine and Nonmarine Genera (IRMNG), to create a comprehensive reference that maximizes taxonomic coverage while minimizing redundancy.[1] This synthesis ensures that each taxon is assigned a unique Open Tree Taxonomy Identifier (OTT ID), facilitating consistent mapping and interoperability across phylogenetic datasets. The construction of OTT relies on automated processes to merge input taxonomies, with the "smasher" software playing a central role in resolving discrepancies and producing a single coherent hierarchy. Smasher, implemented as a Java-based tool with supporting Python utilities, aligns homologous nodes across sources, merges synonymous names—such as alternative scientific names for the same species—and flags or resolves conflicts like homonyms or differing classifications through algorithmic rules and scripted interventions.[19] The output includes detailed logs of mergers, synonym lists, and conflict reports, enabling transparency and iterative refinement while preserving the original source attributions for each taxon.[1] As of version 3.7 (released May 30, 2024), OTT encompasses over 10 million taxonomic names, encompassing both accepted taxa and synonyms, with extensive mappings to external databases like NCBI and GBIF to support cross-referencing and data integration.[9] This scale reflects ongoing updates that incorporate new taxonomic descriptions and revisions, ensuring broad representation across eukaryotes, bacteria, and archaea. A key design principle of OTT is its emphasis on stability and version control, achieved through a git-based versioning system that allows taxonomic updates without invalidating prior phylogenetic syntheses. Each release is independently archived and documented, enabling users to reference specific versions (e.g., via OTT IDs) and facilitating reproducible analyses even as nomenclature evolves. This approach minimizes disruptions in supertree building, where the taxonomy acts as a stable scaffold for integrating diverse phylogenetic trees.[1][19]Phylogenetic Synthesis Process
The phylogenetic synthesis process of the Open Tree of Life utilizes a supertree approach to combine published phylogenetic trees with the Open Tree Taxonomy (OTT) into a comprehensive, dynamic tree of life. This semi-automated method involves grafting compatible clades from source phylogenies onto the OTT backbone, which serves as a taxonomic scaffold for alignment and constraint. Conflicts between input trees are resolved through graph-based algorithms that prioritize well-supported relationships, ensuring the synthetic tree reflects the broadest consensus of available evidence while minimizing unsupported resolutions.[1] The process starts with curation of input trees from public repositories such as TreeBase and Dryad, where phylogenies are selected for their relevance and quality, often nominated by the community for inclusion. These trees are then aligned to OTT taxa via automated mapping of tips to taxonomic identifiers, accounting for synonyms and hierarchical structure to standardize nomenclature across sources. Aligned trees are decomposed into smaller subproblems at nodes without conflicts, facilitating efficient integration.[1][20] Subsequently, the OTT and decomposed input trees are loaded into a Neo4j graph database to construct a tree alignment graph (TAG), representing all compatible and conflicting relationships as edges and nodes. Traversal of the TAG employs a greedy heuristic algorithm that maximizes the number of displayed groups by rank (DGR), akin to maximum parsimony principles, to resolve polytomies and incompatibilities. Well-supported clades from higher-ranked inputs—such as expert-curated or recently published phylogenies—are prioritized over taxonomic assumptions, while unresolved areas rely on OTT constraints to infer monophyly or basal placements. This approach handles conflicts by flagging discordant clades for community review rather than forcing arbitrary resolutions.[1] The output is a resolved synthetic tree encompassing millions of taxa, with branch lengths incorporated where source trees provide dating information, such as molecular clock estimates or fossil calibrations. As of the initial 2015 release, the synthesis incorporated 484 source trees from 3,062 studies, covering relationships for approximately 38,000 tips (or ~42,000 including nonterminal taxa) directly from source phylogenies; by 2021, this had expanded to 1,216 studies informing 87,000 taxa within a 2.4 million-tip tree. As of the July 2024 synthesis (v15.1), this has expanded to 129,778 tips derived directly from phylogenies across more than 4,500 studies containing 9,395 trees. The full Phylesystem database, which stores all curated trees for potential synthesis, contained over 7,700 trees from 3,400 studies as of 2016, supporting ongoing updates to the dynamic framework.[1][21][22][3]Software Tools and Pipelines
The Open Tree of Life project employs a suite of open-source software tools and pipelines to facilitate the synthesis of phylogenetic trees and taxonomic data. Central to this infrastructure is the Propinquity pipeline, a Snakemake-based workflow designed for constructing comprehensive synthetic supertrees by integrating input phylogenies and taxonomies.[23] Propinquity relies on the otcetera library, a set of C++ tools for phylogenetic tree manipulations, including supertree operations that prioritize compatibility across source trees.[24] This pipeline automates the transformation of data into a unified format, performs taxonomic mapping, and generates grafted supertrees, enabling scalable synthesis for millions of taxa.[25] For taxonomy management, the project uses Smasher, a Java-based tool within the reference-taxonomy repository that merges multiple input taxonomies—such as those from NCBI, ITIS, and GBIF—into the Open Tree Taxonomy (OTT) by resolving synonyms, hierarchies, and conflicts through rule-based algorithms.[19] Smasher outputs a stable, unique identifier system (OTT IDs) for taxa, which underpins subsequent phylogenetic integrations.[26] These tools are hosted on the OpenTreeOfLife GitHub organization, providing version-controlled code, documentation, and issue tracking for community contributions.[27] Automated workflows are supported through language-specific packages that interface with the project's web-service APIs, allowing users to query taxonomy and tree data programmatically. The OpenTree Python package wraps API endpoints for tasks like retrieving induced subtrees, matching taxa via the Taxonomic Name Resolution Service (TNRS), and downloading study metadata, facilitating custom syntheses and analyses.[28] Similarly, the rotl R package provides functions to access the same endpoints, includingtaxonomy_tnrs for name matching and tree_induced_subtree for extracting phylogenies, enabling seamless integration into R-based ecological modeling.[29] These APIs, documented in the project's wiki, include dedicated endpoints for taxonomy (e.g., /tnrs/match_names) and trees (e.g., /phylesystem/v1/study), supporting JSON responses for efficient data retrieval.[5]
The infrastructure emphasizes reproducibility, with Propinquity and associated tools allowing users to regenerate synthetic trees from archived source data using specified pipeline versions, such as the SHA used in the July 2024 synthesis release (v15.1).[3] Recent updates, including the May 2024 taxonomy release (v3.7), have incorporated pipeline enhancements for improved data handling and API reliability, ensuring stable access to evolving resources.[9]