Entrez
Entrez is the National Center for Biotechnology Information's (NCBI) primary text-based search and retrieval system, designed to provide integrated access to a vast array of biomedical and life sciences databases, including nucleotide and protein sequences, gene information, medical literature, and genomic data.[1] Developed initially in 1991 as a CD-ROM-based tool for querying linked databases, Entrez evolved into a web-accessible platform by 1993, enabling users worldwide to perform unified searches across disparate resources such as PubMed for scientific literature, GenBank for nucleotide sequences, and Protein for amino acid sequences.[2] Today, it encompasses over 30 interconnected databases, including BioProject, ClinVar, Gene, OMIM, SNP, and Taxonomy, facilitating cross-references and links that reveal relationships between molecular data, publications, and biological entities.[1] Key features of Entrez include advanced search capabilities with Boolean operators, field-specific queries, and facets for refining results, as well as tools like Search History and Clipboard for managing and combining queries efficiently.[1] Users can access Entrez through the NCBI website, where a single interface allows retrieval of records in formats like FASTA or XML, and integration with My NCBI enables personalized collections and alerts.[1] The system's programming utilities, known as E-utilities, further extend its functionality by allowing programmatic access for developers to build custom applications that query and retrieve data programmatically.[3] Since its inception under NCBI—established in 1988 to advance biotechnology through information services—Entrez has become a cornerstone of biomedical research, supporting discoveries in genomics, proteomics, and clinical genetics by democratizing access to high-quality, curated data.[2]Introduction
Purpose and Scope
Entrez serves as the National Center for Biotechnology Information's (NCBI) primary text-based search and retrieval system, designed to integrate diverse biomedical databases for unified querying across literature, molecular sequences, and related resources.[1] Developed by NCBI, a division of the U.S. National Library of Medicine, it enables users to perform cross-database searches that connect disparate data types, such as linking a gene sequence to its associated publications or structural models.[1] The core purpose of Entrez is to facilitate efficient discovery, retrieval, and interconnection of biomedical information, supporting researchers, clinicians, and educators in navigating complex scientific datasets.[1] By providing a single interface for querying over 30 NCBI databases—including those on DNA and protein sequences, genes, genomes, and genetic variations—it streamlines access to interconnected knowledge without requiring users to switch between isolated tools.[1] This integration addresses the need for cohesive exploration in molecular biology, where related data often spans multiple domains.[4] Entrez's scope is limited to public-domain biomedical data hosted by NCBI, encompassing molecular, genomic, and literature resources while excluding proprietary or non-biomedical content.[1] It emphasizes free, open access to these resources worldwide, with no subscription barriers as of 2025, ensuring broad availability for global scientific use.[1] Historically, Entrez was first released in 1991 to resolve the fragmented access to molecular biology databases that characterized the pre-1990s era.[2]Integration with NCBI Resources
Entrez serves as the primary unified interface for accessing and retrieving data from the National Center for Biotechnology Information (NCBI)'s extensive suite of over 30 interconnected databases and tools, enabling users to perform cross-resource searches without needing to navigate multiple standalone platforms.[1] This integration facilitates seamless transitions from Entrez search results to specialized NCBI tools, such as BLAST for sequence similarity searches, Primer-BLAST for designing PCR primers against specific templates, and ClinVar for exploring clinically relevant genetic variants.[1][5] By linking query outputs directly to these resources, Entrez supports efficient workflows for researchers, clinicians, and educators engaging with biomedical data.[1] A key example of this cross-integration is how Entrez queries can feed into visualization and analysis platforms like the Genome Data Viewer and the NCBI Datasets resource, which resulted from the June 2024 merger of the legacy Entrez Genome and Assembly websites to provide streamlined access to genome assemblies and related metadata.[5] Users can initiate a search in Entrez for a gene or sequence, then transition to Datasets for downloading complete genome datasets or to the Genome Data Viewer for interactive browsing of chromosomal contexts, annotations, and alignments.[5] This interconnected approach ensures that data from sources like the Sequence Read Archive (SRA), which alone exceeds 47 petabytes, is accessible through a single entry point.[5] The benefits of Entrez's integration extend to providing "one-stop" access to NCBI's vast repository, encompassing 4.6 billion records across 31 knowledgebases as of August 2024, while handling the underlying indexing and retrieval processes to simplify use for non-specialized users.[5] Entrez employs controlled vocabularies and ontologies, notably Medical Subject Headings (MeSH) for literature indexing in PubMed and the NCBI Taxonomy for organism classification, to enable standardized, precise querying across disparate resources.[1][6] These ontologies promote consistent data linkage and discovery, reducing ambiguity in searches involving biomedical terms or evolutionary relationships.[1]Supported Databases
Literature and Biomedical Databases
Entrez provides access to several key databases focused on biomedical literature and publications, enabling researchers to search, retrieve, and analyze citations, abstracts, and full-text content. These resources form the backbone of literature-based inquiries in the biomedical sciences, supporting evidence-based research and knowledge synthesis. PubMed serves as the primary literature database within Entrez, containing more than 39 million citations and abstracts of biomedical literature sourced from MEDLINE, life science journals, and online books.[7] It includes links to full-text articles where available and employs Medical Subject Headings (MeSH) for precise indexing and retrieval, facilitating targeted searches across diverse topics in medicine and biology.[8] PubMed's coverage extends to journals from the 1940s onward, with comprehensive indexing beginning in 1966 and retrospective inclusion of earlier citations through OLDMEDLINE for pre-1966 literature.[9] PubMed Central (PMC) functions as an open-access subset of PubMed, offering free full-text access to a growing archive of biomedical and life sciences journal articles deposited by publishers and authors. As of 2025, PMC supports compliance with the 2024 NIH Public Access Policy, which mandates public access to NIH-funded research outputs no later than 12 months after publication, effective July 1, 2025, thereby enhancing the dissemination of peer-reviewed content.[10] The NCBI Bookshelf complements these resources by providing free online access to full-text books, reports, and documents in the biomedical, life sciences, health care, and medical humanities fields.[11] Integrated into Entrez, Bookshelf enables contextual reading alongside journal literature, with searchable content from more than 13,000 titles that include authoritative textbooks, technical reports, and educational materials to support in-depth study and reference.[12] These databases collectively allow Entrez users to perform unified searches across literature holdings, linking citations to related biomedical data for holistic research exploration. Additional resources in this category include the Online Mendelian Inheritance in Man (OMIM) database, which catalogs genes and genetic phenotypes associated with inherited diseases.[13]Molecular Sequence and Gene Databases
The Nucleotide database in Entrez serves as a comprehensive repository for DNA and RNA sequences, primarily through its integration with GenBank, the annotated collection of publicly available nucleotide sequences submitted by researchers worldwide. GenBank, established in 1982, contains over 5.9 billion records encompassing 47.01 trillion bases as of release 268.0 in August 2025, covering sequences from viruses, prokaryotes, eukaryotes, and organelles.[14] Each record includes detailed annotations such as gene names, protein products, biological source, and literature references, facilitating functional analysis and comparative genomics. Submission to GenBank follows standardized guidelines outlined by the International Nucleotide Sequence Database Collaboration (INSDC), ensuring data quality through validation tools like the Submission Portal and BankIt, which support formats including FASTA and feature annotations for exons, introns, and regulatory elements. The Protein database in Entrez provides a centralized collection of amino acid sequences derived mainly from the conceptual translations of coding regions in nucleotide records, augmented by curated entries from sources like RefSeq, Swiss-Prot, and PDB. This database enables researchers to perform sequence alignments, homology searches, and functional predictions using integrated tools such as BLAST for identifying similar proteins across species. With a focus on non-redundant representations where possible, it supports applications in structural biology, evolutionary studies, and drug discovery by linking sequences to experimental data like enzymatic activities and post-translational modifications. Entrez Protein emphasizes practical utilities, including multiple sequence alignment viewers and prediction algorithms for secondary structure and domains, enhancing its role in proteomics workflows.[15] Entrez Gene offers a gene-centered view of genomic information, aggregating curated records from RefSeq and other sources to provide summaries of gene function, location, expression patterns, and interactions for organisms ranging from bacteria to humans. Each gene record includes details on orthologs across species, genetic variants, pathways, and expression data from sources like GEO, with over 50 million loci documented as of 2025. In 2025, NCBI introduced redesigned Gene pages through the Datasets tool, featuring an intuitive interface for downloading sequences, annotations, and metadata in formats like JSON or TSV, improving accessibility for bulk analysis and visualization of gene models. This update integrates variant information from dbSNP, allowing users to explore SNPs, indels, and their clinical implications directly within gene contexts.[16][17] Key to navigating these databases are Entrez's support for standard data formats and identification systems, such as FASTA for sequence retrieval and display, which simplifies importing data into analysis software like sequence aligners or phylogenetic tools. Accession numbers serve as stable identifiers, with the legacy GI (GenInfo Identifier) system supplemented by unique IDs (UIDs) for versioning and tracking updates, ensuring traceability in publications and databases. Furthermore, Gene records link seamlessly to dbSNP for variant analysis, enabling queries on population frequencies and phenotypic associations without leaving the Entrez environment. The dbSNP database itself catalogs single nucleotide polymorphisms (SNPs), insertions, deletions, and other variants, supporting genetic association studies and population genetics. These features, combined with cross-links to PubMed for relevant literature, underscore Entrez's utility in integrating molecular sequence data for comprehensive biological research.Taxonomy and Structural Databases
The Entrez Taxonomy database provides a curated hierarchical classification and nomenclature system for organisms represented in public sequence databases, encompassing over 2.7 million taxonomic nodes as of 2025. This includes detailed lineage information tracing evolutionary relationships from domains to species, facilitating phylogenetic analysis through an interactive taxonomy browser that displays the tree structure and links to related genomic data. The database covers a broad spectrum of life forms, with approximately 595,000 nodes for bacteria, 15,000 for archaea, 1.8 million for eukaryotes (including major subgroups like metazoa, fungi, and viridiplantae), and 273,000 for viruses, enabling researchers to explore organismal diversity in evolutionary and structural biology contexts.[18][19][20] Recent enhancements, including 2024 updates to prokaryotic classifications and integration with metagenomic data, support taxonomic assignment for uncultured microbial communities by incorporating environmental sequencing projects into the hierarchy. These updates align with the International Committee on Taxonomy of Viruses (ICTV) and other standards, improving resolution for viral and bacterial phylogenies. Additionally, the BioProject database within Entrez offers metadata on sequencing initiatives, such as project scope, organism associations, and assembly details, which link directly to taxonomy entries to contextualize large-scale genomic efforts without delving into raw sequence data from sources like GenBank. Following the 2024 merger of Entrez Genome and Assembly resources into NCBI Datasets, taxonomy records now provide streamlined access to genome assemblies, enhancing links between organism classifications and structural assemblies for viral, bacterial, and eukaryotic entries.[21][22][23][24][25] The Entrez Structure database, centered on the Molecular Modeling Database (MMDB), archives three-dimensional molecular structures derived from the Protein Data Bank (PDB), focusing on proteins, nucleic acids, and complexes to support studies in structural biology and evolution. As of March 2025, MMDB contains over 233,000 structure records, each enhanced with annotations like chemical graphs, secondary structure assignments, and cross-references to sequence data for functional inference. These models enable visualization of evolutionary conservation through domain alignments and superposition tools. Integrated with the Cn3D viewer, users can interactively explore 3D structures alongside phylogenetic lineages from Taxonomy, highlighting structural motifs across related organisms without requiring separate software.[26][27][28][29] Other notable databases in the taxonomy and structural categories include ClinVar, which aggregates information about genomic variations and their relationship to human health.[30]Core Features
Search and Query Capabilities
Entrez supports a range of search mechanisms designed to facilitate precise retrieval from its integrated databases. Users can construct queries using Boolean operators such as AND, OR, and NOT, which must be entered in uppercase to ensure proper processing. These operators allow for complex combinations, evaluated from left to right unless parentheses are used to group terms, as in the example "g1p3 AND (response element OR promoter)".[1] Field-specific searches enhance targeting by restricting terms to particular data elements, using square bracket notation like [field]. For instance, in PubMed, [tiab] limits searches to titles and abstracts, while [au] specifies authors and [organism] denotes species. Advanced filters further refine queries, including date ranges (e.g., "2015/3/1:2016/4/30[Publication Date]") and MeSH terms (e.g., "neoplasms[MeSH Terms]"), enabling users to narrow results by publication date, organism, or other indexed attributes.[1] The Global Query feature provides a unified entry point, allowing a single search string to span all Entrez databases simultaneously via the NCBI homepage. This returns ranked results across databases, ordered by relevance scoring based on term frequency and proximity, with options to filter by database type for focused exploration.[1] Search History maintains a record of recent queries for up to eight hours of inactivity, permitting users to revisit, combine, or modify them through the Advanced Search interface. Complementing this, the Clipboard temporarily stores up to 500 search results per database, facilitating temporary holding before further actions. Results from either can be exported via the "Send to" menu in formats such as XML or CSV, depending on the database, for offline analysis or integration with other tools.[1]Linking and Cross-Database Navigation
Entrez employs a sophisticated system of hyperlinks known as links to facilitate navigation between related records within and across its integrated databases, enabling users to discover contextual connections without reformulating searches. These links are categorized into two primary types: hard links and neighbor links. Hard links are direct, predefined connections derived from the inherent data relationships in records, such as a PubMed article linking to the Gene entry it cites or a Protein sequence record connecting to its corresponding three-dimensional structure in the Structure database.[1] Neighbor links, in contrast, are computationally generated associations that identify similarities or co-occurrences, such as linking a nucleotide sequence to its taxonomic lineage in the Taxonomy database or suggesting related articles in PubMed based on shared content.[1] This dual approach allows for both explicit and inferred navigation, enhancing the discovery of biological relationships.[1] A key feature of Entrez's cross-database navigation is the use of neighbor links to generate related searches, which provide suggestions based on patterns like co-citation or sequence similarity. For instance, searching for a specific gene in the Gene database may yield neighbor links to homologous sequences in the Nucleotide or Protein databases, derived from alignment algorithms that detect evolutionary relationships.[31] These suggestions appear as facets or sidebar options in search results, allowing users to pivot seamlessly to pertinent data in other databases, such as from a literature abstract to associated genomic variants in dbSNP.[1] By prioritizing these automated connections, Entrez supports exploratory analysis, where users can trace pathways from molecular data to functional annotations without manual intervention.[1] Entrez's linking system incorporates unique concepts like Related Structures and NCBI Orthologs to represent complex biological networks. Related Structures uses the Vector Alignment Search Tool (VAST) to compute neighbor links between protein structures based on three-dimensional similarity, enabling navigation from one structure record to others with analogous folds or functions, such as linking a query enzyme to evolutionarily conserved homologs.[1] Similarly, NCBI Orthologs aggregates orthologous genes across species through automated detection, providing links from a Gene record to 1:1 orthologs in over 100 species, which aids in comparative genomics.[25] These features rely on underlying indexing that groups records by shared attributes, forming conceptual graphs of relatedness.[31] For efficient large-scale navigation, Entrez supports batch linking via Unique Identifiers (UIDs), allowing programmatic retrieval of connections for multiple records simultaneously through tools like the E-utilities' elink function. This capability is particularly useful for workflows involving high-throughput data, where users can fetch neighbor links for an entire set of PubMed articles to their cited genes or proteins in one operation.[32] Overall, this hyperlink infrastructure transforms static database entries into a dynamic, interconnected knowledge graph, promoting interdisciplinary insights in biomedical research.[1]Access and Usage
Web-Based Interface
The Entrez web-based interface provides an intuitive browser-accessible entry point for users to search and retrieve data from NCBI's interconnected databases, centered around a prominent search bar located at the top of the NCBI homepage. This search bar allows users to enter queries using natural language terms, phrases, Boolean operators (such as AND, OR, and NOT), wildcards, and field-specific restrictions, with a pull-down menu for selecting from over 30 supported databases. Below the search bar, options for advanced search link to a dedicated builder tool that enables constructing complex queries via indexed fields and maintains a search history for iterative refinement. The overall layout emphasizes simplicity and accessibility, including skip-to-content links and access keys for keyboard navigation, ensuring compliance with web standards for users with disabilities.[1] Upon submitting a search, results appear in a paginated summary view, displaying 20 records per page by default, with adjustable settings via a "Display Options" menu to show 10, 50, 100, or 200 items. The left-hand sidebar features facets for filtering results by attributes like publication date, organism, or availability of full text, allowing users to narrow large result sets efficiently. Individual records can be expanded to full views tailored to the database, revealing detailed metadata, abstracts, or sequences, while a "Send To" dropdown facilitates exporting selections to formats such as CSV, XML, or direct integration with tools like citation managers. Pagination controls at the bottom of result pages enable navigation through thousands of hits, supporting workflows from broad discovery to targeted retrieval.[1] To aid users, the interface integrates comprehensive help resources, including inline tooltips, a searchable help manual with tutorials on query syntax and navigation, and guided examples for common tasks. Integration with NCBI Accounts via My NCBI allows registered users to save searches, set up email alerts for new results, and store collections of records for later access, addressing limitations in anonymous sessions by persisting preferences across devices. The interface incorporates a responsive, mobile-first design that adapts to various screen sizes, enhancing usability on tablets and smartphones without requiring separate apps. As of 2025, these features reflect ongoing refinements to streamline the user experience, with no major redesign implemented.[1][33]Programmatic and API Access
Entrez offers programmatic access primarily through the Entrez Programming Utilities (E-utilities), a suite of eight server-side programs that provide a stable interface for querying and retrieving data from its interconnected databases.[34] These utilities enable developers to perform operations such as searching, fetching records, summarizing data, and linking across databases, supporting output formats including XML and, for select utilities, JSON.[3] Key examples include ESearch, which retrieves unique identifiers (UIDs) matching a query term, and EFetch, which downloads full records based on those UIDs.[35] To prevent server overload, NCBI imposes rate limits on E-utilities requests: three per second without an API key and ten per second with a registered key obtained via an NCBI account.[36] Developers must adhere to these guidelines, which also recommend batching large jobs by using the WebEnv parameter and History server to store intermediate search results as temporary sessions, allowing subsequent utilities to reference and process them efficiently without repeated full queries.[3] For instance, a workflow might involve EPost to upload a large list of UIDs into a history session, followed by EFetch in batches to retrieve records while respecting limits. Several programming libraries and tools simplify integration with E-utilities. In Python, Biopython's Bio.Entrez module wraps the utilities, offering functions likeesearch() and efetch() that handle URL construction, XML parsing, and automatic rate limiting.[37] For R users, the rentrez package provides similar functionality, including entrez_search() for querying and entrez_fetch() for data retrieval, with built-in support for API keys and JSON output.[38] On Unix-like systems, Entrez Direct (EDirect) enables command-line scripting through executables like esearch and efetch, facilitating pipeline automation and integration with tools such as awk or sed for data processing; EDirect was updated to version 24.2 on June 20, 2025, with refactored archive paths.[32][39]
While the web-based interface serves manual exploration, E-utilities and associated libraries are designed for scripted, high-volume access in research workflows, ensuring compliance with NCBI's policies on data usage and attribution. The E-utilities documentation was last updated on March 25, 2025.[3][34]