Fact-checked by Grok 2 weeks ago

GenBank

GenBank is the National Institutes of Health (NIH) genetic sequence database, an annotated collection of all publicly available nucleotide sequences, including DNA, RNA, and protein translations, designed to provide unrestricted access to the scientific community for research and analysis. Established in 1982 with its initial release containing 680,338 bases and 606 sequences, GenBank originated as a collaborative effort to centralize genetic data and has since grown exponentially under the management of the National Center for Biotechnology Information (NCBI). In 1992, NCBI assumed full responsibility for its development and maintenance, fostering international partnerships that accelerated its expansion from 51 million bases in 1990 to over 47 trillion base pairs across 5.9 billion sequences and more than 580,000 formally described species as of August 2025. As a core member of the International Nucleotide Sequence Database Collaboration (INSDC) alongside the DNA DataBank of Japan (DDBJ) and the European Nucleotide Archive (ENA), GenBank ensures daily data exchanges to maintain a synchronized, global repository of primary sequence information. This collaboration supports principles, allowing scientists from 121 countries to submit data via tools like the Submission Portal, which now includes features for uploading mRNA feature tables and accelerated processing for urgent cases such as sequences. Submissions undergo automated and manual quality checks, with options for delayed release until publication, while human sequences must exclude identifiable to protect privacy. GenBank's data powers downstream resources like and NCBI Gene, enabling applications in , , and , including a surge in viral submissions during the , with total viral sequences reaching 6.8 million by 2021, of which 2.2 million were coronaviruses (largely ). Bi-monthly releases are freely available via FTP, and users can access records through Nucleotide, BLAST searches, or NCBI Datasets, promoting FAIR (Findable, Accessible, Interoperable, Reusable) data principles.

Introduction and Overview

Definition and Purpose

GenBank is an open-access, annotated collection of all publicly available nucleotide sequences and their associated biological information, maintained by the National Center for Biotechnology Information (NCBI) at the U.S. National Institutes of Health (NIH). As the NIH's primary genetic sequence database, it serves as a comprehensive repository designed to provide unrestricted access to DNA and RNA sequence data for the global scientific community. Established in 1982 under NIH funding at Los Alamos National Laboratory, GenBank was created to centralize the rapidly expanding volume of DNA sequence data produced by early sequencing technologies, addressing the need for a centralized resource amid growing genomic research. Its core objectives include facilitating scientific discovery through free and open access to genetic information, thereby supporting advancements in genomics, evolutionary biology, and medicine. Specifically, it enables critical analyses such as sequence comparison, gene function prediction, and phylogenetic studies, which underpin research in molecular biology and related fields. GenBank records integrate nucleotide sequences with derived protein translations, allowing users to explore coding regions and their translated products without needing separate databases. As a member of the International Nucleotide Sequence Database Collaboration (INSDC), it synchronizes data daily with partner repositories ENA and DDBJ to ensure a unified global resource.

Scope and Content

GenBank encompasses a vast array of nucleotide sequence data, primarily consisting of DNA and RNA sequences submitted by researchers worldwide. These include genomic DNA from chromosomes and organelles, messenger RNA (mRNA) transcripts, ribosomal RNA (rRNA), transfer RNA (tRNA), and non-coding regions such as regulatory elements and introns. Each sequence entry is accompanied by rich annotations that describe biological features, including gene locations, protein products, exons, introns, coding sequences (CDS), and functional elements like promoters and polyadenylation sites. Additionally, entries link to bibliographic references, such as peer-reviewed publications, to provide context for the sequence's discovery and characterization. The database's coverage is exceptionally broad, encompassing sequences from over 581,000 formally named as well as unnamed in metagenomic studies, spanning all domains of life: viruses, , , and eukaryotes ranging from unicellular protists to complex multicellular like , animals, and fungi. This includes both complete genome assemblies and partial sequences derived from targeted sequencing efforts, such as expressed sequence tags (ESTs) or amplicons from specific loci. Metagenomic samples from environmental sources, like soil microbiomes or ocean water, further extend the scope to uncultured microbial communities, enabling research into and dynamics. By late 2024, GenBank held sequences representing more than 4.7 billion records, with the total accumulating to approximately 34 trillion base pairs, a figure that continued to grow rapidly into . Content in GenBank is systematically organized into divisions to facilitate targeted access and management. Standard divisions categorize sequences by type or source, such as PRI for sequences (including ), ROD for , PLN for and fungi, BCT for , VRL for viruses, and ENV for environmental samples. Specialized divisions handle high-throughput data, including WGS for whole shotgun assemblies, TSA for transcriptome shotgun assemblies, and GSS for genome survey sequences. This structure supports efficient storage and retrieval, with each division subdivided into numbered files (e.g., gbpri1.seq for the first part of sequences) to manage the enormous volume of data. As of Release 268.0 in August 2025, the database exceeded 47 trillion base pairs across traditional and set-based records. A distinctive feature of GenBank is its emphasis on annotation depth and standardization, which enhances the interpretability of sequences for scientific use. Annotations employ controlled vocabularies defined by the International Nucleotide Sequence Database Collaboration (INSDC), ensuring consistent terminology for features—such as "/gene" for gene names, "/product" for protein descriptions, and "/inference" for evidence supporting predictions like similarity to known sequences or experimental validation. This richness distinguishes GenBank from raw sequence repositories, providing users with curated insights into sequence function, evolution, and variation without requiring extensive post-processing. Bibliographic links further integrate sequences with the primary literature, fostering and advancing genomic research across disciplines.

History and Development

Origins and Early Years

GenBank was initiated in 1982 by Walter Goad at the (LANL), with funding from the U.S. Department of Energy () as well as contributions from the (NIH) and other agencies, to address the increasing influx of DNA sequences produced through manual sequencing methods that were becoming more prevalent in research. Goad, a biophysicist in LANL's Theoretical Biology and Biophysics Group, envisioned a centralized repository to collect, annotate, and distribute nucleic acid sequence data, filling a critical need as the volume of published sequences grew beyond what individual researchers could manage. Early operations centered on quarterly releases of the database, distributed primarily via magnetic tapes to academic and research institutions worldwide, allowing researchers to access the data on their local systems. The inaugural public release, known as Release 3, occurred in December 1982 and included 606 sequences comprising 680,338 base pairs, reflecting the modest scale of sequence data available at the time. Key members of the LANL team, including Christian Burks, played pivotal roles in curating entries, developing submission protocols, and ensuring amid the nascent field's demands. The team encountered substantial challenges from the of submissions, which rapidly outstripped the resources and capabilities of , prompting ongoing optimizations in and retrieval efficiency. To facilitate broad accessibility and portability across diverse environments, GenBank adopted a text-based flat-file format from the outset, featuring structured records with , annotations, and references, supplemented by basic indexing for keyword-based searches. This design emphasized simplicity and interoperability, enabling easy transfer via tapes without reliance on .

Key Milestones and Transitions

In 1988, the U.S. established the (NCBI) within the at the (NIH) to advance computational biosciences, including the management of genetic sequence data. This marked the beginning of GenBank's transition from its initial custodians at to federal oversight under NIH. The handover process spanned from 1989 to 1992, culminating in October 1992 when NCBI assumed full responsibility for GenBank's operations, data distribution, and development. Concurrently, NCBI introduced the retrieval system in 1991, enabling integrated online access to GenBank sequences alongside related protein, taxonomy, and literature data, which revolutionized user interaction with the database. The 1990s brought pivotal technological integrations that expanded GenBank's utility and reach. In 1990, NCBI developed the Basic Local Alignment Search Tool (), a high-speed for identifying sequence similarities against GenBank entries, facilitating rapid genomic comparisons essential for emerging research. Throughout the decade, GenBank adopted internet-based distribution methods, including anonymous FTP access and web interfaces, shifting from primary reliance on CD-ROMs to network delivery, which accelerated as submissions grew exponentially. GenBank's release numbering system, initiated with Release 3 in December 1982, continued bimonthly, providing structured versioning of the to track updates systematically. The 2000s and 2010s saw GenBank adapt to the explosion of high-throughput sequencing data, driven by advances in genomic technologies. By December 2000 (Release 121), GenBank had amassed over 10 million sequences, encompassing 11 billion bases, reflecting the impact of large-scale projects like the Human Genome Project. To accommodate unfinished high-throughput genomic sequences, NCBI created the High-Throughput Genomic Sequences (HTGS) division in 1999, allowing rapid deposition of draft data without full assembly. By 2010, GenBank began incorporating next-generation sequencing (NGS) outputs through the Whole Genome Shotgun (WGS) division and coordination with the Sequence Read Archive (SRA), handling the surge in short-read data from platforms like Illumina, which multiplied sequence volumes by orders of magnitude. From 2020 to 2025, GenBank underwent transitions to manage escalating data volumes and specialized applications, including enhanced cloud-based infrastructure for associated . The drove a surge in viral sequence submissions, with genomes increasing significantly and contributing to overall database growth. NCBI made data, which includes raw reads linked to GenBank entries, available via cloud platforms like AWS and , enabling scalable access to petabyte-scale datasets without local downloads. For , submission guidelines were refined as of March 2025 to streamline handling of environmental and sequences, encouraging raw read submissions and detailed to support and annotation of uncultured microbial communities through targeted wizards and validation tools.

Organization and Collaboration

International Nucleotide Sequence Database Collaboration (INSDC)

The International Sequence Database Collaboration (INSDC) was established in 1987 as a formal agreement among GenBank, the (EMBL) Sequence Database (now the European Nucleotide Archive or ENA at EMBL-EBI), and the DNA Data Bank of (DDBJ) to coordinate the collection, , and dissemination of data worldwide. This arose from earlier efforts in 1986 between GenBank and EMBL to standardize data formats, with DDBJ joining to create a unified framework that prevents data redundancy and ensures comprehensive global coverage of publicly available sequences. The primary purpose is to facilitate synchronized exchange of core data, enabling researchers to submit sequences to any partner database while guaranteeing identical access across all three archives. Submitters may choose any partner database, though it is recommended to use the one closest geographically or most convenient for support: , managed by the (NCBI) in the United States; ENA at EMBL-EBI in ; and DDBJ, operated by the National Institute of Genetics in . To maintain consistency, the partners engage in daily data mirroring, exchanging new and updated records in standardized formats such as the Feature Table, which ensures that the core datasets—comprising annotated sequences—are identical across all databases without duplication. This synchronization process supports redundancy for data preservation and allows seamless querying from any INSDC portal. While the core data are mirrored identically, each partner adds unique value through region-specific enhancements. For instance, GenBank incorporates U.S.-focused biological annotations linked to resources like and includes dedicated records for patent sequences derived from intellectual property filings, which are not duplicated in ENA or DDBJ but remain accessible globally via the shared framework. The total holdings of the INSDC, synchronized across partners, comprise over 5.7 billion sequences as of mid-2025, underscoring the collaboration's role in scaling genomic . In the 2020s, the INSDC has evolved to address emerging data types and accessibility needs, including joint development of standards for metagenomic and environmental sequencing data in partnership with the Genomics Standards Consortium to improve metadata consistency for and studies. Additionally, the has reinforced policies aligned with (Findable, Accessible, Interoperable, Reusable) principles, mandating unrestricted public access to all deposited sequences via unique accession numbers and prohibiting proprietary restrictions on core data. In 2023, the founding members signed a Founders to formalize their , and the INSDC has since developed a Membership to attract additional qualified sequence archives as new members, enhancing global representation. These updates ensure the INSDC remains adaptable to high-throughput sequencing advancements while upholding its foundational commitment to equitable global data sharing.

Data Management and Standards

GenBank employs a multi-tiered curation process to maintain the integrity and utility of its sequence data, involving both professional by NCBI staff for high-profile or complex entries, such as those from or genomes, and community-driven updates through author revisions. NCBI staff conduct manual reviews and annotations for select sequences, ensuring accuracy in biological interpretation, while submitters can request updates or corrections post-release, which are verified and incorporated by NCBI curators. All annotations in GenBank records utilize the Feature Table format, a structured system for describing sequence features like genes, exons, and regulatory elements, which facilitates consistent representation across entries. Adherence to established standards is central to GenBank's data management, with the database following the International Nucleotide Sequence Database Collaboration (INSDC) Table Definition (FTD) document to define feature keys, locations, and qualifiers for annotations. This ensures interoperability and precision in describing biological entities, supplemented by controlled vocabularies such as those from the for terms related to genomic features. Validation checks are rigorously applied during processing, encompassing automated and manual assessments of sequence integrity, such as verifying base composition and length, alongside compliance to prevent errors in organism naming or feature labeling. Internal management tools at NCBI support ongoing through pipelines designed for error detection and mitigation, including contamination screening via the Foreign Contaminant Screen (FCS) to identify non-target sequences in submissions. GenBank data are released bimonthly in versioned flat files, allowing users to track changes and access complete datasets via FTP, with daily incremental updates for timely synchronization across INSDC partners. These releases incorporate to preserve historical records while enabling corrections. Unique to GenBank's policies is the status of all deposited data, permitting unrestricted use, reuse, and distribution without licensing fees, though submitters retain any applicable rights. For pre-publication sequences, NCBI handles confidential submissions by withholding them from public access until the specified release date or publication, at which point they enter the open archive.

Submission and Annotation

Submission Processes

Researchers contribute new nucleotide sequences to GenBank through several established pathways designed to accommodate varying submission sizes and complexities. For small-scale submissions, such as individual sequences or sets up to 500 entries or 50 kb total, the web-based BankIt tool allows users to enter data interactively via a browser interface, guiding the preparation of sequence and feature information. Larger or bulk submissions, including annotated genomes, utilize the standalone tbl2asn software, which converts tabular data and files into the required format (.sqn) for submission. Sequencing centers and high-volume submitters often employ direct FTP uploads to NCBI servers or email submissions to [email protected], facilitating efficient transfer of extensive datasets. All submissions require specific formats and mandatory metadata to ensure compatibility and traceability. Sequence data must be provided in , with annotations in (.sqn) for structured features. Essential metadata includes the source organism (with details), submitter and information, references (if applicable), and collection details such as isolate, strain, or geographic location. These elements are verified during submission to align with International Sequence Database Collaboration (INSDC) standards. The submission workflow begins with pre-submission validation using built-in tools like the validator in tbl2asn or the Submission Portal's automated checks, which detect issues such as errors, , or chimeric sequences. Once submitted, NCBI staff perform biological review, assigning provisional accession numbers typically within two working days; examples include standard accessions like U12345 (one letter followed by five digits) or Whole Genome (WGS) accessions such as AABM01000000. Full processing, including integration into public releases, takes days to weeks depending on complexity, after which data undergo post-submission . GenBank handles substantial submission volumes, with over 7 million new sequence records added in alone, reflecting an annual influx exceeding 1 million sequences from global researchers. To manage this scale, specialized tracks exist for high-priority data types, such as complete genomes submitted via the Genome Submission Portal and metagenomic assemblies through the Transcriptome Shotgun Assembly (TSA) pathway, ensuring streamlined processing for large-scale genomic projects.

Annotation Guidelines and Quality Control

GenBank annotations are structured using a feature table format that employs qualifier-value pairs to describe biological elements within sequences. These pairs follow the syntax /qualifier="value", where qualifiers provide specific attributes such as names or product descriptions. For instance, the qualifier /gene="ABC1" identifies a symbol, while /product="protein X" specifies the encoded protein. This system allows for precise, machine-readable descriptions of features like coding sequences (), , and sources. Mandatory fields ensure basic metadata integrity, with the source organism qualifier /organism required on every source feature to denote the biological origin, accompanied by /mol_type (e.g., /mol_type="genomic DNA") to classify the sequence type. Optional qualifiers enhance detail, such as /locus_tag for unique gene identifiers within a record or /note for additional context. Submitters are responsible for providing accurate annotations, with NCBI offering templates and validation tools like table2asn to facilitate compliance during submission. Evidence tags distinguish between experimental and computational support for annotations. The /experiment qualifier documents direct evidence, such as /experiment="northern blot", while /inference captures computational predictions, formatted as /inference="ab initio prediction:Prodigal:2.6". These tags promote transparency and reproducibility, adhering to controlled vocabularies to maintain consistency across submissions. Quality control begins with automated validation during submission processing, using tools to check sequence validity (e.g., detecting internal stop codons or invalid characters), nomenclature consistency (e.g., standardized organism names from the NCBI Taxonomy database), and potential contamination (e.g., mismatched primer sequences or unexpected organism assignments). Common errors, such as missing source descriptors or improper geographic location codes, generate discrepancy reports for correction. Incomplete or erroneous submissions may be rejected or require revisions before acceptance. For complex annotations, NCBI staff conduct manual reviews to verify intricate features, ensuring alignment with INSDC standards. This hybrid approach minimizes errors while handling the volume of submissions, with tools like the GenBank Submission providing real-time feedback. Submitters retain ownership of annotations but must address validation issues to proceed. In the 2020s, enhancements have streamlined for high-throughput data, including support for GFF3 format uploads to accommodate next-generation sequencing (NGS) assemblies and structured evidence reporting. Standards for synthetic sequences specify the SYN and qualifiers like /organism="synthetic construct" or /note to flag engineered elements, with validation ensuring clear distinction from natural sequences. As of 2025, the Submission supports uploading feature tables for eukaryotic nuclear mRNA sequences, including coding sequences () and protein annotations; the Popset database retired in January 2025, with submitters directed to use BioProject records; support for experimental and inferential Third Party (TPA) sequences ended in January 2025; and AGP files for genome assemblies are no longer accepted, with submitters instructed to use 'N's in sequences for gaps. These updates, including accelerated processing for specific datasets like , reflect ongoing efforts to adapt to evolving genomic technologies.

Access and Retrieval

User Interfaces and Tools

GenBank data is primarily accessed through the (NCBI) platforms, offering a suite of integrated tools for searching, viewing, and analyzing sequences. The core interface for text-based retrieval is the database, which allows users to query GenBank records using accession numbers, keywords, author names, or filters. For example, entering an accession like "U49845" retrieves the full annotated sequence record, while a keyword search such as "human BRCA1 gene" yields relevant entries with links to related genomic and literature data. Graphical browsing is facilitated by the Genome Data Viewer (GDV), a web-based tool that displays GenBank sequences in a visual format, enabling users to navigate assemblies, zoom into regions, and overlay annotations like genes and variants. GDV supports exploration of eukaryotic genomes from organisms such as humans, mice, and , with features for comparative viewing across species. This interface is particularly useful for contextual analysis of sequence data without needing to download files. For sequence similarity searches, the Basic Local Alignment Search Tool () integrates directly with GenBank, allowing users to input a query sequence and compare it against the nucleotide database to identify homologous regions. Options like blastn for nucleotide-to-nucleotide alignments compute , aiding in functional inference and phylogenetic studies. BLAST results link back to original GenBank records for detailed annotation review. Supporting tools enhance data handling and organization. The Sequence View provides an annotated display of individual records, highlighting features such as coding regions, promoters, and references in a graphical panel embedded within results. The Taxonomy Browser enables filtering and navigation of GenBank sequences by organismal hierarchy, from broad domains like to specific strains, streamlining organism-specific queries. For bulk operations, Batch Entrez permits uploading lists of identifiers (up to thousands) to retrieve multiple records simultaneously, ideal for exporting subsets like all sequences from a particular study for local analysis. Programmatic access is available via the Programming Utilities (E-utilities) API, which supports scripted searches and retrievals in languages like or , including functions for fetching data by ID or term. NCBI Datasets offers an additional and for genome-centric queries, with redesigned views for easier navigation. While no dedicated mobile apps exist for GenBank, the web interfaces are responsive, allowing basic searches and views on mobile devices through browsers. All these interfaces are freely accessible without login requirements for basic use, promoting open scientific collaboration, and integrate seamlessly with for linking to associated publications. This no-cost model ensures broad availability to researchers worldwide.

Data Formats and Downloads

GenBank primarily distributes its data in the flat-file format, known as the GenBank Flat File (GBFF), which structures each record with a header section, a features table for annotations, and the or protein itself. The header includes fields such as LOCUS (specifying the name, , type, and ), DEFINITION (a brief description), ACCESSION (a ), VERSION (including the GI number for versioning), (organism details), and (citation information). The features table delineates annotated elements like coding (CDS), genes, and regulatory regions using a standardized vocabulary, with locations and qualifiers providing precise details such as product names or translations. This format, exemplified in sample records like accession U49845 for the Saccharomyces cerevisiae TCP1-beta gene, ensures human-readable and parseable representation of complex biological data. Alternative formats cater to specific use cases: provides a simplified, sequence-only output with a definition line starting with ">" followed by the accession and description, ideal for alignment tools and lacking annotations. (Abstract Syntax Notation One) offers a structured, binary-compatible representation for programmatic access and exchange, supporting hierarchical data like sequences and in a machine-optimized way. These formats, alongside GBFF, are available for download to accommodate diverse computational needs. Data downloads occur via the NCBI FTP site at ftp://ftp.ncbi.nih.gov/genbank/, where full bimonthly releases—such as Release 268.0 from August 2025, encompassing over 47 trillion bases and 5.9 billion records—are provided in GBFF, , and . Incremental updates, reflecting daily additions from submissions, are also accessible to minimize bandwidth usage for users tracking recent changes. For targeted subsets, NCBI Datasets enables cloud-based access and downloads of genomic data across domains, supporting formats like for sequences, GFF3 for annotations, and for metadata, integrated with GenBank records. As part of the International Nucleotide Sequence Database Collaboration (INSDC) with EMBL-EBI (ENA) and DDBJ, GenBank synchronizes data using the shared Feature Table format, which employs EMBL-like flat-file structures for consistent annotation exchange, including feature keys (e.g., ), locations, and qualifiers (e.g., /product). XML variants of this table provide machine-readable annotations, facilitating automated parsing and across the databases. Best practices for handling GenBank data emphasize managing file sizes—full releases often exceed 5 TB uncompressed—through compression available on the FTP site, and employing via stable accession numbers or identifiers to track updates without re-downloading entire datasets. Users are advised to verify formats against official documentation to ensure compatibility with analysis pipelines.

Growth and Impact

GenBank's data volume has exhibited remarkable growth since its inception, doubling approximately every 18 months from 1982 onward, a pattern sustained through advancements in sequencing technologies and increased research output. This exponential trajectory reflects the broader evolution of genomics, where falling sequencing costs have democratized data generation. In the 1980s, Sanger sequencing costs were around $5–10 per base pair, limiting submissions to targeted experiments and resulting in modest accumulation. By the 2020s, costs had plummeted to less than $0.01 per base pair, enabling high-throughput projects and fueling sustained expansion. Early growth from the to was relatively linear, transitioning from hundreds of thousands of bases to tens of millions as manual and early automated sequencing methods prevailed. Release 1 in 1983 contained just 0.68 million bases from 680 sequences, primarily from small-scale studies of genes and viruses. By 1990, the database had reached 51 million bases across over 41,000 sequences, driven by accumulating data from labs worldwide. The 2000s marked a shift to with the advent of next-generation sequencing (NGS) technologies around , which drastically increased throughput and reduced per-base costs. The completion of the in 2003, sequencing approximately 3 billion base pairs, exemplified this surge and encouraged global submissions, propelling GenBank past 100 billion bases by 2010. The following table summarizes key milestones in GenBank releases, highlighting the scale of growth:
Release YearRelease NumberTotal Bases (approximate)Key Driver
198310.68 millionInitial manual sequencing efforts
1990~5051 millionEarly and targeted genomics
200011411 billionPre-NGS high-volume projects
2010178108 billionNGS adoption
20222501.39 trillion and large-scale surveys
In the 2020s, growth has accelerated further due to and environmental sequencing initiatives, which generate vast datasets from microbial communities and ecosystems, outpacing even NGS-driven increases of the prior decade. These trends underscore GenBank's role as a foundational repository, with ongoing expansions anticipated from emerging fields like .

Current Statistics and Significance

As of August 2025, GenBank release 268.0 contains 47.01 trillion base pairs across 5.90 billion sequences, spanning more than 581,000 formally described . The database receives approximately 1.8 million new sequences daily through incremental updates, reflecting its rapid expansion driven by high-throughput sequencing technologies. The content breakdown highlights the dominance of bacterial and archaeal sequences, which constitute the majority of records due to their prevalence in microbial research. Eukaryotic genomes are comprehensively represented, including full coverage of the with over 28 million entries for Homo sapiens. Viral sequences have experienced particularly rapid growth following the , with more than 9 million entries for alone. GenBank plays a pivotal role in modern science by facilitating global research collaborations through the International Nucleotide Sequence Database Collaboration (INSDC), enabling standardized access to data worldwide. It supports AI-driven predictions, such as those from , which relies on GenBank-derived sequences via for training models and advancing fields like . The database's economic impact is substantial in and pharmaceuticals, where it underpins genomics-based innovations, including and development. Millions of scientific papers reference GenBank accessions annually, underscoring its foundational influence across . In 2025, GenBank has seen enhanced holdings in metagenomic data, bolstered by contributions from initiatives like the Earth BioGenome Project, which deposits reference genomes to catalog eukaryotic biodiversity and support conservation efforts.

Challenges and Limitations

Data Quality and Errors

GenBank, as a comprehensive repository of sequences, faces ongoing challenges with stemming from the diverse origins of submissions and the volume of legacy records. Common errors include species misidentifications, where sequences are incorrectly assigned to taxa due to taxonomic ambiguities or submitter oversights. For instance, analyses of (Cytb) gene sequences for fishes identified approximately 2% (1,303 out of 65,326 records) as potentially problematic, involving species misidentification, laboratory contamination, or chimeras. Contamination from laboratory artifacts, such as reagent-derived sequences or cross-sample mixing, is another prevalent issue, with large-scale screens identifying over 2,000,000 contaminated entries across the database. Additionally, outdated annotations persist, where functional or taxonomic labels fail to incorporate subsequent research findings, leading to discrepancies between GenBank records and current biological knowledge. These errors often trace back to early manual submissions, which lacked rigorous validation, and more recent next-generation sequencing (NGS) assembly processes, where algorithmic limitations can introduce chimeric or erroneous contigs. In the case of , early submissions prior to 2020 included mislabeled variants and sequences with anomalies that propagated uncertainties in viral phylogenetics. NGS-specific issues, such as errors in long-read assemblies, further compound inaccuracies when unfiltered drafts are deposited. Such problems have been exacerbated by the rapid influx of data during events like the , where incomplete accompanied high-volume uploads. Quantitatively, taxonomic misidentifications in metazoan sequences are estimated at less than 1% at the level, though higher rates—up to 32% sequence discrepancies—appear in re-sequenced specimens for specific groups like tetrapods. These inaccuracies distort phylogenetic reconstructions, assessments, and evolutionary analyses by introducing noise that biases tree topologies or inflates divergence estimates. Detection of errors relies on community-driven flagging through update submissions and NCBI's discrepancy reports, which evaluate annotations for inconsistencies like mismatched taxonomy or sequence anomalies. Tools such as the Foreign Contamination Screen (FCS) aid in identifying contaminants in new assemblies, but legacy data remains intact without automatic purging to preserve historical records. This manual and semi-automated approach, while effective for ongoing curation, underscores the database's vulnerability to propagated errors from unchecked early entries.

Future Directions and Improvements

GenBank and its collaborators in the International Nucleotide Sequence Database Collaboration (INSDC) are pursuing several upcoming initiatives to enhance automated and error correction processes. Recent advancements include the integration of and techniques for improving viral protein in metagenomic datasets, particularly for uncultivated genomes, by leveraging protein models to detect remote and reduce errors. These efforts build on existing automated tools, such as the FLu ANnotation (FLAN) system used by NCBI for validating and predicting protein sequences in submissions, which accelerates processing and ensures consistency in high-volume data streams. Future developments emphasize -specific models trained on large datasets like those in GenBank to incorporate genomic context and further minimize functional inaccuracies. INSDC members are advancing standardized reporting for metagenomic data through the adoption of the Genomic Standards Consortium's Minimum Information about any (x) Sequence (MIxS) , which extends core requirements to include environmental and sample-specific details for genome and metagenome sequences. This standardization facilitates better comparability across studies and supports the submission of sequences and metagenome-assembled genomes with structured ontologies. Regarding versioning, INSDC maintain trails via unique accession numbers and update mechanisms, allowing submitters to revise records while preserving historical , though challenges in propagating corrections across linked entries persist due to limited systematic tracking. To address ongoing limitations, there is growing emphasis on improving data provenance through enhanced in BioSample records, which promote consistency in taxonomic assignments and of sequence origins. Erroneous records are handled via submitter-initiated updates or flagging, as GenBank ownership remains with depositors, preventing direct NCBI modifications but enabling through replacement with corrected versions. For in the face of exabyte-scale growth, NCBI is leveraging cloud platforms, such as hosting BLAST databases on and , to distribute computational loads and support federated access to large datasets without centralizing all storage. In a broader vision, GenBank aligns with principles—Findable, Accessible, Interoperable, and Reusable—through resources like BioProject and BioSample, which enhance interoperability and global . Potential expansions include greater integration with related NCBI archives, such as the Epigenomics DataBase, to incorporate epigenomic datasets alongside nucleotide sequences, fostering comprehensive genomic analyses while maintaining open-access standards.

References

  1. [1]
    GenBank Overview - NCBI
    Dec 8, 2022 · GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.How to submit data · Sample GenBank Record · Sequence Identifiers · About TSA
  2. [2]
    GenBank and WGS Statistics - NCBI
    The following table lists the number of bases and the number of sequence records in each release of GenBank, beginning with Release 3 in 1982.
  3. [3]
    A Brief History of NCBI's Formation and Growth - NIH
    1992—GenBank at NCBI—NCBI assumes responsibility for GenBank, a database of nucleotide sequences, and collaborates in its development with international ...
  4. [4]
    GenBank 2025 update - PMC
    ### Latest Statistics on Sequence Data, Growth, and Key Updates from GenBank 2025 Update
  5. [5]
    About GenBank - NCBI
    Feb 5, 2019 · The GenBank database is designed to provide and encourage access within the scientific community to the most up to date and comprehensive DNA sequence ...
  6. [6]
    1982: GenBank Database Formed
    GenBank, NIH's publicly accessible genetic sequence database, was formed at Los Alamos National Laboratory. Scientists submit DNA sequence data from a wide ...
  7. [7]
    Sample GenBank Record - NCBI
    This page presents an annotated sample GenBank record (accession number U49845 ) in its GenBank Flat File format. You can see the corresponding live record for ...<|control11|><|separator|>
  8. [8]
    Current GenBank Release Notes - NCBI - NIH
    Aug 15, 2025 · This document describes the format and content of the flat files that comprise releases of the GenBank nucleotide sequence database.
  9. [9]
    2025 - Nucleic Acids Research - Database Issue - NCBI Insights - NIH
    Jan 13, 2025 · GenBank® is a comprehensive, public data repository that contains 34 trillion base pairs from over 4.7 billion nucleotide sequences for 581 000 ...
  10. [10]
    How GenBank, Databases Speed Vaccine, Drug Development-and ...
    Apr 23, 2014 · Initially called the Los Alamos Sequence Database, this resource was conceptualized in 1979 by Walter Goad, a nuclear physicist and a pioneer in ...Missing: origins | Show results with:origins
  11. [11]
    GenBank & The Early Years of “Big Data”
    Mar 3, 2016 · Ten years later, the operation of the GenBank database was transferred from Los Alamos National Laboratory to the National Center for ...
  12. [12]
    Making Sense of Sequences | Los Alamos National Laboratory
    Aug 1, 2018 · According to the NIH, the number of bases entered into GenBank from 1982 until now has doubled approximately every 18 months. But the ...
  13. [13]
    A Quarter Century of GenBank | GenomeWeb
    Jun 1, 2007 · GenBank has come a long way. The nucleic acid sequence database was established by NIH in 1982 and grew out of the Los Alamos Sequence Database.
  14. [14]
    NCBI News | Summer 1999 - NIH
    From its inception in November 1988, NCBI was charged with providing data access and analysis tools for molecular biology information. As its 10th anniversary ...Missing: milestones | Show results with:milestones
  15. [15]
    Release Notes For GenBank Release 113 - NCBI - NIH
    GenBank distribution via CD-ROM ceased as of GenBank Release 106.0 (April, 1998). 6.4 Other Methods of Accessing GenBank Data Entrez is a ...
  16. [16]
    High-Throughput Genomic Sequences - NCBI - NIH
    Jan 19, 2021 · The High Throughput Genomic (HTG) Sequences division was created to make unfinished genomic sequence data rapidly available to the scientific community.Missing: 100 2000 10 2010
  17. [17]
    Home - SRA - NCBI - NIH
    SRA - Now available on the cloud ; Getting Started. Documentation · How to submit · How to search and download · How to use SRA in the cloud ; Tools and Software.How to use SRA in the cloud · Advanced · Sequence Read ArchiveMissing: integration 2020s
  18. [18]
    Metagenome Submission Guide - NCBI - NIH
    Mar 25, 2025 · The analysis of metagenomic data provides a way to identify new organisms and isolate complete genomes from unculturable species that are ...Missing: 2020s | Show results with:2020s
  19. [19]
    GenBank 2025 update | Nucleic Acids Research - Oxford Academic
    Nov 18, 2024 · GenBank® (1) is a comprehensive public repository of nucleotide sequences and supporting bibliographic and biological annotations built and ...
  20. [20]
    The international nucleotide sequence database collaboration - PMC
    Nov 28, 2017 · The INSDC members work together to ensure that all public domain nucleotide sequence data deposited in the archives is preserved as part of the ...
  21. [21]
    International Nucleotide Sequence Database Collaboration (INSDC)
    In February, 1986, GenBank and EMBL began a collaborative effort (joined by DDBJ in 1987) to devise a common feature table format and common standards for ...
  22. [22]
    About INSDC
    INSDC is a global collaboration of independent governmental or non-profit organisations that manage nucleotide sequence databases capturing and preserving ...Missing: Americas Asia Pacific
  23. [23]
    How to submit data to GenBank - NCBI - NIH
    Apr 26, 2024 · There are several options for preparing and submitting data to GenBank. Web-based submission tools that are automatically submitted to GenBank.<|control11|><|separator|>
  24. [24]
    FAQ Pertaining to Patent and Other Intellectual Property Information ...
    Jul 10, 2015 · In addition, the sequences released in the Patent division of GenBank have been used to populate a "Patent database" that is searchable using ...
  25. [25]
    Open Access and Data Sharing of Nucleotide Sequence Data
    Sep 15, 2021 · The INSDC policy permanently guarantees free and unrestricted access to all data using unique identifiers (accession numbers) representing ...Missing: 2020s joint standards metagenomic
  26. [26]
    NLM GenBank and SRA Data Processing - NCBI - NIH
    Jan 4, 2023 · NCBI is responsible for processing submitted sequence data. Processing includes performing automated and manual checks to ensure data integrity, quality, and ...
  27. [27]
    The DDBJ/ENA/GenBank Feature Table Definition
    The DDBJ/ENA/GenBank Feature Table Definition Version 11.3 October 2024 DNA Data Bank of Japan, Mishima, Japan. EMBL-EBI, European Nucleotide Archive, ...
  28. [28]
    The Sequence Ontology: a tool for the unification of genome ...
    The Sequence Ontology (SO) is a structured controlled vocabulary for the parts of a genomic annotation. SO provides a common set of terms and definitions.
  29. [29]
  30. [30]
  31. [31]
    About GenBank Submission
    Submission Portal GenBank has specialized wizards designed to submit specific data types (SARS CoV-2, Influenza, Norovirus, and Dengue; prokaryotic rRNA; ...Missing: responsibility INSDC
  32. [32]
    GenBank Submission Types - NCBI - NIH
    Oct 26, 2020 · GenBank accepts mRNA or genomic sequence data directly determined by the submitter. The submission must include information about the source organism and ...
  33. [33]
    Prokaryotic and Eukaryotic Genomes Submission Guide - NCBI - NIH
    Jul 21, 2025 · This is the simplest submission route because you just fill in a web form in the Submission Portal and upload fasta (or sqn) files of the genome sequences.Type of submission · Events · sqn · Common metadata for all...
  34. [34]
    Prokaryotic Genome Annotation Guide - NCBI
    Mar 21, 2025 · Gene names must follow the standard bacterial nomenclature rules of three lower case letters. Different loci are distinguished by a suffix of ...Public Nucleic Acid Sequence... · Prokaryotic Genome... · Prepare Annotation Table
  35. [35]
    Evidence Qualifiers - NCBI - NIH
    Jun 22, 2018 · The /inference qualifier provides a structured description of non-experimental evidence that supports feature identification or assignment.
  36. [36]
    Validation and Discrepancy Report Error Explanations - NCBI
    Jan 15, 2025 · Explanations for individual errors found during processing are listed below. Suggestions for fixing the errors are included to fix the most common issues.Missing: control | Show results with:control
  37. [37]
    Validation Error Explanations for Genomes - NCBI - NIH
    Oct 30, 2024 · This page has explanations for individual errors that are commonly found during processing of prokaryotic and eukaryotic genomes, along with suggestions to fix ...
  38. [38]
    Home - Nucleotide - NCBI - NIH
    The Nucleotide database is a collection of sequences from several sources, including GenBank, RefSeq, TPA and PDB.
  39. [39]
    BLAST: Basic Local Alignment Search Tool
    - **BLAST Usage for GenBank Similarity Searches**:
  40. [40]
    NCBI Sequence Viewer 3.51.1 - NIH
    Welcome to NCBI's Sequence Viewer 3.51.1. To get started, or review the help documentation. Example Links: Human chromosome: NC_000001: This link provides a ...
  41. [41]
    Taxonomy browser - NCBI
    This is the top level of the taxonomy database maintained by NCBI/GenBank. You can explore any of the taxa listed below by clicking it.BacteriaVirusesHomo sapiensSanguibacteraceaeThermus thermophilus
  42. [42]
    Batch Entrez - NCBI
    Batch Entrez. Given a file of Entrez accession numbers or other identifiers, Batch Entrez downloads the corresponding records.
  43. [43]
    APIs - Develop - NCBI - NIH
    The E-utilities are the public API to the NCBI Entrez system and allow access to all Entrez databases including PubMed, PMC, Gene, Nuccore and Protein.
  44. [44]
    NCBI Datasets v2 REST API - NIH
    These gene services allow you to explore prokaryotic gene data by RefSeq protein sequence accession and download a data package including metadata for the gene ...
  45. [45]
    GenBank 2024 Update | Nucleic Acids Research - Oxford Academic
    Oct 27, 2023 · NCBI provides GenBank sequence records in both the traditional flat file format and in a structured ASN.1 format by anonymous FTP at ftp ...
  46. [46]
    FASTA Format for Nucleotide Sequences - NCBI - NIH
    Jun 18, 2025 · In FASTA format the line before the nucleotide sequence, called the FASTA definition line, must begin with a carat (">"), followed by a unique SeqID (sequence ...Missing: ASN. | Show results with:ASN.
  47. [47]
    FAQs - NCBI - NIH
    What file formats can be downloaded using NCBI Datasets? · Sequence files in FASTA format: genomic/gene, transcript and protein nucleotide sequences · Annotation ...Faqs · Why Do Gene Counts Differ... · Is The Ncbi Datasets Api...<|separator|>
  48. [48]
    International Nucleotide Sequence Database Collaboration - NCBI
    Jun 12, 2024 · This site presents the aims and policies of this long-established collaboration in gathering and publishing nucleotide sequence and annotation.
  49. [49]
    FTP access to GenBank data - NCBI - NIH
    Oct 23, 2017 · The ASN.1 and Flatfile forms of the data are available at NCBI's anonymous FTP server. A mirror of the GenBank FTP site at the NCBI is available at the ...
  50. [50]
  51. [51]
    [PDF] DNA sequencing at 40: past, present and future - Harvard University
    Oct 11, 2017 · Sequence data grew exponentially, approximating Moore's law and motivating the creation of central data repositories (such as GenBank) that, ...<|separator|>
  52. [52]
    DNA Sequencing Costs: Data
    May 16, 2023 · Data used to estimate the cost of sequencing the human genome over time since the Human Genome Project.
  53. [53]
    The sequence of sequencers: The history of sequencing DNA - PMC
    This article traverses those years, iterating through the different generations of sequencing technology, highlighting some of the key discoveries.Missing: GenBank | Show results with:GenBank
  54. [54]
    GenBank | Nucleic Acids Research - Oxford Academic
    Oct 26, 2018 · GenBank (1) is a comprehensive public database of nucleotide sequences and supporting bibliographic and biological annotation.
  55. [55]
    The future is now: single-cell genomics of bacteria and archaea - NIH
    Increasingly powerful tools for single-cell genome sequencing and analysis will play key roles in accessing the genomes of uncultivated organisms.Fig. 1 · Cell Isolation · Sequencing And InformaticsMissing: GenBank | Show results with:GenBank
  56. [56]
    Highly accurate protein structure prediction with AlphaFold - Nature
    Jul 15, 2021 · AlphaFold greatly improves the accuracy of structure prediction by incorporating novel neural network architectures and training procedures ...
  57. [57]
    GenBank 2025 update - PubMed - NIH
    Jan 6, 2025 · GenBank® (https://www.ncbi.nlm.nih.gov/genbank/) is a comprehensive, public data repository that contains 34 trillion base pairs from over ...
  58. [58]
    Earth BioGenome Project
    The Earth BioGenome Project (EBP), a moonshot for biology, aims to sequence, catalog and characterize the genomes of all of Earth's eukaryotic biodiversity.Roadmap · Affiliated Project Application · Report on Assembly Standards · News
  59. [59]
    Detection of Potential Problematic Cytb Gene Sequences of Fishes ...
    Feb 5, 2018 · These cases indicate that at least half of the sequences were either incorrectly identified to species, contamination of DNA occurred in the ...
  60. [60]
    Terminating contamination: large-scale search identifies more than ...
    May 12, 2020 · Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank. Martin Steinegger &; Steven L.
  61. [61]
    Impact of outdated gene annotations on pathway enrichment analysis
    Aug 30, 2016 · Outdated pathway analysis resources strongly affect practical genomic analysis and literature. (a) The majority of public software tools for ...Missing: GenBank | Show results with:GenBank
  62. [62]
    How Trustworthy Are the Genomic Sequences of SARS-CoV-2 in ...
    This paper takes an innovative approach to show that some SARS-CoV-2 genomes submitted to GenBank cannot possibly be authentic.Missing: mislabeled pre-
  63. [63]
    Identification of errors in draft genome assemblies at single ... - Nature
    Oct 17, 2023 · By integrating NGS and SMS mapping, CRAQ can identify assembly errors at different scales and transform error counts into corresponding assembly ...Missing: GenBank | Show results with:GenBank
  64. [64]
    GenBank is a reliable resource for 21st century biodiversity research
    Oct 21, 2019 · We show that metazoan identifications in GenBank are surprisingly accurate, even at low taxonomic levels (likely <1% error rate at the genus level).
  65. [65]
    Unveiling Hidden Errors in GenBank's Tetrapod Taxonomic ...
    Jun 3, 2025 · The study found that 32% of re-sequenced voucher specimens in GenBank yielded different sequences, indicating a high error rate.
  66. [66]
    Estimating genotype error rates from high-coverage next-generation ...
    Error rates for nonreference genotype calls range from 0.1% to 0.6%, depending on the platform and the depth of coverage.
  67. [67]
    Rapid and sensitive detection of genome contamination at scale ...
    Feb 26, 2024 · Most false positives correspond to sequences assigned to other prokaryote taxonomic divisions and are below 1% of total genome length which are ...
  68. [68]
    Propagation, detection and correction of errors using the sequence ...
    Oct 20, 2022 · We review the process of sequence annotation and provide a classification of sources of error, highlighting propagation as a major source.
  69. [69]
    Improving viral annotation with artificial intelligence | mBio
    Sep 4, 2024 · In this review, we describe the potential and pitfalls of large language models for viral annotation.
  70. [70]
    A Practical Approach to Using the Genomic Standards Consortium ...
    Jun 1, 2024 · To facilitate the use of the GSC's MIxS reporting standard, we provide a description of the structure and terminology, how to navigate ...
  71. [71]
    Ten common issues with reference sequence databases and how to ...
    Mar 14, 2024 · ... GenBank records are owned by the data submitter and cannot be modified by NCBI. Furthermore, NCBI may not have necessary data available to ...
  72. [72]
    ElasticBLAST: accelerating sequence search via cloud computing
    Mar 26, 2023 · To enable cloud computing, the NCBI is now hosting popular BLAST databases on Amazon Web Servers (AWS) and Google Cloud Platform (GCP) [5] ...Missing: federations | Show results with:federations
  73. [73]
    a new public resource for exploring epigenomic data sets - PMC
    The Epigenomics database is being created as public resource to provide access to these data. It aims to provide both users familiar with the epigenetics field ...Missing: GenBank | Show results with:GenBank