Fact-checked by Grok 2 weeks ago

Biological database

A biological database is an organized repository that stores vast quantities of , including and sequences, genetic variations, patterns, CpG island locations, profiles, proteomic information, metabolic pathways, and protein-protein interactions, typically integrated with computational software for efficient updating, querying, retrieval, and analysis. The origins of biological databases trace back to the late 1970s and early 1980s, driven by advancements in technologies like (1977) and the (, 1983), which produced overwhelming volumes of genetic data requiring systematic storage and sharing mechanisms. Key institutions emerged to address this, including the () founded in 1988, which launched in 1982 as the first public repository for sequences. In 1986, the International Nucleotide Sequence Database Collaboration (INSDC) was established, uniting 's , the () Nucleotide Sequence Database, and the DNA Data Bank of Japan () to synchronize data daily and ensure global accessibility without duplication. The (1990–2003), an international effort involving 20 sequencing centers across six countries to produce the first reference human genome sequence (approximately 3.2 billion base pairs), highlighted the indispensable role of these databases, with evolving data release policies mandating public deposition within 24 hours by 1996 to accelerate scientific progress. Biological databases are broadly categorized into primary databases, which house raw, experimentally submitted data such as unprocessed nucleotide sequences in (containing over 4.7 billion sequences from more than 580,000 species as of 2025, totaling 34 trillion base pairs and doubling in size approximately every two years), and secondary databases, which offer derived or curated content like alignments, structural models, and functional predictions. Prominent examples include for comprehensive protein sequence and annotation data, the (PDB) for three-dimensional molecular structures, for protein domain families, and specialized resources such as for clinical study data or the (GBIF) for ecological and biodiversity records. These databases form the backbone of bioinformatics, enabling researchers to integrate disparate datasets, perform comparative analyses, and drive discoveries in , , , , and while promoting and international collaboration.

Introduction

Definition and Scope

Biological databases are organized collections of , encompassing and sequences, molecular structures, functional annotations, on genetic variations such as SNPs and patterns, profiles, and associated literature, specifically engineered for efficient storage, retrieval, search, and analytical processing. These repositories differ from general-purpose by accommodating the inherent , heterogeneity, and interconnectivity of biological , where often spans multiple scales from molecular to ecological levels. The scope of biological databases extends across key domains including , , , and , integrating raw experimental outputs with curated interpretations to support interdisciplinary . Unlike conventional focused on structured numerical or textual , they manage voluminous, multidimensional biological entities—such as evolutionary homologues, protein-ligand interactions, and phenotypic records—that require specialized handling to preserve contextual relationships and enable cross-domain queries. These databases are indispensable in advancing biological discovery, serving as both endpoints for data archiving and starting points for hypothesis generation, validation, and integrative analyses in fields like and . For example, the open deposition of data from the into repositories like not only accelerated post-2003 genomic research by enabling rapid sequence comparisons and functional predictions but also solidified global standards for that underpin contemporary bioinformatics. As of 2025, over 2,200 public biological databases are cataloged in comprehensive collections such as the online Molecular Biology Database Collection, reflecting sustained annual growth through 70–100 new submissions and updates that incorporate emerging data types like single-cell and .

Historical Development

The origins of biological databases trace back to the late and early 1980s, spurred by advances in technology that enabled the generation and analysis of sequences. This technology, pioneered in the through the development of restriction enzymes and DNA ligation methods, facilitated the cloning and sequencing of DNA fragments, creating a need for centralized repositories to store and share the resulting data. In , was established at under the leadership of Walter Goad to collect and distribute publicly available DNA sequences, initially handling submissions from researchers worldwide. Concurrently, the (EMBL) launched its Nucleotide Sequence Database in 1980, forming the basis for international collaboration with and Japan's DNA DataBank of Japan (DDBJ) through data exchange agreements by the mid-1980s. These early databases focused on sequences, reflecting the rapid growth in sequencing efforts driven by recombinant techniques, with alone growing from a few hundred entries in to over 100,000 by the early . The 1990s marked significant expansion and diversification of biological databases, building on foundational efforts while addressing the needs of emerging fields like and . The (PDB), initially established in 1971 at with just seven protein structures, underwent formalization and substantial growth during this period, transitioning to digital formats and international management by the late 1990s to accommodate the influx of and NMR data. This era also saw the rise of model-organism databases, such as FlyBase, which was founded in 1992 to curate genetic and molecular data on , providing a centralized resource for researchers studying this key model species. These developments were supported by key publications, including the inaugural (NAR) Database Issue in 1993, which began systematically documenting and reviewing biological databases to promote standardization and accessibility. The represented a boom in database proliferation, catalyzed by the completion of the in 2003, which not only produced vast genomic datasets but also emphasized sharing policies that influenced subsequent repository designs. Post-project, omics-era databases emerged to handle transcriptomics, , and data, with standards like the Minimum Information About a Microarray Experiment (MIAME), proposed in 2001, establishing guidelines for reporting experiments to ensure reproducibility and integration across resources. The ENCyclopedia of DNA Elements () project, initiated in 2003 as a pilot to map functional genomic elements, exemplified this growth by expanding in 2007–2012 to produce comprehensive datasets on non-coding regions, necessitating advanced storage solutions. Entering the and , the advent of next-generation sequencing (NGS) technologies ushered in a era, generating petabytes of sequence data that overwhelmed traditional databases and highlighted integration challenges such as data heterogeneity, interoperability, and computational scalability. Post-2010, efforts focused on federated systems and ontologies to link disparate resources, though issues like inconsistent formats and volume persisted. The NAR Database Issue continued as a cornerstone, with its 2025 edition announcing 73 new databases across , incorporating updates on AI-assisted curation to manage the scale of modern datasets.

Classification

Primary versus Secondary Databases

Biological databases are broadly classified into primary and secondary types based on the nature and processing level of the data they contain. Primary databases serve as archival repositories for raw, experimentally derived data submitted directly by researchers, with minimal processing to preserve the original observations. These include nucleotide sequences obtained from sequencing instruments, protein sequences from direct determination, or structural data from techniques like . The emphasis is on long-term storage and accessibility without alteration, ensuring the integrity of the source material for future verification and analysis. A prominent example is , which has accepted raw sequence submissions since 1982 and now holds over 5.9 billion records as of August 2025. In contrast, secondary databases derive value-added information by compiling, curating, annotating, and computationally analyzing data from multiple primary sources. They often include derived features such as sequence alignments, functional predictions, evolutionary relationships, or structural models, which facilitate broader biological insights but rely on interpretive processes. For instance, aggregates protein sequences and annotations from primary databases like and EMBL, providing curated functional information through manual expert review and automated inference for approximately 246 million sequences. Secondary databases thus enhance usability but depend on the quality and completeness of underlying primary data. Key differences between primary and secondary databases lie in their curation philosophies and potential limitations. Primary databases prioritize archival with limited validation, focusing on and neutrality to avoid altering submitter-provided information, which supports but can result in redundancy or errors from original experiments. Secondary databases, however, apply computational tools and expert curation to reduce noise and add context, such as assignments or pathway integrations, thereby accelerating research; yet this introduces risks of interpretation bias, where curatorial choices or algorithmic assumptions may propagate inaccuracies or favor certain biological models. There is notable overlap and evolution between these categories, with some resources functioning as hybrids or transitioning over time. For example, , maintained by NCBI, curates a non-redundant subset of sequences from the primary database, incorporating computational and manual annotations to produce reference standards for genes, transcripts, and proteins, thus blending archival elements with secondary processing. In the 2025 Nucleic Acids Research database issue, which catalogs 185 resources including 73 new ones, primary databases represent a substantial portion of the ecosystem, underscoring their foundational role alongside the growing number of derived secondary compilations.

Databases by Biological Data Type

Biological databases are classified according to the primary type of data they curate and manage, reflecting the diverse scales and aspects of biological information from molecular to ecosystem levels. This categorization facilitates targeted research in , , , and integrative studies, with each type addressing specific scientific needs while often incorporating elements of primary raw or secondary derived insights. Sequence-based databases focus on and sequences, serving as foundational resources for and by storing raw or annotated sequence data essential for identification, evolutionary analysis, and functional prediction. These databases typically include primary repositories of experimentally determined sequences as well as secondary compilations with alignments and annotations; for instance, archives over 5.9 billion nucleotide sequences totaling 47 trillion base pairs as of August 2025, enabling global access to genomic data from diverse organisms. Similarly, provides comprehensive protein sequence and functional information for approximately 246 million sequences, integrating data from multiple sources to support protein characterization and comparative studies. Structure-based databases specialize in three-dimensional molecular models, primarily atomic coordinates derived from techniques like , NMR , and cryo-electron , which are crucial for understanding , , and . These resources often combine experimental structures with computational predictions to model biomolecular interactions; the (PDB), for example, holds over 244,000 experimentally determined structures as of 2025, including updates for integrative modeling and computed structure models to bridge gaps in experimental data. Such databases enable visualization and simulation of molecular architectures, informing research. Functional and interaction databases curate information on biological pathways, molecular interactions, and gene/protein annotations, providing curated knowledge to elucidate cellular processes, regulatory networks, and disease mechanisms. These often incorporate ontology-based terms for standardized descriptions; the (GO) resource, for instance, encompasses 39,354 terms across biological processes, molecular functions, and cellular components, with over 9 million annotations linking genes to functions across thousands of species as of 2025. Complementing this, KEGG integrates 581 pathway maps and related hierarchies to model systemic functions, including metabolism and signaling, drawing from genomic and chemical data for holistic . Ecological and phenotypic databases compile species traits, biodiversity metrics, and associated metadata such as and environmental distributions, supporting studies in , , and by linking phenotypic variation to ecological contexts. These resources aggregate occurrence , trait measurements, and phylogenetic information; the (GBIF) indexes billions of occurrence records for millions of species, facilitating analyses of distribution patterns and hotspots as of 2025. Additionally, NCBI Taxonomy maintains lineages for over 160,000 organisms with molecular , providing a foundational framework for phenotypic and ecological annotations integrated across biological databases. Multi-omics integration databases combine data from genomics, transcriptomics, proteomics, metabolomics, and other layers to enable systems-level analyses of biological phenomena, particularly in complex diseases and developmental processes. Emerging prominently in the 2020s, these databases leverage advances in high-throughput technologies like single-cell sequencing; The Cancer Genome Atlas (TCGA), initiated in 2006 and expanded through the Genomic Data Commons, integrates multi-omics profiles from thousands of cancer samples, including genomic, transcriptomic, and proteomic data to reveal molecular subtypes and therapeutic targets. Recent examples include MouseOmics, which unifies 21 Mus genus genomes with transcriptomic and phenotypic data for comparative multi-omics studies as of 2025. The growth in multi-omics resources is driven by single-cell technologies, allowing finer-grained integration of heterogeneous datasets for personalized medicine and evolutionary insights.

Technical Foundations

Data Models and Architectures

Biological databases employ diverse data models and architectures to manage the complexity and volume of biological information, evolving from simple storage formats to sophisticated systems capable of handling intricate relationships and massive datasets. In the 1980s, early databases like and the (PDB) relied on flat files for nucleotide sequences and molecular structures, respectively, which allowed straightforward storage but limited efficient querying as data volumes grew. By the 1990s, the shift to relational and object-oriented models addressed these limitations, with the accelerating the adoption of structured formats for integration and analysis. The 2000s and 2010s saw the rise of federated systems, enabling distributed access to heterogeneous sources without centralizing all data, as seen in initiatives like the International Nucleotide Sequence Database Collaboration (INSDC). This progression reflects a move from O(n) linear scans in flat-file systems to O(log n) indexed queries in modern architectures, improving efficiency for large-scale retrieval. Relational models, based on SQL databases, are widely used for structured such as sequences and , organizing information into tables with defined schemas to ensure consistency and facilitate joins. For instance, in implementations, biological sequences can be stored using custom data types like bio-strings, with tables for genomic or protein entries linked to such as accession numbers and annotations, often importing from sources like NCBI . This approach supports efficient SQL queries for operations like calculating or translating sequences via stored functions, avoiding the need for external file parsing. Relational systems excel in handling tabular data, such as linking sequence IDs to files stored separately for bulk access, though they may require to manage redundancy in evolving datasets. For capturing complex, non-tabular relationships like metabolic pathways or protein interactions, and s offer flexible schemas that model entities as nodes and connections as edges. , a property graph database, has been applied in bioinformatics to represent protein-protein interaction networks, such as the STRING database, which as of 2025 contains 59.3 million proteins and over 20 billion interactions, enabling rapid traversal queries like shortest paths that are orders of magnitude faster than in relational systems (e.g., 2,441× speedup for shortest paths in comparisons). These databases handle irregular data growth without rigid schemas, making them suitable for dynamic biological networks where relationships, such as regulatory interactions, evolve frequently. Hierarchical and object-oriented models support nested data structures for sequences, using formats like to define extensible schemas for biological objects such as Bioseqs, which encapsulate sequence instances, descriptions, and annotations in a tree-like manner. In NCBI's toolkit, represents sequences with components like Seq-inst for raw data and Seq-feat for hierarchical features (e.g., coding regions or transcripts), allowing object-oriented manipulation in C++ classes. XML serves as an alternative for markup-based hierarchies, though provides more compact, binary encoding for transmission. Trade-offs between flat-file and indexed storage are evident: flat files, common in early sequence dumps, offer simplicity and portability but incur O(n) search times for large corpora, while indexed approaches in object-oriented systems enable faster O(log n) access at the cost of added overhead for maintenance and storage. Scalability in biological databases addresses petabyte-scale data from next-generation sequencing, leveraging and distributed techniques to manage growth. The NCBI Sequence Read Archive (), hosting billions of reads, utilizes AWS for open-access storage, providing on-demand scaling via services like without local infrastructure burdens. Sharding partitions data across nodes, as in distributed tools like mpiBLAST, reducing query for high-throughput analysis by parallelizing I/O and computation. Cloud architectures, including AWS's public datasets, support elastic resources for variable workloads, ensuring fault tolerance and cost efficiency for federated queries spanning multiple sources.

Ontologies and Standardization

Ontologies provide structured, controlled vocabularies that enable consistent annotation and semantic interoperability in biological databases, facilitating the integration of diverse data sources by defining relationships between terms in a machine-readable format. A seminal example is the (GO), initiated in 1998 as a collaborative effort to standardize descriptions of and product attributes across cellular components, molecular functions, and biological processes. GO uses the (OWL), a W3C standard developed in 2004, which supports formal reasoning and inference over ontological knowledge through . This framework reduces ambiguity in functional annotations, allowing researchers to query and compare functions across with precision. Standardization efforts extend to data formats that ensure compatibility and ease of exchange in biological databases. For nucleotide and protein sequences, the FASTA format, introduced in 1988, represents sequences as plain text with a descriptive header line followed by the sequence data, promoting widespread adoption in sequence analysis tools. Complementing this, the FASTQ format, formalized in 2009, incorporates quality scores alongside sequences to account for sequencing errors, becoming essential for high-throughput next-generation sequencing data. For protein structures, the Protein Data Bank (PDB) format, established in 1971 and refined over decades, stores atomic coordinates and metadata in a fixed-width text structure, enabling visualization and simulation software to parse structural data uniformly. In synthetic biology, the Synthetic Biology Open Language (SBOL), first specified in 2010, standardizes the representation of genetic designs, including parts, devices, and modules, to support modular engineering and data sharing. Interoperability initiatives further enhance standardization by addressing semantic and operational consistency across databases. The EDAM , released in 2011, provides a for bioinformatics operations, data types, identifiers, topics, and formats, enabling better searchability and integration of tools and resources. Complementing this, the principles, articulated in 2016, emphasize making data findable, accessible, interoperable, and reusable, guiding database curators to implement standards and persistent identifiers that support automated discovery and reuse. These efforts tackle key challenges like term , where synonymous or polysemous biological concepts hinder cross-database queries; for instance, ontology mapping algorithms conceptually align terms via , hierarchical matching, and logical inference to reconcile discrepancies. BioPortal, a central repository, integrates 1,549 biomedical ontologies (1,182 of them public) as of its 2025 updates, offering tools for browsing, mapping, and annotation to promote unified access. Tools such as Protégé, an open-source ontology editor developed since 1987, empower users to create, visualize, and reason over OWL-based ontologies through graphical interfaces and plugin support, streamlining the development of custom semantic frameworks for biological data.

Access and Usage

Query Interfaces and Tools

Biological databases provide diverse query interfaces and tools to facilitate user interaction, ranging from intuitive graphical user interfaces to robust command-line options, enabling researchers to search, retrieve, and analyze data efficiently. These tools are designed to accommodate varying levels of user expertise, from novices relying on web-based browsers to advanced users employing scripts for bulk operations. Web-based graphical user interfaces (GUIs) form the primary entry point for most users, offering seamless navigation through complex datasets. The system, developed by the (NCBI), serves as a unified search portal across multiple databases, allowing keyword-based queries that integrate results from , proteins, and literature. Integrated within such platforms, tools like Basic Local Alignment Search Tool () enable similarity searches by aligning user-submitted sequences against database entries, supporting rapid identification of homologous genes or proteins. These browser-based interfaces prioritize ease of use, often featuring faceted search filters and result previews to refine queries without requiring programming knowledge. Command-line tools complement web interfaces by supporting automated and large-scale data access, particularly for downloading bulk datasets. File Transfer Protocol (FTP) servers hosted by major repositories allow direct retrieval of files using clients like , which recursively fetches directories over HTTP or FTP for efficient bulk transfers. Similarly, enables synchronized updates by comparing local and remote files, minimizing bandwidth usage for incremental downloads from sites like Ensembl or . These utilities are essential for high-throughput workflows, such as mirroring entire assemblies. Visualization aids enhance query results by rendering data in interpretable formats, aiding in pattern recognition and validation. The , launched in 2000, provides an interactive web interface for querying and visualizing genomic annotations, tracks, and alignments across species, with tools like the Table Browser for exporting customized subsets. For sequence alignments, viewers such as the NCBI Multiple Sequence Alignment (MSA) Viewer display nucleotide or protein alignments graphically, highlighting conserved regions and gaps to support evolutionary analysis. Other tools, including Jalview, offer desktop-based editing and 3D structure integration for deeper exploration. User is bolstered by comprehensive documentation and evolving support features to lower barriers for diverse audiences. Most platforms include tutorials, help documentation, and video guides; for instance, the offers detailed user's guides covering search syntax and track customization. In the 2020s, mobile applications have emerged, particularly for field , enabling on-the-go queries via apps linked to databases with GPS-integrated identification. Adherence to accessibility standards, such as WCAG guidelines, ensures compatibility with screen readers and keyboard navigation in web tools. Notable examples illustrate specialized query capabilities, such as PubMed's interface for literature-linked biological queries, which combines free-text searches with terms to retrieve citations connected to experimental data in linked databases like . As of 2025, enhancements like voice-activated search have appeared in select platforms, exemplified by conversational interfaces such as , an skill for querying cancer datasets through . These developments build on underlying data architectures to expand interactive access.

Data Retrieval and Integration Methods

Biological databases employ application programming interfaces () to enable programmatic access to , with RESTful services emerging as a dominant approach for their simplicity and scalability. The Ensembl BioMart system, introduced in 2004, exemplifies RESTful by allowing users to query genomic annotations and sequences across multiple species without requiring deep knowledge of the underlying database schema. Legacy systems often rely on SOAP-based web services, which provide structured XML messaging for but have been largely supplanted due to their complexity and overhead in modern applications. Federated query mechanisms facilitate cross-database searches by distributing requests across multiple sources while presenting unified results. BioMart supports federated queries, enabling seamless integration of data from disparate repositories such as Ensembl and for tasks like gene annotation retrieval. For ontology-based retrieval, queries leverage technologies to traverse RDF triples, allowing precise extraction of biologically linked data, as implemented in tools like Bio2RDF for federated access to resources including and . Data exchange in biological databases commonly uses lightweight formats like and XML to ensure compatibility across tools and languages. 's human-readable structure and reduced parsing overhead make it ideal for responses, while XML supports validation for complex hierarchical data such as sequence alignments. For bulk downloads of large datasets, high-speed protocols like Aspera, employed by NCBI for and dbGaP, accelerate transfers of terabyte-scale files using UDP-based FASP, mitigating bandwidth limitations in genomic data distribution. Integration platforms streamline the pipelining of retrieval and analysis steps across databases. The workflow system, launched in 2005, offers a visual interface for composing reusable pipelines that invoke APIs from sources like NCBI and EMBL-EBI, promoting in bioinformatics workflows. ELIXIR's federated , established in 2014, coordinates European resources to enable standardized data access and sharing, supporting approximately 25 nodes (as of 2025) for compute-intensive integrations. As of 2025, APIs are gaining traction for their query flexibility, allowing clients to specify exact data needs and reduce over-fetching in biological contexts, such as ZincBind's implementation for metal-binding site predictions. Handling results exceeding 10^6 entries involves techniques like , streaming, and indexing to manage scale, as seen in HBase adaptations for queries, ensuring efficient processing without overwhelming resources.

Major Categories

Genomic and Sequence Databases

Genomic and sequence databases serve as foundational repositories for storing, annotating, and disseminating nucleotide sequences from DNA and RNA, enabling researchers to access raw and annotated data essential for biological discovery. These databases emerged in response to the need for centralized, publicly accessible archives amid advancing sequencing technologies, with the International Nucleotide Sequence Database Collaboration (INSDC) coordinating efforts among its members: GenBank at the National Center for Biotechnology Information (NCBI) in the United States, the European Nucleotide Archive (ENA) at EMBL-EBI in Europe, and the DNA Data Bank of Japan (DDBJ). Established in 1982, GenBank pioneered this domain by providing a comprehensive public repository for nucleotide sequences, now containing over 47 trillion base pairs from approximately 5.9 billion records as of late 2025, spanning diverse taxa including 581,000 formally described species. The INSDC ensures daily synchronization of data across these partners, promoting global consistency and redundancy-free access while supporting open science principles. Key features of these databases include streamlined sequence submission processes, advanced annotation tools, and integrated search capabilities. Users can submit sequences via web portals, APIs, or command-line interfaces, with automated validation to maintain data quality and third-party annotations for enhanced biological context, such as gene features and biodiversity metadata. Alignment tools like (Basic Local Alignment Search Tool), introduced by NCBI in 1990, allow rapid comparison of query sequences against database entries to identify similarities, supporting tasks from homology detection to functional inference. Specialized variant databases, such as dbSNP launched in 1998, catalog single nucleotide polymorphisms (SNPs) and small insertions/deletions, now encompassing over 1.1 billion unique reference SNPs derived from billions of submissions, aiding in studies. These resources underpin critical applications in genome assembly, where de novo reconstruction from fragmented reads relies on reference sequences, and , enabling evolutionary tree construction through sequence alignments across species. The advent of next-generation sequencing (NGS) in the mid-2000s triggered explosive data growth, addressed by archives like the Sequence Read Archive (SRA), established in 2007 as part of INSDC to store raw high-throughput reads, now holding petabytes of data from diverse experiments. In 2025, significant additions include metagenomic sequences from environmental sampling, such as and microbiomes, expanding coverage of uncultured microbial diversity and facilitating ecosystem-level analyses. To manage the vast scale and inherent redundancies in submitted data, these databases employ clustering techniques; for instance, NCBI's core_nt database, released in 2024, groups highly similar sequences (e.g., >90% identity) to streamline searches without losing essential information, reducing computational demands while preserving taxonomic breadth. This approach ensures efficient querying of non-redundant subsets for practical use, distinct from the full archival records that retain all submissions for traceability.

Protein Structure and Functional Databases

Protein structure and functional databases serve as essential repositories for three-dimensional (3D) atomic models of proteins, enabling researchers to analyze molecular architecture, folding patterns, and functional mechanisms. These resources primarily store experimentally determined structures, such as those obtained via , (NMR), or cryo-electron microscopy (cryo-EM), alongside computationally predicted models. Key data elements include atomic coordinates defining the positions of atoms in space, maps that visualize the distribution of electrons from diffraction data, and validation metrics like the R-factor, which quantifies the agreement between observed and calculated structure factors to assess model accuracy. For instance, the R-factor typically ranges from 0.15 to 0.25 for high-quality structures, indicating reliable fits to experimental data. The Protein Data Bank (PDB), established in 1971 as the inaugural global archive for 3D macromolecular structures, remains the cornerstone of this domain, housing over 244,000 experimentally validated entries as of November 2025. These include proteins, nucleic acids, and their complexes, with atomic coordinates provided in standard formats like PDBx/mmCIF for interoperability. Complementing PDB, the Protein Structure Database, launched in 2021 by DeepMind and the , has revolutionized access by providing predicted 3D models for over 200 million proteins across diverse organisms, derived from sequence data using . These predictions achieve near-experimental accuracy for many targets, filling gaps where experimental determination is resource-intensive. Structure classification systems like (Structural Classification of Proteins, initiated in the mid-1990s) and CATH (Class, Architecture, Topology, Homologous superfamily, also from the 1990s) organize PDB entries hierarchically based on structural and evolutionary relationships, grouping domains into folds and superfamilies to reveal functional conservation. Functional annotations in these databases extend beyond geometry to include ligand-binding sites, enzymatic activities, and interaction interfaces, often visualized through integrated tools like PyMOL, a molecular graphics system with plugins that fetch and render PDB structures for interactive analysis. For protein-nucleic acid complexes, specialized resources such as DNAproDB provide curated datasets of protein-DNA interfaces, incorporating over 6,700 high-resolution structures as of mid-2024, with weekly updates and metrics for binding affinity and deformability. Applications span , where structure-based identifies potential inhibitors by small molecules to target pockets, and predictions that accelerate hypothesis testing for disease-related variants. Recent 2025 studies highlight the growth driven by AI, with -inspired models adding millions of predictions annually, while demonstrating complementarity between experimental (PDB) and computational (AlphaFold DB) resources for hybrid workflows in functional annotation.

Model-Organism and Biodiversity Databases

Model-organism databases serve as centralized repositories for comprehensive data on key species used in biological research, integrating genomic sequences, mutant strains, and phenotypic information to facilitate experimental studies. FlyBase, established in 1992, focuses on the fruit fly Drosophila melanogaster and related species, providing curated data on genes, alleles, insertions, and expression patterns derived from thousands of publications. WormBase, initiated in the 1990s as an extension of the ACeDB system for Caenorhabditis elegans, aggregates genomic annotations, genetic interactions, and phenotypic descriptions from mutant screens and RNAi experiments. Similarly, the Zebrafish Information Network (ZFIN), launched in 1994, compiles genetic, genomic, and developmental data for the zebrafish Danio rerio, including mutant phenotypes linked to human disease models through orthology mappings. Biodiversity databases, in contrast, emphasize ecological and taxonomic data across diverse species to support global monitoring and identification efforts. The Barcode of Life Data System (BOLD), founded in 2005, specializes in DNA barcoding records, storing over 21 million public sequences from more than 1.3 million species to enable rapid species identification via mitochondrial COI gene analysis. The Global Biodiversity Information Facility (GBIF), established in 2001 as an international network, aggregates occurrence records from museums, herbaria, and field observations, encompassing data for over 2.2 million species as of 2025. Taxonomy in these resources is often standardized using the Integrated Taxonomic Information System (ITIS), which provides authoritative classifications for plants, animals, fungi, and microbes, ensuring consistent nomenclature across datasets. Key features of these databases include phenotypic ontologies for standardized trait descriptions and tools for visualizing evolutionary relationships. Model-organism databases employ ontologies such as the Anatomy Ontology in FlyBase and the Anatomy and Stage Ontologies in ZFIN to annotate phenotypes consistently, enabling cross-study comparisons of mutant effects. Biodiversity resources incorporate phylogenetic trees derived from taxonomic hierarchies, with GBIF's Backbone Taxonomy synthesizing classifications from multiple sources to map evolutionary lineages. In the 2020s, has significantly contributed to data, with platforms like GBIF integrating observations from initiatives such as , adding millions of georeferenced records to enhance temporal and spatial coverage. These databases support applications in and , with unique capabilities for cross-species alignments that reveal evolutionary conservation. In model organisms, tools like those in the of Genome Resources align orthologous genes across Drosophila, C. elegans, and to inform human-relevant studies, such as identifying conserved pathways in development. For , BOLD and GBIF facilitate conservation by mapping species distributions against environmental threats; by 2025, GBIF has expanded to include climate impact datasets, such as modeled range shifts under warming scenarios, aiding priority-setting for protected areas. These alignments enable researchers to extrapolate findings from model species to wild populations, promoting preservation amid .

Medical and Clinical Databases

Medical and clinical databases serve as critical repositories that integrate with human health outcomes, facilitating research into diseases, diagnostics, and therapies. These resources emphasize human-centric applications, such as genetic variants associated with disorders and molecular profiles of pathologies, to support evidence-based clinical decision-making. Prominent examples include the (OMIM), initiated in the early 1960s by Victor A. McKusick as a catalog of Mendelian traits and disorders, with its online version launching in 1987 to provide comprehensive entries on genetic conditions. ClinVar, established in 2013 by the (NCBI), aggregates interpretations of genomic variants and their relationships to human health, encompassing over 3 million classified variants as of 2024 to aid in clinical interpretation. (TCGA), launched in 2006 by the (NCI) and (NHGRI), has molecularly characterized over 20,000 primary cancer and matched normal samples across 33 tumor types, revealing key genomic alterations driving oncogenesis. Key features of these databases include standardized disease ontologies and pharmacogenomic annotations to enhance interoperability and applicability. The Disease Ontology (DOID), a structured vocabulary for human diseases, provides consistent, reusable descriptions by integrating features like and clinical manifestations, enabling semantic mapping across biomedical resources. PharmGKB, founded in 2000 as a central repository for pharmacogenetics, curates relationships between genes, drugs, and diseases, including clinical guidelines for personalized dosing based on genetic variants. These databases support applications in precision medicine by integrating with genome-wide association studies (GWAS) to identify disease susceptibility loci and tailor interventions, such as predicting drug responses from variant data. In the 2020s, linkages to electronic health records (EHRs) have emerged, allowing from clinical encounters to validate genomic findings and improve patient outcomes through retrospective analyses. As of 2025, advancements include AI-annotated datasets for clinical trials, where tools accelerate annotation of biomedical imaging and trial outcomes, reducing timelines and enhancing for participant recruitment and efficacy assessment. Unique to medical databases are stringent and ethical safeguards; compliance with the Health Insurance Portability and Accountability Act (HIPAA) mandates , access controls, and audit logs to protect during storage and transmission. Ethical considerations in prioritize patient consent, to prevent re-identification risks, and equitable access to mitigate exploitation by third parties, ensuring benefits outweigh potential harms in research collaborations.

Challenges and Advances

Integration and Accessibility Issues

Biological databases often suffer from data , where heterogeneous formats and structures across resources impede seamless integration. For instance, genomic sequences, protein structures, and clinical metadata exist in diverse formats such as , PDB, or proprietary schemas, leading to schema mismatches and interoperability issues that complicate cross-database analyses. Analyses of biomedical highlight how these silos fragment datasets across platforms, with public repositories frequently lacking uniformity with in-house data, resulting in challenges for multidisciplinary teams harmonizing multi-omics information. Similarly, single-cell sequencing data appears in incompatible formats like , h5ad, or mtx, exacerbating integration efforts in large-scale studies. Accessibility barriers further compound these problems, including paywalls that restrict entry to essential resources and rate limits that throttle query volumes for non-commercial users. A 2025 evaluation of life sciences data portals revealed that 74.8% exhibit severe issues, such as missing text for images and low color contrast, hindering use by diverse researchers, including those with disabilities. Additionally, non-Western populations are significantly underrepresented in these databases; for example, individuals, comprising 5% of the global population, are greatly underrepresented in genome-wide studies, limiting the applicability of findings to contexts. Rate limits and institutional biases also disproportionately affect researchers from low-resource settings, with studies showing racial identity more strongly predicting access to paywalled content than affiliation. Quality issues persist due to inconsistent annotations and outdated entries, often stemming from curation backlogs in primary databases. Biocurators resolve discrepancies, such as conflicting protein-protein interactions across resources like BioGRID and IMEx, but manual processes struggle with rapid data growth, leading to annotation error rates of up to 17% in databases like and Greengenes. Outdated entries, including obsolete sequences, arise from funding shortages that have cut model-organism database budgets by 30-40%, creating literature curation backlogs of up to a decade. These inconsistencies, including four distinct types identified in annotations, undermine reliability in downstream analyses. Ethical concerns arise from biases in AI training data sourced from these databases and privacy risks in clinical repositories. AI models trained on genomic datasets that underrepresent non-European populations perpetuate disparities, with biased outcomes exacerbating inequities in and precision medicine. Privacy issues in clinical databases involve risks of data breaches, as aggregated records can be hacked, compromising patient despite regulations like HIPAA. These biases, originating from skewed compositions, affect clinical , with calls for mitigation strategies to ensure equitable AI deployment. Cross-database query error rates driven by format incompatibilities and annotation variances, while solutions like the principles remain only partially adopted. For example, environmental data sets capture just 27-34% of terms without standardized vocabularies, reflecting limited implementation despite its emphasis on and reusability. In biological contexts, adoption enhances but faces barriers in legacy systems, with error rates in clinical data processing varying from 0.04% to 27.84% per field depending on verification methods. Recent initiatives, such as the 2024-2025 GO implementation network, have aimed to increase adoption through community standards and tools. The integration of (AI) into biological databases is revolutionizing predictive modeling and data generation, with generative models enabling unprecedented accuracy in . 3, developed by DeepMind and released in 2024, employs a diffusion-based architecture to forecast the joint structures of protein complexes, ligands, and nucleic acids, surpassing previous models in handling diverse biomolecular interactions. This advancement populates databases like the with AI-generated predictions, facilitating downstream analyses in and . In parallel, automated curation workflows are emerging in annual database compilations, as evidenced by contributions in the 2025 database issue, where AI-driven tools enhance annotation efficiency for genomic and proteomic repositories by leveraging large language models to verify and standardize entries. Federated learning addresses big data challenges in biological databases by enabling collaborative model training across distributed institutions without centralizing sensitive data, thus preserving privacy in multi-site genomic studies. A 2024 evaluation demonstrated its efficacy on biomedical image datasets, achieving comparable performance to centralized approaches while mitigating data-sharing barriers in clinical and research consortia. Complementing this, blockchain technology ensures data provenance by creating immutable ledgers for biological samples and sequences, as proposed in frameworks that convert biomedical data into non-fungible tokens (NFTs) to track ownership and circulation in biobanks. These cloud-based innovations scale to petabyte-level datasets, supporting real-time integration across global repositories. Multi-omics fusion is advancing through integration tools that uncover shared variation across datasets, exemplified by (MOFA), introduced in 2018 as a Bayesian framework for disentangling sources of heterogeneity in , , and . MOFA+ extends this to single-cell data, enabling scalable analysis of thousands of samples by inferring latent factors that reveal mechanisms and therapeutic targets. Such methods bridge silos in biological databases, fostering holistic insights into complex phenotypes. Sustainability efforts in biological databases emphasize to reduce the environmental impact of data-intensive bioinformatics, with studies quantifying the of common analyses—such as —at up to several kg CO2 equivalent per run on high-performance clusters. Initiatives promote energy-efficient algorithms and renewable-powered data centers to curb emissions from expanding repositories. Post-2020 open-access mandates, including the U.S. National Institutes of Health's policy requiring immediate public archiving of peer-reviewed manuscripts, have accelerated free dissemination of database-derived research, aligning with global efforts like to enhance equity in access. Looking ahead, is poised to generate a substantial portion of database content by 2030, with generative models populating entries for protein functions and variant annotations, as forecasted in analyses of 's trajectory in life sciences. Challenges such as hallucinations—where outputs fabricate biological relationships—are being mitigated through validation agents that cross-reference predictions against curated databases, as in GeneAgent, which autonomously queries resources like to refine gene-set analyses with over 90% accuracy improvement. These developments promise more reliable, -augmented biological databases, though ongoing validation protocols remain essential to maintain scientific integrity.

Key Resources and Publications

Nucleic Acids Research Database Issues

The (NAR) Database Issue has served as an annual supplement since , initially appearing as supplementary issues in April 1991 (with 18 articles) and May 1992 (19 articles), before being formally labeled as the Database Issue in July 1993 (24 articles). This publication has grown substantially, reflecting the expansion of bioinformatics resources; the 2025 issue, for instance, features 185 papers covering 73 new databases across diverse fields in and related disciplines. Over its history, the NAR Database Issue has documented thousands of resources, with analyses identifying 1,727 unique databases from articles published between 1991 and 2016 alone, and subsequent issues adding hundreds more for a total exceeding 2,300 as of 2025. Papers in the NAR Database Issue typically describe new databases, detailing their underlying methods, supported data types, and intended applications, while update articles on established resources highlight enhancements, such as expanded datasets or improved interfaces, often including quantitative metrics like user access statistics, citation counts, or query volumes to demonstrate ongoing utility and impact. For example, submissions must adhere to guidelines emphasizing accessibility, sustainability, and interoperability, ensuring that descriptions provide sufficient technical detail for reproducibility and integration with other tools. This structured format has made the issue a vital venue for database developers to report progress and benchmark against peers. The NAR Database Issue functions as a registry for databases, cataloging resources that might otherwise remain underrecognized and thereby influencing their funding, maintenance, and adoption within the . High citation rates from these publications, such as those for seminal databases like (over 16,000 citations), correlate with sustained support and prioritization in grant allocations. Notably, issues from the played a pivotal role in kickstarting model-organism databases by facilitating their transition to web-accessible formats during the early era, with over 75% of pre-1994 databases evolving to online platforms post-publication. All NAR Database Issue content is freely available through Oxford Academic, with full open-access articles ensuring broad dissemination. A cumulative index, known as the online Molecular Biology Database Collection, aggregates descriptions and links to all databases featured across issues, enabling efficient discovery and historical tracking. In the 2020s, the NAR Database Issue has evolved to emphasize emerging areas like integration for data analysis and multi-omics approaches that combine , , and beyond, as evidenced by increasing coverage of AI-driven tools and resources handling heterogeneous datasets in recent editions. This shift mirrors broader trends in bioinformatics, with papers now frequently addressing applications and integrative platforms to tackle complex biological questions.

Curated Repositories and Directories

Curated repositories and directories serve as centralized hubs that systematically catalog, annotate, and provide access to biological databases worldwide, enabling researchers to discover and navigate the expansive landscape of resources. These meta-resources go beyond individual databases by offering searchable interfaces, classifications, and , which streamline the identification of relevant tools for genomic, proteomic, and other life sciences inquiries. By maintaining up-to-date inventories, they address the rapid proliferation of databases, ensuring that users can locate both established and emerging resources efficiently. One prominent example is Database Commons, a manually curated catalog launched in 2017 by the National Genomics Data Center in , which as of the latest available data in 2023 encompasses 5,825 biological databases derived from 8,931 publications across 72 countries and regions; the catalog continues to grow with community contributions. This directory categorizes entries by data type (e.g., genomic, proteomic), , and functionality, while providing detailed such as database scope, update status, and access links to facilitate user exploration. It supports advanced search capabilities, including keyword queries and faceted browsing, and is regularly enriched through community submissions and curatorial efforts to reflect the evolving ecosystem of . Another key directory is BioDB100, an initiative under the BioDBcore framework developed by the International Society for Biocuration, which identifies and standardizes descriptions for a core set of 100 essential biological databases to promote and quality assessment. BioDBcore provides a uniform specification for core attributes like database name, description, and data types, serving as a foundational checklist that directories like Database Commons adopt to ensure consistent representation. This approach highlights influential resources, such as those for sequence data or protein interactions, aiding in the prioritization of high-impact databases for and development. In , , established in 2014 as an intergovernmental research infrastructure, coordinates a network of nodes that maintain and disseminate core biological data resources, including repositories for , , and data. 's platform features a searchable registry of recommended tools and databases, with categorization by scientific domain and emphasis on sustainability through federated access and standards compliance, coordinating hundreds of core bioinformatics resources, including over 400 open/ resources as of recent assessments. Similarly, the NCBI Resources page, hosted by the , offers a comprehensive directory of its own databases—such as and —alongside links to external resources, organized by categories like literature, sequences, and structures for seamless discovery. These directories play a crucial role in facilitating discovery by aggregating dispersed resources and incorporating recent advancements; for instance, Database Commons has integrated entries for AI-focused databases in 2025, such as those for predictive modeling in and . Unlike annual compilations like the Database Issue, these repositories offer ongoing, open-source contributions and regular updates to capture dynamic additions, ensuring a more current and inclusive view of the biological database ecosystem.

References

  1. [1]
    History of Biological Databases, Their Importance, and Existence in ...
    Jan 18, 2025 · Biological databases store not only nucleotide sequences, but also sequences of amino acids, data regarding methylation and CpG island location, ...
  2. [2]
    The importance of biological databases in biological discovery
    Biological databases play a central role in bioinformatics. They offer scientists the opportunity to access a wide variety of biologically relevant data.Missing: definition | Show results with:definition
  3. [3]
    Defending Our Public Biological Databases as a Global Critical ...
    Apr 4, 2019 · INSDC and similar databases have dramatically increased the pace of fundamental biological discovery and enabled a host of innovative ...<|separator|>
  4. [4]
    The 2025 Nucleic Acids Research database issue and the online ...
    Dec 10, 2024 · The 2025 Nucleic Acids Research database issue contains 185 papers spanning biology and related areas. Seventy three new databases are covered.
  5. [5]
    GenBank 2025 update | Nucleic Acids Research - Oxford Academic
    Nov 18, 2024 · By the end of 2021, GenBank contained 6.8 million viral genomes, of which 2.2 million were from coronaviruses, or about one-third of the viral ...
  6. [6]
    UniProt: the Universal Protein Knowledgebase in 2025
    Nov 18, 2024 · The UniProt suite of databases (https://www.uniprot.org/) serves as a leading global data resource for protein sequence and functional ...Uniprot: The Universal... · Progress And New... · Expert Curation
  7. [7]
    Primary and secondary databases | Bioinformatics for the terrified
    Primary databases are populated with experimentally derived data such as nucleotide sequence, protein sequence or macromolecular structure.Missing: RefSeq | Show results with:RefSeq
  8. [8]
    Protecting against researcher bias in secondary data analysis
    Researcher biases can lead to questionable research practices in secondary data analysis, which can distort the evidence base.
  9. [9]
    NCBI reference sequences (RefSeq): a curated non-redundant ...
    NCBI builds RefSeq from the sequence data available in the archival database GenBank (4), which is a comprehensive public repository of sequences submitted to, ...
  10. [10]
    Database Commons: A Catalog of Worldwide Biological Databases
    An up-to-date catalog of worldwide biological databases, as well as their curated meta-information and derived statistics, is publicly available at Database ...
  11. [11]
    Updated resources for exploring experimentally-determined PDB ...
    Nov 28, 2024 · Updated resources for exploring experimentally-determined PDB structures and Computed Structure Models at the RCSB Protein Data BankAbstract · Introduction · Results
  12. [12]
    Gene Ontology Resource
    Current release 2025-10-10: 39,354 GO terms | 9,281,704 annotations 1,601,555 gene products | 5,495 species (see statistics). The Gene Ontology Resource. The ...See statistics · GO enrichment analysis · Introduction to GO annotations · Ontology
  13. [13]
    KEGG: biological systems database as a model of the real world
    Oct 17, 2024 · The KEGG database (1,2) has been developed as a computer model of biological systems, such as the cell and the organism, by capturing and organizing knowledge ...Overview of KEGG · New developments in KEGG · Other improvements of KEGG
  14. [14]
    GBIF
    GBIF | Global Biodiversity Information Facility. Free and open access to biodiversity data. Occurrences Species Datasets Publishers Resources · Search · What is ...What is GBIF?www.gbif.org · Search · GBIF Work Programme 2025 · Species search
  15. [15]
    All Resources - Site Guide - NCBI - NIH
    A database providing information on the structure of assembled genomes, assembly names and other meta-data, statistical reports, and links to genomic ...
  16. [16]
    MouseOmics: a multi-omics database for mouse biological study
    Oct 21, 2025 · Here, we established a mouse multi-omics database, MouseOmics, by integrating 21 genomes distributed among five species within the genus Mus, ...Materials And Methods · Data Collection And... · Results
  17. [17]
    Multi-omics: Exploring Inside Cells - Front Line Genomics
    Aug 27, 2021 · Multi-omics is the simultaneous study of several molecular 'omes', providing insights into the complex mechanisms that underpin disease.Genomics · Transcriptomics · Steps Of Multi-Omics...Missing: classified | Show results with:classified
  18. [18]
    [PDF] Molecular biological databases: evolutionary history, data modeling ...
    Since the emergence of the first ever computer based molecular biological database 'Protein Data Bank‟ in 1971, biological database domain has grown rapidly in.Missing: origins | Show results with:origins
  19. [19]
    Data integration in biological research: an overview - PubMed Central
    Data sharing, integration and annotation are essential to ensure the reproducibility of the analysis and interpretation of the experimental findings.Missing: post- | Show results with:post-
  20. [20]
    Are graph databases ready for bioinformatics? - PMC - NIH
    Graph databases themselves are ready for bioinformatics and can offer great speedups over relational databases on selected problems.
  21. [21]
    Bio-Strings: A Relational Database Data-Type for Dealing with ...
    Jul 30, 2022 · We propose the relational text data type to represent and manipulate biological sequences and their derivatives.
  22. [22]
    Biological sequences integrated: a relational database approach
    In this paper we show how to use relational modeling techniques and relational database technology for modeling and storing biological sequence data, i.e. for ...
  23. [23]
    Graph databases in systems biology: a systematic review
    Nov 20, 2024 · For example, the Neo4j-based resource GREG combines five types of regulatory processed data (transcription factors, regulatory noncoding RNAs, ...
  24. [24]
    Biological Sequence Data Model - NCBI C++ Toolkit Book
    This chapter describes the NCBI Biological Sequence Data Model, with emphasis on the ASN.1 files and C++ API.
  25. [25]
    NIH NCBI Sequence Read Archive (SRA) on AWS
    The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries.
  26. [26]
    Scalability and Validation of Big Data Bioinformatics Software - PMC
    Jul 20, 2017 · This review examines two important aspects that are central to modern big data bioinformatics analysis – software scalability and validity.Missing: sharding | Show results with:sharding
  27. [27]
    OWL Web Ontology Language Reference - W3C
    Feb 10, 2004 · The Web Ontology Language OWL is a semantic markup language for publishing and sharing ontologies on the World Wide Web. OWL is developed as a ...Acknowledgments · Introduction · OWL document · Individuals
  28. [28]
    The Sanger FASTQ file format for sequences with quality scores ...
    Dec 16, 2009 · This article defines the FASTQ format, covering the original Sanger standard, the Solexa/Illumina variants and conversion between them.
  29. [29]
    File Format Documentation - wwPDB
    In 1976, a version using 72 characters plus 8 for sequencing was introduced. This 80-column format is what has commonly been called the (legacy) "PDB format".
  30. [30]
    EDAM: an ontology of bioinformatics operations, types of data and ...
    Abstract. Motivation: Advancing the search, publication and integration of bioinformatics tools and resources demands consistent machine-understandable des.Abstract · INTRODUCTION · METHODS · RESULTS
  31. [31]
    The FAIR Guiding Principles for scientific data management ... - Nature
    Mar 15, 2016 · This article describes four foundational principles—Findability, Accessibility, Interoperability, and Reusability—that serve to guide data ...
  32. [32]
    BioPortal: an open community resource for sharing, searching, and ...
    May 13, 2025 · The world's most comprehensive repository of biomedical ontologies. It provides infrastructure for finding, sharing, searching, and utilizing biomedical ...
  33. [33]
    protégé
    Protégé is a free, open-source ontology editor and framework used to build intelligent systems, with a plug-in architecture for simple and complex applications.Software · About · Support · Community
  34. [34]
    Biological Databases & Bioinformatics Tools | Systems ... - Fiveable
    Key Concepts and Definitions. Biological databases store, organize, and make accessible various types of biological data (sequences, structures, pathways, etc.) ...
  35. [35]
    NCBI Genomes FTP - NIH
    ... download them piecemeal or download in bulk using command-line tools such as lftp and rsync. How can I download only the current annotation for an organism?
  36. [36]
    FTP Download - Ensembl
    FTP Download. You can download via a browser from our FTP site, use a script, or even use rsync from the command line.
  37. [37]
    The UCSC Genome Browser Database - PMC - NIH
    The Genome Browser Database, browsing tools and downloadable data files can all be found on the UCSC Genome Bioinformatics website (http://genome.ucsc.edu), ...
  38. [38]
    NCBI Multiple Sequence Alignment Viewer 1.26.0
    The NCBI Multiple Sequence Alignment Viewer (MSA) is a graphical display for nucleotide and protein sequence alignments.
  39. [39]
    Jalview Home Page - Jalview
    Jalview is a free program for sequence alignment editing, visualization, and analysis, with DNA, RNA, and protein capabilities. It uses Jmol for 3D structures.Download · Online Training videos · Jalview Discussion Forum · About
  40. [40]
    Genome Browser User's Guide
    The Genome Browser supports text and sequence based searches that provide quick, precise access to any region of specific interest. Secondary links from ...
  41. [41]
  42. [42]
    Ten simple rules for researchers who want to develop web apps
    Jan 6, 2022 · Rule 1: Start with user-centered design · Rule 2: Test early, test often · Rule 3: Make it accessible · Rule 4: Protect your users · Rule 5: Hire a ...
  43. [43]
    Help - PubMed - NIH
    Jun 25, 2025 · What if the link to the full text is not working? How do I search by author? How do I search by journal name? How do I find a specific citation?
  44. [44]
    Melvin is a conversational voice interface for cancer genomics data
    Jan 5, 2024 · Melvin is a multi-modal Amazon Alexa skill that allows users to quickly explore cancer genomics data from TCGA through simple conversations.
  45. [45]
    BioMart RESTful access (Perl and wget) - Ensembl
    BioMart RESTful access is a quick and easy way to query the Ensembl marts using wget or perl and doesn't require any programing knowledge.
  46. [46]
    Biological SOAP servers and web services provided by the public ...
    A number of biological data resources (i.e. databases and data analytical tools) are searchable and usable on-line thanks to the internet and the World Wide ...Missing: legacy | Show results with:legacy
  47. [47]
    BioMart: a data federation framework for large collaborative projects
    Sep 19, 2011 · BioMart is a freely available, open source, federated database system that provides a unified access to disparate, ...
  48. [48]
    Ontology-Based Querying with Bio2RDF's Linked Open Data - PMC
    The availability of Bio2RDF-SIO mappings makes it possible to compose data source independent SPARQL queries that can be applied to all SPARQL endpoints, as ...
  49. [49]
    The Biological Object Notation (BON): a structured file format ... - NIH
    Jun 25, 2018 · In biology, JSON is mainly used to store or exchange application related data that can include biological data, such as alignments or protein ...
  50. [50]
    [PDF] Aspera Transfer Guide - dbGaP - NIH
    Aspera, using the FASP protocol, is for high-throughput file transfers with NCBI, using Aspera Connect for large files and ascp for bulk transfers.
  51. [51]
    Galaxy: a comprehensive approach for supporting accessible ...
    The Galaxy workflow system facilitates analysis repeatability and, like Galaxy's accessibility model, in a way that is usable even to users that have little ...
  52. [52]
    [PDF] Annual Report 2014 - ELIXIR Europe
    Welcome to the first Annual Report of ELIXIR, Europe's research infrastructure for life-science data and information. Launched officially in January 2014, ...
  53. [53]
    GraphQL for the delivery of bioinformatics web APIs and application ...
    GraphQL is a novel, increasingly prevalent alternative to REST and SOAP that represents the available data in the form of a graph to which any conceivable query ...
  54. [54]
    High dimensional biological data retrieval optimization with NoSQL ...
    Nov 13, 2014 · We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, ...
  55. [55]
    International Nucleotide Sequence Database Collaboration (INSDC)
    The International Nucleotide Sequence Database Collaboration (INSDC) archives nucleotide sequence data, from raw to assembled and annotated sequences, ...About INSDC · Global Participation · Publications · AnnouncementsMissing: history size 2025
  56. [56]
    GenBank Release 268.0 is Available! - NCBI Insights - NIH
    Aug 26, 2025 · GenBank release 268.0 (8/18/2025) is now available on the NCBI FTP site. This release has 47.01 trillion bases and 5.90 billion records.Missing: history | Show results with:history
  57. [57]
    BLAST: Basic Local Alignment Search Tool
    The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches.Standard Nucleotide BLAST · Standard Protein BLAST · NCBI Protein BLAST · RatMissing: history | Show results with:history
  58. [58]
    Introduction - BLAST® Command Line Applications User Manual
    Jun 23, 2008 · The National Center for Biotechnology Information (NCBI) first introduced BLAST in 1989. The NCBI has continued to maintain and update BLAST ...
  59. [59]
    dbSNP: the NCBI database of genetic variation - Oxford Academic
    Since its inception in September 1998, the dbSNP database has served as a central, public repository for genetic variation. Once such variations are identified ...Missing: history | Show results with:history
  60. [60]
    The evolution of dbSNP: 25 years of impact in genomic research
    Over 25 years, dbSNP has grown to include more than 4.4 billion submitted SNPs and 1.1 billion unique reference SNPs, providing essential data for identifying ...Missing: size | Show results with:size
  61. [61]
    The Sequence Read Archive (SRA) - NCBI - NIH
    Aug 3, 2023 · The SRA is a publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and ...Download SRA sequences · Submission Quick Start · Cloud · Data Storage Model
  62. [62]
    Metagenome Submission Guide - NCBI - NIH
    Mar 25, 2025 · Metagenome projects may include raw sequence reads collected from an ecological or organismal source (submitted to the Sequence Read Archive), ...
  63. [63]
    User guide to the wwPDB X-ray validation reports
    Aug 9, 2024 · The wwPDB X-ray validation reports cover overall quality, entry composition, residue plots, data statistics, model quality, and the fit of the ...
  64. [64]
    Methods for Determining Atomic Structures - PDB-101
    Flexible portions of protein will often be invisible in crystallographic electron density maps, since their electron density will be smeared over a large space.
  65. [65]
    RCSB PDB: Homepage
    RCSB Protein Data Bank (RCSB PDB) enables breakthroughs in science and education by providing access and tools for exploration, visualization, and analysis.About RCSB PDB · PDB Statistics · Team Members · Protein Data Bank
  66. [66]
    Protein Data Bank: the single global archive for 3D macromolecular ...
    Oct 24, 2018 · The PDB Core Archive houses 3D atomic coordinates of more than 144 000 structural models of proteins, DNA/RNA, and their complexes with metals ...
  67. [67]
    EMBL-EBI and Google DeepMind renew partnership and release ...
    Oct 7, 2025 · The AlphaFold Database contains protein structure predictions for over 200 million proteins, and has been used by over three million people in ...
  68. [68]
    AlphaFold Protein Structure Database: massively expanding the ...
    Nov 17, 2021 · The initial release of AlphaFold DB contains over 360,000 predicted structures across 21 model-organism proteomes, which will soon be expanded ...Missing: launch | Show results with:launch
  69. [69]
    SCOP| Structural Classification of Proteins
    Jun 29, 2022 · SCOP classification of proteins aims to provide comprehensive structural and evolutionary relationships between all proteins whose structure ...
  70. [70]
    CATH: Protein Structure Classification Database at UCL
    Sep 30, 2024 · CATH is a classification of protein structures downloaded from the Protein Data Bank. We group protein domains into superfamilies when there is sufficient ...Browse · Search · Search CATH by PDB structure · Download CATH-Gene3D Data
  71. [71]
    a PyMOL plugin to integrate and visualize data for drug discovery
    Oct 1, 2015 · In summary, we proposed an innovative, easy to use PyMOL plugin that automatically retrieves chemical and biological data from six high-quality ...
  72. [72]
    DNAproDB - A Database and Web Tool for Structural Analysis of ...
    DNAproDB is a database, structure processing pipeline, and visualization tool to analyze DNA-protein complexes.
  73. [73]
    DNAproDB: an updated database for the automated and interactive ...
    Nov 4, 2024 · DNAproDB (https://dnaprodb.usc.edu/) is a database, visualization tool, and processing pipeline for analyzing structural features of protein–DNA interactions.
  74. [74]
    Protein Structure Prediction in Drug Discovery - PMC - NIH
    Aug 17, 2023 · By bridging computational predictions with experimental results, they provide a comprehensive understanding of protein behavior and folding, ...
  75. [75]
    AI-driven protein structure prediction and its clinical impact
    Aug 6, 2025 · AI-driven protein prediction helps identify binding sites, simulate drug-protein interactions, and model complex proteins for customized ...
  76. [76]
    FlyBase Homepage
    FlyBase: a database for drosophila genetics and molecular biology. ... Model Organisms (MODs) · Alliance of Genome Resources · BeeBase · DictyBase · EcoCyc ...FlyBase:About · FlyBase:Nomenclature · Citing FlyBase · FlyBase:FAQMissing: history | Show results with:history
  77. [77]
    ZFIN The Zebrafish Information Network
    The Zebrafish Information Network (ZFIN) is the database of genetic and genomic data for the zebrafish (Danio rerio) as a model organism.
  78. [78]
    BOLD – The Barcode of Life Data Systems
    A data portal that provides access to over 17.8m public records representing 1.3m Species. The records in the portal are public and fully accessible. The ...
  79. [79]
    ITIS.gov | Integrated Taxonomic Information System (ITIS)
    Here you will find authoritative taxonomic information on plants, animals, fungi, and microbes of North America and the world.
  80. [80]
    The Unified Phenotype Ontology : a framework for cross-species ...
    The ontology design templates are based on shared features of existing phenotypic descriptions from various model organisms and represent community consensus.Missing: citizen 2020s
  81. [81]
    GBIF Backbone Taxonomy
    The GBIF Backbone Taxonomy is a single, synthetic management classification with the goal of covering all names GBIF is dealing with.Backbone Archive · Constituents · Metrics · Taxonomic backbone
  82. [82]
    GBIF and DNA
    Jan 13, 2022 · GBIF collaborates with experts and organizations to incorporate DNA-derived biodiversity data, resulting in a more comprehensive picture of life on Earth.
  83. [83]
    Alliance of Genome Resources Portal: unified model organism ...
    Sep 25, 2019 · The Alliance web portal (www.alliancegenome.org) provides a single point of access to multiple types of genetic and genomic data from diverse model organisms.Missing: establishment | Show results with:establishment
  84. [84]
    GBIF and climate change
    GBIF mediated-data is an essential input for understanding the impacts of climate change on biodiversity, including setting priorities and creating protection ...Settings · Recent Data Uses In Climate... · Global Warming Reshuffling...Missing: organism genomics 2025 expansions records
  85. [85]
    Position Paper on Genomics in Conservation
    Genomics is a powerful tool for conservation, informing management, guiding actions, and identifying genes for adaptation, but costs and access are limitations.
  86. [86]
    About OMIM
    This database was initiated in the early 1960s by Dr. Victor A. McKusick as a catalog of mendelian traits and disorders, entitled Mendelian Inheritance in ...
  87. [87]
    ClinVar: updates to support classifications of both germline and ...
    Nov 23, 2024 · ClinVar (www.ncbi.nlm.nih.gov/clinvar/) is a free, public database of human genetic variants and their relationships to disease, with >3 million ...
  88. [88]
    ClinGen curation of ClinVar: Improving a critical community resource
    ClinVar, a public NCBI-maintained database, has aggregated over 4.7 million classified variants, aiding clinical genomic interpretation and research.P630: Clingen Curation Of... · Methods · Results
  89. [89]
    The Cancer Genome Atlas Program (TCGA) - NCI
    The Cancer Genome Atlas (TCGA) is a landmark cancer genomics program that sequenced and molecularly characterized over 11000 cases of primary cancer samples ...Using TCGA · TCGA Molecular... · TCGA by the Numbers · GDC Data Portal
  90. [90]
    Disease Ontology - Institute for Genome Sciences @ University of ...
    Explore the DO's semantically defined disease knowledgebase, mine disease-to-disease relatedness through defined features and mechanisms via DO-KB tools.DownloadsDO
  91. [91]
    PharmGKB, an Integrated Resource of Pharmacogenomic Knowledge
    PharmGKB began as the central data repository for the Pharmacogenetics Research Network (PGRN) and scientific community at large in 2000 (Giacomini et al ...
  92. [92]
    Genomic medicine on the frontier of precision medicine - PMC - NIH
    Feb 1, 2025 · Genomic medicine applies genomic information for prediction, prevention, early diagnosis, or tailored treatment, using individual genomic data ...
  93. [93]
    Twenty-Five Years of Evolution and Hurdles in Electronic Health ...
    Jan 9, 2025 · During the 2020s, retrospective EHR reviews enabled early identification of clinical characteristics, risk factors, and effective interventions ...
  94. [94]
    Linking MarketScan claims data with Veradigm EHR data - Merative
    Sep 30, 2024 · This new healthcare database unlocks the whole picture of patient care at an unparalleled depth by linking claims data with electronic health records.
  95. [95]
    New AI system could accelerate clinical research | MIT News
    Sep 25, 2025 · MIT researchers developed an interactive, AI-based system that enables users to rapidly annotate areas of interest in new biomedical imaging ...
  96. [96]
    Top Annotation and Data Service Providers for Clinical Trial ... - iMerit
    Top Annotation and Data Service Providers for Clinical Trial and RWE AI in 2025 · 1. iMerit · 2. CureMeta · 3. Scale AI · 4. CloudFactory · 5. Centaur Labs.
  97. [97]
    Summary of the HIPAA Security Rule | HHS.gov
    Dec 30, 2024 · The Security Rule establishes a national set of security standards to protect certain health information that is maintained or transmitted in electronic form.
  98. [98]
    Exploring barriers and ethical challenges to medical data sharing
    Nov 15, 2024 · According to current law, informed consent from patients is required for the use of samples and medical data in scientific research. However, ...
  99. [99]
    Ethical Issues in Patient Data Ownership - PMC - NIH
    May 21, 2021 · First, patient data may be exploited with unauthorized access by third parties (hackers). Second, individuals may lose control over their data ...
  100. [100]
    Data Harmonization Challenges in Biomedical Research - Elucidata
    Feb 17, 2025 · Explore the complexities of harmonizing diverse datasets in large-scale biomedical research, including data heterogeneity, silos, ...
  101. [101]
  102. [102]
    A comprehensive evaluation of life sciences data resources reveals ...
    Our analysis shows that many life sciences resources contain severe accessibility issues (74.8% of data portals and 69.1% of journal websites) and are ...
  103. [103]
    Massive underrepresentation of Arabs in genomic studies of ...
    Nov 22, 2023 · Arabs represent 5% of the world population and have a high prevalence of common disease, yet remain greatly underrepresented in genome-wide association studies.
  104. [104]
    Causal evidence of racial and institutional biases in accessing ...
    Sep 10, 2025 · We find that racial identity more strongly predicts response rate to paywalled article requests compared to institutional affiliation, whereas ...
  105. [105]
    Quality Matters: Biocuration Experts on the Impact of Duplication and ...
    Biocuration plays a vital role in biological database curation [20]. It de-duplicates database records [21], resolves inconsistencies [22], fixes errors [17], ...Missing: backlog | Show results with:backlog
  106. [106]
    Detecting and correcting misclassified sequences in the large-scale ...
    Jun 24, 2020 · They showed that the annotation error rate in SILVA and Greengenes databases is about 17%. They also used the phylogenetic-based approach ...
  107. [107]
    A roadmap for the functional annotation of protein families
    Aug 12, 2022 · A novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns.
  108. [108]
    Ethical Challenges of Artificial Intelligence in Medicine | Cureus
    Nov 26, 2024 · AI systems trained on biased or non-representative data may produce biased outcomes, leading to disparities in diagnosis, health equity [19] ...
  109. [109]
    Ethical Issues of Artificial Intelligence in Medicine and Healthcare
    Clinical data collected by robots can be hacked into and used for malicious purposes that minimize privacy and security. Some social networks gather and store ...
  110. [110]
    Bias in medical AI: Implications for clinical decision-making - NIH
    Nov 7, 2024 · We discuss potential biases that can arise at different stages in the AI development pipeline and how they can affect AI algorithms and clinical decision- ...
  111. [111]
    Estimating Error Rates in Bioactivity Databases - ResearchGate
    Aug 6, 2025 · Error rates varied between 0 and 6% in Clinical Pathology parameters and were 21.6% in the Histopathology dataset, which is consistent with ...
  112. [112]
    A Case for Accelerating Standards to Achieve the FAIR Principles of ...
    Jun 23, 2023 · This commentary provides a broad overview of current approaches and tools to promote the adoption of the FAIR principles for environmental ...
  113. [113]
    Error Rates of Data Processing Methods in Clinical Research
    The accuracy associated with data processing methods varied widely, with error rates ranging from 2 errors per 10,000 fields to 2,784 errors per 10,000 fields.
  114. [114]
    Accurate structure prediction of biomolecular interactions ... - Nature
    Here we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture that is capable of predicting the joint structure of complexes.
  115. [115]
    Volume 53 Issue D1 | Nucleic Acids Research - Oxford Academic
    Jan 6, 2025 · The 2025 Nucleic Acids Research database issue and the online molecular biology database collection
  116. [116]
    An in-depth evaluation of federated learning on biomedical ... - Nature
    May 15, 2024 · Federated learning (FL) offers a decentralized solution that enables collaborative learning while ensuring data privacy. In this study, we ...
  117. [117]
    Efficient Use of Biological Data in the Web 3.0 Era by Applying ...
    May 28, 2024 · By converting biomedical data into NFTs, the collection and circulation of samples can be accelerated, and the transformation of resources can be promoted.
  118. [118]
    Multi‐Omics Factor Analysis—a framework for unsupervised ...
    We present Multi‐Omics Factor Analysis (MOFA), a computational method for discovering the principal sources of variation in multi‐omics data sets. MOFA infers a ...
  119. [119]
    The Carbon Footprint of Bioinformatics - PMC - PubMed Central
    Bioinformatic research relies on large-scale computational infrastructures which have a nonzero carbon footprint but so far, no study has quantified the ...
  120. [120]
    U.S. science funding agencies roll out policies on free access to ...
    Dec 20, 2024 · The NIH and DOE policies require grantees to post accepted, peer-reviewed manuscripts in each agency's public repository as soon as they are published.
  121. [121]
    Biological databases in the age of generative artificial intelligence
    Mar 20, 2025 · Data deposited in public databases feed inference algorithms that perform many useful tasks, such as identifying features of genomic sequences ...<|separator|>
  122. [122]
    GeneAgent: self-verification language agent for gene-set analysis ...
    Jul 28, 2025 · Here we present GeneAgent, an LLM-based AI agent for gene-set analysis that reduces hallucinations by autonomously interacting with biological ...
  123. [123]
    The 2013 Nucleic Acids Research Database Issue and the online ...
    Nov 30, 2012 · The next issue, published on July 1, 1993, was the first one formally labelled as the Database Issue. It consisted of 24 articles, which added ...
  124. [124]
    The 2025 Nucleic Acids Research database issue and the online ...
    Dec 10, 2024 · The 2025 Nucleic Acids Research database issue contains 185 papers spanning biology and related areas. Seventy three new databases are covered.Abstract · New and updated databases · NAR online molecular biology...
  125. [125]
    25 Years of Molecular Biology Databases: A Study of Proliferation ...
    The study presented here first identifies each unique database described in 3055 Nucleic Acids Research Database Issue articles published between 1991 and 2016.
  126. [126]
    The 24th annual Nucleic Acids Research database issue - NIH
    The current 2017 Nucleic Acids Research Database Issue is the 24th annual collection of bioinformatic databases on various areas of molecular biology. It ...
  127. [127]
    Submitting to the Database Issue | Nucleic Acids Research
    From 2009 the NAR Database Issue is published online only. Nucleic Acids Research devotes its first issue each year to publishing information on online ...
  128. [128]
    About the journal | Nucleic Acids Research - Oxford Academic
    Nucleic Acids Research (NAR) publishes the results of leading-edge research into physical, chemical, biochemical and biological aspects of nucleic acids.
  129. [129]
    Volume 48 Issue D1 | Nucleic Acids Research - Oxford Academic
    DriverDBv3: a multi-omics database for cancer driver gene research. Shu-Hsuan Liu and others. Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020 ...Missing: 2020s | Show results with:2020s
  130. [130]
    Home - Database Commons
    Database Commons is a curated catalog of worldwide biological databases, with the aim to provide a full landscape of biological databases throughout the world.Statistics · Browse · Search · Help
  131. [131]
    a community-defined information specification for biological databases
    The present article proposes the adoption of a community-defined, uniform, generic description of the core attributes of biological databases, BioDBCore.
  132. [132]
    a community-defined information specification for biological databases
    Nov 18, 2010 · To help establish requirements, some examples can be found on the BioDBCore page of the ISB, and moreover the APBioNet's BioDB100 initiative ...
  133. [133]
    ELIXIR: providing a sustainable infrastructure for life science data at ...
    Jun 27, 2021 · ELIXIR is in a unique position to drive the transformation of life science's distributed data resources within Europe to be sustainable, federated, standards- ...
  134. [134]
    InertDB as a generative AI-expanded resource of biologically ...
    Apr 10, 2025 · InertDB serves as a valuable alternative to random sampling and decoy generation, offering improved training datasets and enhancing the accuracy ...