RefSeq
RefSeq is a curated, non-redundant database of high-quality reference sequences for genomes, transcripts, and proteins, maintained by the National Center for Biotechnology Information (NCBI) as part of the U.S. National Library of Medicine.[1]It serves as a stable standard for biological research, annotation, and comparative studies by providing integrated, well-annotated data derived from public sequence repositories.[2]
Initiated over 25 years ago in the early 2000s, RefSeq was developed to address the need for reliable, non-redundant sequence standards amid the growth of genomic data, evolving from initial efforts to annotate model organisms into a comprehensive resource spanning diverse taxa.[2]
Unlike archival databases such as GenBank, which accept redundant submissions from the public, RefSeq undergoes rigorous curation—including automated pipelines and manual review by NCBI experts—to ensure accuracy, consistency, and minimal contamination, with records featuring unique identifiers and evidence-based annotations.[3][2]
As of Release 232 in September 2025, RefSeq encompasses sequences from over 170,000 organisms, including more than 427 million proteins and 74 million transcripts, covering eukaryotes, prokaryotes, viruses, organelles, and targeted loci to support applications in medicine, functional genomics, and evolutionary biology.[1]
Overview
Definition and Scope
RefSeq is a curated, non-redundant collection of genomic DNA, transcript (RNA), and protein sequences maintained by the National Center for Biotechnology Information (NCBI) as a stable reference for biological research.[4] It provides a comprehensive, integrated set of well-annotated sequences derived from publicly available data, emphasizing quality and consistency across diverse organisms including archaea, bacteria, eukaryotes, and viruses.[4] The scope of RefSeq is deliberately limited to representative, high-quality sequences that are experimentally validated or computationally predicted with strong evidence, excluding redundant, uncurated, or raw submissions such as those directly entered into GenBank.[4] This focus ensures that RefSeq serves as a baseline for comparative genomics, functional annotation, and diversity studies without incorporating every possible variant or unverified record.[4] In contrast to broader archival databases like GenBank, RefSeq sequences undergo rigorous curation to maintain non-redundancy and are assigned stable, versioned identifiers—such as the NC_ prefix for chromosomal sequences—to support reliable tracking and updates over time.[4] Annotation in RefSeq follows core principles of accuracy and evidence-based validation, combining automated computational methods, collaborative input from experts, and manual curation to link nucleotide and protein sequences while reflecting the latest biological knowledge.[4] These processes prioritize sequences with direct experimental support where possible, supplemented by predictive models for gaps, ensuring the database's utility in medical, functional, and evolutionary analyses.[4] RefSeq integrates within the NCBI ecosystem, facilitating seamless access alongside resources like GenBank for enhanced data exploration.[4]Purpose and Significance
RefSeq serves as a foundational resource in molecular biology by providing a stable, curated set of reference sequences that support key research activities, including genome annotation, gene prediction, comparative genomics, and functional studies. These sequences offer a reliable baseline for annotating newly sequenced genomes, predicting gene structures from transcriptomic data, and comparing genetic variations across species, thereby enabling researchers to draw accurate biological inferences. For instance, RefSeq's integration of genomic, transcript, and protein data facilitates the identification of orthologous genes and the elucidation of evolutionary relationships, which are essential for understanding gene function and regulation.[4][2] The database plays a critical role in standardizing nomenclature and minimizing redundancy in sequence data, which is vital for applications such as variant calling and proteomics. By curating representative sequences with consistent naming conventions derived from authoritative sources like the HUGO Gene Nomenclature Committee, RefSeq ensures that researchers use uniform identifiers, reducing errors in downstream analyses like mapping mutations to specific loci. This non-redundant approach streamlines proteomics workflows by providing high-quality protein models linked to functional annotations, allowing for precise identification of protein isoforms and interactions without the clutter of duplicate entries from primary databases.[3][2] RefSeq's impact extends across multiple fields, enhancing the accuracy of next-generation sequencing alignments, advancing drug discovery through sequence-function linkages, and promoting international collaborations via shared reference standards. In genomics, it enables precise read mapping in high-throughput sequencing projects, improving variant detection and assembly quality for clinical and research purposes. For drug development, RefSeq sequences connect genetic data to known functions, aiding in target validation and therapeutic design, as seen in studies of disease-associated genes. Additionally, its role in global initiatives, such as collaborations with Ensembl and model organism databases, fosters data interoperability and accelerates discoveries in biodiversity and human health.[4][3][2] A key challenge RefSeq addresses is the resolution of ambiguities inherent in comprehensive repositories like GenBank, where multiple submissions for the same locus or isoform can lead to inconsistencies. Through careful selection of representative sequences—such as prioritizing experimentally validated or computationally predicted models—RefSeq curates a coherent dataset that clarifies these issues, ensuring researchers have access to the most accurate and up-to-date references for their work. This curation resolves discrepancies by integrating and validating data from diverse sources, thereby enhancing the reliability of genomic interpretations.[4][2]History and Development
Establishment and Early Phases
The Reference Sequence (RefSeq) database was established in 1999 by the National Center for Biotechnology Information (NCBI) as part of the International Human Genome Project, aiming to provide a curated, non-redundant set of reference sequences for the human genome to facilitate genomic analysis and annotation.[5] This initiative addressed the growing complexity of sequence data in public repositories like GenBank by offering stable, integrated standards for mRNAs and proteins, initially focused exclusively on human sequences to support the project's goal of mapping and sequencing the human genome.[5][6] During its early phases from 2000 to 2005, RefSeq expanded its scope to include model organisms such as Saccharomyces cerevisiae, Drosophila melanogaster, and Arabidopsis thaliana, while prioritizing human data.[6] The database began with chromosomal sequences designated by the NC_ prefix for complete genome assemblies, followed by rapid inclusion of transcript records (NM_) and corresponding protein sequences (NP_), which by mid-2003 encompassed over 211,000 transcripts and 785,000 proteins, primarily from eukaryotic sources.[6] Key decisions included adopting an evidence-based annotation approach that integrated submitter data from GenBank with manual curation and computational predictions, rather than relying solely on automated methods, to ensure accuracy and biological relevance.[6] Additionally, RefSeq was tightly integrated with LocusLink—the precursor to the current Entrez Gene database—to enable gene-centered access and linking of sequences to loci.[6][5] Initial development faced significant challenges, including limited computational resources for handling the influx of large-scale sequencing data and heavy dependence on annotations provided by submitters, which required rigorous validation to maintain non-redundancy and quality.[6] These efforts laid the foundation for RefSeq as a reliable reference amid the rapid growth of genomic databases during the Human Genome Project era.[6]Key Milestones and Expansions
In 2007, the RefSeq project introduced RefSeqGene records with the NG_ accession prefix, providing stable genomic sequences for specific gene loci that span multiple transcripts and support standardized reporting of sequence variants, particularly for clinical applications.[7][8][9] During the 2010s, RefSeq significantly expanded its scope to include non-coding RNA records (NR_ prefix) for better representation of functional transcripts beyond protein-coding genes.[8] The database also incorporated comprehensive viral genome sequences, shifting from curated reference models to include both reference and representative genomes to address diverse community needs.[10] Bacterial genome annotations grew rapidly, with the prokaryotic RefSeq collection reaching nearly 200,000 genomes by 2020 through integration of high-quality assemblies and non-redundant protein sets.[11] These expansions facilitated incorporation of human genetic variation data from large-scale efforts like the 1000 Genomes Project into RefSeq-linked resources such as dbSNP.[12] In the 2020s, RefSeq enhanced its annotation processes through improved automation and computational pipelines, enabling more efficient curation of eukaryotic and prokaryotic sequences using RNA-Seq evidence and advanced quality controls.[2] The project extended to metagenome-assembled genomes (MAGs), incorporating over 17,000 such prokaryotic entries as of 2023 (within a total of over 350,000 prokaryotic genomes by 2024 and more than 400,000 by 2025).[2][13][14][15] Post-2020 collaborations with Ensembl/GENCODE under the MANE initiative aligned RefSeq transcripts with GENCODE models, producing matched annotation sets for nearly all human protein-coding genes to standardize clinical and research use.[16][17] The COVID-19 pandemic prompted accelerated development of viral RefSeq entries, including the SARS-CoV-2 reference genome (NC_045512.2) released in 2020, which served as a foundational sequence for global research and variant tracking. These updates also advanced curation improvements, such as refined quality metrics for genome assemblies.[2] Continued growth through 2025, as seen in Release 232 (September 2025), encompassed sequences from over 170,000 organisms, underscoring RefSeq's ongoing role in supporting genomic research.[1]Sequence Categories
Genomic and Gene Sequences
RefSeq genomic sequences form the foundational DNA-level records in the database, representing assembled chromosomes, scaffolds, and contigs from various organisms. These records are designated with specific accession prefixes to indicate their scope and assembly status. The NC_ prefix denotes complete genomic molecules, such as fully assembled nuclear chromosomes, organelle genomes, bacterial chromosomes, viral genomes, or plasmids, derived from high-quality submissions to the International Nucleotide Sequence Database Collaboration (INSDC).[3] In contrast, the NW_ prefix is used for working draft genomic sequences, primarily whole genome shotgun (WGS) contigs or scaffolds from draft assemblies, providing intermediate representations of genomic regions that may later contribute to higher-quality assemblies.[18] These genomic records serve as stable references for mapping and comparative genomics, with annotations generated through NCBI's eukaryotic or prokaryotic genome annotation pipelines.[4] Gene sequences in RefSeq are captured as curated genomic regions under the NG_ accession prefix, encompassing specific loci that include the full gene structure along with surrounding introns and regulatory elements. These records are manually curated by NCBI staff or collaborators to support detailed annotation of genes, haplotypes, paralogs, or non-transcribed pseudogenes, often as part of the RefSeqGene project for human and select model organisms.[3] Unlike broader genomic assemblies, NG_ records focus on targeted regions to facilitate precise gene-level analysis and integration with clinical or functional data. Annotations on these genomic and gene sequences emphasize structural and functional details at the DNA level, including gene models that delineate exon-intron boundaries, promoter regions, and links to regulatory elements such as CpG islands. These features are derived from evidence-based predictions using tools like Gnomon for gene prediction and cmsearch for non-coding elements, ensuring alignment with experimental data where available.[19] For instance, the human chromosome 1 record (NC_000001.11) includes thousands of annotated genes with defined boundaries and regulatory annotations, reflecting updates from the GRCh38.p14 assembly.[19] Versioning in RefSeq accessions, such as the ".11" suffix, tracks sequential updates to the sequence or annotations, maintaining historical stability while incorporating new evidence; the base accession (e.g., NC_000001) remains unchanged across versions.[20] An example of an NG_ record is NG_008407.1, which covers a curated human genomic region for a specific gene, such as PRPS1, including intronic sequences and proximal regulatory elements.[21] These genomic and gene sequences integrate briefly with transcript data to form complete gene models, enabling downstream derivation of RNA and protein predictions.[4]Transcript and Protein Sequences
RefSeq transcript sequences represent the expressed RNA products derived from curated genomic data, providing non-redundant, high-quality representations of messenger RNAs (mRNAs) and non-coding RNAs (ncRNAs). Protein-coding transcripts are designated with NM_ accessions, such as NM_000518.5 for the human HBB gene, and are generated through alignment and annotation of cDNA evidence from international nucleotide sequence database collaboration (INSDC) submissions or direct genomic inference.[22][3] Non-coding transcripts, including long non-coding RNAs (lncRNAs), microRNAs (miRNAs), and other functional RNAs, use NR_ accessions, like NR_002578.1 for a human snoRNA, ensuring comprehensive coverage of regulatory and structural RNA elements without protein translation potential.[23][1] These sequences link to underlying genomic loci via coordinates on reference assemblies, facilitating integration with gene models.[3] Protein sequences in RefSeq, prefixed with NP_, are conceptual translations of NM_ transcripts, offering predicted amino acid chains for functional analysis. For instance, NP_000509.1 corresponds to the beta-globin protein from the HBB transcript, incorporating start-to-stop codon predictions validated against experimental data.[24][1] Functional annotations on these proteins include identification of conserved domains and motifs using the Conserved Domain Database (CDD), which integrates models from sources like Pfam to highlight structural and biochemical features, such as kinase domains in signaling proteins.[3] This annotation process relies on evidence from sequence alignments and literature curation to assign reliability scores, distinguishing reviewed records from provisional ones.[1] Isoform representation in RefSeq prioritizes canonical transcripts selected for their biological relevance, determined by factors like expression abundance, evolutionary conservation, and supporting evidence from transcriptomic datasets. For protein-coding genes, the primary isoform is often the MANE Select transcript—a collaborative standard from NCBI and EMBL-EBI that designates one high-confidence NM_ per human gene locus, covering over 99% of protein-coding genes and aligned identically to Ensembl/GENCODE equivalents for clinical and research consistency.[25] Alternative isoforms, when supported by full-length evidence, are modeled as XM_ (for transcripts) and XP_ (for proteins) accessions, such as XM_003988641.2, allowing representation of splice variants without curated status.[3][3] Selection criteria emphasize transcripts with robust experimental validation, avoiding speculative models unless tied to genomic predictions.[26] Detailed annotations on transcript and protein records enhance their utility for functional studies, including delineation of untranslated regions (UTRs) that influence mRNA stability and translation efficiency. The 5' UTR spans from the transcription start site to the coding sequence initiation, while the 3' UTR extends from the stop codon to the polyadenylation site, with poly-A signals (e.g., AAUAAA motifs) explicitly noted to guide post-transcriptional processing.[3] Evidence codes, such as those indicating alignment to INSDC sequences or computational support, accompany these features to denote curation confidence, with MANE Select transcripts particularly flagged for their clinical relevance in human variant interpretation.[25] This layered annotation ensures RefSeq records serve as reliable references for downstream applications like variant effect prediction and comparative genomics.[1]Curation Process
Data Sources and Integration
RefSeq derives its sequences primarily from submissions to GenBank, the comprehensive nucleotide sequence database maintained by the International Nucleotide Sequence Database Collaboration (INSDC), which includes raw genomic, transcript, and expressed sequence tag (EST) data from global researchers.[4] For protein sequences, RefSeq integrates data from UniProt, particularly the manually curated Swiss-Prot subset, to ensure consistent nomenclature and functional annotations.[27] Additionally, annotations for eukaryotic genomes incorporate contributions from Ensembl and the GENCODE project, which provide high-quality gene models based on computational predictions and experimental evidence.[4] Direct collaborations with sequencing consortia, such as those for microbial genomes (e.g., via the NCBI Prokaryotic Genome Annotation Pipeline (PGAP)) and viral genomes (e.g., through Viral Genome Advisors), supply fully annotated assemblies that form the basis for RefSeq records.[27] The integration process begins with automated alignment of sequences from these diverse sources to the reference genome using tools like BLAST for initial similarity searches and Splign for precise spliced alignment of transcripts to genomic DNA, ensuring accuracy in exon-intron boundaries. NCBI staff then manually curate and harmonize the data, resolving discrepancies through evidence-based selection, such as prioritizing sequences supported by RNA-Seq or proteomics data.[4] This pipeline produces a non-redundant set by selecting representative sequences according to criteria like the longest open reading frame (ORF) for protein-coding genes or the highest level of experimental support, thereby minimizing redundancy while maximizing coverage. RefSeq employs a dual versioning system to track updates without disrupting stable identifiers: each record has a unique GenInfo Identifier (GI) number, which remains constant even as the sequence evolves, and an Accession.Version format (e.g., NM_001.5 for a messenger RNA transcript), where the version number increments with revisions to reflect changes in sequence or annotation.[28] This approach allows users to access both historical and current versions via tools like the Sequence Revision History, facilitating reproducible research.[28] Post-integration, basic quality checks verify alignment fidelity before advancing to detailed annotation.[4]Annotation and Quality Control
The annotation of RefSeq records involves integrating experimental evidence with computational predictions to assign biological function and structure to nucleotide and protein sequences, ensuring they represent high-quality, non-redundant references for genomic research.[4] For eukaryotic genomes, the NCBI Eukaryotic Genome Annotation Pipeline (EGAP) processes genomic assemblies by first masking repetitive regions using tools like RepeatMasker, then aligning diverse evidence sources such as RNA-Seq data via STAR, transcript alignments from RefSeq and GenBank using Splign, protein alignments, and long-read transcriptomics with Minimap2. As of August 2025, EGAP version 10.5 includes improvements to the long-read alignment process to handle large numbers of alignments more efficiently.[29] Computational gene prediction is performed using the ab initio modeler Gnomon, which employs hidden Markov models refined by alignments to non-redundant protein databases, while prokaryotic genomes are annotated via the Prokaryotic Genome Annotation Pipeline (PGAP), combining homology-based methods with protein family models (e.g., HMMs and CDD profiles) for identifying coding genes, RNAs, and pseudogenes.[30] These pipelines prioritize evidence from experimental data like cDNA, ESTs, proteomics, and RNA-Seq to generate models that include gene identifiers, locus types (e.g., coding or pseudogene), and functional annotations such as Gene Ontology terms derived from InterProScan.[31][32] RefSeq employs a tiered system to denote annotation quality, reflecting the level of evidence and curation. Fully curated or reviewed records, often prefixed with NM_, NR_, or NP_ for transcripts and proteins, are manually validated by NCBI staff using high-confidence experimental data from sources like INSDC submissions, literature, and collaborations (e.g., with the Consensus CDS project for human genes), ensuring optimal sequence representation and functional descriptions.[4][32] Provisionally curated records incorporate mixed evidence, such as initial alignments from RNA-Seq or ESTs combined with computational support, and may undergo limited manual review to resolve ambiguities before promotion to full curation.[31] Predicted records, denoted by XM_, XR_, or XP_ prefixes, rely primarily on computational models like those from Gnomon or PGAP without direct experimental validation, serving as provisional placeholders for further evidence accumulation.[4] This tiering system allows users to assess reliability, with reviewed records representing the highest standard for applications like clinical genomics.[32] Quality control in RefSeq annotation encompasses multiple layers to maintain accuracy and consistency. NCBI staff conduct manual reviews for discrepancies in curated records, particularly for model organisms like humans, where alignments are cross-checked against experimental datasets and nomenclature standards from bodies like HGNC.[32] Automated validation includes filtering low-quality alignments (e.g., those below 85% identity) and assessing completeness using benchmarks like BUSCO, which evaluates ortholog representation and flags assemblies with excessive partial genes or missing loci.[31] Periodic re-annotation occurs every 12-18 months for major genomes, incorporating new evidence and assembly updates while mapping prior annotations via whole-genome alignments to track changes in gene loci.[31][32] Error handling protocols focus on identifying and correcting artifacts to enhance record reliability. Common issues like frameshifts in coding sequences are flagged during alignment and prediction steps, with tools like Gnomon compensating by inserting ambiguous bases (Ns) or excluding defective models labeled as "PREDICTED: LOW QUALITY PROTEIN."[31] Chimeric sequences or contaminants are screened through staff review and alignment quality metrics, leading to corrections or exclusions from RefSeq; for instance, prokaryotic annotations in PGAP detect pseudogenes and mobile elements to avoid misrepresenting functional genes.[30] These measures, combined with ongoing pipeline upgrades, ensure RefSeq records remain a trusted resource by minimizing propagation of errors from source data.[4]Special Projects and Initiatives
RefSeq Select and Functional Annotations
RefSeq Select, launched in 2021, designates a single representative transcript and protein for each protein-coding gene locus to standardize references for comparative genomics, clinical variant reporting, and evolutionary studies. This initiative addresses the challenge of transcript isoform diversity by selecting transcripts based on automated pipelines that prioritize criteria such as high expression levels across tissues (derived from RNA-seq data like GTEx and recount3), evolutionary conservation of coding regions (using PhyloCSF scores), and alignment with experimentally validated isoforms. For human genes, RefSeq Select integrates with the Matched Annotation from NCBI and EMBL-EBI (MANE) project, which ensures a consensus transcript set covering approximately 80% of protein-coding genes as of 2025, with ongoing efforts to reach near 100% by aligning RefSeq and Ensembl/GENCODE annotations to the GRCh38 genome assembly.[26][17] Selection criteria emphasize biological relevance and evidence support, favoring transcripts with strong experimental validation, such as those from ClinVar submissions or Locus Reference Genomic (LRG) records for clinical utility, while deprioritizing low-expression or poorly conserved isoforms. In cases of multiple viable candidates, the oldest accession is selected as a tie-breaker. This hierarchical approach results in a non-redundant set that reflects the predominant functional isoform for most genes, with MANE Select providing an exact match for human transcripts to facilitate unambiguous variant interpretation.[26][17] Functional annotations enhance RefSeq Select records by incorporating Gene Ontology (GO) terms, which describe molecular functions, biological processes, and cellular components, computed via InterProScan for eukaryotic proteins starting in 2023. Pathway information, such as associations with KEGG pathways, is integrated through links in the associated NCBI Gene records, enabling contextualization of gene roles in metabolic and signaling networks. Disease associations are linked via ClinVar, allowing direct access to variant pathogenicity data, particularly for MANE Select transcripts used in clinical diagnostics. These annotations are derived from curated evidence and high-throughput datasets, ensuring reliability for downstream applications.[33] The RefSeq Select dataset undergoes annual reviews to incorporate emerging evidence, with updates reflecting advances in transcriptomics and clinical data. For instance, the MANE project includes the MANE Plus Clinical set, which adds representative transcripts for 55 genes where the MANE Select is insufficient to report all pathogenic variants, including cancer-related isoforms to better support oncogenomic analyses. These revisions maintain coverage for human, mouse, and rat genomes while planning extensions to other eukaryotes, ensuring the dataset remains a dynamic resource aligned with broader RefSeq curation standards.[17][34]Targeted Projects for Organisms
The RefSeq project maintains specialized efforts for bacterial and archaeal genomes, focusing on high-quality, annotated reference sequences derived from complete and whole-genome shotgun assemblies submitted to international nucleotide sequence databases. These prokaryotic RefSeq genomes undergo automated annotation via the Prokaryotic Genome Annotation Pipeline (PGAP), which incorporates expert-curated protein family models to ensure consistency and accuracy across diverse taxa. As of 2023, the collection encompasses over 315,000 bacterial and archaeal genomes, emphasizing representative strains for type materials and enabling comparative genomics studies.[35][36] A key component is the RefSeq Targeted Loci Project, which curates 16S ribosomal RNA sequences for bacteria and archaea, providing a non-redundant dataset aligned to type strains for taxonomic identification and microbial ecology research.[37] Recent initiatives have prioritized rapid annotation for pathogenic species, including expansions in 2022 to incorporate antimicrobial resistance genes through integration with tools like AMRFinderPlus, facilitating surveillance of resistance mechanisms in clinical isolates.[38] In 2025, an updated reference collection selected the highest-quality assemblies for 22,082 prokaryotic species, reducing redundancy and enhancing usability for metagenomic analyses.[39] For eukaryotic organisms, RefSeq supports targeted initiatives in plants and parasites, adapting annotation pipelines to handle complex genomes and facilitate agricultural and medical research. In plants, RefSeq provides comprehensive assemblies and annotations for model species such as Arabidopsis thaliana, with the TAIR10.1 genome assembly serving as a foundational resource for gene function studies and comparative plant genomics, including 119.1 Mb across five chromosomes and organelle sequences.[40] These efforts extend to broader plant diversity through the NCBI Eukaryotic Genome Annotation Pipeline, which processes hundreds of plant genomes to generate consistent RefSeq records for transcripts and proteins. For parasites, RefSeq curates sequences from protozoan and helminth species, supporting multi-strain comparisons to elucidate virulence factors and host interactions; for instance, annotations for rodent malaria parasites like Plasmodium yoelii enable synteny mapping and evolutionary analyses across strains.[41][42] Initiatives like WormBase ParaSite integrate with RefSeq to provide access to 274 helminth genomes as of the 19th release, emphasizing non-redundant reference sets for parasitic nematodes and platyhelminths relevant to global health.[43][44] The Viral RefSeq project curates complete or near-complete genomes for viruses, with particular emphasis on segmented genomes such as those of influenza viruses, where individual RNA segments are annotated and organized for surveillance and vaccine development. The NCBI Influenza Virus Database maintains RefSeq records for thousands of strains, grouping segments from the same isolate to reconstruct full genomes and track antigenic drift.[45] This approach supports rapid updates for emerging threats; for example, in response to the 2022-2024 mpox outbreaks, RefSeq incorporated multiple monkeypox virus assemblies, including high-coverage genomes like GCA_033539925.1, to aid phylogenetic tracking and diagnostic tool design.[46] Automated assembly processes introduced in 2024 further enhance this by linking segments from diverse viral samples, ensuring timely RefSeq releases for pathogens like influenza and orthopoxviruses.[47] RefSeq's metazoan expansions target non-model organisms through collaborations with the Genome Reference Consortium (GRC), prioritizing high-contiguity assemblies to fill gaps in animal genomic representation beyond traditional models like human and mouse. These efforts leverage the NCBI Eukaryotic Genome Annotation Pipeline to annotate genomes from diverse metazoans, such as insects and marine invertebrates, supporting ortholog inference across expanded taxa.[48] Recent advancements include scaling ortholog calculations to a broader set of metazoan RefSeq genomes, enabling gene nomenclature consistency and functional predictions for understudied species.[49] For instance, chromosome-level assemblies for non-model metazoans like the sacoglossan sea slug Elysia timida integrate with RefSeq to provide reference sequences that reveal evolutionary adaptations, such as kleptoplasty.[50] This GRC-aligned focus enhances the database's utility for biodiversity genomics and conservation applications.[34]Access and Tools
Databases and Interfaces
RefSeq records are primarily accessible through the Nucleotide and Protein databases hosted by the National Center for Biotechnology Information (NCBI), where they form a curated subset of sequences. Users can refine searches to RefSeq entries using specific filters in the Entrez search and retrieval system, such as therefseq[filter] criterion, which limits results to non-redundant reference sequences in these databases.[51]
Key web interfaces enhance user interaction with RefSeq data. The Genome Data Viewer (GDV) provides an interactive platform for visualizing RefSeq genomic sequences, allowing exploration of assemblies, transcript alignments, and annotations through graphical representations like tracks and trees.[52] The Gene database offers detailed locus summaries for RefSeq entries, integrating transcript and protein sequences with nomenclature, maps, and functional details for specific genes across organisms.[53]
Entrez supports advanced search queries to access RefSeq content efficiently, such as "refseq[filter] AND human[organism]" to retrieve human reference sequences, with results displaying sequence details, alignments, and phylogenetic trees for evolutionary context.[3]
Additional user features include seamless integration with the Basic Local Alignment Search Tool (BLAST) for performing similarity searches against RefSeq datasets, enabling identification of homologous sequences, and options for batch retrieval via FTP to handle multiple records.[1]
APIs and Download Options
RefSeq data can be accessed programmatically through NCBI's E-utilities, a suite of server-side programs that enable searching, linking, and downloading from the Entrez system of databases, including RefSeq records in the Nuccore (nucleotide) and Protein databases.[54] The ESearch utility allows users to query for specific RefSeq accessions, such as messenger RNA (mRNA) records prefixed with NM_, by constructing RESTful URL calls to the nucleotide database; for instance, a search for human NM_ accessions might use the endpointhttps://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=[nucleotide](/page/Nucleotide)&term="[Homo sapiens](/page/Human)"[organism] AND NM_[accn], returning a list of unique identifiers (UIDs).[55] Once UIDs are obtained, the EFetch utility retrieves the corresponding records in various formats, such as FASTA for sequences or GenBank flat files for annotated data; an example fetch for multiple NM_ accessions in FASTA format is https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=[nucleotide](/page/Nucleotide)&id=NM_000546,NM_000547&rettype=fasta&retmode=text.[55]
For bulk metadata retrieval, the NCBI Datasets API provides an alternative to E-utilities, supporting queries for Gene records linked to RefSeq accessions and returning structured JSON responses that include details like transcript and protein associations. This API is particularly useful for large-scale operations, as demonstrated by retrieving metadata for multiple RefSeq mRNA accessions (e.g., NM_001) in a single request, which outputs parseable objects for integration into workflows without needing to parse XML from E-utilities.[56]
Bulk downloads of RefSeq data are available via anonymous FTP at ftp://ftp.ncbi.nlm.nih.gov/refseq/, organized into directories for complete bi-monthly releases (e.g., release/232/), daily updates (daily/), and supplemental files (supplemental/), with species-specific subdirectories for genomes like H_sapiens/.[1] These directories include release notes detailing changes, MD5 checksum files for verifying download integrity, and catalogs listing available records to facilitate selective retrieval.[3]
RefSeq files are provided in multiple formats to suit different use cases: FASTA for raw nucleotide or protein sequences, GenBank flat files for records with full annotations including features and references, and ASN.1 for structured, machine-readable representations of the data that support advanced parsing.[3] Users can download entire releases or subsets, such as all human transcripts, by navigating to appropriate paths like release/release-catalog/ for indices.[57]
To ensure data consistency, best practices for downloads emphasize handling versioning carefully; RefSeq accessions include a version suffix (e.g., NM_000123.5) that increments with updates to reflect sequence or annotation changes, so users should retrieve version-specific files from a single release directory to avoid mixing outdated records, and always verify against checksums post-download.[3] Web interfaces like the Nucleotide database serve as initial entry points for identifying accessions before programmatic access.
Statistics and Updates
Record Counts and Growth
As of RefSeq Release 232, available on September 2, 2025, the database contains 558,426,495 records in total.[58] This includes 56,702,917 genomic records, 74,202,490 transcript and RNA records, and 427,129,536 protein records, spanning sequences from 170,401 organisms.[58] These figures reflect the non-redundant, curated nature of RefSeq, with genomic records often representing complete assemblies and transcripts derived from high-quality annotations.[1] In terms of organism distribution, approximately 59% of represented organisms are bacterial (99,961), 32% are eukaryotic (about 53,880 across vertebrates, fungi, plants, invertebrates, and protozoa), and 9% are viral (14,610).[58] For eukaryotic examples, the human (Homo sapiens) dataset includes around 59,792 genes, encompassing roughly 20,000 protein-coding genes and additional non-coding elements.[59] Bacterial and viral records dominate the overall counts, due to the high volume of prokaryotic and viral sequencing data integrated into RefSeq.[34] RefSeq has exhibited robust growth, with total records increasing by an average of 15-20% annually since 2020, rising from approximately 223 million in January 2020 to over 558 million by September 2025.[60] This expansion, totaling about 335 million new records over five years, has been accelerated by surges in high-throughput sequencing, including integration of thousands of metagenome-assembled prokaryotic genomes since 2023.[35] From Release 231 (July 2025) to Release 232, records grew by 11.25 million, primarily in protein and genomic categories.[58] Key metrics highlight the scale of RefSeq records.[34] Update frequencies vary by category, with prokaryotic and viral datasets seeing more frequent additions compared to eukaryotic transcripts, which prioritize quality refinements over volume.[34]| Category | Count (Release 232) | Example Growth Driver |
|---|---|---|
| Total Records | 558,426,495 | Metagenomics integration since 2023 |
| Genomic | 56,702,917 | Prokaryotic assemblies |
| Transcripts/RNA | 74,202,490 | Eukaryotic annotations |
| Proteins | 427,129,536 | Bacterial and viral sequences |
| Organisms | 170,401 | +3,179 since Release 231 |