Fact-checked by Grok 2 weeks ago

RefSeq

RefSeq is a curated, non-redundant database of high-quality reference sequences for genomes, transcripts, and proteins, maintained by the (NCBI) as part of the U.S. .
It serves as a stable standard for biological research, annotation, and comparative studies by providing integrated, well-annotated data derived from public sequence repositories.
Initiated over 25 years ago in the early , RefSeq was developed to address the need for reliable, non-redundant sequence standards amid the growth of genomic data, evolving from initial efforts to annotate model organisms into a comprehensive resource spanning diverse taxa.
Unlike archival databases such as , which accept redundant submissions from the public, RefSeq undergoes rigorous curation—including automated pipelines and manual review by NCBI experts—to ensure accuracy, consistency, and minimal contamination, with records featuring unique identifiers and evidence-based annotations.
As of Release 232 in September 2025, RefSeq encompasses sequences from over 170,000 organisms, including more than 427 million proteins and 74 million transcripts, covering eukaryotes, prokaryotes, viruses, organelles, and targeted loci to support applications in , , and .

Overview

Definition and Scope

RefSeq is a curated, non-redundant collection of genomic DNA, transcript (RNA), and protein sequences maintained by the National Center for Biotechnology Information (NCBI) as a stable reference for biological research. It provides a comprehensive, integrated set of well-annotated sequences derived from publicly available data, emphasizing quality and consistency across diverse organisms including archaea, bacteria, eukaryotes, and viruses. The scope of RefSeq is deliberately limited to representative, high-quality sequences that are experimentally validated or computationally predicted with strong evidence, excluding redundant, uncurated, or raw submissions such as those directly entered into . This focus ensures that RefSeq serves as a baseline for , functional annotation, and diversity studies without incorporating every possible variant or unverified record. In contrast to broader archival databases like , RefSeq sequences undergo rigorous curation to maintain non-redundancy and are assigned stable, versioned identifiers—such as the NC_ prefix for chromosomal sequences—to support reliable tracking and updates over time. Annotation in RefSeq follows core principles of accuracy and evidence-based validation, combining automated computational methods, collaborative input from experts, and manual curation to link and protein sequences while reflecting the latest biological knowledge. These processes prioritize sequences with direct experimental support where possible, supplemented by predictive models for gaps, ensuring the database's utility in medical, functional, and evolutionary analyses. RefSeq integrates within the NCBI ecosystem, facilitating seamless access alongside resources like for enhanced data exploration.

Purpose and Significance

RefSeq serves as a foundational in by providing a stable, curated set of reference sequences that support key research activities, including genome annotation, , , and functional studies. These sequences offer a reliable baseline for annotating newly sequenced genomes, predicting gene structures from transcriptomic data, and comparing genetic variations across species, thereby enabling researchers to draw accurate biological inferences. For instance, RefSeq's integration of genomic, transcript, and protein data facilitates the identification of orthologous s and the elucidation of evolutionary relationships, which are essential for understanding gene function and regulation. The database plays a critical role in standardizing and minimizing redundancy in sequence data, which is vital for applications such as variant calling and . By curating representative sequences with consistent naming conventions derived from authoritative sources like the , RefSeq ensures that researchers use uniform identifiers, reducing errors in downstream analyses like mapping mutations to specific loci. This non-redundant approach streamlines workflows by providing high-quality protein models linked to functional annotations, allowing for precise identification of protein isoforms and interactions without the clutter of duplicate entries from primary databases. RefSeq's impact extends across multiple fields, enhancing the accuracy of next-generation sequencing alignments, advancing through sequence-function linkages, and promoting international collaborations via shared reference standards. In , it enables precise read mapping in high-throughput sequencing projects, improving detection and quality for clinical and research purposes. For , RefSeq sequences connect genetic data to known functions, aiding in validation and therapeutic design, as seen in studies of disease-associated genes. Additionally, its role in global initiatives, such as collaborations with Ensembl and databases, fosters and accelerates discoveries in and human health. A key challenge RefSeq addresses is the resolution of ambiguities inherent in comprehensive repositories like , where multiple submissions for the same locus or isoform can lead to inconsistencies. Through careful selection of representative sequences—such as prioritizing experimentally validated or computationally predicted models—RefSeq curates a coherent dataset that clarifies these issues, ensuring researchers have access to the most accurate and up-to-date references for their work. This curation resolves discrepancies by integrating and validating data from diverse sources, thereby enhancing the reliability of genomic interpretations.

History and Development

Establishment and Early Phases

The Reference Sequence (RefSeq) database was established in 1999 by the (NCBI) as part of the International Human Genome Project, aiming to provide a curated, non-redundant set of reference sequences for the to facilitate genomic analysis and annotation. This initiative addressed the growing complexity of sequence data in public repositories like by offering stable, integrated standards for mRNAs and proteins, initially focused exclusively on human sequences to support the project's goal of mapping and sequencing the . During its early phases from 2000 to 2005, RefSeq expanded its scope to include model organisms such as Saccharomyces cerevisiae, Drosophila melanogaster, and Arabidopsis thaliana, while prioritizing human data. The database began with chromosomal sequences designated by the NC_ prefix for complete genome assemblies, followed by rapid inclusion of transcript records (NM_) and corresponding protein sequences (NP_), which by mid-2003 encompassed over 211,000 transcripts and 785,000 proteins, primarily from eukaryotic sources. Key decisions included adopting an evidence-based annotation approach that integrated submitter data from GenBank with manual curation and computational predictions, rather than relying solely on automated methods, to ensure accuracy and biological relevance. Additionally, RefSeq was tightly integrated with LocusLink—the precursor to the current Entrez Gene database—to enable gene-centered access and linking of sequences to loci. Initial development faced significant challenges, including limited computational resources for handling the influx of large-scale sequencing data and heavy dependence on annotations provided by submitters, which required rigorous validation to maintain non-redundancy and quality. These efforts laid the foundation for RefSeq as a reliable reference amid the rapid growth of genomic databases during the era.

Key Milestones and Expansions

In 2007, the RefSeq project introduced RefSeqGene records with the NG_ accession prefix, providing stable genomic sequences for specific gene loci that span multiple transcripts and support standardized reporting of sequence variants, particularly for clinical applications. During the , RefSeq significantly expanded its scope to include records (NR_ prefix) for better representation of functional transcripts beyond protein-coding genes. The database also incorporated comprehensive genome sequences, shifting from curated reference models to include both reference and representative genomes to address diverse community needs. Bacterial genome annotations grew rapidly, with the prokaryotic RefSeq collection reaching nearly 200,000 s by 2020 through integration of high-quality assemblies and non-redundant protein sets. These expansions facilitated incorporation of human genetic variation data from large-scale efforts like the into RefSeq-linked resources such as dbSNP. In the 2020s, RefSeq enhanced its annotation processes through improved automation and computational pipelines, enabling more efficient curation of eukaryotic and prokaryotic sequences using evidence and advanced quality controls. The project extended to metagenome-assembled genomes (MAGs), incorporating over 17,000 such prokaryotic entries as of 2023 (within a total of over 350,000 prokaryotic genomes by 2024 and more than 400,000 by 2025). Post-2020 collaborations with Ensembl/GENCODE under the initiative aligned RefSeq transcripts with GENCODE models, producing matched annotation sets for nearly all human protein-coding genes to standardize clinical and use. The prompted accelerated development of viral RefSeq entries, including the reference genome (NC_045512.2) released in 2020, which served as a foundational for global research and variant tracking. These updates also advanced curation improvements, such as refined quality metrics for genome assemblies. Continued growth through 2025, as seen in Release 232 (September 2025), encompassed sequences from over 170,000 organisms, underscoring RefSeq's ongoing role in supporting genomic research.

Sequence Categories

Genomic and Gene Sequences

RefSeq genomic sequences form the foundational DNA-level records in the database, representing assembled chromosomes, scaffolds, and contigs from various . These records are designated with specific accession prefixes to indicate their scope and assembly status. The NC_ prefix denotes complete genomic molecules, such as fully assembled nuclear chromosomes, organelle genomes, bacterial chromosomes, viral genomes, or plasmids, derived from high-quality submissions to the International Nucleotide Sequence Database Collaboration (INSDC). In contrast, the NW_ prefix is used for working draft genomic sequences, primarily whole genome shotgun (WGS) contigs or scaffolds from draft assemblies, providing intermediate representations of genomic regions that may later contribute to higher-quality assemblies. These genomic records serve as stable references for mapping and , with annotations generated through NCBI's eukaryotic or prokaryotic genome annotation pipelines. Gene sequences in RefSeq are captured as curated genomic regions under the NG_ accession prefix, encompassing specific loci that include the full along with surrounding introns and regulatory elements. These records are manually curated by NCBI staff or collaborators to support detailed annotation of genes, haplotypes, paralogs, or non-transcribed pseudogenes, often as part of the RefSeqGene project for human and select model organisms. Unlike broader genomic assemblies, NG_ records focus on targeted regions to facilitate precise gene-level analysis and integration with clinical or functional data. Annotations on these genomic and gene sequences emphasize structural and functional details at the DNA level, including gene models that delineate exon-intron boundaries, promoter regions, and links to regulatory elements such as CpG islands. These features are derived from evidence-based predictions using tools like for and cmsearch for non-coding elements, ensuring alignment with experimental data where available. For instance, the record (NC_000001.11) includes thousands of annotated genes with defined boundaries and regulatory annotations, reflecting updates from the GRCh38.p14 assembly. Versioning in RefSeq accessions, such as the ".11" suffix, tracks sequential updates to the sequence or annotations, maintaining historical stability while incorporating new evidence; the base accession (e.g., NC_000001) remains unchanged across versions. An example of an NG_ record is NG_008407.1, which covers a curated genomic region for a specific , such as PRPS1, including intronic sequences and proximal regulatory elements. These genomic and sequences integrate briefly with transcript to form complete models, enabling downstream derivation of and protein predictions.

Transcript and Protein Sequences

RefSeq transcript sequences represent the expressed products derived from curated genomic , providing non-redundant, high-quality representations of messenger RNAs (mRNAs) and non-coding RNAs (ncRNAs). Protein-coding transcripts are designated with NM_ accessions, such as NM_000518.5 for the HBB , and are generated through and of cDNA from international nucleotide collaboration (INSDC) submissions or direct genomic inference. Non-coding transcripts, including long non-coding RNAs (lncRNAs), microRNAs (miRNAs), and other functional RNAs, use NR_ accessions, like NR_002578.1 for a snoRNA, ensuring comprehensive coverage of regulatory and structural elements without protein translation potential. These sequences link to underlying genomic loci via coordinates on reference assemblies, facilitating integration with models. Protein sequences in RefSeq, prefixed with NP_, are conceptual translations of NM_ transcripts, offering predicted chains for . For instance, NP_000509.1 corresponds to the beta-globin protein from the HBB transcript, incorporating start-to-stop codon predictions validated against experimental data. Functional annotations on these proteins include identification of conserved domains and motifs using the Conserved Domain Database (CDD), which integrates models from sources like to highlight structural and biochemical features, such as domains in signaling proteins. This process relies on evidence from sequence alignments and literature curation to assign reliability scores, distinguishing reviewed records from provisional ones. Isoform representation in RefSeq prioritizes canonical transcripts selected for their biological , determined by factors like expression abundance, evolutionary , and supporting from transcriptomic datasets. For protein-coding , the primary isoform is often the Select transcript—a collaborative standard from NCBI and EMBL-EBI that designates one high-confidence NM_ per locus, covering over 99% of protein-coding genes and aligned identically to Ensembl/GENCODE equivalents for clinical and research consistency. Alternative isoforms, when supported by full-length , are modeled as XM_ (for transcripts) and XP_ (for proteins) accessions, such as XM_003988641.2, allowing representation of splice variants without curated status. Selection criteria emphasize transcripts with robust experimental validation, avoiding speculative models unless tied to genomic predictions. Detailed annotations on transcript and protein records enhance their utility for functional studies, including delineation of untranslated regions (UTRs) that influence mRNA stability and translation efficiency. The 5' UTR spans from the transcription start site to the coding sequence initiation, while the 3' UTR extends from the to the site, with poly-A signals (e.g., AAUAAA motifs) explicitly noted to guide post-transcriptional processing. Evidence codes, such as those indicating alignment to INSDC sequences or computational support, accompany these features to denote curation confidence, with MANE Select transcripts particularly flagged for their clinical relevance in human variant interpretation. This layered annotation ensures RefSeq records serve as reliable references for downstream applications like variant effect prediction and .

Curation Process

Data Sources and Integration

RefSeq derives its sequences primarily from submissions to , the comprehensive nucleotide sequence database maintained by the International Nucleotide Sequence Database Collaboration (INSDC), which includes raw genomic, transcript, and (EST) data from global researchers. For protein sequences, RefSeq integrates data from , particularly the manually curated Swiss-Prot subset, to ensure consistent nomenclature and functional annotations. Additionally, annotations for eukaryotic genomes incorporate contributions from Ensembl and the GENCODE project, which provide high-quality gene models based on computational predictions and experimental evidence. Direct collaborations with sequencing consortia, such as those for microbial genomes (e.g., via the NCBI Prokaryotic Genome Annotation Pipeline (PGAP)) and viral genomes (e.g., through Viral Genome Advisors), supply fully annotated assemblies that form the basis for RefSeq records. The integration process begins with automated alignment of sequences from these diverse sources to the using tools like for initial similarity searches and Splign for precise spliced alignment of transcripts to genomic DNA, ensuring accuracy in exon-intron boundaries. NCBI staff then manually curate and harmonize the data, resolving discrepancies through evidence-based selection, such as prioritizing sequences supported by or data. This pipeline produces a non-redundant set by selecting representative sequences according to criteria like the longest (ORF) for protein-coding genes or the highest level of experimental support, thereby minimizing redundancy while maximizing coverage. RefSeq employs a dual versioning system to track updates without disrupting stable identifiers: each record has a unique GenInfo Identifier (GI) number, which remains constant even as the sequence evolves, and an format (e.g., NM_001.5 for a transcript), where the version number increments with revisions to reflect changes in sequence or . This approach allows users to access both historical and current versions via tools like the Sequence Revision History, facilitating reproducible . Post-integration, basic quality checks verify alignment fidelity before advancing to detailed .

Annotation and Quality Control

The annotation of RefSeq records involves integrating experimental evidence with computational predictions to assign biological function and structure to nucleotide and protein sequences, ensuring they represent high-quality, non-redundant references for genomic research. For eukaryotic genomes, the NCBI Eukaryotic Genome Annotation Pipeline (EGAP) processes genomic assemblies by first masking repetitive regions using tools like RepeatMasker, then aligning diverse evidence sources such as RNA-Seq data via STAR, transcript alignments from RefSeq and GenBank using Splign, protein alignments, and long-read transcriptomics with Minimap2. As of August 2025, EGAP version 10.5 includes improvements to the long-read alignment process to handle large numbers of alignments more efficiently. Computational gene prediction is performed using the ab initio modeler Gnomon, which employs hidden Markov models refined by alignments to non-redundant protein databases, while prokaryotic genomes are annotated via the Prokaryotic Genome Annotation Pipeline (PGAP), combining homology-based methods with protein family models (e.g., HMMs and CDD profiles) for identifying coding genes, RNAs, and pseudogenes. These pipelines prioritize evidence from experimental data like cDNA, ESTs, proteomics, and RNA-Seq to generate models that include gene identifiers, locus types (e.g., coding or pseudogene), and functional annotations such as Gene Ontology terms derived from InterProScan. RefSeq employs a tiered system to denote annotation quality, reflecting the level of evidence and curation. Fully curated or reviewed records, often prefixed with NM_, NR_, or NP_ for transcripts and proteins, are manually validated by NCBI staff using high-confidence experimental data from sources like INSDC submissions, , and collaborations (e.g., with the for human genes), ensuring optimal sequence representation and functional descriptions. Provisionally curated records incorporate mixed evidence, such as initial alignments from or ESTs combined with computational support, and may undergo limited manual review to resolve ambiguities before promotion to full curation. Predicted records, denoted by XM_, XR_, or XP_ prefixes, rely primarily on computational models like those from or PGAP without direct experimental validation, serving as provisional placeholders for further evidence accumulation. This tiering system allows users to assess reliability, with reviewed records representing the highest standard for applications like clinical . Quality control in RefSeq annotation encompasses multiple layers to maintain accuracy and consistency. NCBI staff conduct manual reviews for discrepancies in curated records, particularly for model organisms like humans, where alignments are cross-checked against experimental datasets and standards from bodies like HGNC. Automated validation includes filtering low-quality alignments (e.g., those below 85% identity) and assessing completeness using benchmarks like BUSCO, which evaluates ortholog representation and flags assemblies with excessive partial genes or missing loci. Periodic re-annotation occurs every 12-18 months for major genomes, incorporating new evidence and assembly updates while mapping prior annotations via whole-genome alignments to track changes in gene loci. Error handling protocols focus on identifying and correcting artifacts to enhance record reliability. Common issues like frameshifts in coding sequences are flagged during alignment and prediction steps, with tools like Gnomon compensating by inserting ambiguous bases (Ns) or excluding defective models labeled as "PREDICTED: LOW QUALITY PROTEIN." Chimeric sequences or contaminants are screened through staff review and alignment quality metrics, leading to corrections or exclusions from RefSeq; for instance, prokaryotic annotations in PGAP detect pseudogenes and mobile elements to avoid misrepresenting functional genes. These measures, combined with ongoing pipeline upgrades, ensure RefSeq records remain a trusted resource by minimizing propagation of errors from source data.

Special Projects and Initiatives

RefSeq Select and Functional Annotations

RefSeq Select, launched in 2021, designates a single representative transcript and protein for each protein-coding locus to standardize references for , clinical variant reporting, and evolutionary studies. This initiative addresses the challenge of transcript isoform diversity by selecting transcripts based on automated pipelines that prioritize criteria such as high expression levels across tissues (derived from data like GTEx and recount3), evolutionary conservation of coding regions (using PhyloCSF scores), and alignment with experimentally validated isoforms. For human genes, RefSeq Select integrates with the Matched Annotation from NCBI and EMBL-EBI () project, which ensures a consensus transcript set covering approximately 80% of protein-coding genes as of 2025, with ongoing efforts to reach near 100% by aligning RefSeq and Ensembl/GENCODE annotations to the GRCh38 genome assembly. Selection criteria emphasize biological relevance and evidence support, favoring transcripts with strong experimental validation, such as those from ClinVar submissions or Locus Reference Genomic (LRG) records for clinical utility, while deprioritizing low-expression or poorly conserved isoforms. In cases of multiple viable candidates, the oldest accession is selected as a tie-breaker. This hierarchical approach results in a non-redundant set that reflects the predominant functional isoform for most genes, with MANE Select providing an exact match for human transcripts to facilitate unambiguous variant interpretation. Functional annotations enhance RefSeq Select records by incorporating (GO) terms, which describe molecular functions, biological processes, and cellular components, computed via InterProScan for eukaryotic proteins starting in 2023. Pathway information, such as associations with pathways, is integrated through links in the associated NCBI Gene records, enabling contextualization of gene roles in metabolic and signaling networks. Disease associations are linked via ClinVar, allowing direct access to variant pathogenicity data, particularly for MANE Select transcripts used in clinical diagnostics. These annotations are derived from curated evidence and high-throughput datasets, ensuring reliability for downstream applications. The RefSeq Select dataset undergoes annual reviews to incorporate emerging evidence, with updates reflecting advances in transcriptomics and clinical data. For instance, the project includes the MANE Plus Clinical set, which adds representative transcripts for 55 genes where the MANE Select is insufficient to report all pathogenic variants, including cancer-related isoforms to better support oncogenomic analyses. These revisions maintain coverage for , , and genomes while planning extensions to other eukaryotes, ensuring the dataset remains a dynamic resource aligned with broader RefSeq curation standards.

Targeted Projects for Organisms

The RefSeq project maintains specialized efforts for bacterial and archaeal genomes, focusing on high-quality, annotated reference sequences derived from complete and whole-genome shotgun assemblies submitted to international nucleotide sequence databases. These prokaryotic RefSeq genomes undergo automated annotation via the Prokaryotic Genome Annotation Pipeline (PGAP), which incorporates expert-curated protein family models to ensure consistency and accuracy across diverse taxa. As of 2023, the collection encompasses over 315,000 bacterial and archaeal genomes, emphasizing representative strains for type materials and enabling comparative genomics studies. A key component is the RefSeq Targeted Loci Project, which curates 16S ribosomal RNA sequences for bacteria and archaea, providing a non-redundant dataset aligned to type strains for taxonomic identification and microbial ecology research. Recent initiatives have prioritized rapid annotation for pathogenic species, including expansions in 2022 to incorporate antimicrobial resistance genes through integration with tools like AMRFinderPlus, facilitating surveillance of resistance mechanisms in clinical isolates. In 2025, an updated reference collection selected the highest-quality assemblies for 22,082 prokaryotic species, reducing redundancy and enhancing usability for metagenomic analyses. For eukaryotic organisms, RefSeq supports targeted initiatives in plants and parasites, adapting annotation pipelines to handle complex genomes and facilitate agricultural and medical research. In plants, RefSeq provides comprehensive assemblies and annotations for model species such as Arabidopsis thaliana, with the TAIR10.1 genome assembly serving as a foundational resource for gene function studies and comparative plant genomics, including 119.1 Mb across five chromosomes and organelle sequences. These efforts extend to broader plant diversity through the NCBI Eukaryotic Genome Annotation Pipeline, which processes hundreds of plant genomes to generate consistent RefSeq records for transcripts and proteins. For parasites, RefSeq curates sequences from protozoan and helminth species, supporting multi-strain comparisons to elucidate virulence factors and host interactions; for instance, annotations for rodent malaria parasites like Plasmodium yoelii enable synteny mapping and evolutionary analyses across strains. Initiatives like WormBase ParaSite integrate with RefSeq to provide access to 274 helminth genomes as of the 19th release, emphasizing non-redundant reference sets for parasitic nematodes and platyhelminths relevant to global health. The Viral RefSeq project curates complete or near-complete genomes for viruses, with particular emphasis on segmented genomes such as those of viruses, where individual segments are annotated and organized for and development. The NCBI Influenza Virus Database maintains RefSeq records for thousands of strains, grouping segments from the same isolate to reconstruct full genomes and track antigenic drift. This approach supports rapid updates for emerging threats; for example, in response to the 2022-2024 outbreaks, RefSeq incorporated multiple assemblies, including high-coverage genomes like GCA_033539925.1, to aid phylogenetic tracking and diagnostic tool design. Automated assembly processes introduced in 2024 further enhance this by linking segments from diverse viral samples, ensuring timely RefSeq releases for pathogens like and orthopoxviruses. RefSeq's metazoan expansions target non-model organisms through collaborations with the Genome Reference Consortium (GRC), prioritizing high-contiguity assemblies to fill gaps in animal genomic representation beyond traditional models like and . These efforts leverage the NCBI Eukaryotic Genome Annotation Pipeline to annotate genomes from diverse metazoans, such as and , supporting ortholog inference across expanded taxa. Recent advancements include scaling ortholog calculations to a broader set of metazoan RefSeq genomes, enabling gene nomenclature consistency and functional predictions for understudied species. For instance, chromosome-level assemblies for non-model metazoans like the sacoglossan Elysia timida integrate with RefSeq to provide reference sequences that reveal evolutionary adaptations, such as . This GRC-aligned focus enhances the database's utility for biodiversity genomics and conservation applications.

Access and Tools

Databases and Interfaces

RefSeq records are primarily accessible through the and Protein databases hosted by the (NCBI), where they form a curated subset of sequences. Users can refine searches to RefSeq entries using specific filters in the search and retrieval system, such as the refseq[filter] criterion, which limits results to non-redundant reference sequences in these databases. Key web interfaces enhance user interaction with RefSeq data. The Genome Data Viewer (GDV) provides an interactive platform for visualizing RefSeq genomic sequences, allowing exploration of assemblies, transcript alignments, and annotations through graphical representations like tracks and trees. The database offers detailed locus summaries for RefSeq entries, integrating transcript and protein sequences with , maps, and functional details for specific genes across organisms. Entrez supports advanced search queries to access RefSeq content efficiently, such as "refseq[filter] AND human[organism]" to retrieve human reference sequences, with results displaying sequence details, alignments, and phylogenetic trees for evolutionary context. Additional user features include seamless integration with the for performing similarity searches against RefSeq datasets, enabling identification of homologous sequences, and options for batch retrieval via FTP to handle multiple records.

APIs and Download Options

RefSeq data can be accessed programmatically through NCBI's E-utilities, a suite of server-side programs that enable searching, linking, and downloading from the system of databases, including RefSeq records in the Nuccore () and Protein databases. The ESearch utility allows users to query for specific RefSeq accessions, such as () records prefixed with NM_, by constructing RESTful URL calls to the database; for instance, a search for NM_ accessions might use the endpoint https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=[nucleotide](/page/Nucleotide)&term="[Homo sapiens](/page/Human)"[organism] AND NM_[accn], returning a list of unique identifiers (UIDs). Once UIDs are obtained, the EFetch utility retrieves the corresponding records in various formats, such as for sequences or flat files for annotated data; an example fetch for multiple NM_ accessions in is https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=[nucleotide](/page/Nucleotide)&id=NM_000546,NM_000547&rettype=fasta&retmode=text. For bulk metadata retrieval, the NCBI Datasets provides an alternative to E-utilities, supporting queries for Gene records linked to RefSeq accessions and returning structured responses that include details like transcript and protein associations. This is particularly useful for large-scale operations, as demonstrated by retrieving for multiple RefSeq mRNA accessions (e.g., NM_001) in a single request, which outputs parseable objects for integration into workflows without needing to parse XML from E-utilities. Bulk downloads of RefSeq data are available via anonymous FTP at ftp://ftp.ncbi.nlm.nih.gov/refseq/, organized into directories for complete bi-monthly releases (e.g., release/232/), daily updates (daily/), and supplemental files (supplemental/), with species-specific subdirectories for genomes like H_sapiens/. These directories include release notes detailing changes, checksum files for verifying download integrity, and catalogs listing available records to facilitate selective retrieval. RefSeq files are provided in multiple formats to suit different use cases: for raw nucleotide or protein sequences, flat files for records with full annotations including features and references, and for structured, machine-readable representations of the data that support advanced parsing. Users can download entire releases or subsets, such as all transcripts, by navigating to appropriate paths like release/release-catalog/ for indices. To ensure data consistency, best practices for downloads emphasize handling versioning carefully; RefSeq accessions include a version suffix (e.g., NM_000123.5) that increments with updates to reflect sequence or annotation changes, so users should retrieve version-specific files from a single release directory to avoid mixing outdated records, and always verify against checksums post-download. Web interfaces like the Nucleotide database serve as initial entry points for identifying accessions before programmatic access.

Statistics and Updates

Record Counts and Growth

As of RefSeq Release 232, available on September 2, 2025, the database contains 558,426,495 records in total. This includes 56,702,917 genomic records, 74,202,490 transcript and records, and 427,129,536 protein records, spanning sequences from 170,401 organisms. These figures reflect the non-redundant, curated nature of RefSeq, with genomic records often representing complete assemblies and transcripts derived from high-quality annotations. In terms of organism distribution, approximately 59% of represented organisms are bacterial (99,961), 32% are eukaryotic (about 53,880 across vertebrates, fungi, , invertebrates, and protozoa), and 9% are viral (14,610). For eukaryotic examples, the (Homo sapiens) dataset includes around 59,792 genes, encompassing roughly 20,000 protein-coding genes and additional non-coding elements. Bacterial and viral records dominate the overall counts, due to the high volume of prokaryotic and sequencing data integrated into RefSeq. RefSeq has exhibited robust growth, with total records increasing by an average of 15-20% annually since , rising from approximately 223 million in January 2020 to over 558 million by September 2025. This expansion, totaling about 335 million new records over five years, has been accelerated by surges in high-throughput sequencing, including integration of thousands of metagenome-assembled prokaryotic genomes since 2023. From Release 231 (July 2025) to Release 232, records grew by 11.25 million, primarily in protein and genomic categories. Key metrics highlight the scale of RefSeq records. Update frequencies vary by category, with prokaryotic and viral datasets seeing more frequent additions compared to eukaryotic transcripts, which prioritize quality refinements over volume.
CategoryCount (Release 232)Example Growth Driver
Total Records558,426,495Metagenomics integration since 2023
Genomic56,702,917Prokaryotic assemblies
Transcripts/RNA74,202,490Eukaryotic annotations
Proteins427,129,536Bacterial and viral sequences
Organisms170,401+3,179 since Release 231

Release Schedule and Maintenance

RefSeq undergoes regular updates to ensure the accuracy and currency of its reference sequences, with full releases occurring approximately every two months, specifically during the first two weeks of odd-numbered months such as , , May, , , and . These bi-monthly releases incorporate cumulative changes, including new annotations, revisions to existing records, and expansions based on incoming data from collaborating institutions and public repositories. Between full releases, intermediate daily updates are made available via FTP, allowing users to access the latest changes without waiting for the next major cycle. This schedule particularly benefits rapidly evolving datasets, such as viral genomes, where timely incorporation of new sequences is essential for research on emerging pathogens. Maintenance of RefSeq involves ongoing synchronization of annotations with primary source databases, including , DDBJ, and EMBL, supplemented by manual and automated curation performed by NCBI staff and external collaborators. Obsolete or superseded records are deprecated systematically, with lists of removed entries provided in dedicated files to maintain transparency and prevent use of outdated data. Community feedback plays a key role in this process, enabling the identification and correction of discrepancies; users can submit suggestions, updates, or corrections through the dedicated and RefSeq Feedback Web Form, which informs curation decisions and policy refinements. This collaborative approach ensures that RefSeq remains a reliable, non-redundant resource aligned with evolving . Version control in RefSeq employs stable identifiers to track changes while preserving historical data. RefSeq accession numbers, such as NM_020236, serve as permanent identifiers for records, with appended version numbers (e.g., .5) indicating updates to the sequence or annotation. Complementing these are GenInfo Identifiers (GI numbers), unique integers assigned to each distinct sequence version, which provide stability for referencing specific iterations in publications and analyses. Previous versions of records and full releases are archived indefinitely, accessible via FTP directories, allowing researchers to retrieve historical data for reproducibility and comparative studies. As of 2025, RefSeq maintenance continues to evolve with an emphasis on enhanced automation and , including support for standards developed by international consortia to facilitate global data sharing in .

References

  1. [1]
    RefSeq: NCBI Reference Sequence Database - NIH
    A comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein.About RefSeq · RefSeqGene · Prokaryotic RefSeq Genomes · RefSeq Select
  2. [2]
    NCBI RefSeq: reference sequence standards through 25 years ... - NIH
    Nov 11, 2024 · The guiding principle of the RefSeq project is to provide a stable, non-redundant, curated set of reference sequences, with a focus on quality, ...
  3. [3]
    RefSeq Frequently Asked Questions (FAQ) - NCBI
    Nov 15, 2010 · The NCBI Reference Sequence (RefSeq) project provides sequence records and related information for numerous organisms, and provides a baseline for medical, ...What is a Reference... · How do I cite the RefSeq... · NCBI's annotation displayed...
  4. [4]
    About RefSeq - NCBI - NIH
    Mar 19, 2021 · The Reference Sequence (RefSeq) collection provides a comprehensive, integrated, non-redundant, well-annotated set of sequences, including genomic DNA, ...
  5. [5]
    NCBI News | Summer 1999
    RefSeq is a new database, distinct from GenBank, which currently comprises a non-redundant set of human reference sequences for mRNAs and proteins.Missing: founding | Show results with:founding
  6. [6]
    NCBI Reference Sequence (RefSeq): a curated non-redundant ...
    Dec 17, 2004 · RefSeq sequences are validated to confirm the following: (i) accurate nucleotide-to-protein sequence correspondence; (ii) valid ASN.1 format and ...
  7. [7]
    About the NCBI RefSeqGene Project - NIH
    Nov 14, 2023 · RefSeqGene, a subset of NCBI's Reference Sequence (RefSeq) project, defines genomic sequences to be used as reference standards for well-characterized genes.Missing: introduction | Show results with:introduction
  8. [8]
    NCBI Reference Sequences: current status, policy and new initiatives
    NCBI's Reference Sequence (RefSeq) is a public database of nucleotide and protein sequences with feature and bibliographic annotation.<|control11|><|separator|>
  9. [9]
    NCBI Viral Genomes Resource - PMC - PubMed Central
    Nov 26, 2014 · Therefore the viral RefSeq model has expanded to include both reference and representative genome sequences to better serve community needs.Ncbi Viral Genomes Resource · Adapting The Refseq Data... · Data Curation And Community...
  10. [10]
    RefSeq: expanding the Prokaryotic Genome Annotation Pipeline ...
    Dec 3, 2020 · The RefSeq collection for prokaryotes has grown to nearly 200 000 genomes and 150 million non-redundant proteins and, after over a decade, ...
  11. [11]
    A global reference for human genetic variation - PMC - NIH
    The 1000 Genomes Project has sought to comprehensively catalogue human genetic variation across populations, providing a valuable public genomic resource.
  12. [12]
    NCBI RefSeq and Ensembl/GENCODE taking MANE mainstream ...
    Nov 2, 2020 · NCBI and EBI have been hard at work on our joint MANE collaboration, providing a set of representative transcripts for human protein-coding ...
  13. [13]
    A joint NCBI and EMBL-EBI transcript set for clinical genomics and ...
    Apr 6, 2022 · Each MANE transcript represents an exact match between the exonic sequences of an Ensembl/GENCODE transcript and its counterpart in RefSeq such ...
  14. [14]
    Variation feature changes in NCBI Reference Sequences coming in ...
    Oct 27, 2017 · Starting in March 2018, SNP variation features will no longer be in RefSeq genome assembly records – chromosome and contig records with NC_, NT_, NW_ and AC_ ...Missing: NG_ | Show results with:NG_
  15. [15]
    Homo sapiens chromosome 1, GRCh38.p14 Primary Assembly - Nucleotide - NCBI
    ### Summary of NC_000001.11 (Homo sapiens chromosome 1, GRCh38.p14)
  16. [16]
    Sequence Identifiers - NCBI - NIH
    Nov 9, 2017 · Many sequences have two types of identification numbers, GI and VERSION. The two identifier types differ in format, and were implemented at different times.
  17. [17]
    Matched Annotation from NCBI and EMBL-EBI (MANE) - NIH
    Nov 4, 2024 · The MANE Select transcript for a human protein-coding gene consists of a pair of identically annotated transcripts, the RefSeq transcript (with ...Missing: NR_ NP_ XM_ XP_
  18. [18]
    NCBI RefSeq Select - NIH
    Jun 20, 2024 · For prokaryotes, RefSeq Select is defined as proteins annotated on RefSeq reference and representative genomes.
  19. [19]
    RefSeq Collaborators and data sources - NCBI
    Jul 26, 2024 · We value our collaborators contributed information ranging from completely annotated genomes, advice to improve the sequence or annotation of individual RefSeq ...
  20. [20]
    Sequence Identifiers: GI number and Accession.Version - NCBI - NIH
    The two types of sequence identification numbers, GI and VERSION, have different formats and were implemented at different points in time.
  21. [21]
    The NCBI Eukaryotic Genome Annotation Pipeline - NIH
    Apr 4, 2024 · The NCBI Eukaryotic Genome Annotation Pipeline provides content for various NCBI resources including Nucleotide, Protein, BLAST, Gene and the Genome Data ...Refseq · Process · Rna-Seq Read Alignments
  22. [22]
    NCBI Prokaryotic Genome Annotation Pipeline - NIH
    Apr 4, 2024 · The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).Annotation Process · Release Notes · Annotation Standards
  23. [23]
    RefSeq curation and annotation of the human reference genome
    Jun 14, 2022 · The RefSeq project generates comprehensive genome annotation results for the reference assembly each year (approximately every 12 to 18 months).Missing: early history 2000-2005 NM_ NP_
  24. [24]
    Gene Ontology (GO) Terms for NCBI RefSeq Eukaryotic Genomes
    Nov 15, 2023 · We've expanded NCBI RefSeq's Eukaryote Genome Annotation Pipeline (EGAP) to include Gene Ontology (GO) terms computed for most protein-coding genes.
  25. [25]
    NCBI RefSeq: reference sequence standards through 25 years of ...
    Nov 11, 2024 · The guiding principle of the RefSeq project is to provide a stable, non-redundant, curated set of reference sequences, with a focus on quality, ...
  26. [26]
    RefSeq and the prokaryotic genome annotation pipeline in the age ...
    Nov 14, 2023 · The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and ...
  27. [27]
    Prokaryotic RefSeq Genomes - NCBI
    Jan 14, 2025 · The RefSeq archaeal and bacterial genome assemblies are annotated and maintained copies of complete and whole-genome shotgun assemblies submitted to INSDC.<|separator|>
  28. [28]
    NCBI RefSeq Targeted Loci Project - NIH
    Aug 22, 2024 · The RefSeq dataset contains curated 16S ribosomal RNA sequences that correspond to bacteria and archaea type materials.
  29. [29]
    AMRFinderPlus and the Reference Gene Catalog facilitate ... - Nature
    Jun 16, 2021 · Here, we describe the expansion of the Reference Gene Database, now called the Reference Gene Catalog, to include putative acid, biocide, metal, ...
  30. [30]
    An Updated Bacterial and Archaeal Reference Genome Collection ...
    Sep 2, 2025 · We built this collection of 22,082 genomes by selecting the “best” genome assembly for each species among the 440,000+ prokaryotic genomes in ...
  31. [31]
    Arabidopsis thaliana genome assembly TAIR10.1 - NCBI - NLM - NIH
    RefSeq, GenBank. Genome size, 119.1 Mb, 119.1 Mb. Total ungapped length, 119 Mb, 119 Mb. Number of chromosomes, 5, 5. Number of organelles, 2, 2.
  32. [32]
    Eukaryotic genomes annotated at NCBI - NIH
    Hundreds of eukaryotic genomes have been annotated by the NCBI Eukaryotic Genome Annotation Pipeline (see graphs).<|control11|><|separator|>
  33. [33]
    Genome sequence and comparative analysis of the model rodent ...
    Oct 3, 2002 · Recent comparative studies have revealed that the fine detail of short stretches of the rodent and human malaria parasite genomes is remarkably ...Plasmodium Yoelii Yoelii... · Comparative Genome Analysis · A Genome-Wide Synteny Map
  34. [34]
    WormBase ParaSite − a comprehensive resource for helminth ... - NIH
    WormBase ParaSite is a new resource for helminth genomics. The resource provides access to over 100 nematode and platyhelminth genomes.
  35. [35]
    Influenza virus database - NCBI - NIH
    Protein or nucleotide sequences can be retrieved from the database using GenBank accession numbers or search terms.Genome Set · Important Update · 7 sequences (1 full set)
  36. [36]
    mpox - Genome - Assembly - NCBI
    Monkeypox virus (viruses) genome assembly mpox from na [GCA_033539925.1 ] ... RefSeq assembly accession: n/a IDs: 49945378[UID] 49945378 [GenBank]
  37. [37]
    Now Available: Assembled Genomes for Influenza Viruses and ...
    Aug 5, 2024 · NCBI Virus now offers genomes for viruses such as Influenza A by using an automated process to group segments from the same samples.
  38. [38]
    Genome Reference Consortium - NCBI - NIH
    The GRC remains committed to its mission to improve the human reference genome assembly, correcting errors and adding sequence to ensure it provides the best ...Human Genome Overview · Mouse Genome Overview · Genome AssembliesMissing: RefSeq metazoan expansions
  39. [39]
    NCBI Orthologs: Public Resource and Scalable Method for ...
    Sep 25, 2025 · One of the important objectives of the RefSeq eukaryotic genome annotation process is to provide informative and consistent gene nomenclature.
  40. [40]
    Chromosome-level genome assembly of the sacoglossan sea slug ...
    Jun 4, 2024 · In this study, we present the chromosome-level genome assembly and annotation of the marine sacoglossan species Elysia timida, known for its ability to store ...
  41. [41]
    Search Field Descriptions for Sequence Database - NCBI - NIH
    Dec 3, 2010 · Entrez Sequences Help [Internet]. Show details. Bethesda (MD): ... refseq[filter] Nucleotide Protein mammals[filter] Nucleotide Protein.
  42. [42]
  43. [43]
    Home - Gene - NCBI
    ### Summary of How the Gene Database Provides Locus Summaries for RefSeq
  44. [44]
    APIs - Develop - NCBI - NIH
    The E-utilities are the public API to the NCBI Entrez system and allow access to all Entrez databases including PubMed, PMC, Gene, Nuccore and Protein.
  45. [45]
    E-utilities Quick Start - Entrez Programming Utilities Help - NCBI - NIH
    Dec 12, 2008 · This chapter provides a brief overview of basic E-utility functions along with examples of URL calls.
  46. [46]
  47. [47]
  48. [48]
    None
    ### Summary of Record Counts and Growth from RefSeq-release232.txt
  49. [49]
    Homo sapiens genome assembly GRCh38.p14 - NCBI - NLM - NIH
    This genome assembly includes 127 unplaced scaffolds. RefSeq and GenBank assembly differences The NCBI RefSeq assembly differs from the submitted assembly.Missing: accession | Show results with:accession
  50. [50]
    RefSeq growth statistics - NCBI - NIH
    RefSeq growth statistics: Accession growth, Organism growth. Follow NCBI. Connect with NLM. National Library of Medicine.
  51. [51]
    Contact Us - NCBI - NIH
    Gene and Reference Sequences (RefSeq) Feedback. Submit data suggestions, updates or corrections to the Gene and RefSeq Feedback Web Form. National Center for ...Missing: community | Show results with:community
  52. [52]
  53. [53]
    refget Sequences - GA4GH
    Refget Sequences provides a framework to retrieve reference sequences by using an algorithm to derive a unique identifier.Missing: NCBI RefSeq