Fact-checked by Grok 2 weeks ago

Consensus CDS Project

The Consensus Coding Sequence (CCDS) project is a collaborative initiative that identifies and curates a core set of high-quality protein-coding regions consistently annotated across major and assemblies, providing a standardized dataset of coding sequences () with unique identifiers to support reliable genomic research and annotation. Launched in 2005, the project arose from the need to reconcile discrepancies in annotations among leading genome databases following the stabilization of sequences, aiming to produce a "" set of protein-coding free from artifacts like frameshifts or invalid sites. Initial efforts focused on comparing annotations from the (NCBI) and Ensembl, with the first public release in 2005 yielding 20,159 CCDS regions from 17,052 and 17,707 regions from 16,893 . By the project's inaugural publication in 2009, it had engaged over 50 researchers to refine this dataset through rigorous quality controls, including requirements for full-length with proper start (ATG) and stop codons, consensus junctions, and to the without internal stops or frameshifts. Key collaborators include the NCBI, Ensembl (jointly managed by EMBL-EBI and Sanger Institute), the Human Genome Nomenclature Committee (HGNC), the Mouse Genome Informatics (MGI) database, the University of California Santa Cruz (UCSC) Genome Browser team, and the Havana group at EMBL-EBI, which provides manual curation expertise. The process involves periodic synchronization of annotations: differences are flagged, manually reviewed in collaborative meetings, and resolved only by , ensuring that each CCDS entry represents identical sequences across participating databases; updates are versioned (e.g., CCDS1.1) to track changes in structure or sequence. This methodology has progressively incorporated more events and expanded the dataset, with releases integrated into genome browsers like Ensembl and UCSC for seamless access. As of the latest human release (Release 24, October 26, 2022), the CCDS dataset comprises 35,608 unique IDs corresponding to 19,107 genes and producing 48,062 protein sequences, reflecting additions of 2,746 new IDs and 237 genes since the prior update, aligned to GRCh38; the mouse dataset (Release 23, October 24, 2019) includes 27,219 unique IDs corresponding to 20,486 genes, aligned to GRCm38. The project intersects with broader efforts like the Matched Annotation from NCBI and EMBL-EBI () initiative, enhancing clinical and research applications by promoting annotation consistency essential for variant interpretation, , and functional studies. Freely available via FTP (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/) and web portals, CCDS remains a foundational resource for the community, with its stable, peer-reviewed annotations cited in thousands of studies for defining protein-coding sets.

Introduction and Background

Project Overview

The Consensus CDS (CCDS) Project is a collaborative effort to identify a core set of identically annotated, high-quality protein-coding regions—known as coding sequences ()—in the and reference genomes. A CDS represents the portion of a that is translated into a protein, encompassing the exons that form the mature mRNA after splicing. By focusing on regions with consistent annotations across multiple databases, the project establishes a reliable foundation for genomic and . The primary goal of the CCDS Project is to produce a "" dataset that promotes uniformity in gene annotation, reducing discrepancies that arise from varying annotation pipelines in major resources. This standardized set supports applications in , functional studies, and by providing verifiable, high-confidence protein-coding loci. The project emphasizes full-length CDS with consensus splice sites, absence of frameshifts, and alignment to reference assemblies without significant discrepancies. As of Release 24 (October 2022), the human CCDS dataset comprises 35,608 CCDS IDs corresponding to 19,107 genes; as of Release 23 (October 2019), the mouse dataset comprises 27,219 CCDS IDs corresponding to 20,486 genes. This ongoing collaboration, involving key organizations such as the (NCBI) and Ensembl, ensures the dataset evolves with advances in genome sequencing and annotation practices.

Historical Context and Motivation

Prior to the establishment of the Consensus CDS (CCDS) Project, genomic research encountered substantial challenges arising from inconsistencies in gene annotations across major databases, primarily due to divergent curation guidelines, evidence thresholds, and computational algorithms employed by different groups. These variations led to fragmented representations of protein-coding gene sets, hindering reliable cross-database comparisons and introducing uncertainties in downstream applications such as proteomic studies and the interpretation of disease-associated genetic variants. For instance, annotations from the NCBI RefSeq and Ensembl projects often differed in transcript structures and coding sequence boundaries, reflecting trade-offs between annotation comprehensiveness and stringency. The CCDS collaboration was formed in 2005 as a direct response to these issues, aiming to reconcile annotations from leading resources like NCBI and Ensembl by identifying a core set of high-confidence, identically defined protein-coding regions on the and s. This initiative capitalized on the maturation of stable assemblies, enabling a focused effort to produce a conservative yet reliable that prioritized quality over exhaustive coverage. Initial data releases for the occurred in 2005, followed by the in 2006, providing early access to consensus coding sequences through collaborative platforms. Key milestones were documented in seminal publications, including the 2009 initial description of the project and its methodologies in Genome Research, which outlined the foundational comprising over 17,000 and 16,000 loci. Subsequent updates in 2012 detailed curation coordination mechanisms, while 2014 and 2018 reports highlighted expansions, new features like archive tracking, and alignments with updated genome builds such as GRCh38 () and GRCm38 (). The overarching motivation was to deliver a non-redundant, expertly curated resource that reduces discrepancies, thereby enhancing the accuracy of genomic analyses in and clinical contexts.

Participating Organizations

Key Collaborators

The Consensus CDS (CCDS) Project is a collaborative initiative involving several primary organizations, each contributing distinct expertise in genomic and related fields. The (NCBI) provides proficiency in curation and maintenance of comprehensive genomic databases. The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI), encompassing the team, offers specialized knowledge in manual and curation. Ensembl, a joint project of EMBL-EBI and the Sanger Institute, brings expertise in automated predictions and . The team at the , contributes skills in genome visualization and integration. The (HGNC) focuses on standardized and symbol assignment. Finally, the Mouse Genome Informatics (MGI) resource supplies in-depth knowledge of mouse genomics and . The collaboration began in 2005, initially comprising NCBI, Ensembl, and the team to address discrepancies in annotations between automated and manual approaches. Over time, participation expanded to include HGNC for enhanced nomenclature consistency and MGI to incorporate mouse-specific genomic input, broadening the project's scope to both human and mouse datasets. These groups' complementary expertise underpins the consensus-building efforts in identifying high-quality protein-coding regions. As of 2025, all core collaborators—NCBI, EMBL-EBI (including and Ensembl), , HGNC, and MGI—continue to participate actively, with no major additions or departures documented in recent updates.

Roles and Contributions

The (NCBI) plays a central role in the Consensus CDS (CCDS) Project by providing gene annotations, hosting the primary CCDS database and web interface for public access, and performing initial computational alignments to identify candidate coding regions across and genomes. NCBI also manages CCDS identifiers, conducts checks, and facilitates manual curation to resolve annotation discrepancies. The Ensembl and Havana groups, affiliated with the European Molecular Biology Laboratory's (EMBL-EBI) and the Wellcome Trust Sanger Institute, contribute GENCODE-based predictions and extensive manual curations, with a particular emphasis on validating sites and defining alternative isoforms to ensure high-quality . Their inputs integrate automated pipelines with expert review to align with other collaborators' datasets during the -building process. The University of California Santa Cruz (UCSC) Genome Browser integrates CCDS tracks into its visualization platform, enabling users to explore consensus regions alongside other genomic data, and contributes comparative genomics analyses, including assessments of orthology, conservation, and pseudogene identification to support quality testing. Although UCSC ceased active voting membership in 2014, it continues to provide essential input on evolutionary conservation during automated pipeline evaluations. The (HGNC) and Mouse Genome Informatics (MGI) ensure standardized gene nomenclature across the CCDS sets for human and mouse, respectively, by reviewing and resolving naming conflicts to maintain consistency in annotations. As voting members since 2014, they participate in policy decisions and specific case reviews, helping to harmonize symbols and with broader genomic resources. Collaboration among these organizations occurs through ongoing communication via a restricted-access , discussion lists, and voting mechanisms to discuss and resolve discrepancies, with ad-hoc working groups formed for complex cases. This coordinated review process, supported by standardized guidelines, ensures the integrity of the CCDS dataset without unilateral changes by any single group.

Methodology for Consensus Building

Defining the CCDS Gene Set

The Consensus CDS (CCDS) gene set is defined by stringent criteria to ensure high-confidence protein-coding regions that achieve identical annotations across independent genomic databases. Specifically, a coding sequence (CDS) qualifies for inclusion if it exhibits exact sequence identity and coordinate overlap in at least two independent annotation sources, such as RefSeq from NCBI and Ensembl, while spanning the full length from an initiating ATG start codon to a valid stop codon without interruptions like frameshifts or non-consensus splice sites. These regions must also align seamlessly to the reference genome assembly, confirming structural integrity and translational potential. The project primarily focuses on and genomes, utilizing the GRCh38.p14 for and GRCm39 for , where approximately 18,000 to 19,000 genes per meet the criteria, representing a core subset of reliably annotated protein- loci. Initial generation of the CCDS set involves automated detection of overlapping exons from partner datasets, systematically excluding partial fragments or those derived solely from predictive models without supporting evidence. This process prioritizes sequences backed by empirical data, ensuring the set captures evolutionarily conserved, biologically relevant genes. To maintain exclusivity to true protein-coding genes, the CCDS framework imposes strict exclusions for non-coding elements and pseudogenes, requiring demonstrable evidence of translation such as transcriptomic support from , proteomic detection, or protein existence (PE) scores indicating experimental validation (PE 1–4). Putative pseudogenes and retrotransposed sequences are rigorously filtered out unless compelling multi-omic evidence reclassifies them as functional. Subsequent testing further refines these sets to uphold annotation consistency.

Alignment and Comparison Procedures

The alignment and comparison procedures in the Consensus CDS (CCDS) Project focus on identifying identical protein-coding regions across annotations from collaborating organizations, primarily through computational analysis of genomic coordinates and sequences derived from (NCBI) and GENCODE/Ensembl datasets. These annotations are initially generated by independent pipelines that align transcript and protein evidence to the ; for example, NCBI employs for local alignments followed by Splign for global refinement to accurately map splice sites and exons, while Ensembl integrates alignments from tools like Exonerate alongside predictions. Once aligned, the CDS features are extracted for direct comparison, ensuring that only high-confidence, non-overlapping models are considered. The core step-by-step process involves: (1) extracting intervals, including boundaries, start and stop codons, from the and GENCODE/Ensembl sets; (2) verifying alignments to the to confirm placement; and (3) performing pairwise intersection analysis using custom scripts at NCBI to check for exact coordinate matches and 100% sequence identity at the level, with translated protein sequences cross-verified for absence of frameshifts or internal stop codons. This requires perfect agreement on splice junctions and coding frame, excluding any discrepancies due to annotation errors or assembly artifacts. The criteria for a valid match emphasize identical genomic spans and sequences, as detailed in the project's gene set definition guidelines. For handling isoforms, the procedures prioritize principal transcripts—selected as the representative or isoform from each source based on factors like longest or highest expression support—ensuring consensus is built around these before considering alternatives. Alternative isoforms are flagged if their regions match exactly across sources, allowing potential expansion of the CCDS set without introducing variability in untranslated regions. This selective approach maintains consistency while accommodating splicing diversity observed in supporting evidence. Computationally, the pipeline scales to process annotations from tens of thousands of genes, involving millions of exons across and reference assemblies, with comparisons conducted iteratively after each annotation update. Results, including matched regions and discrepancies, are loaded into a at NCBI for querying, versioning, and collaborative review, facilitating efficient tracking of over 35,000 CCDS identifiers in recent releases derived from aligning and comparing approximately 48,000 protein sequences per source.

Quality Assurance and Testing

The quality assurance (QA) and testing phase in the Consensus CDS (CCDS) Project applies a series of automated biological validation checks to candidate coding sequences (CDS) derived from alignments across collaborating annotation groups, ensuring they meet criteria for protein-coding potential and structural integrity. These tests are executed through standardized pipelines primarily managed by the National Center for Biotechnology Information (NCBI), which flag potential issues for further scrutiny while prioritizing high-confidence annotations. Key automated tests include verification of consensus splice sites adhering to the canonical GT-AG rule at exon-intron boundaries to confirm proper transcript structure, assessment of (ORF) integrity by ensuring no internal stop codons disrupt the coding sequence (except in cases of with supporting evidence), and evaluation of Kozak sequence similarity around the translation initiation codon to gauge initiation efficiency—optimal sequences matching GCC[A/G]CCaugG are preferred, with weaker variants requiring additional validation. Cross-species conservation is also examined using BLASTP searches against orthologous proteins, requiring significant similarity in at least two evolutionarily distant species to support functional relevance. Additional pipeline checks enforce multi-exon structures and exclude overlaps with annotated non-coding RNAs, reducing the risk of misannotating regulatory elements as coding regions. These pipelines incorporate diverse evidence types for robustness, such as for splice site confirmation and for ORF translation potential. Historical performance indicates high reliability, with dataset stability evidenced by progressively fewer annotation updates or withdrawals across releases—for instance, from Release 20 (32,524 CCDS IDs) to Release 21 (25,757 CCDS IDs), reflecting a low error rate validated through orthogonal methods like data from .

Curation and Review Processes

Manual Curation Techniques

Manual curation techniques in the Consensus CDS (CCDS) Project focus on expert-driven refinement of protein-coding gene annotations where automated alignments reveal discrepancies, such as conflicting exon boundaries or uncertain isoform structures. These methods ensure high-confidence consensus by incorporating biological evidence that computational tools alone cannot fully interpret. Primary responsibility lies with the (Human and Vertebrate Analysis and Annotation) group at the , who perform detailed transcript modeling using similarity searches against nucleotide and protein databases, ab initio gene predictions with tools like GENSCAN and , analysis of sequence conservation data, and utilization of Distributed Annotation System (DAS) tracks to integrate diverse data sources into coherent gene models. Key techniques include comprehensive literature reviews to gather experimental evidence, such as full-length cDNA clones supporting transcript structures or antibody-based validations confirming protein expression and localization. Curators also employ visualization in genome browsers like Ensembl, UCSC Genome Browser, and NCBI's Genome Data Viewer to inspect alignments, conservation patterns, and supporting tracks (e.g., RNA-seq or proteomics data) for contextual assessment. Additionally, sequence re-alignment is conducted using specialized tools like NCBI's Splign, which generates spliced alignments of transcripts to genomic DNA, allowing precise adjustment of coding sequences in ambiguous regions. These hands-on interventions are targeted at specific cases flagged by automated quality tests, typically involving novel isoforms or low-confidence predictions that prevent agreement across collaborating databases. curators prioritize evidence from high-throughput experiments and peer-reviewed studies to resolve such issues, often extending automated models with manually derived extensions or corrections. Every curation decision is meticulously documented, with logs capturing the rationale, integrated types, and references to supporting publications via IDs, ensuring transparency and reproducibility for future updates. This documentation is maintained in restricted-access collaboration tools and summarized in public release notes on the CCDS website.

Review and Resolution Mechanisms

The review and resolution mechanisms of the Consensus CDS (CCDS) project ensure high-quality, consistent protein-coding annotations through structured collaboration among participating organizations, including the (NCBI), Ensembl, and others. Flagged cases, such as discrepancies in coding sequence coordinates or annotation quality, are discussed in regular teleconferences among curators to facilitate consensus-based decisions. These discussions prioritize from experimental , conservation across , and with established guidelines, with individual manual curations from experts submitted for group-level evaluation. Decisions on annotations follow a where collaborators, including representatives from , , and UCSC, vote to resolve disagreements, aiming for unanimous agreement. Resolution categories include accepting the locus as a CCDS entry if is reached and quality criteria are met, deferring the case pending additional evidence such as new experimental data, or rejecting it by withdrawing the CCDS identifier in instances of irreconcilable conflicting annotations. NCBI coordinates the and serves as the final arbiter in cases of impasse, ensuring updates align with the assemblies. Tracking of review cases occurs via an issue management system, such as Atlassian , which logs discussions, assigns tasks, and monitors progress through each release cycle, handling hundreds of per update. Since , the has integrated through a dedicated user request on the CCDS , allowing external experts to submit comments on potential inclusions or revisions during annotation cycles. This mechanism has supported refinements in over 2,000 new CCDS identifiers added in recent releases, promoting broader scientific validation.

Annotation Guidelines and Challenges

The Consensus CDS (CCDS) Project establishes stringent annotation guidelines to ensure high-quality, consistent protein-coding regions across collaborating databases such as and Ensembl/GENCODE. These guidelines prioritize coding sequences () supported by experimentally validated evidence, including curated transcripts from sources like /Swiss-Prot, over purely automated predictions. A core requirement is 100% sequence agreement, meaning identical genomic coordinates, start codons (typically ATG), stop codons, and consensus splice sites (GT-AG) among all participating annotations, with no frameshifts or internal stop codons permitted. Pseudogenes are systematically excluded from the CCDS set unless reclassified as functional through rigorous evidence review, such as multi-species alignments that detect duplicated, non-coding copies misannotated as protein-coding. This exclusion involves tests, including BLAST-based alignments to identify repetitive or fragmented genomic regions that could mimic coding sequences. For genes with , only isoforms achieving full consensus are included; those affecting only untranslated regions (UTRs) may share a CCDS identifier, but coding-impacting variants require separate agreement to avoid ambiguity. Key challenges in CCDS annotation arise from alternative splicing, which introduces isoform ambiguity and complicates consensus on principal transcripts, particularly in genes with multiple functional variants. Incomplete or low-quality genome assemblies can generate alignment artifacts, especially in repetitive regions, leading to discordant annotations that must be manually resolved. Additionally, balancing the speed of automated annotation pipelines with the accuracy of manual curation poses ongoing difficulties, as automated methods vary between groups and may overlook subtle evidence conflicts from cDNA or genomic data. To address these issues, the project has adapted by incorporating advanced evidence types, such as long-read sequencing data, to better resolve events and expand the consensus set, as seen in Release 24 (2022), which added over 2,700 new CCDS identifiers through enhanced curation. As of November 2025, the latest release remains Release 24, with ongoing curation using these established processes. These adaptations maintain focus on manual review mechanisms for guideline application while integrating new genomic technologies to improve consistency.

Data Access and Integration

Methods for Accessing CCDS Data

The primary means of accessing Consensus CDS (CCDS) data is through the official NCBI CCDS website, which offers an interactive interface for browsing and searching the database. Users can query by gene symbol or ID, genomic coordinates (, start, and end positions), or sequence similarity, retrieving detailed reports that include annotation status, exon structures, and links to associated transcripts and proteins. For bulk retrieval, CCDS datasets are available via the NCBI FTP site at ftp.ncbi.nlm.nih.gov/pub/CCDS/, organized into directories for current (Release 24, October 2022) and (Release 23, October 2019) releases as well as archives. Key files include gzip-compressed formats such as CCDS_nucleotide.fna.gz for genomic nucleotide sequences of coding regions and CCDS_protein.faa.gz for translated protein sequences, with headers containing CCDS ID, version, genome build, and information; additionally, tab-delimited text files like CCDS.txt provide comprehensive including , strand, coding sequence coordinates, boundaries, gene IDs, and status (e.g., Public or Withdrawn), which can be processed into BED-like formats for coordinate-based analyses. Mouse data aligns to GRCm38, with no updates since 2019 despite the availability of GRCm39. Programmatic access to CCDS data is facilitated through the NCBI E-utilities, a set of web services that allow scripted queries to the system, including searches by CCDS ID or linked Gene IDs and retrieval of records in XML or text formats for integration into workflows. CCDS annotations are visualized and downloadable as tracks in major genome browsers, enabling overlay with other genomic data; for example, the offers CCDS tracks exportable in BED format via the Table Browser, Ensembl provides them within its GFF3 annotation files, and the NCBI Genome Data Viewer supports GFF3 exports for aligned views. All CCDS data is released into the public domain as a U.S. government work, permitting unrestricted use, reproduction, and distribution for research and educational purposes, though commercial applications should verify any embedded third-party content.

Integration with Genomic Databases

The Consensus CDS (CCDS) project facilitates integration with major genomic databases through direct mappings of its identifiers to established annotation systems, enabling seamless cross-referencing and querying. CCDS IDs are explicitly linked to RefSeq accessions from the NCBI and stable transcript IDs from Ensembl/GENCODE, allowing researchers to retrieve consistent protein-coding annotations across these resources without discrepancies in genomic coordinates. This alignment ensures that CCDS serves as a reliable bridge for comparative genomics, where a single CCDS ID can resolve to multiple equivalent entries in RefSeq (e.g., NM_ accessions) and Ensembl (e.g., ENST_ stable IDs), supporting unified data retrieval in tools like the NCBI Genome Data Viewer or Ensembl Biomart. In genome browsers, CCDS data is incorporated as dedicated tracks that visualize coding sequence boundaries and provide hyperlinks to supporting evidence, enhancing interpretability for users analyzing genomic regions. The includes a CCDS Gene track that displays these high-quality, consensus-annotated regions alongside and GENCODE tracks, with direct links to CCDS reports for detailed validation data such as alignment evidence and quality metrics. Similarly, Ensembl integrates CCDS annotations into its gene and transcript pages, where CCDS IDs appear in transcript summaries, allowing users to toggle views of consensus boundaries and access cross-referenced evidence from both Ensembl and NCBI sources. These integrations update in coordination with browser releases, ensuring alignment with the latest assemblies like GRCh38. Since November 2022, with the release of CCDS Release 24, the project has supported the Matched Annotation from NCBI and EMBL-EBI () initiative by identifying MANE Select transcripts within CCDS reports, providing a standardized representative transcript per protein-coding locus for over 19,000 genes. This tie-in leverages CCDS's core set of validated coding regions to underpin MANE's goal of exact exonic matches between and Ensembl/GENCODE, facilitating clinical-grade annotations where one high-confidence transcript is prioritized per gene. For variant annotation, CCDS data is exported via FTP archives and incorporated into tools like the Ensembl Variant Effect Predictor (VEP), which outputs CCDS IDs alongside consequence predictions, and dbSNP, where consensus coding boundaries inform functional impact assessments for single nucleotide variants and indels.

Applications and Impact

Current Scientific Applications

The Consensus CDS (CCDS) project provides a standardized, high-quality reference for , particularly in -based protein identification and quantification. By offering a core set of consistently protein-coding regions, CCDS sequences serve as a reliable for database searches, enabling accurate peptide-spectrum matching and reducing annotation discrepancies in proteomic workflows. For instance, in the Human Proteome Project (HPP), CCDS forms a foundational for consensus annotations, with efforts achieving protein evidence (PE1) for 18,138 of 19,411 predicted proteins (93.4% coverage) as of the 2024 HUPO report, following the retirement of neXtProt and transition to GENCODE and resources; this reflects extensive tissue and cell line analyses aligned with CCDS's ~19,000 genes. This high coverage underscores CCDS's role in mapping the human proteome and validating novel protein identifications. In variant analysis, CCDS acts as a standard reference for interpreting exonic variants, particularly in clinical tools like ClinVar. The project's stable identifiers and consensus annotations facilitate the assessment of variant consequences on protein-coding sequences, helping to distinguish benign polymorphisms from pathogenic changes. By aligning with guidelines from the American College of and , CCDS integration minimizes false positives in disease association studies, as discrepancies in transcript models are resolved through multi-source curation, improving the reliability of pathogenicity predictions. For , CCDS supports ortholog mapping between such as and , providing a consistent framework for . The project's cross-species alignment ensures that protein-coding regions are identically annotated across genomes, enabling precise identification of conserved sequences and functional elements. This basis aids in tracing , regulatory mechanisms, and models, with regular reviews of human-mouse orthologs maintaining for downstream analyses. As of 2025, CCDS plays a key role in -driven annotation models, including those for like . Through its integration into UniProt's canonical sequences, CCDS provides verified coding regions that inform training datasets and predictions, enhancing the accuracy of structural models tied to genomic annotations. This linkage supports advanced applications in , where models leverage CCDS for reliable sequence inputs in structure and variant effect forecasting.

Broader Implications and Collaborations

The Consensus CDS (CCDS) Project has significant implications for clinical genomics, particularly in supporting precision medicine through the standardization of coding sequences (). By providing a core set of consistently annotated protein-coding regions, CCDS enables accurate interpretation of genetic variants in applications such as , where reliable CDS boundaries are crucial for predicting drug responses, and cancer variant calling, where precise annotation reduces errors in identifying pathogenic mutations. This standardization forms the foundation for initiatives like the Matched Annotation from NCBI and EMBL-EBI () project, which extends CCDS to full-length transcripts for clinical reporting, covering nearly all protein-coding genes and including those relevant to pharmacogenomic guidelines and oncology diagnostics. In education, the CCDS dataset serves as a reliable resource in teaching materials and bioinformatics curricula, allowing students to explore high-quality gene models and understand annotation consistency across major genomic databases. It is integrated into genome browser tutorials and introductory exercises that demonstrate eukaryotic gene structure, supporting active learning in undergraduate and high school biology programs focused on genomics. The project fosters key external collaborations, including ties with the project via the GENCODE consortium for functional validation of coding regions using experimental data on regulatory elements and transcription. Additionally, CCDS annotations are integrated with GTEx expression data to align protein-coding models with tissue-specific gene regulation patterns, enhancing analyses of genetic effects on expression. These partnerships, involving NCBI, Ensembl, and UCSC, ensure CCDS remains aligned with broader genomic efforts. On a societal level, CCDS promotes in genomic by delivering a stable, publicly accessible that standardizes annotations for diverse global studies, mitigating discrepancies that could disadvantage under-resourced labs or underrepresented populations in variant analysis. The project's foundational publications underscore its widespread adoption and impact on inclusive biomedical .

Evolution and Future Directions

Release History

The Consensus CDS Project initiated its public releases with the first (Release 1) on March 2, 2005, comprising 14,795 CCDS IDs from 13,142 genes, aligned to NCBI Build 35; the first (Release 2) followed on October 10, 2006, aligned to MGSCv36. Subsequent releases have followed an approximately annual cadence, supplemented by minor patches for corrections and alignments; as of November 2025, no major release has occurred since 2022 (Release 24). Key milestones include Release 11 in 2012, which expanded the dataset through enhanced curation efforts, and Release 20 in 2016, incorporating new evidence from advanced transcriptomic and proteomic data to refine consensus annotations. Release 24, issued in October 2022, marked a significant update with a total of 35,608 CCDS IDs, including 2,746 newly added sequences, and was based on the updated GRCh38 and GRCm39 mouse genome assemblies. Across releases, changes typically involve additions derived from resolved manual curations, removals of CDS regions invalidated by emerging evidence, and adjustments to reflect genome build patches or improvements.

Prospects for and

The Consensus CDS (CCDS) project continues to prioritize enhancements in annotation completeness for and genomes through ongoing collaborative efforts, with plans to incorporate additional experimental validation to resolve remaining discrepancies and expand the core set of protein-coding regions. Targeted curation initiatives are expected to drive further growth in the dataset, building on historical releases to achieve greater stability and coverage. Improvements in automation and are anticipated, including the potential expansion of quality tests to encompass cross-species analyses and identification, which would aid in more precise discrepancy detection among partner annotations. The project maintains close alignment with initiatives like GENCODE and Ensembl, as evidenced by the integration of CCDS identifiers in GENCODE annotations to denote consensus between and GENCODE, with future mappings planned for additional assemblies to support comprehensive annotation; this includes ongoing use in GENCODE 2025. Recent alignments, such as with the project for matched human annotations, underscore efforts to standardize records across resources. Key challenges include adapting to emerging genomic complexities, such as structural variants and the shift toward representations involving multiple haplotype-resolved assemblies, which require resolving differences across diverse reference builds while maintaining . Rapid advances in sequencing technologies and automatic methods further complicate consensus-building, necessitating refined processes to ensure high-quality, consistent outputs amid evolving data landscapes. In the long term, the project's vision centers on fostering a highly complete, standardized set of protein-coding genes for and , supported by increased input from curation groups like NCBI, Ensembl, and GENCODE to enhance reliability and utility in genomic research. This iterative approach aims to minimize annotation variability, with ongoing communication among collaborators positioned to address future refinements.

References

  1. [1]
    CCDS Report for Consensus CDS - NCBI - NIH
    The Consensus CDS (CCDS) project is a collaborative effort to identify a core set of human and mouse protein coding regions that are consistently annotated ...CCDS8394.1CCDS Report for Consensus ...
  2. [2]
    Identifying a common protein-coding gene set for the human and ...
    The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes · Kim D Pruitt · Jennifer Harrow.
  3. [3]
    CCDS project - Ensembl
    The CCDS project is a collaborative effort to identify a core set of protein coding regions that are consistently annotated and of high quality. Initial results ...
  4. [4]
    a standardized set of human and mouse protein-coding regions ...
    Nov 6, 2017 · The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and ...
  5. [5]
    Tracking and coordinating an international curation effort for ... - NIH
    The Consensus Coding Sequence (CCDS) collaboration involves curators at multiple centers with a goal of producing a conservative set of high quality, ...
  6. [6]
    Consensus coding sequence (CCDS) database - NIH
    Nov 6, 2017 · The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse ...
  7. [7]
    Current status and new features of the Consensus Coding ...
    Nov 11, 2013 · The CCDS dataset is expected to continue to grow in the next few years as a result of targeted curation initiatives (see below), which will ...
  8. [8]
    CCDS Report for Consensus CDS
    ### Summary of the Consensus CDS (CCDS) Project
  9. [9]
    CCDS Release 24 - NCBI Insights - NIH
    Nov 2, 2022 · This update adds 2,746 new CCDS IDs and 237 new genes compared to the last human CCDS build (Release 22, 2018). CCDS Release 24 includes a total ...
  10. [10]
  11. [11]
    The Ensembl gene annotation system - PMC - NIH
    Ensembl collaborates in the Consensus Coding Sequence (CCDS) project (90, 91). This project provides a set of consistently annotated protein-coding gene ...
  12. [12]
    None
    ### Summary of Quality Assurance Tests for CCDS (from CCDS Curation Guidelines)
  13. [13]
    Manual gene annotation by Havana
    ### Summary of HAVANA Manual Curation Techniques for Gene Annotation
  14. [14]
    GENCODE: producing a reference annotation for ENCODE
    Aug 7, 2006 · This report describes how the manual annotation and experimental verification were performed. It also highlights some interesting features in the GENCODE ...
  15. [15]
    RefSeq Frequently Asked Questions (FAQ) - NCBI
    Nov 15, 2010 · The consensus CDS (CCDS) project is a collaboration between the RefSeq group at NCBI, WTSI Havana curators, Ensembl, and UCSC that identifies ...
  16. [16]
    None
    ### Summary of CCDS Data File Formats
  17. [17]
    Index of /pub/CCDS/current_human
    ### File List and Formats in /pub/CCDS/current_human
  18. [18]
    A General Introduction to the E-utilities - Entrez Programming ... - NCBI
    May 26, 2009 · The Entrez Programming Utilities (E-utilities) are a set of nine server-side programs that provide a stable interface into the Entrez query and database system.
  19. [19]
    Table Browser - UCSC Genome Browser
    Use this tool to retrieve and export data from the Genome Browser annotation track database. You can limit retrieval based on data attributes.Missing: Consensus | Show results with:Consensus
  20. [20]
  21. [21]
    Frequently Asked Questions: Gene tracks - Genome Browser FAQ
    The Consensus Coding Sequence Project is a list of transcript coding sequence (CDS) genomic regions that are identically annotated by RefSeq and Ensembl/GENCODE ...Missing: comparison | Show results with:comparison
  22. [22]
    NCBI/EMBL-EBI Transcript Set for Clinical Genomics
    Apr 6, 2022 · We then developed a workflow to iteratively compare the pipeline outputs, identify transcript pairs with the same coding sequence (CDS) and exon ...Missing: procedures | Show results with:procedures
  23. [23]
    Variant Effect Predictor - Homo_sapiens - Ensembl
    If you are looking for VEP for Human GRCh37, please go to GRCh37 website. ... CCDS: Protein: UniProt: HGVS: Variants and frequency dataCo-located variants ...
  24. [24]
    Index of /pub/CCDS
    ### Files and Directories
  25. [25]
    10 years of Human Proteome Project with neXtProt as ... - SIB Swiss
    Oct 29, 2020 · The Human Proteome Project (HPP) reports the experimental validation of 90.4% of the human proteome at high stringency, a significant milestone.Missing: CCDS | Show results with:CCDS
  26. [26]
    Standards and guidelines for the interpretation of sequence variants
    This report recommends the use of specific standard terminology—“pathogenic,” “likely pathogenic,” “uncertain significance,” “likely benign,” and “benign”—to ...
  27. [27]
    UniProt and Mass Spectrometry-Based Proteomics—A 2-Way ...
    The choice of canonical sequence representing the protein product of each human gene has been selected in collaboration with the CCDS (11) and MANE (12) ...
  28. [28]
    Adding bioinformatics to AP Bio | Gas station without pumps
    Feb 5, 2011 · This is the ideal moment to given AP Bio teachers ideas for how bioinformatics exercises or computational labs can support teaching the “four ...
  29. [29]
    An undergraduate bioinformatics curriculum that teaches eukaryotic ...
    A series of six Modules that employ an active learning approach using a bioinformatics tool, the genome browser, to help students understand eukaryotic gene ...
  30. [30]
    The GENCODE Project: Encyclopædia of genes and gene variants
    We are working in close collaboration with various other resources and research groups around the world. These include the NCBI (Terence Murphy, CCDS project) ...
  31. [31]
    Current status and new features of the Consensus Coding ... - NIH
    Nov 11, 2013 · The Consensus Coding Sequence (CCDS) project (5) has been established to identify a gold standard set of protein-coding gene annotations that ...
  32. [32]
    CCDS Report for Consensus CDS - NCBI - NIH
    Oct 26, 2022 · This update adds 2,746 new CCDS IDs, and adds 237 genes into the human CCDS set. CCDS Release 24 includes a total of 35,608 CCDS IDs that ...
  33. [33]
  34. [34]
    Past Announcements - CCDS Report for Consensus CDS - NIH
    The NCBI, Ensembl, and Sanger (Havana) annotation of the mouse reference genome (NCBI build 37.2) was analyzed to identify additional coding sequences (CDS) ...
  35. [35]
    Tags - GENCODE
    Tag description in GENCODE. The following tags can be found in the GENCODE GTF/GFF3 files. Read more about the GTF file format.