Fact-checked by Grok 2 weeks ago

Coding region

In molecular biology, the coding region, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that directly specifies the amino acid sequence of a protein through translation. It begins with a start codon, typically ATG in DNA (AUG in RNA), and ends with one of three stop codons (TAA, TAG, or TGA in DNA; UAA, UAG, or UGA in RNA), encompassing an open reading frame (ORF) of nucleotide triplets called codons that correspond to the protein's polypeptide chain. This sequence is essential for protein synthesis, as it provides the genetic instructions for building functional proteins that perform diverse roles in cellular processes, metabolism, and organismal development. In prokaryotes, coding regions are often contiguous within genes, allowing direct transcription and without interruption. In contrast, eukaryotic genes typically contain coding regions interspersed with non-coding introns, which are transcribed into pre-mRNA but spliced out during RNA processing to produce mature mRNA consisting of joined exons that include the . of exons can generate multiple protein isoforms from a single coding region, enhancing proteomic diversity from a limited . Coding regions () typically range from a few hundred to several thousand , with an average length of about 1,200–1,500 in humans (corresponding to proteins of ~400–500 ). The full protein-coding , including non-coding introns, averages around 27 kilobases, though some exceed 2 million due to extensive intronic sequences. Despite their critical role, coding regions constitute less than 2% of the human genome, with approximately 20,000 protein-coding genes identified (as of 2025), while the majority of DNA is non-coding and regulatory in function. Variations in coding regions, such as single nucleotide polymorphisms (SNPs) or insertions/deletions, can alter protein structure and function, contributing to genetic diseases, evolutionary adaptation, and phenotypic diversity. The accurate annotation of coding regions is fundamental in genomics, enabling tools like the Consensus Coding Sequence (CCDS) project to standardize protein-coding annotations across species for research and clinical applications.

Fundamentals

Definition

The coding region, also known as the , is the portion of a gene's DNA that directly specifies the sequence of a protein through transcription into (mRNA) and subsequent translation, excluding non-coding interruptions such as introns in eukaryotes. This sequence begins at the , typically ATG (which codes for ), and ends at one of the three stop codons (TAA, TAG, or TGA), forming an (ORF)—a continuous stretch of codons uninterrupted by stop signals that can potentially be translated into a polypeptide. In biological context, the coding region represents the functional core of protein-coding genes, distinguishing it from surrounding elements that do not contribute to the protein product. Non-coding regions encompass regulatory sequences, such as promoters and enhancers that initiate or modulate transcription, introns that are transcribed but spliced out of pre-mRNA in eukaryotes, and untranslated regions (UTRs)—the 5' UTR preceding the and the 3' UTR following the in mature mRNA, which influence mRNA stability, localization, and translation efficiency without encoding . Examples illustrate these distinctions across organisms: in prokaryotes like , genes generally lack introns, resulting in a uninterrupted coding region from start to stop that is directly transcribed and translated. In eukaryotes, however, the coding region is discontinuous in the genomic DNA, composed of multiple exons separated by introns that are removed during to yield a contiguous mRNA sequence for .

Composition

The coding region, also known as the coding DNA sequence (CDS), is composed of sequences of arranged in triplets known as codons. In DNA, these are (A), (T), (C), and (G), while in the corresponding (mRNA) transcribed from the coding region, is replaced by uracil (U). Each codon consists of three consecutive that specify a particular or a stop signal during protein . The codon structure of coding regions follows the , which comprises 64 possible triplets arising from the four taken three at a time (4^3 = 64). Of these, 61 codons encode the 20 standard , while the remaining three—UAA, UAG, and UGA in mRNA—serve as stop codons that terminate . This code exhibits degeneracy, meaning that most are specified by multiple synonymous codons (ranging from two to six per ), which provides redundancy and reduces the impact of certain mutations./Unit_III:_The_Pathway_of_Gene_Expression/13:_Genetic_code) Coding regions vary in length, typically spanning from a few hundred to several thousand base pairs, with an average of approximately 1,000 to 1,500 base pairs in eukaryotic organisms; for instance, the average CDS length in the is about 1,340 , corresponding to proteins of around 447 . This variability reflects the diverse sizes of proteins encoded across . Additionally, the (the proportion of and ) in coding regions shows species-specific patterns, often ranging from 40% to 60% or higher in certain taxa like vertebrates, influencing mRNA secondary and . Higher generally enhances thermal due to the stronger bonding of G-C pairs compared to A-T pairs.

Historical Development

Early Discoveries

The foundational concept of a coding region, as the portion of genetic material that specifies protein sequences, emerged from early 20th-century experiments linking genes to biochemical functions. In 1941, and Edward Tatum proposed the "one gene-one enzyme" hypothesis through their studies on the bread mold . By inducing mutations with X-rays and observing that specific genetic alterations disrupted individual enzymatic steps in metabolic pathways, they demonstrated that each directs the production of a single , establishing a direct correlation between genes and proteins. This hypothesis laid the groundwork for understanding genetic coding, but the molecular basis remained elusive until the structure of DNA was elucidated. In 1953, and described the double-helical structure of DNA, revealing how its nucleotide bases could store and replicate genetic information through complementary base pairing. While this model provided a structural prerequisite for genetic coding, it did not yet specify how DNA sequences translated into proteins. Breakthroughs in the early 1960s clarified the intermediary role of RNA and the triplet nature of the genetic code. In 1961, Sydney Brenner, François Jacob, and Matthew Meselson identified an unstable RNA species that carried genetic information from DNA to ribosomes for protein synthesis, proposing it as "messenger RNA" (mRNA), which serves as the template for translation. Concurrently, Marshall Nirenberg and J. Heinrich Matthaei used a cell-free Escherichia coli system to demonstrate that synthetic polyuridylic acid (poly-U) RNA directed the incorporation of phenylalanine into polypeptides, revealing that the codon UUU specifies phenylalanine and establishing the RNA-based nature of genetic coding. These experiments confirmed that coding regions in DNA are transcribed into mRNA, whose nucleotide triplets (codons) dictate amino acid sequences in proteins. Building on this, further experiments through the mid-1960s by Nirenberg, Har Gobind Khorana, and others fully deciphered the genetic code by 1966, assigning functions to all 64 possible codons, including start and stop signals.

Key Milestones

The advent of recombinant DNA technology in the 1970s marked a pivotal shift in the study of coding regions, allowing for the first time the isolation and manipulation of specific genetic sequences. In 1973, Stanley N. Cohen and Herbert W. Boyer demonstrated the construction of biologically functional bacterial plasmids by joining restriction endonuclease-generated fragments from separate plasmids in vitro, enabling the cloning and propagation of foreign DNA segments, including coding sequences, in host cells. This breakthrough facilitated the direct isolation of coding regions from complex genomes, laying the foundation for gene cloning and expression studies that transformed molecular biology. Building on these techniques, the (HGP), launched in 1990 and completed in 2003, provided the first comprehensive sequence of the , annotating approximately 20,000 protein-coding genes and elucidating their exon-intron structures. The project's efforts revealed that coding regions constitute about 1-2% of the , with introns often comprising the majority of gene lengths, offering critical insights into and gene architecture. This large-scale annotation not only mapped coding regions across the euchromatic portions of all 24 chromosomes but also enabled to identify conserved coding sequences essential for protein function. The ushered in the next-generation sequencing (NGS) , dramatically accelerating the identification of variants through high-throughput technologies. Illumina's introduction of the Genome Analyzer platform in 2006, based on reversible terminator chemistry, allowed for massively parallel sequencing that generated up to one gigabase of data per run, making it feasible to sequence entire exomes—regions encompassing all sequences—at scale. This advancement supported projects like the , which cataloged millions of variants across diverse populations, revealing their roles in disease susceptibility and evolution. A landmark in coding region manipulation came with the development of CRISPR-Cas9 in 2012, enabling precise editing of targeted sequences. Martin Jinek, , and demonstrated that the endonuclease, guided by a dual crRNA-tracrRNA complex (later simplified to a single-guide RNA), cleaves DNA at specific sites complementary to the RNA guide, allowing for targeted insertions, deletions, or replacements within coding regions. This RNA-programmed system accelerated functional studies of coding regions by permitting rapid knockout or modification of genes in various organisms, profoundly impacting fields from to therapeutic applications.

Structure and Function

Molecular Structure

The coding region, also known as the coding sequence (CDS), represents the portion of a gene that is transcribed and translated into a protein. In prokaryotes, such as bacteria, coding regions consist of continuous linear stretches of DNA nucleotides without intervening non-coding sequences, allowing for direct transcription into a mature mRNA that mirrors the genomic organization. This uninterrupted structure facilitates rapid gene expression in these organisms, where transcription and translation occur concurrently. In contrast, eukaryotic coding regions exhibit a discontinuous linear organization, split into coding exons separated by non-coding introns that are removed during . The average length of an in genes is approximately 150 base pairs (bp), though this varies by and organism, with exons collectively comprising a small fraction of the total length due to the often much larger introns. This modular structure enables , but the core coding sequence remains the concatenated exons that encode the sequence. At the level, the mature mRNA derived from the coding region is a linear single-stranded molecule, yet it can adopt secondary structures such as hairpins and loops formed by base-pairing within the coding sequence itself, which contribute to mRNA stability and influence post-transcriptional processes. Coding regions are integrated into the landscape, predominantly residing within regions characterized by a more open, less condensed arrangement that enhances accessibility to DNA-binding proteins and . This positioning contrasts with , where genes are typically silenced due to compact packaging. In three-dimensional genome organization, coding regions within topological associating domains (TADs)—self-interacting regions identified through chromatin conformation capture techniques—are frequently brought into close spatial proximity with distal enhancers via long-range looping, facilitating efficient regulatory interactions without altering the linear sequence. Seminal studies have shown that these domains, averaging 1 megabase in size, compartmentalize the to promote such enhancer-promoter contacts essential for .

Role in Protein Synthesis

The coding region of a gene serves as the template for synthesizing messenger RNA (mRNA) during transcription, which is the first step in protein synthesis. In prokaryotes, a single type of RNA polymerase binds to the promoter upstream of the coding region and synthesizes mRNA complementary to the template strand of the DNA in the 5' to 3' direction (with A pairing to U in RNA, and G to C), resulting in an mRNA sequence identical to the coding strand sequence (except T is replaced by U). This process produces a mature mRNA that includes the entire coding sequence without interruptions, as prokaryotic genes typically lack introns. In eukaryotes, RNA polymerase II performs this transcription in the nucleus, generating a pre-mRNA that encompasses the coding region along with untranslated regions and introns. Following transcription in eukaryotes, the pre-mRNA undergoes splicing to remove non-coding introns and join the exons, which contain the coding sequence derived from the coding region. This splicing is mediated by the , a complex of small nuclear ribonucleoproteins (snRNPs) that recognizes consensus sequences at intron-exon boundaries and excises introns through a series of reactions, resulting in a mature mRNA consisting solely of the continuous coding sequence flanked by untranslated regions. Prokaryotes do not require this step, allowing immediate availability of the mRNA for . The mature mRNA is then exported to the in eukaryotes, where it directs protein synthesis. During translation, ribosomes bind to the mature mRNA and read its coding in the 5' to 3' direction, decoding groups of three known as codons to assemble a polypeptide chain. Transfer RNAs (tRNAs), each carrying a specific , recognize these codons via anticodon base pairing in the ribosome's A site, delivering the corresponding that are linked by bonds in the P site to form the growing chain. begins at the AUG, which codes for (N-formylmethionine in prokaryotes), with the ribosome scanning from the 5' end in eukaryotes or recognizing a Shine-Dalgarno in prokaryotes. continues until a termination codon (UAA, UAG, or UGA) enters the A site, triggering release factors to hydrolyze the bond between the polypeptide and tRNA, yielding a completed protein chain. Typically, a single coding region produces one polypeptide chain, though post-translational modifications may further process it into a functional protein.

Regulation

Transcriptional Control

Transcriptional control of coding regions primarily occurs through regulatory elements and modifications that govern the initiation and elongation phases of II-mediated transcription, ensuring precise expression levels of protein-coding genes. These mechanisms integrate signals from the cellular to activate or repress transcription, directly impacting the abundance of mRNA derived from coding sequences. Promoter-proximal , located immediately upstream of the transcription start site (TSS), serve as foundational platforms for assembling the pre-initiation complex, while distal and epigenetic modifications provide additional layers of fine-tuning. This regulation is essential for coordinating during , , and response to stimuli, with coding regions benefiting from enhanced and recruitment. Promoter-proximal elements, such as the and , are critical for facilitating binding and transcription initiation upstream of the coding start. The , typically positioned 25-35 base pairs upstream of the TSS with a of TATAWAW (where W is A or T), is recognized by the (TBP), a subunit of transcription factor IID (TFIID), which nucleates the assembly of the basal transcription machinery including TFIIA, TFIIB, TFIIE, TFIIF, and TFIIH. This interaction bends DNA and positions for accurate start site selection, thereby influencing the rate of transcription initiation for downstream coding regions. The , located further upstream around -80 base pairs from the TSS, binds transcription factors like NF-Y, enhancing the efficiency of promoter recognition and polymerase recruitment in a combinatorial manner with the . These elements are conserved across eukaryotes and are particularly enriched in genes requiring inducible or tissue-specific expression, ensuring that transcription aligns with cellular needs. Distal regulatory sequences, including enhancers and silencers, exert long-range control over the transcription of coding regions by modulating initiation and elongation rates. Enhancers are cis-acting DNA segments, often located thousands of base pairs away from the TSS, that boost transcription through binding of activator transcription factors and co-activators, which loop to contact the promoter via mediator complexes and . This proximity enables recruitment of and histone acetyltransferases, increasing the frequency of transcriptional bursts and thus elevating mRNA output from coding exons. For instance, in developmental contexts, enhancers drive spatiotemporal precision in . In contrast, silencers repress transcription by recruiting repressive factors like or , which inhibit polymerase progression or maintain compaction, thereby reducing coding region activity in specific lineages. These elements form a of interactions that can switch functions based on cellular , ensuring balanced . Epigenetic modifications, particularly histone acetylation, play a pivotal role in opening chromatin around coding exons to enhance accessibility for transcription. Acetylation of lysine residues on histones, such as H3K27ac and H4K16ac, mediated by histone acetyltransferases like CBP/p300, neutralizes positive charges on histones, loosening nucleosome-DNA interactions and creating states conducive to binding. Regions marked by H3K27ac exhibit higher DNA accessibility, as measured by DNase I , and correlate with elevated transcription levels of nearby genes, including those with coding exons in active promoters. This mark is often enriched at promoter-proximal and enhancer regions, facilitating the recruitment of bromodomain-containing proteins that further stabilize the transcription apparatus. In contrast, deacetylation by histone deacetylases promotes formation, repressing coding region transcription. These dynamic marks integrate with distal elements to sustain long-term expression patterns. Tissue-specific expression of coding regions, exemplified by , relies on combinatorial transcription factors that activate regulatory elements in a context-dependent manner. , which encode homeodomain proteins crucial for body patterning, are regulated by clusters of enhancers responsive to combinations of factors like (Ubx), Homothorax (Hth), and Extradenticle (Exd) in . In specific tissues, such as haltere imaginal discs, Ubx binds motifs in open regions to recruit cofactors, increasing accessibility and activating transcription of target coding regions while repressing others through chromatin closure. This combinatorial code allows Hox factors to orchestrate distinct expression profiles across tissues, ensuring coding regions are transcribed only where needed for developmental identity. Similar mechanisms operate in vertebrates, highlighting the evolutionary conservation of this regulatory strategy.

Translational Control

Translational control mechanisms regulate the and accuracy of protein from the coding region of mRNA after transcription, ensuring that translation aligns with cellular needs such as protein and . Key post-transcriptional modifications to mRNA, including the addition of a 5' and a poly-A , play critical roles in stabilizing the transcript and facilitating recruitment. The 5' , a 7-methylguanosine structure, is recognized by the eIF4E, which promotes the assembly of the pre-initiation complex and circularization of the mRNA through interactions with poly-A binding proteins (PABPs) bound to the 3' poly-A . This cap-poly-A synergy enhances mRNA stability by protecting against exonucleolytic degradation and boosts initiation by increasing binding affinity. Codon optimization within the coding region further fine-tunes translation kinetics, where the strategic use of rare codons can slow ribosomal elongation to accommodate co-translational protein folding requirements. Rare codons, which correspond to less abundant tRNAs, induce temporary ribosome pausing, allowing nascent polypeptides sufficient time to adopt correct secondary structures before full emergence from the ribosomal exit tunnel. This deceleration is particularly important in domains prone to misfolding, as excessive translation speed can trap proteins in off-pathway conformations, reducing overall folding efficiency. Studies demonstrate that incorporating rare codons at key positions enhances the native yield of proteins like luciferase and green fluorescent protein by synchronizing translation rates with folding kinetics. MicroRNAs (miRNAs) exert translational repression by binding to target sites in the coding region or untranslated regions (UTRs) of mRNAs derived from coding sequences, thereby inhibiting protein output in contexts like cancer progression. For instance, miR-21, an overexpressed in various malignancies, binds to the 3' UTRs of tumor suppressor genes such as PDCD4 and PTEN, recruiting proteins to block ribosomal scanning and promote mRNA deadenylation or . While miR-21 primarily targets UTRs, evidence shows it can also interact with coding sequences to sterically hinder elongating ribosomes, leading to reduced translation of pro-apoptotic factors and enhanced cell survival. In and colorectal cancers, this miR-21-mediated inhibition correlates with increased invasion and , underscoring its role in dysregulated translational control. Ribosome pausing at specific codons within the coding region serves as a regulatory checkpoint for co-translational folding, preventing aggregation of nascent chains. Pauses often occur at rare or suboptimal codons, where tRNA scarcity or mRNA secondary structures delay peptidyl transfer, providing a temporal window for chaperone-assisted folding or stabilization. This is conserved across eukaryotes and , with experimental revealing pauses that correlate with folding intermediates in proteins like CFTR and . Disruptions in pausing, such as through codon bias alterations, can lead to misfolded products and cellular stress, highlighting its essential function in maintaining integrity.

Mutations and Genetic Variations

Types of Mutations

Mutations in coding regions can disrupt the that specifies sequences in proteins, leading to altered or nonfunctional proteins. These mutations are broadly classified into point mutations, insertions and deletions (indels), and splice-site mutations, each with distinct molecular consequences. Point mutations involve the of a single base in the DNA sequence within a coding region. These can be further categorized based on their impact on the protein: synonymous mutations, which do not change the encoded due to the degeneracy of the ; missense mutations, which result in the replacement of one with another, potentially altering or function; and mutations, which introduce a premature , leading to truncated and often nonfunctional proteins. Insertions and deletions, collectively termed indels, add or remove one or more from the coding sequence. If the number of nucleotides affected is not a multiple of three, these indels cause a , shifting the of all downstream codons and typically resulting in a completely altered sequence and premature termination. Splice-site mutations occur at the boundaries between and introns, disrupting the precise removal of introns during . Such mutations can lead to , where an entire exon is omitted from the mature mRNA; retention of introns, introducing non-coding sequences into the coding region; or activation of cryptic splice sites, causing aberrant exon inclusion or partial exon removal, all of which alter the final protein product. A well-known example of a in a coding region is the one responsible for sickle cell anemia, where a single in the beta-globin gene changes the codon from GAG (encoding ) to GTG (encoding ) at position 6, causing molecules to polymerize under low-oxygen conditions and deform red blood cells.

Mechanisms of Formation

Mutations in coding regions arise through various biological and environmental processes that alter the DNA sequence, potentially leading to changes in the encoded protein. Spontaneous errors occur endogenously without external influences, primarily during DNA replication or due to chemical instability of bases. One common mechanism is deamination, where cytosine (C) loses an amino group to form uracil (U), which pairs with adenine (A) instead of guanine (G), resulting in a C-to-T transition mutation if unrepaired. This process is accelerated by heat and contributes significantly to spontaneous mutagenesis in single-stranded DNA regions, such as those exposed during replication in coding regions. Another spontaneous event is depurination, the hydrolysis of the glycosidic bond linking a purine base (adenine or guanine) to the deoxyribose sugar, leaving an apurinic (AP) site that can cause base substitution or frameshift mutations during replication as DNA polymerase inserts a random base opposite the lesion. These errors are particularly relevant in coding regions because even a single base change can disrupt the reading frame or amino acid sequence. Environmental agents induce mutations by directly damaging DNA bases or structure. Ultraviolet (UV) light from sunlight primarily affects coding regions in skin cells by causing the formation of cyclobutane pyrimidine dimers (CPDs), such as thymine dimers, where adjacent thymine bases covalently link, distorting the DNA helix and blocking replication or transcription. This damage leads to C-to-T or CC-to-TT transitions at dipyrimidine sites within exons, contributing to mutations in genes like those involved in tumor suppression. Chemical mutagens, such as alkylating agents (e.g., those found in tobacco smoke or chemotherapy drugs like temozolomide), add alkyl groups to DNA bases, primarily at the O6 position of guanine, forming O6-alkylguanine that mispairs with thymine during replication, yielding G-to-A transitions. These agents target nucleophilic sites on all four bases but disproportionately affect guanine-rich coding sequences, increasing the risk of oncogenic mutations in proto-oncogenes. Replication slippage represents a polymerase-dependent mechanism that generates insertion-deletion (indel) mutations, especially in coding regions containing repetitive sequences like microsatellites or trinucleotide repeats. During , the temporarily dissociates and reassociates on repetitive templates, causing the nascent strand to loop out and misalign, resulting in extra or missing bases that shift the and often produce truncated or nonfunctional proteins. This slippage is more frequent in homopolymeric runs or short tandem repeats within exons. Such events account for a substantial portion of small s observed in coding regions, with rates influenced by the length and purity of the repeat tract. Transposon insertions provide another pathway for disrupting coding regions through the activity of mobile genetic elements. In humans, Alu elements—short interspersed nuclear elements (SINEs) comprising about 10% of the genome—propagate via RNA-mediated retrotransposition, where they are transcribed, reverse-transcribed into DNA, and reintegrated into the genome using target-primed reverse transcription (TPRT). When an Alu insert lands within an exon, it can introduce premature stop codons or frameshifts, inactivating the gene; for example, Alu insertions have been documented in at least 76 human disease-causing cases, including hemophilia and neurofibromatosis. These events occur at a low but ongoing rate and are biased toward AT-rich regions common in genes.

Repair and Prevention

Cells employ several DNA repair mechanisms to maintain the integrity of coding regions, which are critical for accurate protein synthesis. Base excision repair (BER) is a primary pathway that addresses small, non-helix-distorting base lesions in DNA, including those in coding sequences. This process initiates with the recognition and removal of damaged bases, such as uracil resulting from cytosine deamination, by specific DNA glycosylases. The resulting abasic site is then processed by an AP endonuclease, followed by nucleotide insertion and ligation to restore the original sequence, thereby preventing mutations that could alter the encoded protein. Mismatch repair (MMR) specifically targets errors introduced during , such as base mismatches or small insertion/deletion loops in regions. The MMR system scans the newly synthesized strand, identifies distortions via proteins like MutS homologs, and excises the erroneous segment using MutL and activities, followed by resynthesis. Defects in MMR genes, such as MLH1 or MSH2, lead to hereditary conditions like Lynch syndrome, characterized by increased rates in sequences and elevated risk of colorectal and other cancers. During , by the activity of , such as polymerase δ and ε, provides an immediate . This 3'→5' removes mismatched immediately after incorporation, enhancing replication fidelity by approximately 100- to 1000-fold and reducing the overall error rate to about 10^{-7} per . Errors that evade are largely caught by subsequent MMR, ensuring high accuracy in coding region duplication. At the evolutionary level, the degeneracy of the serves as a preventive safeguard against deleterious in coding regions. With most encoded by multiple synonymous codons, this redundancy allows many single-nucleotide changes to result in silent that do not alter the protein , thereby minimizing the harmful impact of potential synonymous variations. This structural feature of the code has been optimized over time to buffer against mutational damage while preserving functional protein diversity.

Advanced Concepts

Constrained Coding Regions

Constrained coding regions (CCRs) are subsets of protein-coding sequences in the that exhibit unusually low levels of , indicating they are under strong purifying selection not solely attributable to their role in encoding but also for additional non-protein functions such as maintaining mRNA secondary structure, serving as binding sites for microRNAs (miRNAs), or facilitating regulatory interactions like splicing enhancers. CCRs comprise approximately 2.5% of the coding sequence in the but harbor a disproportionate share of pathogenic variants, suggesting evolutionary pressures to preserve multifunctional elements within the coding sequence. Prominent examples of CCRs include overlapping genes in viral genomes, where a single nucleotide sequence must encode multiple proteins under conflicting selective pressures. In human immunodeficiency virus type 1 (HIV-1), the and genes overlap by 241 nucleotides, with producing structural proteins and encoding enzymes like ; this arrangement imposes purifying selection to balance protein functionality while allowing ribosomal frameshifting for expression. Another instance occurs with upstream open reading frames (uORFs) in eukaryotic mRNAs, particularly type 2 uORFs that initiate in the but extend into and overlap the main coding sequence (), thereby constraining of the primary protein while potentially encoding regulatory peptides under purifying selection. Detection of CCRs often relies on measures of evolutionary conservation, such as the dN/dS ratio, which quantifies purifying selection in coding regions. The dN/dS ratio is calculated as the rate of nonsynonymous substitutions (dN, changes that alter the ) per nonsynonymous site divided by the rate of synonymous substitutions (, silent changes that do not alter the ), formally expressed as: \frac{d_N}{d_S} = \frac{K_N}{L_N} \div \frac{K_S}{L_S} where K_N and K_S are the observed numbers of nonsynonymous and synonymous substitutions, respectively, and L_N and L_S are the numbers of potential nonsynonymous and synonymous sites; values of dN/dS < 1 indicate purifying selection, as fewer deleterious nonsynonymous changes are tolerated compared to neutral synonymous ones. In CCRs, dN/dS ratios are typically well below 1, reflecting dual constraints that limit variation beyond protein-level selection, though intraspecies variant depletion (e.g., from large-scale data) provides complementary evidence of constraint. Mutations in CCRs often produce broader and more severe phenotypes than expected from protein disruption alone, due to the disruption of overlaid non-coding functions. For instance, variants in CCRs show a 7.1-fold enrichment in mutations associated with neurodevelopmental disorders, such as autism spectrum disorder and , where altered mRNA stability or miRNA regulation exacerbates developmental impacts. This multifunctionality underscores why CCRs are hotspots for disease causality, with odds ratios for pathogenicity exceeding 160 in highly constrained segments.

Detection and Annotation

Detection and annotation of regions in involve a combination of computational predictions and experimental validations to accurately identify open reading frames (ORFs) that encode proteins. These methods are essential for projects, as coding regions often comprise only a small fraction of eukaryotic , interspersed with non-coding sequences. Modern approaches integrate statistical modeling, , and empirical data to delineate exons, introns, and start and stop sites with high precision. Ab initio prediction methods rely on intrinsic statistical properties of DNA sequences, such as codon usage biases and splice site signals, to predict structures without external evidence. A seminal , GENSCAN, employs a (HMM) to model the probabilistic structure of genes, including exons, introns, and intergenic regions, by estimating parameters from training sets of known genes. GENSCAN identifies potential ORFs by scoring sequences based on codon statistics and HMM transitions, achieving nucleotide-level accuracies of around 75-80% for human genes in benchmark tests. This approach is particularly useful for annotating genomes where experimental data is scarce, though it can overpredict short or atypical ORFs. Evidence-based annotation leverages transcriptomic data to confirm and refine predicted regions. RNA sequencing () generates reads from expressed mRNAs, which are aligned to the to map boundaries and identify spliced transcripts. The aligner, designed for high-throughput spliced , efficiently maps RNA-seq reads to genomes by indexing splice junctions and handling multimapping, enabling the assembly of exons with sensitivities exceeding 90% for known transcripts in diverse . Tools like StringTie or then use these alignments to assemble and quantify transcripts, prioritizing those with strong read support to distinguish from non-coding regions. Comparative genomics exploits evolutionary conservation to pinpoint coding regions, as functional ORFs tend to be preserved across species due to selective pressure. The Basic Local Alignment Search Tool () performs rapid sequence similarity searches between a query genome and related species, identifying conserved protein-coding segments through translated nucleotide alignments (e.g., TBLASTN). For visualization and integrated analysis, the displays multi-species alignments alongside conservation scores, such as phastCons, which quantify nucleotide-level conservation in coding regions using phylogenetic models; for instance, human coding exons show phastCons scores above 0.9 in alignments of 100 vertebrates. This method enhances annotation accuracy by filtering spurious predictions in conserved blocks. Experimental validation provides direct confirmation of computational annotations, focusing on transcript boundaries and translation activity. Rapid amplification of cDNA ends () isolates 5' and 3' untranslated regions (UTRs) adjacent to coding sequences, using gene-specific primers and anchored PCR to map transcription start and sites; 5' , for example, has been used to precisely define the 5' ends of low-abundance mRNAs in projects. () captures ribosome-protected mRNA fragments to identify actively translated ORFs, revealing translation initiation sites through footprint density and 3-nucleotide periodicity in coding regions; this technique has validated thousands of novel coding regions in eukaryotes by confirming ribosomal occupancy beyond annotated boundaries.

References

  1. [1]
    Glossary - The NCBI Handbook - NIH
    Coding region, coding sequence. CDS refers to the portion of a genomic DNA sequence that is translated, from the start codon to the stop codon, inclusively, if ...
  2. [2]
    CDS Annotation in Full-Length cDNA Sequence - PubMed Central
    CDS is a sequence of nucleotides that corresponds with the sequence of amino acids in a protein. A typical CDS starts with ATG and ends with a stop codon. CDS ...
  3. [3]
    Anatomy of a Gene - Learn Genetics Utah
    The protein-coding region is just that: its nucleotides specify the order of amino acids that make up a particular protein. A single-stranded copy of the ...
  4. [4]
    Coding Region - an overview | ScienceDirect Topics
    The nuclear coding region is the region of a gene that produces a fully functional protein for an organism. The type of gene depends on its function within the ...
  5. [5]
    What is noncoding DNA?: MedlinePlus Genetics
    Jan 19, 2021 · Some noncoding DNA regions, called introns, are located within protein-coding genes but are removed before a protein is made. Regulatory ...
  6. [6]
    Consensus coding sequence (CCDS) database - NIH
    Nov 6, 2017 · Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation
  7. [7]
    Overview: Gene Structure - Holland-Frei Cancer Medicine - NCBI - NIH
    In the coding region of a gene, the linear sequence of nucleotides encodes the amino acid sequence of the protein. This genetic code is in triplet form so that ...
  8. [8]
    Open Reading Frame - National Human Genome Research Institute
    An open reading frame is a portion of a DNA molecule that, when translated into amino acids, contains no stop codons.
  9. [9]
    Understanding a Genome Sequence - NCBI - NIH
    A cDNA is a copy of an mRNA (see Figure 5.32) and so corresponds to the coding region of a gene, plus any leader or trailer sequences that are also transcribed.
  10. [10]
    From DNA to RNA - Molecular Biology of the Cell - NCBI Bookshelf
    In Chapter 4, we saw that a typical eucaryotic gene is present in the genome as short blocks of protein-coding sequence (exons) separated by long introns, and ...
  11. [11]
    Chromosomal DNA and Its Packaging in the Chromatin Fiber - NCBI
    The coding sequences are called exons; the intervening (noncoding) sequences are called introns (see Figure 4-15 and Table 4-1). The majority of human genes ...<|control11|><|separator|>
  12. [12]
    UTR definition
    AKA: Untranslated Region 5' UTR is the portion of an mRNA from the 5' end to the position of the first codon used in translation. The 3' UTR is the portion ...
  13. [13]
    Genome Anatomies - NCBI - NIH
    Two other features of prokaryotic genomes can be deduced from Figure 2.2E. First, there are no introns in the genes present in this segment of the E. coli ...
  14. [14]
    The Complexity of Eukaryotic Genomes - The Cell - NCBI Bookshelf
    Conversely, introns are present in rare genes of prokaryotes. The presence or absence of introns is therefore not an absolute distinction between prokaryotic ...
  15. [15]
    Translation: DNA to mRNA to Protein | Learn Science at Scitable
    There are three termination codons that are employed at the end of a protein-coding sequence in mRNA: UAA, UAG, and UGA. No tRNAs recognize these codons. Thus, ...The Beginning Of Mrna Is Not... · Translation Begins After The... · The Elongation Phase
  16. [16]
    genetic code | Learn Science at Scitable - Nature
    Of these 64 codons, 61 represent amino acids, and three are stop signals. Although each codon is specific for only one amino acid (or one stop signal), the ...
  17. [17]
    Size of the protein-coding genome and rate of molecular evolution
    May 1, 2005 · For example (Table 1), humans have ~32,500 protein-coding genes equivalent to ~4.36×107 nt (the average protein gene is ~1340 nt long). Because ...
  18. [18]
    A positive correlation between GC content and growth temperature ...
    Feb 9, 2022 · GC pairs are generally more stable than AT pairs; GC-rich genomes were proposed to be more adapted to high temperatures than AT-rich genomes.
  19. [19]
    Genetic Control of Biochemical Reactions in Neurospora - PNAS
    Genetic Control of Biochemical Reactions in Neurospora. G. W. Beadle and E. L. Tatum ... 1941. Published in issue: November 15, 1941. Authors. Affiliations
  20. [20]
    A Structure for Deoxyribose Nucleic Acid - Nature
    The determination in 1953 of the structure of deoxyribonucleic acid (DNA), with its two entwined helices and paired organic bases, was a tour de force in ...
  21. [21]
    An Unstable Intermediate Carrying Information from Genes ... - Nature
    BRENNER, S., JACOB, F. & MESELSON, M. An Unstable Intermediate Carrying Information from Genes to Ribosomes for Protein Synthesis. Nature 190, 576–581 (1961).
  22. [22]
    The dependence of cell-free protein synthesis in E. coli upon ... - PNAS
    It was shown that this apparent protein synthesis was energy-dependent, was stimulated by a mixture of L-amino acids, and was markedly inhibited by RNAase, ...
  23. [23]
    Construction of Biologically Functional Bacterial Plasmids In Vitro
    Abstract. The construction of new plasmid DNA species by in vitro joining of restriction endonuclease-generated fragments of separate plasmids is described.
  24. [24]
    Herbert W. Boyer and Stanley N. Cohen | Science History Institute
    The first success of the Boyer-Cohen collaboration occurred in spring 1973 and involved one of Cohen's plasmids, pSC101. Plasmids were already known to transfer ...
  25. [25]
    The Human Genome Project
    Mar 19, 2025 · Launched in October 1990 and completed in April 2003, the Human Genome Project's signature accomplishment – generating the first sequence of ...Missing: regions | Show results with:regions
  26. [26]
    History of Illumina Sequencing & Solexa Technology
    The evolution of the next-generation sequencing technology powering Illumina instruments.History Of Illumina... · History Of Sbs And Solexa · Evolution Of A Novel...
  27. [27]
    [PDF] An Introduction to Next-Generation Sequencing Technology - Illumina
    In 2007, a single sequencing run could produce a maximum of around one gigabase (Gb) of data. By 2011, that rate has nearly reached a terabase (Tb) of data in a ...
  28. [28]
    A Programmable Dual-RNA–Guided DNA Endonuclease ... - Science
    Jun 28, 2012 · Our study reveals a family of endonucleases that use dual-RNAs for site-specific DNA cleavage and highlights the potential to exploit the system for RNA- ...
  29. [29]
    Co-transcriptional gene regulation in eukaryotes and prokaryotes
    Jan 1, 2025 · Multiple (possibly overlapping) coding sequences (CDS) can be encoded within one operon in prokaryotes. In eukaryotes, a single transcription ...
  30. [30]
    Distributions of Exons and Introns in the Human Genome
    Jun 16, 2004 · The average number of exons in human genes is about 8–10 and the mean value of 8.8 exons per gene. Exon lengths are distributed much more ...
  31. [31]
    Widespread Selection for Local RNA Secondary Structure in Coding ...
    These results suggest widespread regulation of translation and/or mRNA decay in prokaryotes by mechanisms involving coding-region hairpins. Experimental ...
  32. [32]
    Chromatin accessibility: biological functions, molecular mechanisms ...
    Dec 4, 2024 · The remaining accessible loci are generally located in the euchromatin, which have less nucleosome occupancy and higher regulatory activity. The ...
  33. [33]
    Topological Domains in Mammalian Genomes Identified by Analysis ...
    We identify large, megabase-sized local chromatin interaction domains, which we term “topological domains”, as a pervasive structural feature of the genome ...
  34. [34]
    Translation of mRNA - The Cell - NCBI Bookshelf - NIH
    Signals for translation initiation. Initiation sites in prokaryotic mRNAs are characterized by a Shine-Delgarno sequence that precedes the AUG initiation codon.
  35. [35]
    Core Promoters in Transcription: Old Problem, New Insights - NIH
    Transcription initiation complex nucleators: the TATA box and the Inr. The two most common core promoter elements associated with protein-coding genes are the ...
  36. [36]
    Assembly of the Transcription Initiation Complex - Genomes - NCBI
    The core promoter consists of two segments: the -25 or TATA box (consensus 5′-TATAWAW-3′, where W is A or T) and the initiator (Inr) sequence (consensus 5′- ...
  37. [37]
    Identification of putative promoters in 48 eukaryotic genomes on the ...
    Mar 14, 2018 · Core promoter regions are characterized by the presence of cognate sequence motifs such as Initiator (Inr), TATA-box and downstream promoter ...
  38. [38]
  39. [39]
    Candidate silencer elements for the human and mouse genomes
    Feb 26, 2020 · Outside of promoter regions, silencers along with enhancers and insulators create a complex array of distal cis-regulatory elements (CREs).
  40. [40]
    DNA accessibility and methylation, histone marks, and RNA - Nature
    Feb 18, 2015 · We used chromatin states to study the relationship between histone modification patterns, RNA expression levels, DNA methylation, and DNA accessibility.
  41. [41]
  42. [42]
    The molecular basis of coupling between poly(A)-tail length ... - eLife
    Jul 2, 2021 · This association is proposed to stabilize the interaction between eIF4E and the mRNA 5′ cap and facilitate recruitment and/or recycling of ...
  43. [43]
    Roles of mRNA poly(A) tails in regulation of eukaryotic gene ...
    Mar 13, 2023 · The poly(A) tail contributes to both the translational status and stability of mRNAs, and it functions as a master regulator of gene expression ...
  44. [44]
    A code within the genetic code: codon usage regulates co ...
    Sep 9, 2020 · Codon usage regulates the speed of translation elongation, resulting in non-uniform ribosome decoding rates on mRNAs during translation that is ...
  45. [45]
    Kinetic modelling indicates that fast-translating codons can ... - Nature
    Jan 7, 2014 · We find that speeding up codon translation through misfolding-prone segments can, in some cases, increase the folding probability of a domain.
  46. [46]
    Specific codons control cellular resources and fitness - Science
    Feb 21, 2024 · Low CAI values (rare codon use) reduce protein synthesis rate ... rate of translation elongation to regulate co-translational protein folding.
  47. [47]
    Role of microRNAs in translation regulation and cancer - PMC - NIH
    Since a miRNA binds the 3'UTR of a target mRNA, how can it inhibit its translation? To date it is very clear that miRNAs contribute to the regulation of protein ...Mirna: Discovery And... · Mirnas And Translation... · Mirnas And Cancer
  48. [48]
    MicroRNA-21 targets tumor suppressor genes in invasion ... - Nature
    Feb 12, 2008 · Mir-21 has a role not only in tumor growth but also in invasion and tumor metastasis by targeting multiple tumor/metastasis suppressor genes.Mir-21 Affects Cell Invasion... · Mir-21 Directly Targets... · Pdcd4 Protein Level...
  49. [49]
    On the rules of engagement for microRNAs targeting protein coding ...
    Jul 31, 2023 · MiRNAs post-transcriptionally repress gene expression by binding to mRNA 3′UTRs, but the extent to which they act through protein coding regions ...Results · Mirnas Frequently Interact... · Cds-Targeting Mirnas Promote...
  50. [50]
    Identification of miR-21 targets in breast cancer cells using a ...
    Overall, our results demonstrate that miR-21 affects the expression of many of its targets through translational inhibition and highlights the utility of ...
  51. [51]
    Review A pause for thought along the co-translational folding pathway
    Co-translational folding occurs during protein synthesis, where mRNA may fine-tune folding, and translation pausing might enable sequential folding. Codon ...
  52. [52]
    Translational Control by Ribosome Pausing in Bacteria - Frontiers
    The primary effect of ribosome pauses on protein folding is likely cotranslational, arising as the nascent peptide moves through the exit tunnel and emerges ...
  53. [53]
    Synonymous Mutations and Ribosome Stalling Can Lead to Altered ...
    Rare codons may result in ribosome stalling,, either due to a lower concentration of cognate tRNAs or an alteration of the RNA structure. Translational stalling ...Co-Translational... · Ribosome Pausing Effects And... · Ribosome Stalling And...<|control11|><|separator|>
  54. [54]
    Mutation, Repair and Recombination - Genomes - NCBI Bookshelf
    Insertions and deletions are often called frameshift mutations because when one occurs within a coding region it can result in a shift in the reading frame ...
  55. [55]
    1.3: DNA Mutations - Biology LibreTexts
    Apr 9, 2022 · A mutation is a heritable change in DNA sequence. This can happen in several ways: substitution of a DNA base, insertion or deletion of one or more DNA bases.
  56. [56]
    Splicing mutations in human genetic disorders: examples, detection ...
    Apr 21, 2018 · The splicing mutation may occur in both introns and exons and disrupt existing splice sites, create new ones, or activate the cryptic ones. They ...
  57. [57]
    Sickle Cell Disease—Genetics, Pathophysiology, Clinical ...
    May 7, 2019 · A single base-pair point mutation (GAG to GTG) results in the substitution of the amino acid glutamic acid (hydrophilic) to Valine (hydrophobic) ...
  58. [58]
    Cytosine deamination and the precipitous decline of spontaneous ...
    Jul 5, 2016 · Cytosine deamination appears to be largely responsible for spontaneous mutations in the modern world. Because of its sensitivity to temperature ...
  59. [59]
    Mechanisms of DNA damage, repair and mutagenesis - PMC
    Base deamination is a major source of spontaneous mutagenesis in human cells, where cytosine (C), adenine (A), guanine (G), and 5-methyl cytosine (5mC) in DNA ...
  60. [60]
    Genetic Mutation | Learn Science at Scitable - Nature
    Figure 5: Depurination is a spontaneous mutation that occurs when a nucleotide loses a purine base. During replication, two strands of DNA separate. If a ...
  61. [61]
    Both DNA global deformation and repair enzyme contacts mediate ...
    Jan 27, 2017 · Irradiation of the cell with ultra-violet light can result in covalent dimer formation of neighboring pyrimidines in DNA. The most common ...
  62. [62]
    Chapter 12: DNA Damage and Repair - Chemistry
    UV light can cause molecular crosslinks to form between two pyrimidine residues, commonly two thymine residues, that are positioned consecutively within a ...Chapter 12: Dna Damage And... · Types Of Mutations · Dna Strand Breaks
  63. [63]
    DNA alkylation and DNA methylation: cooperating mechanisms ...
    The alkylated base, O6‐methylguanine, is mutated to adenine, which can induce oncogenic mutations in genes such as KRAS. These mutations increase future cancer ...
  64. [64]
    DNA alkylation lesion repair: outcomes and implications in cancer ...
    Jan 15, 2021 · Alkylated DNA lesions, induced by both exogenous chemical agents and endogenous metabolites, represent a major form of DNA damage in cells.
  65. [65]
    Genetic Evidence That Both dNTP-Stabilized and Strand Slippage ...
    May 1, 2016 · In the slipped-strand misalignment mechanism, repetitive DNA strands separate from each other and rehybridize out of frame, forming an indel ...
  66. [66]
    Genomic Insertions and Deletions (Indels)
    Mechanisms for small indel events are replication slippage (as seen in STR regions), recombination, unequal crossing over, and tandem duplication caused by ...
  67. [67]
    Differences in genome-wide repeat sequence instability conferred ...
    Mar 30, 2015 · We report the genome-wide rates of formation and repair of indels made during replication of yeast nuclear DNA.
  68. [68]
    Alu elements: An intrinsic source of human genome instability - PMC
    Alu elements are ~300 bp sequences that have amplified via an RNA intermediate leading to the accumulation of over 1 million copies in the human genome.
  69. [69]
    Functions and Utility of Alu Jumping Genes | Learn Science at Scitable
    Alu elements are a type of "jumping gene," or transposable element (TE), that exists only in primates. Like all TEs, they are discrete DNA sequences that move, ...
  70. [70]
    Transposable Elements and Human Diseases: Mechanisms and ...
    Whether through insertion of LINE-1 or Alu elements that cause chromosomal rearrangements, or through epigenetic modifications, TEs are widely implicated in the ...
  71. [71]
    Base Excision Repair - PMC - NIH
    Base excision repair (BER) corrects DNA damage from oxidation, deamination and alkylation. Such base lesions cause little distortion to the DNA helix structure.
  72. [72]
    Overview of Base Excision Repair Biochemistry - PMC
    BER predominantly deals with non-bulky small nucleobase lesions, excising and replacing incorrect (eg uracil) or damaged (eg 3-methyladenine, 8-oxoG) bases.
  73. [73]
    Mechanisms and functions of DNA mismatch repair | Cell Research
    Dec 24, 2007 · MMR corrects DNA mismatches generated during DNA replication, thereby preventing mutations from becoming permanent in dividing cells 1, 2, 3. ...Mechanism Of Mismatch... · Mmr Mediates Dna Damage... · Mmr Deficiency Leads To...
  74. [74]
    Mismatch repair defects and Lynch syndrome: the role of the basic ...
    The discovery that LS is caused by inherited mutations in genes of the DNA mismatch repair (MMR) pathway has been tremendously important for the management of ...
  75. [75]
    The mechanism of mismatch repair and the functional analysis of ...
    The principal function of the MMR system is to correct DNA polymerase misincorporation errors that arising during DNA replication [44]. Overall the MMR ...Introduction · Mismatch Repair · Muts Homologs
  76. [76]
    Fidelity of DNA replication—a matter of proofreading - PMC
    Contribution of proofreading to fidelity. The proofreading on average improves replication fidelity by about 10–1000-fold. The errors that escape proofreading ...
  77. [77]
    Quantifying the contributions of base selectivity, proofreading and ...
    When all substitutions are considered, the average contribution of proofreading to replication fidelity in vivo is 160-fold for Pol ε and 1000-fold for Pol δ ( ...
  78. [78]
    The structural basis of the genetic code: amino acid recognition by ...
    Jul 28, 2020 · ... deleterious effects of mutations. According to this theory ... Degeneracy of the genetic code: extent, nature, and genetic implications.
  79. [79]
    Molecular Mechanisms and the Significance of Synonymous Mutations
    Jan 20, 2024 · Synonymous mutations result from the degeneracy of the genetic code. Most amino acids are encoded by two or more codons.
  80. [80]
    Upstream ORFs are prevalent translational repressors in vertebrates
    Features that characterize repressive uORFs are targets of selection. Upstream open reading frames have been proposed to act broadly as repressors of CDS ...Uorfs Are Widespread And... · Most Uorfs Do Not Encode... · Uorfs Are Associated With...