Nucleic acid sequence
A nucleic acid sequence is a polymer composed of nucleotides that forms the primary structure of deoxyribonucleic acid (DNA) or ribonucleic acid (RNA), serving as the fundamental carrier of genetic information in living organisms.[1] Each nucleotide consists of a nitrogenous base, a five-carbon sugar (deoxyribose in DNA or ribose in RNA), and a phosphate group, with the bases—adenine (A), guanine (G), cytosine (C), thymine (T) in DNA, or uracil (U) replacing thymine in RNA—linked in a specific linear order that encodes instructions for biological processes.[1] By convention, these sequences are written and read from the 5' end to the 3' end, reflecting the directionality of the phosphodiester bonds that connect the nucleotides.[2] In DNA, the sequence typically forms a double-stranded helix where complementary bases pair (A with T, G with C), enabling stable storage of genetic data across generations, while RNA sequences are generally single-stranded and versatile, functioning in roles such as messenger RNA (mRNA) for protein synthesis or regulatory non-coding RNAs.[1] The human genome, for instance, comprises approximately 3 billion base pairs of DNA sequence, underscoring the immense informational capacity of these molecules.[1] Variations in nucleic acid sequences, known as mutations, can lead to genetic diversity, evolution, and diseases, making sequence analysis central to fields like genomics and molecular biology.[3] Discovered in the late 19th century by Friedrich Miescher, nucleic acids were later recognized for their sequence-based coding of heredity through landmark experiments in the mid-20th century.[1]Components and Representation
Nucleotides
Nucleotides are the fundamental monomeric units that compose nucleic acid sequences, each consisting of a nitrogenous base, a five-carbon pentose sugar, and one or more phosphate groups attached to the sugar.[4] The pentose sugar is ribose in ribonucleic acid (RNA) or 2'-deoxyribose in deoxyribonucleic acid (DNA), differing by the absence of a hydroxyl group at the 2' carbon position in deoxyribose.[5] The phosphate group is typically linked to the 5' carbon of the sugar, forming a nucleotide monophosphate, though di- or triphosphate forms occur in metabolic contexts. The nitrogenous bases in nucleotides are heterocyclic aromatic compounds classified into two main types: purines and pyrimidines. Purines, adenine (A) and guanine (G), feature a double-ring structure—a six-membered pyrimidine ring fused to a five-membered imidazole ring—with nitrogen atoms at positions 1, 3, 7, and 9. Adenine has an amino group at position 6, while guanine has a carbonyl group at position 6 and an amino group at position 2.[6] Pyrimidines, cytosine (C), thymine (T), and uracil (U), possess a single six-membered ring with nitrogens at positions 1 and 3; cytosine has an amino group at position 4 and a carbonyl group at position 2, thymine has a methyl group at position 5 and carbonyl groups at positions 2 and 4, and uracil mirrors thymine without the methyl group. In DNA, the canonical bases are adenine, cytosine, guanine, and thymine, while in RNA, uracil substitutes for thymine.[4]| Base | Type | DNA | RNA | Key Structural Features |
|---|---|---|---|---|
| Adenine (A) | Purine | Yes | Yes | Fused pyrimidine-imidazole rings; amino group at C6 |
| Guanine (G) | Purine | Yes | Yes | Fused rings; carbonyl at C6, amino at C2 |
| Cytosine (C) | Pyrimidine | Yes | Yes | Single ring; amino at C4, carbonyl at C2 |
| Thymine (T) | Pyrimidine | Yes | No | Single ring; methyl at C5, carbonyls at C2 and C4 |
| Uracil (U) | Pyrimidine | No | Yes | Single ring; carbonyls at C2 and C4 |
Notation Systems
Nucleic acid sequences are symbolically represented using single-letter abbreviations for the four standard nucleotide bases. In deoxyribonucleic acid (DNA), these are A for adenine, C for cytosine, G for guanine, and T for thymine. For ribonucleic acid (RNA), uracil (U) replaces thymine, resulting in A, C, G, and U.[11] These abbreviations, established as a compact notation for sequence description, facilitate clear communication in scientific literature and databases. By convention, nucleic acid sequences are written in the 5' to 3' direction, reflecting the polarity of the sugar-phosphate backbone where the 5' end terminates in a phosphate group attached to the 5' carbon of the sugar, and the 3' end has a free hydroxyl group on the 3' carbon. This directionality aligns with the biochemical processes of replication and transcription, which proceed from 5' to 3'. For example, a short DNA sequence might be denoted as 5'-ATCG-3', indicating the order of bases from the 5' end to the 3' end. RNA sequences follow the same convention, such as 5'-AUCG-3'. The prefixes 5' and 3' are often omitted when the direction is unambiguous, but explicit notation is used for clarity, especially in diagrams or when specifying strands.[11] To handle uncertainty or variability in sequencing data, the International Union of Pure and Applied Chemistry (IUPAC) and the International Union of Biochemistry (IUB), now IUBMB, introduced ambiguity codes in their recommendations. These single-letter symbols represent groups of bases, allowing concise notation for degenerate or polymorphic sites. For instance, N denotes any base (A, C, G, or T/U), R specifies a purine (A or G), and Y indicates a pyrimidine (C or T/U). The full set of IUPAC ambiguity codes is as follows:| Symbol | Bases Represented | Complementary Bases | Origin of Designation |
|---|---|---|---|
| A | A | T | Adenine |
| C | C | G | Cytosine |
| G | G | C | Guanine |
| T (DNA)/U (RNA) | T/U | A | Thymine/Uracil |
| R | A or G | Y | puRine |
| Y | C or T/U | R | pYrimidine |
| M | A or C | K | aMino |
| K | G or T/U | M | Keto |
| S | C or G | S | Strong (3 H-bonds) |
| W | A or T/U | W | Weak (2 H-bonds) |
| H | A or C or T/U | D | not-G (H) |
| B | C or G or T/U | V | not-A (B) |
| V | A or C or G | B | not-T/U (V) |
| D | A or G or T/U | H | not-C (D) |
| N | A or C or G or T/U | N | aNy |
3'-TAGC-5' Ambiguity codes can be applied to either strand, with complementary symbols used for the opposite strand (e.g., an R on one strand corresponds to a Y on the complement). This format highlights base pairing and is essential for representing genomic regions or restriction sites.[11]