Fact-checked by Grok 2 weeks ago

Human genome

The human genome is the complete set of genetic information encoded in deoxyribonucleic acid (DNA) molecules found in the nucleus and mitochondria of human cells, consisting of approximately 3 billion base pairs organized into 23 pairs of chromosomes in the nuclear genome and a small circular mitochondrial genome of about 16,569 base pairs. This genetic blueprint provides the instructions for building and maintaining the human body, encompassing an estimated 19,000 to 20,000 protein-coding genes that account for roughly 1-2% of the total DNA sequence, with the vast majority comprising non-coding regions, regulatory elements, and repetitive sequences. Structurally, the nuclear human genome is diploid, meaning each cell contains two copies of each chromosome—one inherited from each parent—totaling 46 chromosomes that vary in size from about 50 million to 250 million base pairs. DNA in the genome is composed of four nucleotide bases—adenine (A), thymine (T), guanine (G), and cytosine (C)—arranged in a double-helix structure where A pairs with T and G with C, enabling accurate replication and transmission of genetic information during cell division and reproduction. The mitochondrial genome, inherited solely from the mother, encodes 37 genes primarily involved in energy production and is present in multiple copies per cell, contrasting with the single nuclear genome copy per haploid set. Genetic variation within the human genome is minimal yet profound, with any two individuals sharing about 99.9% sequence identity, differing by roughly 0.1% or approximately 3 to 5 million single variants and other structural changes that contribute to individual traits, disease susceptibility, and population diversity. These variations, including single polymorphisms (SNPs), insertions, deletions, and copy number variations, are cataloged in resources like the , which has mapped common variants across global populations to advance understanding of and . The elucidation of the human genome sequence was a monumental achievement of the (HGP), an international collaboration launched in 1990 and completed in 2003, which generated a reference sequence from multiple anonymous donors, achieving over 99% coverage of the euchromatic regions with high accuracy. This reference has since been refined through efforts like the Telomere-to-Telomere Consortium, which in 2022 produced the first fully complete sequence, including previously intractable heterochromatic regions. The HGP's legacy extends to enabling technologies such as next-generation sequencing, which have accelerated discoveries in , cancer genomics, and diagnostics, transforming biology from a descriptive to a predictive .

Physical characteristics

Genome size

The human genome comprises approximately 3 billion base pairs () in its haploid configuration, representing the DNA content of one set of chromosomes. In diploid cells, which contain two copies of each , this doubles to about 6.4 billion of DNA per . The mitochondrial genome adds a small , with 16,569 , but the nuclear component dominates the overall size. Historical efforts to quantify the genome size began with the (HGP), whose 2001 draft sequence assembled 2.91 billion , primarily covering the euchromatic regions and leaving gaps in repetitive . Subsequent refinements, including the Telomere-to-Telomere (T2T) Consortium's 2022 complete assembly (T2T-CHM13), established the precise haploid length at 3.055 billion by filling those previously unsequenced regions. Compared to other mammals, the human genome is slightly larger than that of the , estimated at about 3.0 billion in its initial draft sequence. The base composition of the human genome features approximately 41% GC content overall, reflecting a balance of guanine-cytosine and adenine-thymine pairs, though this varies across chromosomes with AT-rich regions prominent in chromosomes such as 13, Y, and parts of the acrocentric chromosomes. This composition contributes to the physical properties of DNA, with the total diploid nuclear DNA mass equaling roughly 6 picograms per cell. A substantial portion of the genome's size stems from repetitive sequences, which amplify the total length beyond coding regions.

Chromosome structure

The human genome is organized into 23 pairs of chromosomes in diploid cells, totaling 46 chromosomes, consisting of 22 pairs of autosomes and one pair of . Females typically have two X chromosomes (46,XX), while males have one X and one (46,XY). Gametes, such as and eggs, are haploid and contain 23 chromosomes each, one from each pair, to ensure proper diploid restoration upon fertilization. Chromosomes vary in size and morphology, with being the largest at approximately 249 million base pairs (Mb) and the smallest at about 48 Mb; the spans roughly 59 Mb. Each features specialized structures, including centromeres that serve as attachment points for fibers during , telomeres that cap the ends to prevent and , and regions of that appear as dense bands, often concentrated near centromeres and telomeres for . Five pairs of acrocentric chromosomes (13, 14, 15, 21, and 22) have very short arms (p arms) primarily composed of and containing clusters of genes essential for . The overall arrangement of chromosomes is visualized through karyotyping, a technique that stains and photographs them during to display their number, size, and shape; , using Giemsa dye after treatment, produces characteristic light and dark bands for precise identification. Sex determination in humans is governed by the SRY gene on the , which triggers male development by initiating testis formation; its absence leads to female development. Deviations in chromosome number, known as , represent structural anomalies that can disrupt normal development; for example, trisomy 21—an extra copy of —causes , characterized by and physical features like .

Molecular organization

Protein-coding genes

The human genome contains approximately 19,000–20,000 protein-coding genes, a figure refined through ongoing efforts following the initial (HGP) estimate of over 30,000 genes. These genes encode the proteins essential for cellular structure, function, and regulation, representing about 1.5% of the total genomic sequence when considering only the coding exons. Protein-coding genes vary widely in size, with an length of around 27 kilobase pairs (kbp), though some span up to 2.4 million pairs, as seen in the on the , which is crucial for muscle function. This variability arises primarily from the exon-intron architecture: exons, which carry the coding information, constitute only 1–2% of the and are typically short ( ~150 pairs), while introns make up the bulk of and are removed during . of pre-mRNA allows a single to produce multiple protein isoforms, occurring in approximately 95% of human multi-exon protein-coding genes and enabling functional diversity without increasing gene count. Genes are not uniformly distributed across the 23 chromosome pairs; gene density is highest on , which has more than double the genome-wide average due to its compact structure and abundance of short, intron-poor genes. Certain genes cluster in families with related functions, such as the organized into four paralogous clusters (HoxA, HoxB, HoxC, and HoxD) on chromosomes 2, 7, 12, and 17, respectively, which play critical roles in embryonic development by specifying body patterning. Post-HGP annotation projects like GENCODE and Ensembl have been instrumental in identifying and refining these loci through integration of transcriptomic, proteomic, and comparative genomic data. Among protein-coding genes, roughly 2,000 are considered , meaning their complete leads to in cellular or organismal models, often involving processes like and . These can be distinguished from non-essential genes, with genes—such as those encoding or GAPDH—expressed constitutively across all tissues to maintain basic cellular functions, while tissue-specific genes, like those for insulin in pancreatic beta cells, are restricted to particular cell types and respond to developmental or environmental cues. The expression of protein-coding genes is modulated by interactions with nearby non-coding regulatory elements, which fine-tune transcription in a context-dependent manner.

Non-coding functional elements

The human genome contains a diverse array of non-coding functional elements, which are DNA sequences that do not encode proteins but play essential roles in gene regulation, RNA processing, and other cellular processes. These elements include non-coding genes that produce functional RNAs and cis-regulatory sequences that control transcription. Together, they contribute to the complexity of patterns observed across human tissues and developmental stages. Non-protein-coding genes comprise approximately 59,000 in the human genome (GENCODE Release 49), encompassing various classes of functional RNAs. Long non-coding RNAs (lncRNAs) represent a major category, with 35,899 lncRNA genes identified (GENCODE Release 49), many of which regulate chromatin modification, transcription, and post-transcriptional processes; for instance, the lncRNA XIST is crucial for X-chromosome inactivation in females by coating and silencing one X chromosome. MicroRNAs (miRNAs), numbering around 2,600 mature forms, posttranscriptionally repress gene expression by binding to target mRNAs, collectively regulating an estimated 60% of human protein-coding genes. Small nuclear RNAs (snRNAs), such as U1, U2, U4, U5, and U6, form the core of the spliceosome and facilitate pre-mRNA splicing, with functional snRNA genes numbering in the dozens but supported by hundreds of genomic loci. Regulatory sequences constitute another key class of non-coding functional elements, directing the spatial and temporal expression of genes. Promoters, located upstream of transcription start sites, initiate recruitment and often feature motifs like the for basal transcription or CpG islands associated with housekeeping genes. Enhancers, numbering in the hundreds of thousands to over one million across the , boost transcription from distant locations—sometimes megabases away—by looping to interact with promoters via folding; they are enriched in cell-type-specific binding sites. Silencers repress transcription similarly, while insulators prevent unwanted interactions between enhancers and non-target promoters, maintaining boundaries in domains. Untranslated regions (UTRs) in mRNA precursors, including 5' and 3' UTRs, modulate translation efficiency and mRNA stability. (rRNA) genes, clustered in nucleolar organizer regions on acrocentric chromosomes, produce the structural components of ribosomes and number around 300–400 copies. Biochemical and evolutionary analyses indicate that approximately 8–10% of the human genome consists of functional under purifying selection, as refined from early findings that highlighted widespread regulatory activity. The project's 2012 comprehensive mapping across cell types revealed pervasive transcription and marks in non-coding regions, though subsequent studies emphasized that only a subset demonstrates strict functional constraint. Some repetitive elements have been co-opted for regulatory roles, such as acting as enhancers or lncRNA precursors. Many non-coding functional elements, particularly enhancers, exhibit evolutionary conservation of activity despite sequence divergence, reflecting selective pressure on function rather than primary sequence. Comparative genomics shows that enhancer-driven expression patterns are often preserved across mammals, enabling coordinated regulation during , even as compositions evolve rapidly.

Repetitive and pseudogenic DNA

Pseudogenes are non-functional copies of genes that have accumulated disabling , rendering them incapable of producing functional proteins. In the human genome, there are approximately 14,000 pseudogenes, which arise primarily through two mechanisms: followed by degenerative mutations, or retrotransposition of processed mRNA lacking introns, resulting in intronless copies known as processed pseudogenes. Duplicated pseudogenes retain including introns, while processed pseudogenes typically lack regulatory elements and are inserted randomly via activity from long interspersed nuclear elements (LINEs). A prominent example is the gene family, which includes around 800 pseudogenes, reflecting the evolutionary decay of sensory capabilities in humans compared to other mammals. Repetitive DNA sequences constitute about 50% of the human genome, encompassing various classes that contribute to , genome expansion, and occasionally disease susceptibility, though the majority are evolutionarily inert. Short tandem repeats include microsatellites (1-6 bp units) and longer arrays such as telomeric repeats (TTAGGG)n at chromosome ends, which protect against degradation, and centromeric satellites like alpha satellites (171 bp monomers organized into higher-order repeats) that facilitate assembly. Segmental duplications, comprising roughly 5-7% of the genome, are large (>1 kb) low-copy duplicates prone to copy-number variants and structural rearrangements due to their sequence similarity. Transposable elements, making up approximately 45% of the genome (and the majority of repetitive DNA), include long terminal repeat (LTR) retrotransposons (~8%), long interspersed nuclear elements (LINEs, ~20%, dominated by LINE-1), short interspersed nuclear elements (, ~13%, with Alu elements accounting for ~11%), and DNA transposons (~3%). Alu elements, for instance, have expanded the genome through ~1 million insertions over the past 65 million years of evolution, often mediating recombination events. Historically termed "" due to their apparent lack of protein-coding potential, repetitive and pseudogenic sequences were once viewed as evolutionary byproducts with no utility, a originating from early genomic analyses highlighting the C-value paradox. However, while some repeats contribute to regulatory functions or organization, the vast majority remain structurally inert, serving primarily as substrates for mutation and genome plasticity without direct biochemical roles. This misconception has evolved with evidence from projects like , yet the bulk of these elements—particularly ancient pseudogenes and fossilized transposons—do not exhibit widespread functionality.

Genome sequencing

Historical milestones

Efforts to map the human began in the with the development of maps, which used restriction fragment length polymorphisms (RFLPs) to track inheritance patterns and locate genes on chromosomes. In , researchers proposed using DNA polymorphisms to construct systematic genetic maps, enabling the localization of genes without prior knowledge of their sequences. By , the first comprehensive human genetic linkage map was created, incorporating 400 RFLPs across all chromosomes. Concurrently, advanced through yeast artificial chromosomes (YACs), introduced in , which allowed and ordering of large DNA segments up to 1 megabase for high-resolution frameworks. The Human Genome Project (HGP), launched in October 1990, represented a coordinated international effort led by the U.S. National Institutes of Health (NIH) and the Wellcome Trust, aiming to sequence the entire human genome over 15 years at an estimated cost of $3 billion. A key component was the Ethical, Legal, and Social Implications (ELSI) program, which allocated 5% of the project's annual budget to address bioethical concerns such as privacy, discrimination, and equitable access to genomic data. Progress accelerated with the sequencing of the first complete human chromosome in December 1999; chromosome 22, the smallest autosome, was fully sequenced by an international consortium, spanning 33.4 megabases of euchromatin and revealing 545 protein-coding genes. Draft sequences were published in February 2001 by both the public HGP consortium and Celera Genomics, covering about 90% of the genome with varying accuracy. The project declared a "complete" sequence in April 2003, achieving 99% coverage of the euchromatic portion at an accuracy better than 99.99%. Following the HGP, the project was initiated in September 2003 by the to systematically identify and annotate functional non-coding elements across the , continuing as an ongoing international collaboration. The , launched in 2008 and completed in 2015, cataloged by sequencing 2,504 individuals from 26 populations, identifying over 88 million variants including 84% of common single nucleotide polymorphisms. Technological advancements post-HGP dramatically reduced sequencing costs, from the $3 billion for the initial reference to under $1,500 per genome by late 2015, enabling broader genomic studies.

Sequencing technologies

The sequencing of the human genome has relied on successive generations of technologies, each advancing read length, throughput, and accuracy to enable comprehensive genomic analysis. First-generation sequencing, exemplified by the Sanger chain-termination method, served as the foundational approach for early large-scale projects. Developed in 1977, this technique involves the enzymatic synthesis of DNA strands in the presence of chain-terminating dideoxynucleotides, producing fragments that are separated by capillary electrophoresis to determine the sequence. Typical read lengths reach approximately 800 base pairs (bp), with an error rate below 0.001%, making it highly accurate but labor-intensive and low-throughput. Second-generation sequencing, often termed next-generation sequencing (NGS), introduced massively parallel approaches that dramatically increased throughput and reduced costs compared to Sanger methods. Platforms like Illumina's sequencing-by-synthesis utilize reversible terminator s and fluorescence detection to generate short reads of 50–300 bp, enabling billions of fragments to be sequenced simultaneously on a single flow cell. Early methods, such as those from 454 Life Sciences, detected release during incorporation for similar short-read outputs. Initial error rates for NGS ranged from 0.1% to 1%, primarily due to base-calling inaccuracies in homopolymer regions, though advancements in chemistry and algorithms have improved this to below 0.01% in many applications. Third-generation sequencing technologies shifted toward single-molecule, long-read capabilities to address limitations in resolving repetitive and structural elements. (PacBio) employs single-molecule real-time (SMRT) sequencing, where incorporates fluorescently labeled in zero-mode waveguides, yielding continuous reads of 10–20 kilobases (kb) or longer with high-fidelity (HiFi) modes achieving over 99.9% accuracy through circular . uses protein nanopores to measure ionic current changes as DNA translocates through, enabling real-time sequencing of reads up to megabases in length, though with higher raw error rates of 5–15% that are mitigated by . Assembling these reads into a contiguous sequence presents distinct challenges depending on the approach and data type. assembly reconstructs the without a , relying on overlapping reads to form contigs (continuous sequences) and scaffolds (contigs linked by paired-end ), but it struggles with short reads in repetitive regions, often resulting in fragmented outputs. -guided aligns reads to an existing to fill gaps and correct errors, improving contiguity for closely related samples but introducing biases from the . quality is commonly evaluated using the N50 statistic, which indicates the contig or scaffold length at which 50% of the is covered by sequences of that length or longer, with higher N50 values signifying better continuity. Hybrid approaches integrate short-read NGS data with long-read third-generation sequences to leverage the strengths of both, enhancing overall accuracy and completeness. Long reads provide structural to span complex regions, while short reads offer high-fidelity to correct errors, reducing and mismatch rates by up to 65% in some pipelines. Tools like POLCA and NextPolish align short reads to draft long-read assemblies, achieving polished error rates below 0.01% without introducing substantial new errors. This combination has become standard for high-quality human genome drafts, balancing computational efficiency with biological fidelity.

Recent complete assemblies

In 2022, the Telomere-to-Telomere (T2T) Consortium published the first complete haploid assembly of a human genome, designated T2T-CHM13, which achieved gapless telomere-to-telomere coverage across all 22 autosomes and the , totaling 3,054,815,472 base pairs. This assembly filled approximately 8% of the human genome that remained unresolved in prior references, equivalent to about 200 million base pairs, primarily in challenging repetitive regions such as centromeres, the short arms of acrocentric chromosomes, and telomeres. Building on this foundation, advancements from 2023 to 2025 produced haplotype-resolved assemblies from 65 diverse individuals, generating 130 high-quality diploid genomes with a contig length of 130 and closing 92% of gaps present in earlier assemblies like GRCh38. These efforts addressed longstanding issues in short-read sequencing, including repeat collapse where highly similar sequences were erroneously merged, by leveraging ultra-long reads from technologies like Oxford Nanopore and from Bionano Genomics to accurately resolve complex structures. The T2T-CHM13 assembly alone resolved approximately 2,000 new genes within previously inaccessible repetitive regions, including centromeric (rDNA) arrays that encode essential components for . Subsequent multi-individual assemblies have enabled the of novel centromeric variants and complete of gene clusters, such as the full gene array on , enhancing understanding of epigenetic regulation and chromosomal stability. These complete assemblies provide a more accurate framework for variant interpretation and studies, particularly in regions prone to structural variation.

Reference and variant genomes

Standard reference genome

The standard reference genome for humans is the Genome Reference Consortium human (GRCh) build, a linear, haploid representation maintained by the Genome Reference Consortium (GRC) to serve as a foundational for genomic analyses. GRCh38, released in December 2013, is the current major build as of 2025, with the latest non-coordinate-changing patch, GRCh38.p14, issued in February 2022 to incorporate fixes for assembly artifacts and updates. This spans approximately 3.1 gigabases in total size, with about 2.9 gigabases of contiguous and roughly 200 million pairs represented as gaps, primarily in heterochromatic and repetitive regions that remain unassembled. GRCh38 evolved from GRCh37, released in February 2009 as a refinement of the Human Genome Project's earlier assemblies, with improvements in contiguity and accuracy derived from clone-based and next-generation sequencing data. Iterative updates since GRCh37 have focused on closing minor gaps, resolving misassemblies, and enhancing representation of polymorphic regions through mechanisms like FIX patches and alternate loci; GRCh38 includes 261 alternate loci across 178 genomic regions to accommodate common structural variants without disrupting the primary linear scaffold. These alternate loci, often haplotypes from diverse donors, allow for better handling of variant-rich areas such as the . The sequence in GRCh38 derives from a limited set of donors—93% from just 11 individuals, with approximately 70% from a single anonymous male of likely African-European admixed ancestry—resulting in a representation biased toward European genetic backgrounds despite some admixture. This reference underpins the vast majority of human genomic studies, enabling standardized mapping, variant discovery, and annotation in applications from clinical diagnostics to population genetics. Despite its ubiquity, GRCh38's linear structure limits accurate alignment and detection of structural variants, particularly in repetitive or inverted regions, leading to reduced in variant calling compared to predecessor builds. To mitigate mapping artifacts from contaminants like bacterial or DNA, decoy sequences—such as synthetic satellites or phiX174—are often appended in practical implementations of the reference. Comprehensive annotation overlays enhance its utility, integrating for gene models (annotating over 20,000 protein-coding genes in release 232 as of September 2025) and dbSNP for variant catalogs (over 1.2 billion distinct single polymorphisms mapped to GRCh38 coordinates in build 157 as of March 2025). Remaining gaps, especially in pericentromeric , have been filled by subsequent efforts like the Telomere-to-Telomere Consortium's complete assembly. GRCh38 remains the primary linear reference for most genomic analyses as of 2025, while complete assemblies like T2T-CHM13 complement it in specialized applications.

Telomere-to-telomere completion

The Telomere-to-telomere (T2T) completion of the was achieved in 2022 with the T2T-CHM13 , marking the first gapless, end-to-end of the human nuclear . This was derived from the CHM13hTERT cell line, obtained from a complete hydatidiform with a homozygous 46,XX of paternal origin, lacking maternal genetic contributions due to fertilization of an anucleate followed by paternal duplication. The resulting fully resolves all 22 autosomes and the in a single, contiguous , spanning 3,055 million base pairs. In 2023, a complete (T2T-Y) was integrated, yielding the T2T-CHM13+Y reference that encompasses all 24 human chromosomes without gaps or unresolved regions. Compared to prior references like GRCh38, which left approximately 8% of the unsequenced due to repetitive complexity, T2T-CHM13 incorporates 238 million additional base pairs, of which 182 million represent entirely novel sequence. These additions primarily fill the short arms of the five acrocentric chromosomes (13, 14, 15, 21, and 22), where ~20 million base pairs consist of (rDNA) arrays that comprise 99% of the region and include 219 complete rDNA units; centromeric satellite arrays, dominated by alpha satellites and totaling ~200 million base pairs or about 6.5% of the genome; and distal tracts with their associated repetitive motifs. This resolution unveils the full architecture of these heterochromatic domains, previously inaccessible to short-read technologies. Analysis of the newly sequenced regions identified 63 novel protein-coding genes, 195 long non-coding RNAs (lncRNAs), and more than 100 pseudogenes, many located within segmental duplications and satellite arrays. These elements include duplicated copies of genes like those in the WASH complex, potentially influencing cellular trafficking and immune functions, thereby expanding the annotated functional repertoire of the human genome. The T2T-CHM13 assembly was constructed using an integrated leveraging high-accuracy, long-read sequencing technologies: circular consensus HiFi reads from PacBio II for base-level precision in repetitive areas, ultralong Oxford Nanopore reads exceeding 100 kb for spanning large repeats, and interaction data to order and orient contigs into chromosome-scale scaffolds. This hybrid approach ensured high contiguity (contig N50 > 25 Mb) and accuracy (>99.9%), validated through orthogonal mapping and optical genome mapping. By 2025, T2T-CHM13 has been established as the foundational complete reference in genomic studies, serving as the primary scaffold for variant calling, epigenetic profiling, and pangenome construction. The planned GRCh39 build has been indefinitely postponed by the Genome Reference Consortium.

Human pangenome initiative

The (HPRC), launched in 2019 and funded by the , released an initial draft of the human pangenome reference in May 2023. This draft comprises 47 phased diploid genome assemblies derived from 47 genetically diverse individuals, primarily selected from the cohort to represent global population diversity across multiple ancestries. The consortium aims to expand this resource to include high-quality assemblies from at least 350 individuals, with ongoing efforts targeting over 100 additional diverse genomes by 2025 to further enhance representation. The is structured as a non-linear graph-based reference, consisting of core genomic sequences aligned with "bubbles" that encapsulate variant paths, allowing for the representation of multiple haplotypes and structural variations without relying on a single linear sequence. This graph format, constructed using telomere-to-telomere methods, facilitates more accurate of sequencing reads to diverse genomes. Compared to traditional linear references, the improves variant calling accuracy, particularly for small variants in non-European ancestries, by approximately 34%. In 2025, the HPRC advanced the initiative with the production of 130 haplotype-resolved assemblies from 65 diverse genomes, achieving a contig length of 130 and closing 92% of previously unresolved gaps in the human reference. These assemblies capture complex structural variants, such as large insertions and inversions, that are often missed by linear references, thereby providing a more complete depiction of human genomic diversity. Key benefits of the include the reduction of reference bias in variant calling, which minimizes mapping errors for underrepresented populations and enhances the detection of population-specific alleles. It also supports the development of more equitable polygenic risk scores by incorporating diverse data, improving predictive accuracy across global ancestries. The resource integrates sequencing data from the and aligns with efforts from the Research Program to promote inclusive genomic analyses.

Genetic variation

Single nucleotide polymorphisms

Single nucleotide polymorphisms (SNPs) are the most prevalent form of genetic variation in the human genome, consisting of substitutions at a single nucleotide position in the DNA sequence among individuals. These point mutations occur where one of the four nucleotide bases (adenine, thymine, cytosine, or guanine) differs between individuals or compared to a reference sequence. SNPs with a minor allele frequency greater than 1% are considered common and are estimated to number around 4 to 5 million per individual genome. Across the global human population, approximately 84 million such common SNPs have been identified. SNPs are distributed throughout the at a frequency of roughly one per 300 to 1,000 base pairs, reflecting the overall sequence diversity. The majority, about 88%, reside in non-coding regions, which constitute the bulk of the , while fewer occur in protein-coding exons due to purifying selection pressures that limit variation in functional elements. This uneven distribution underscores how most SNPs likely influence regulation or non-coding functions rather than directly altering protein sequences. The discovery and cataloging of SNPs accelerated through the (HGP), which provided the foundational reference sequence, and the launch of the dbSNP database in 1998 by the (NCBI) in collaboration with the (NHGRI). dbSNP serves as a public repository for submitted genetic variations, now encompassing over 1.1 billion unique reference SNPs from ongoing genomic studies and sequencing efforts. A key feature of SNPs is their tendency to occur in (LD), where certain combinations of alleles are inherited together more often than expected by chance, forming haplotype blocks that span segments of the genome and aid in association studies. In terms of functional impacts, SNPs in coding regions are classified as synonymous, which do not change the sequence, or missense, which result in an substitution and potential protein dysfunction; roughly 3.5 million missense SNPs have been documented across human populations. Genome-wide association studies (GWAS) have leveraged SNPs to identify genetic links to , with over 5,000 SNPs associated with various traits reported by 2025 through large-scale meta-analyses. These associations highlight SNPs' role in polygenic without delving into specific disorders. Population-level patterns reveal greater SNP diversity in genomes of African ancestry, where individuals of African ancestry carry approximately 4 million SNPs on average, representing about 20% more than in non-African populations, due to historical demographic factors. This elevated variation in African genomes better captures the full spectrum of human genetic diversity. Additionally, efforts like the Human Pangenome Initiative improve SNP mapping by incorporating diverse reference assemblies beyond the standard linear genome.

Structural variations

Structural variations (SVs) in the human genome refer to genomic alterations involving segments of DNA at least 50 base pairs in length, including copy-number variants (CNVs), insertions, deletions (indels), inversions, and translocations. CNVs, which encompass duplications or deletions ranging from 1 kb to 5 Mb, represent a major class of SVs and affect approximately 12-20% of the genome across individuals. These variations arise primarily through mechanisms such as non-allelic homologous recombination (NAHR) within repetitive sequences like segmental duplications, as well as non-homologous end joining and replication-based errors. On average, a typical human genome harbors 20,000 to 26,000 SVs, collectively spanning millions of base pairs and contributing substantially to inter-individual genetic diversity. Detection of SVs has advanced significantly with technologies like array comparative genomic hybridization (array CGH), which identifies copy-number changes by comparing hybridization signals between reference and sample DNA, and long-read sequencing platforms such as PacBio and Oxford Nanopore, which resolve complex rearrangements through continuous DNA reads. Recent telomere-to-telomere (T2T) assemblies, including those from the in 2025, have uncovered over 10,000 novel SVs previously missed in short-read data, particularly in repetitive regions like centromeres and subtelomeres. For instance, long-read sequencing of 1,019 diverse individuals revealed more than 100,000 biallelic SVs, highlighting the role of these methods in cataloging multiallelic variants. SV hotspots are enriched in regions of segmental duplications, such as near subtelomeres, where sequence similarity promotes NAHR and leads to recurrent rearrangements. A well-known example is the 22q11.2 deletion syndrome, caused by a 3-Mb deletion mediated by low-copy repeats, affecting about 1 in 4,000 individuals and resulting in developmental disorders like . Evolutionarily, SVs drive divergence between species; comparisons between human and genomes show that SVs account for over 3.5% of structural differences, impacting thousands of genes and contributing to approximately 7-10 million base pairs of net insertion/deletion variation beyond single-nucleotide differences.

Population-level diversity

The human genome exhibits remarkable uniformity across individuals, with an average diversity of approximately 0.1%, meaning that any two humans differ at about one per 1,000 in their DNA sequences. This low level of variation underscores the shared ancestry of modern humans, yet subtle differences emerge when examining population-level patterns. Sub-Saharan African populations display the highest levels of heterozygosity, around 0.10%, reflecting their role as the cradle of human genetic , while non-African groups show reduced heterozygosity around 0.07% due to historical demographic events. The Out-of-Africa migration model explains much of this population structure, where early humans leaving approximately 60,000–70,000 years ago experienced serial founder effects and bottlenecks that diminished in descendant populations. For instance, Native American populations exhibit about 20% less than African groups, a consequence of these bottlenecks during the via . events, such as those following colonial-era migrations, have further shaped contemporary diversity, introducing between previously isolated groups and creating mosaic ancestries in many regions. Large-scale genomic projects have cataloged this diversity to enable cross-population comparisons. The sequenced the genomes of 2,504 individuals from 26 populations across five continental groups, identifying over 88 million variants and revealing how frequencies vary by ancestry. More recently, the NIH's Research Program has generated over 400,000 whole-genome sequences from diverse U.S. participants as of early 2025, contributing to the program's goal of one million diverse participants, emphasizing underrepresented ancestries to address health disparities. These efforts highlight that and admixed populations contribute disproportionately to novel variant discovery, underscoring the limitations of European-biased references. In May 2025, the Human Pangenome Reference Consortium released an expanded dataset with high-quality phased genomes from over 200 diverse individuals, enhancing the capture of global . Key metrics quantify inter-population differentiation. The (FST), a measure of , averages around 0.12 globally among populations, indicating low but detectable structure primarily along continental lines. Runs of homozygosity (ROH), extended tracts of identical DNA inherited from both parents, serve as indicators of recent or ; longer ROH are more prevalent in isolated or consanguineous groups, such as certain South Asian or Middle Eastern populations, and correlate with elevated risks for recessive disorders. By 2025, human initiatives have provided deeper insights into non-European diversity, demonstrating that genomes from African, Asian, and Indigenous ancestries harbor 20–30% more structural and sequence variants absent from traditional references like GRCh38. This disparity arises from reference biases that underrepresent non-European haplotypes, as briefly noted in the Human Pangenome Reference Consortium's work, which integrates 47 diverse assemblies to better capture global variation.

Biological implications

Genetic disorders

The human genome harbors variants that can lead to genetic disorders, which arise from alterations in DNA sequence, structure, or inheritance patterns, disrupting normal gene function and cellular processes. These disorders range from those caused by changes in a single gene to those influenced by multiple genetic loci interacting with environmental factors, highlighting the genome's role in disease susceptibility. Approximately 7,000 monogenic disorders are currently known, each typically resulting from mutations in a single gene that follow predictable Mendelian inheritance patterns. Monogenic disorders often stem from point mutations, such as single nucleotide polymorphisms (SNPs), or other specific alterations like repeat expansions. For instance, is primarily caused by SNPs and small deletions in the CFTR gene, with the most common being the ΔF508 mutation that impairs chloride ion transport across cell membranes. In contrast, results from a trinucleotide repeat expansion in the HTT gene, where repeats exceeding 36 copies lead to toxic polyglutamine tract formation in the protein, causing progressive neurodegeneration. Polygenic disorders involve the cumulative effect of variants across numerous genes, often SNPs, contributing to and disease risk. Schizophrenia exemplifies this, with genome-wide association studies identifying over 100 independent risk loci, where common SNPs collectively explain a substantial portion of through subtle disruptions in synaptic function and neurodevelopment. These polygenic risks underscore how widespread genomic variation, as detailed in sections on single nucleotide polymorphisms and structural variations, underlies multifactorial conditions. Beyond point mutations and polygenic effects, other genomic mechanisms contribute to disorders, including trinucleotide repeat expansions, copy number variations (CNVs), and uniparental disomy (UPD). Fragile X syndrome arises from CGG repeat expansion (>200 repeats) in the FMR1 gene's 5' untranslated region, leading to gene silencing and intellectual disability. DiGeorge syndrome (22q11.2 deletion syndrome) is frequently due to a 1.5–3 Mb heterozygous deletion CNV at chromosome 22q11.2, affecting multiple genes involved in immune, cardiac, and cognitive development. UPD, where both copies of a chromosome pair are inherited from one parent, can unmask recessive mutations or disrupt imprinting; for example, maternal UPD15 causes Prader-Willi syndrome by silencing paternally expressed genes. Advances in whole-genome sequencing (WGS) have revolutionized for undiagnosed cases, identifying causative variants in 30–50% of pediatric patients with suspected genetic disorders through comprehensive detection of SNPs, CNVs, and structural changes. Recent developments in , such as CRISPR-Cas9-based therapies, have led to approved treatments for monogenic disorders like (as of 2023). Serious genetic disorders affect approximately 1% of live births worldwide, with consanguineous unions increasing the risk of autosomal recessive conditions by 2–3-fold due to higher homozygosity for rare deleterious variants.

Evolutionary dynamics

The human genome exhibits a high degree of with other , reflecting shared evolutionary ancestry. Comparative genomic analyses reveal approximately 98.8% sequence similarity between the human and genomes in aligned regions, underscoring the close phylogenetic relationship that diverged around 6-7 million years ago. Within this conserved framework, ultraconserved elements—non-coding sequences longer than 200 base pairs that show perfect identity across distantly related mammals—comprise a small fraction (~0.1%) of the genome but are under extremely strong purifying selection, with mechanisms that eliminate nearly all variants to maintain functional integrity. These elements, often located near genes involved in processing and , highlight regions where evolutionary pressures have preserved sequence stability over tens of millions of years. Human-specific genomic changes have contributed to unique traits, particularly in cognition and communication. Mutations in the FOXP2 gene, a transcription factor implicated in neural development, include two amino acid substitutions that occurred after the human-chimpanzee split; these have been hypothesized to facilitate adaptations for articulate speech and language, though recent analyses find no strong evidence for recent positive selection. Similarly, genes like ASPM, which regulates brain size during development, have undergone accelerated evolution in the human lineage, showing signatures of positive selection that correlate with expanded cerebral cortex volume compared to other primates. These alterations represent targeted modifications in a otherwise conserved regulatory landscape, driving neurodevelopmental innovations. Gene duplications and losses account for roughly 3% of the human that is unique relative to chimpanzees, often conferring adaptive advantages. A prominent example is the expansion of salivary genes (AMY1), where humans possess multiple copies—up to 20 per diploid —facilitating efficient , an linked to dietary shifts toward carbohydrate-rich foods in agricultural societies. Such structural variations, including segmental duplications totaling about 5% of the , have introduced novel families absent or reduced in other great apes, influencing traits like and sensory perception. Evolutionary dynamics in the human genome are shaped by diverse selection pressures that have molded its variation. Positive selection has favored alleles like the variant (rs4988235, -13,910*T) in European populations, enabling adult lactose digestion and spreading rapidly within the last 10,000 years in response to . Balancing selection maintains polymorphism in the (MHC) region, promoting genetic diversity to enhance resistance through heterozygous advantage. Meanwhile, purifying selection dominates, eliminating approximately 70% of deleterious mutations before they reach fixation, thereby preserving genomic stability across the population. Admixture with archaic hominins has further enriched the human genome with adaptive variants. Non-African populations carry 1-4% Neanderthal-derived DNA, introduced via interbreeding events around 50,000-60,000 years ago, with segments enriched in immunity-related genes that likely conferred advantages against novel pathogens. In Oceanic populations, Denisovan admixture contributes up to 4-6% of the genome, particularly influencing high-altitude adaptation and immune functions in regions like . These introgressed sequences demonstrate how from has provided raw material for local adaptations without disrupting core genomic architecture.

Mitochondrial genome

The human mitochondrial genome is a distinct, maternally inherited circular DNA molecule separate from the genome, consisting of 16,569 pairs that 37 genes, including 13 protein-coding genes essential for , 22 genes, and 2 genes, with no introns present. This compact structure includes a small non-coding control region () that contains regulatory elements for replication and transcription, enabling efficient transcription of polycistronic RNAs from the coding regions that are subsequently processed into mature transcripts. Mitochondrial DNA (mtDNA) exists in multiple copies per , ranging from 100 to 10,000, with significant variation across tissues based on metabolic demands; for instance, oocytes contain exceptionally high numbers, often exceeding 100,000 copies, to support embryonic development. is strictly uniparental and maternal, as contribute negligible mtDNA during fertilization, leading to potential —where cells harbor a mixture of wild-type and mutant mtDNA variants—that can shift during development and cause mitochondrial diseases such as when pathogenic variants exceed a . The mtDNA mutation rate is approximately 10–17 times higher than that of nuclear DNA due to limited repair mechanisms and proximity to , facilitating rapid and the formation of haplogroups that trace ancestry; for example, represents the most ancient root lineage found predominantly in African populations. The mtDNA was fully sequenced in , providing the foundational Cambridge Reference Sequence, and as of 2025, ongoing research integrates this with nuclear-mitochondrial interactions, including nuclear mitochondrial DNA segments (NUMTs) that arise from mtDNA transfers to the and influence regulation and disease susceptibility. Recent studies also highlight epigenetic-like modifications on mtDNA, such as patterns, that may modulate without altering the sequence.

Epigenomic modifications

Epigenomic modifications encompass heritable chemical and structural changes to DNA and associated proteins that influence without altering the underlying sequence. These modifications include , histone tail alterations, and the actions of non-coding RNAs, collectively forming the epigenome that responds to developmental cues, environmental signals, and cellular states. In humans, the epigenome exhibits tissue-specific patterns that fine-tune genomic function, with dysregulation implicated in diseases ranging from cancer to neurodevelopmental disorders. DNA methylation primarily occurs as 5-methylcytosine at CpG dinucleotides, where the human genome contains approximately 28 million such sites, with 70–80% typically methylated in somatic cells. This modification represses by inhibiting binding or recruiting repressive protein complexes, playing critical roles in —where parental alleles are differentially silenced—and X-chromosome inactivation in females, ensuring dosage compensation for X-linked genes. Histone modifications, such as and on tails of core , further sculpt structure to either promote or inhibit accessibility. For instance, H3K27 acetylation (H3K27ac) marks active enhancers by opening and facilitating recruitment, while H3K9 trimethylation () signals formation and transcriptional repression. The Epigenomics Consortium's 2015 analysis of 111 reference human epigenomes integrated these marks with and accessibility data to define 25 recurrent states across cell types, revealing how combinatorial patterns dictate functional genomic regions like promoters and insulators. Non-coding RNAs, particularly long non-coding RNAs (lncRNAs), contribute to epigenomic regulation by guiding repressive complexes to target loci. For example, certain lncRNAs physically interact with Polycomb repressive complex 2 (PRC2) to direct H3K27 trimethylation and gene silencing, as observed in developmental processes where they maintain stable repression of Hox gene clusters. Although the majority of the human epigenome remains stable, approximately 3–5% of the genome undergoes dynamic modifications that respond to environmental influences, such as nutritional stress. The Dutch Hunger Winter famine of 1944–45 demonstrated transgenerational effects, with prenatally exposed individuals exhibiting persistent hypomethylation at the imprinted IGF2 gene decades later, alongside increased metabolic disorder risk in their offspring. These changes highlight how external factors can propagate epigenetically across generations. Advances in single-cell as of 2025 have enabled high-resolution mapping of modifications across over 100 cell types in tissues, including the , where multimodal profiling of hundreds of thousands of nuclei identifies cell-type-specific states. Such studies also link global DNA hypomethylation—a hallmark of aging—to instability and age-related phenotypes, with loss of at repetitive elements correlating with increased genomic instability in older individuals.