Human genome

The human genome is the complete set of genetic information encoded in deoxyribonucleic acid (DNA) molecules found in the nucleus and mitochondria of human cells, consisting of approximately 3 billion base pairs organized into 23 pairs of chromosomes in the nuclear genome and a small circular mitochondrial genome of about 16,569 base pairs.^[1]^[2] This genetic blueprint provides the instructions for building and maintaining the human body, encompassing an estimated 19,000 to 20,000 protein-coding genes that account for roughly 1-2% of the total DNA sequence, with the vast majority comprising non-coding regions, regulatory elements, and repetitive sequences.^[3]^[1] Structurally, the nuclear human genome is diploid, meaning each cell contains two copies of each chromosome—one inherited from each parent—totaling 46 chromosomes that vary in size from about 50 million to 250 million base pairs.^[4] DNA in the genome is composed of four nucleotide bases—adenine (A), thymine (T), guanine (G), and cytosine (C)—arranged in a double-helix structure where A pairs with T and G with C, enabling accurate replication and transmission of genetic information during cell division and reproduction.^[1] The mitochondrial genome, inherited solely from the mother, encodes 37 genes primarily involved in energy production and is present in multiple copies per cell, contrasting with the single nuclear genome copy per haploid set.^[2] Genetic variation within the human genome is minimal yet profound, with any two individuals sharing about 99.9% sequence identity, differing by roughly 0.1% or approximately 3 to 5 million single nucleotide variants and other structural changes that contribute to individual traits, disease susceptibility, and population diversity.^[5]^[6] These variations, including single nucleotide polymorphisms (SNPs), insertions, deletions, and copy number variations, are cataloged in resources like the 1000 Genomes Project, which has mapped common variants across global populations to advance understanding of human evolution and health.^[6] The elucidation of the human genome sequence was a monumental achievement of the Human Genome Project (HGP), an international collaboration launched in 1990 and completed in 2003, which generated a reference sequence from multiple anonymous donors, achieving over 99% coverage of the euchromatic regions with high accuracy.^[7] This reference has since been refined through efforts like the Telomere-to-Telomere Consortium, which in 2022 produced the first fully complete sequence, including previously intractable heterochromatic regions.^[8] The HGP's legacy extends to enabling genomics technologies such as next-generation sequencing, which have accelerated discoveries in personalized medicine, cancer genomics, and rare disease diagnostics, transforming biology from a descriptive to a predictive science.^[7]

Physical characteristics

Genome size

The human nuclear genome comprises approximately 3 billion base pairs (bp) in its haploid configuration, representing the DNA content of one set of chromosomes.^[9] In diploid somatic cells, which contain two copies of each chromosome, this doubles to about 6.4 billion bp of nuclear DNA per cell.^[2] The mitochondrial genome adds a small fraction, with 16,569 bp, but the nuclear component dominates the overall size.^[2] Historical efforts to quantify the genome size began with the Human Genome Project (HGP), whose 2001 draft sequence assembled 2.91 billion bp, primarily covering the euchromatic regions and leaving gaps in repetitive heterochromatin.^[10] Subsequent refinements, including the Telomere-to-Telomere (T2T) Consortium's 2022 complete assembly (T2T-CHM13), established the precise haploid length at 3.055 billion bp by filling those previously unsequenced regions.^[8] Compared to other mammals, the human genome is slightly larger than that of the chimpanzee, estimated at about 3.0 billion bp in its initial draft sequence.^[11] The base composition of the human genome features approximately 41% GC content overall, reflecting a balance of guanine-cytosine and adenine-thymine pairs, though this varies across chromosomes with AT-rich regions prominent in chromosomes such as 13, Y, and parts of the acrocentric chromosomes.^[12]^[13] This composition contributes to the physical properties of DNA, with the total diploid nuclear DNA mass equaling roughly 6 picograms per cell.^[14] A substantial portion of the genome's size stems from repetitive sequences, which amplify the total length beyond coding regions.^[8]

Chromosome structure

The human genome is organized into 23 pairs of chromosomes in diploid somatic cells, totaling 46 chromosomes, consisting of 22 pairs of autosomes and one pair of sex chromosomes.^[15] Females typically have two X chromosomes (46,XX), while males have one X and one Y chromosome (46,XY).^[15] Gametes, such as sperm and eggs, are haploid and contain 23 chromosomes each, one from each pair, to ensure proper diploid restoration upon fertilization.^[2] Chromosomes vary in size and morphology, with chromosome 1 being the largest at approximately 249 million base pairs (Mb) and chromosome 21 the smallest autosome at about 48 Mb; the Y chromosome spans roughly 59 Mb.^[16] Each chromosome features specialized structures, including centromeres that serve as attachment points for spindle fibers during cell division, telomeres that cap the ends to prevent degradation and fusion, and regions of heterochromatin that appear as dense bands, often concentrated near centromeres and telomeres for structural stability.^[17] Five pairs of acrocentric chromosomes (13, 14, 15, 21, and 22) have very short arms (p arms) primarily composed of heterochromatin and containing clusters of ribosomal DNA genes essential for ribosome biogenesis.^[18] The overall arrangement of chromosomes is visualized through karyotyping, a technique that stains and photographs them during metaphase to display their number, size, and shape; G-banding, using Giemsa dye after trypsin treatment, produces characteristic light and dark bands for precise identification.^[19] Sex determination in humans is governed by the SRY gene on the Y chromosome, which triggers male development by initiating testis formation; its absence leads to female development.^[20] Deviations in chromosome number, known as aneuploidy, represent structural anomalies that can disrupt normal development; for example, trisomy 21—an extra copy of chromosome 21—causes Down syndrome, characterized by intellectual disability and physical features like hypotonia.^[21]

Molecular organization

Protein-coding genes

The human genome contains approximately 19,000–20,000 protein-coding genes, a figure refined through ongoing annotation efforts following the initial Human Genome Project (HGP) estimate of over 30,000 genes.^[22]^[23]^[24] These genes encode the proteins essential for cellular structure, function, and regulation, representing about 1.5% of the total genomic sequence when considering only the coding exons.^[25] Protein-coding genes vary widely in size, with an average length of around 27 kilobase pairs (kbp), though some span up to 2.4 million base pairs, as seen in the dystrophin gene on the X chromosome, which is crucial for muscle function.^[26]^[27] This variability arises primarily from the exon-intron architecture: exons, which carry the coding information, constitute only 1–2% of the genome and are typically short (average ~150 base pairs), while introns make up the bulk of gene length and are removed during RNA splicing.^[25] Alternative splicing of pre-mRNA allows a single gene to produce multiple protein isoforms, occurring in approximately 95% of human multi-exon protein-coding genes and enabling functional diversity without increasing gene count.^[28] Genes are not uniformly distributed across the 23 chromosome pairs; gene density is highest on chromosome 19, which has more than double the genome-wide average due to its compact structure and abundance of short, intron-poor genes.^[29] Certain genes cluster in families with related functions, such as the Hox genes organized into four paralogous clusters (HoxA, HoxB, HoxC, and HoxD) on chromosomes 2, 7, 12, and 17, respectively, which play critical roles in embryonic development by specifying body patterning.^[30] Post-HGP annotation projects like GENCODE and Ensembl have been instrumental in identifying and refining these loci through integration of transcriptomic, proteomic, and comparative genomic data.^[31]^[23] Among protein-coding genes, roughly 2,000 are considered essential, meaning their complete knockout leads to lethality in cellular or organismal models, often involving core processes like DNA replication and metabolism.^[32] These can be distinguished from non-essential genes, with housekeeping genes—such as those encoding actin or GAPDH—expressed constitutively across all tissues to maintain basic cellular functions, while tissue-specific genes, like those for insulin in pancreatic beta cells, are restricted to particular cell types and respond to developmental or environmental cues.^[33]^[34] The expression of protein-coding genes is modulated by interactions with nearby non-coding regulatory elements, which fine-tune transcription in a context-dependent manner.^[35]

Non-coding functional elements

The human genome contains a diverse array of non-coding functional elements, which are DNA sequences that do not encode proteins but play essential roles in gene regulation, RNA processing, and other cellular processes. These elements include non-coding genes that produce functional RNAs and cis-regulatory sequences that control transcription. Together, they contribute to the complexity of gene expression patterns observed across human tissues and developmental stages.^[36] Non-protein-coding genes comprise approximately 59,000 in the human genome (GENCODE Release 49), encompassing various classes of functional RNAs.^[22] Long non-coding RNAs (lncRNAs) represent a major category, with 35,899 lncRNA genes identified (GENCODE Release 49), many of which regulate chromatin modification, transcription, and post-transcriptional processes; for instance, the lncRNA XIST is crucial for X-chromosome inactivation in females by coating and silencing one X chromosome. MicroRNAs (miRNAs), numbering around 2,600 mature forms, posttranscriptionally repress gene expression by binding to target mRNAs, collectively regulating an estimated 60% of human protein-coding genes. Small nuclear RNAs (snRNAs), such as U1, U2, U4, U5, and U6, form the core of the spliceosome and facilitate pre-mRNA splicing, with functional snRNA genes numbering in the dozens but supported by hundreds of genomic loci.^[37]^[38]^[39]^[40] Regulatory sequences constitute another key class of non-coding functional elements, directing the spatial and temporal expression of genes. Promoters, located upstream of transcription start sites, initiate RNA polymerase recruitment and often feature motifs like the TATA box for basal transcription or CpG islands associated with housekeeping genes. Enhancers, numbering in the hundreds of thousands to over one million across the genome, boost transcription from distant locations—sometimes megabases away—by looping to interact with promoters via chromatin folding; they are enriched in cell-type-specific transcription factor binding sites. Silencers repress transcription similarly, while insulators prevent unwanted interactions between enhancers and non-target promoters, maintaining boundaries in chromatin domains. Untranslated regions (UTRs) in mRNA precursors, including 5' and 3' UTRs, modulate translation efficiency and mRNA stability. Ribosomal RNA (rRNA) genes, clustered in nucleolar organizer regions on acrocentric chromosomes, produce the structural components of ribosomes and number around 300–400 copies.^[41]^[36]^[42] Biochemical and evolutionary analyses indicate that approximately 8–10% of the human genome consists of functional non-coding DNA under purifying selection, as refined from early ENCODE findings that highlighted widespread regulatory activity. The ENCODE project's 2012 comprehensive mapping across cell types revealed pervasive transcription and chromatin marks in non-coding regions, though subsequent studies emphasized that only a subset demonstrates strict functional constraint. Some repetitive elements have been co-opted for regulatory roles, such as acting as enhancers or lncRNA precursors.^[36]^[43]^[44] Many non-coding functional elements, particularly enhancers, exhibit evolutionary conservation of activity despite sequence divergence, reflecting selective pressure on function rather than primary sequence. Comparative genomics shows that enhancer-driven expression patterns are often preserved across mammals, enabling coordinated gene regulation during development, even as nucleotide compositions evolve rapidly.^[45]^[46]

Repetitive and pseudogenic DNA

Pseudogenes are non-functional copies of genes that have accumulated disabling mutations, rendering them incapable of producing functional proteins. In the human genome, there are approximately 14,000 pseudogenes, which arise primarily through two mechanisms: gene duplication followed by degenerative mutations, or retrotransposition of processed mRNA lacking introns, resulting in intronless copies known as processed pseudogenes. Duplicated pseudogenes retain gene structure including introns, while processed pseudogenes typically lack regulatory elements and are inserted randomly via reverse transcriptase activity from long interspersed nuclear elements (LINEs). A prominent example is the olfactory receptor gene family, which includes around 800 pseudogenes, reflecting the evolutionary decay of sensory capabilities in humans compared to other mammals.^[47] Repetitive DNA sequences constitute about 50% of the human genome, encompassing various classes that contribute to structural stability, genome expansion, and occasionally disease susceptibility, though the majority are evolutionarily inert. Short tandem repeats include microsatellites (1-6 bp units) and longer arrays such as telomeric repeats (TTAGGG)n at chromosome ends, which protect against degradation, and centromeric satellites like alpha satellites (171 bp monomers organized into higher-order repeats) that facilitate kinetochore assembly. Segmental duplications, comprising roughly 5-7% of the genome, are large (>1 kb) low-copy duplicates prone to copy-number variants and structural rearrangements due to their sequence similarity. Transposable elements, making up approximately 45% of the genome (and the majority of repetitive DNA), include long terminal repeat (LTR) retrotransposons (~8%), long interspersed nuclear elements (LINEs, ~20%, dominated by LINE-1), short interspersed nuclear elements (SINEs, ~13%, with Alu elements accounting for ~11%), and DNA transposons (~3%). Alu elements, for instance, have expanded the genome through ~1 million insertions over the past 65 million years of primate evolution, often mediating recombination events.^[48]^[49] Historically termed "junk DNA" due to their apparent lack of protein-coding potential, repetitive and pseudogenic sequences were once viewed as evolutionary byproducts with no utility, a concept originating from early genomic analyses highlighting the C-value paradox. However, while some repeats contribute to regulatory functions or chromatin organization, the vast majority remain structurally inert, serving primarily as substrates for mutation and genome plasticity without direct biochemical roles. This misconception has evolved with evidence from projects like ENCODE, yet the bulk of these elements—particularly ancient pseudogenes and fossilized transposons—do not exhibit widespread functionality.

Genome sequencing

Historical milestones

Efforts to map the human genome began in the 1980s with the development of genetic linkage maps, which used restriction fragment length polymorphisms (RFLPs) to track inheritance patterns and locate genes on chromosomes.^[50] In 1980, researchers proposed using DNA polymorphisms to construct systematic genetic maps, enabling the localization of disease genes without prior knowledge of their sequences.^[51] By 1987, the first comprehensive human genetic linkage map was created, incorporating 400 RFLPs across all chromosomes.^[50] Concurrently, physical mapping advanced through yeast artificial chromosomes (YACs), introduced in 1987, which allowed cloning and ordering of large DNA segments up to 1 megabase for high-resolution genome frameworks.^[52] The Human Genome Project (HGP), launched in October 1990, represented a coordinated international effort led by the U.S. National Institutes of Health (NIH) and the Wellcome Trust, aiming to sequence the entire human genome over 15 years at an estimated cost of $3 billion.^[7] A key component was the Ethical, Legal, and Social Implications (ELSI) program, which allocated 5% of the project's annual budget to address bioethical concerns such as privacy, discrimination, and equitable access to genomic data.^[53] Progress accelerated with the sequencing of the first complete human chromosome in December 1999; chromosome 22, the smallest autosome, was fully sequenced by an international consortium, spanning 33.4 megabases of euchromatin and revealing 545 protein-coding genes.^[54] Draft sequences were published in February 2001 by both the public HGP consortium and Celera Genomics, covering about 90% of the genome with varying accuracy.^[55] The project declared a "complete" sequence in April 2003, achieving 99% coverage of the euchromatic portion at an accuracy better than 99.99%.^[56] Following the HGP, the Encyclopedia of DNA Elements (ENCODE) project was initiated in September 2003 by the National Human Genome Research Institute (NHGRI) to systematically identify and annotate functional non-coding elements across the genome, continuing as an ongoing international collaboration.^[57] The 1000 Genomes Project, launched in 2008 and completed in 2015, cataloged human genetic variation by sequencing 2,504 individuals from 26 populations, identifying over 88 million variants including 84% of common single nucleotide polymorphisms.^[58] Technological advancements post-HGP dramatically reduced sequencing costs, from the $3 billion for the initial reference to under $1,500 per genome by late 2015, enabling broader genomic studies.^[59]

Sequencing technologies

The sequencing of the human genome has relied on successive generations of technologies, each advancing read length, throughput, and accuracy to enable comprehensive genomic analysis. First-generation sequencing, exemplified by the Sanger chain-termination method, served as the foundational approach for early large-scale projects. Developed in 1977, this technique involves the enzymatic synthesis of DNA strands in the presence of chain-terminating dideoxynucleotides, producing fragments that are separated by capillary electrophoresis to determine the sequence.^[60] Typical read lengths reach approximately 800 base pairs (bp), with an error rate below 0.001%, making it highly accurate but labor-intensive and low-throughput.^[61]^[62] Second-generation sequencing, often termed next-generation sequencing (NGS), introduced massively parallel approaches that dramatically increased throughput and reduced costs compared to Sanger methods. Platforms like Illumina's sequencing-by-synthesis utilize reversible terminator nucleotides and fluorescence detection to generate short reads of 50–300 bp, enabling billions of fragments to be sequenced simultaneously on a single flow cell.^[63] Early pyrosequencing methods, such as those from 454 Life Sciences, detected pyrophosphate release during nucleotide incorporation for similar short-read outputs.^[63] Initial error rates for NGS ranged from 0.1% to 1%, primarily due to base-calling inaccuracies in homopolymer regions, though advancements in chemistry and algorithms have improved this to below 0.01% in many applications.^[64]^[65] Third-generation sequencing technologies shifted toward single-molecule, long-read capabilities to address limitations in resolving repetitive and structural elements. Pacific Biosciences (PacBio) employs single-molecule real-time (SMRT) sequencing, where DNA polymerase incorporates fluorescently labeled nucleotides in zero-mode waveguides, yielding continuous reads of 10–20 kilobases (kb) or longer with high-fidelity (HiFi) modes achieving over 99.9% accuracy through circular consensus.^[66]^[67] Oxford Nanopore Technologies uses protein nanopores to measure ionic current changes as DNA translocates through, enabling real-time sequencing of reads up to megabases in length, though with higher raw error rates of 5–15% that are mitigated by consensus polishing.^[68]^[69] Assembling these reads into a contiguous genome sequence presents distinct challenges depending on the approach and data type. De novo assembly reconstructs the genome without a reference, relying on overlapping reads to form contigs (continuous sequences) and scaffolds (contigs linked by paired-end information), but it struggles with short reads in repetitive regions, often resulting in fragmented outputs.^[70] Reference-guided assembly aligns reads to an existing genome to fill gaps and correct errors, improving contiguity for closely related samples but introducing biases from the reference.^[71] Assembly quality is commonly evaluated using the N50 statistic, which indicates the contig or scaffold length at which 50% of the genome is covered by sequences of that length or longer, with higher N50 values signifying better continuity.^[72] Hybrid approaches integrate short-read NGS data with long-read third-generation sequences to leverage the strengths of both, enhancing overall accuracy and completeness. Long reads provide structural scaffolding to span complex regions, while short reads offer high-fidelity polishing to correct errors, reducing indel and mismatch rates by up to 65% in some pipelines.^[73] Tools like POLCA and NextPolish align short reads to draft long-read assemblies, achieving polished error rates below 0.01% without introducing substantial new errors.^[74] This combination has become standard for high-quality human genome drafts, balancing computational efficiency with biological fidelity.^[75]

Recent complete assemblies

In 2022, the Telomere-to-Telomere (T2T) Consortium published the first complete haploid assembly of a human genome, designated T2T-CHM13, which achieved gapless telomere-to-telomere coverage across all 22 autosomes and the X chromosome, totaling 3,054,815,472 base pairs.^[8] This assembly filled approximately 8% of the human genome that remained unresolved in prior references, equivalent to about 200 million base pairs, primarily in challenging repetitive regions such as centromeres, the short arms of acrocentric chromosomes, and telomeres.^[8] Building on this foundation, advancements from 2023 to 2025 produced haplotype-resolved assemblies from 65 diverse individuals, generating 130 high-quality diploid genomes with a median contig length of 130 Mb and closing 92% of gaps present in earlier assemblies like GRCh38.^[76] These efforts addressed longstanding issues in short-read sequencing, including repeat collapse where highly similar sequences were erroneously merged, by leveraging ultra-long reads from technologies like Oxford Nanopore and optical mapping from Bionano Genomics to accurately resolve complex structures.^[8]^[76] The T2T-CHM13 assembly alone resolved approximately 2,000 new genes within previously inaccessible repetitive regions, including centromeric ribosomal DNA (rDNA) arrays that encode essential components for ribosome biogenesis.^[8] Subsequent multi-individual assemblies have enabled the discovery of novel centromeric variants and complete annotation of gene clusters, such as the full histone gene array on chromosome 6, enhancing understanding of epigenetic regulation and chromosomal stability.^[76] These complete assemblies provide a more accurate framework for variant interpretation and functional genomics studies, particularly in regions prone to structural variation.^[8]^[76]

Reference and variant genomes

Standard reference genome

The standard reference genome for humans is the Genome Reference Consortium human (GRCh) build, a linear, haploid representation maintained by the Genome Reference Consortium (GRC) to serve as a foundational coordinate system for genomic analyses. GRCh38, released in December 2013, is the current major build as of 2025, with the latest non-coordinate-changing patch, GRCh38.p14, issued in February 2022 to incorporate fixes for assembly artifacts and sequence updates. This assembly spans approximately 3.1 gigabases in total size, with about 2.9 gigabases of contiguous sequence and roughly 200 million base pairs represented as gaps, primarily in heterochromatic and repetitive regions that remain unassembled.^[77]^[78] GRCh38 evolved from GRCh37, released in February 2009 as a refinement of the Human Genome Project's earlier assemblies, with improvements in contiguity and accuracy derived from clone-based and next-generation sequencing data. Iterative updates since GRCh37 have focused on closing minor gaps, resolving misassemblies, and enhancing representation of polymorphic regions through mechanisms like FIX patches and alternate loci; GRCh38 includes 261 alternate loci across 178 genomic regions to accommodate common structural variants without disrupting the primary linear scaffold. These alternate loci, often haplotypes from diverse donors, allow for better handling of variant-rich areas such as the major histocompatibility complex.^[79]^[80]^[81] The sequence in GRCh38 derives from a limited set of donors—93% from just 11 individuals, with approximately 70% from a single anonymous male of likely African-European admixed ancestry—resulting in a representation biased toward European genetic backgrounds despite some admixture. This reference underpins the vast majority of human genomic studies, enabling standardized mapping, variant discovery, and annotation in applications from clinical diagnostics to population genetics.^[82]^[83]^[84] Despite its ubiquity, GRCh38's linear structure limits accurate alignment and detection of structural variants, particularly in repetitive or inverted regions, leading to reduced sensitivity in variant calling compared to predecessor builds. To mitigate mapping artifacts from contaminants like bacterial or viral DNA, decoy sequences—such as synthetic satellites or phiX174—are often appended in practical implementations of the reference. Comprehensive annotation overlays enhance its utility, integrating RefSeq for gene models (annotating over 20,000 protein-coding genes in release 232 as of September 2025) and dbSNP for variant catalogs (over 1.2 billion distinct single nucleotide polymorphisms mapped to GRCh38 coordinates in build 157 as of March 2025). Remaining gaps, especially in pericentromeric heterochromatin, have been filled by subsequent efforts like the Telomere-to-Telomere Consortium's complete assembly. GRCh38 remains the primary linear reference for most genomic analyses as of 2025, while complete assemblies like T2T-CHM13 complement it in specialized applications.^[85]^[86]^[87]^[88]^[89]

Telomere-to-telomere completion

The Telomere-to-telomere (T2T) completion of the human genome was achieved in 2022 with the T2T-CHM13 assembly, marking the first gapless, end-to-end sequence of the human nuclear genome. This assembly was derived from the CHM13hTERT cell line, obtained from a complete hydatidiform mole with a homozygous 46,XX karyotype of paternal origin, lacking maternal genetic contributions due to fertilization of an anucleate oocyte followed by paternal genome duplication. The resulting sequence fully resolves all 22 autosomes and the X chromosome in a single, contiguous haplotype, spanning 3,055 million base pairs. In 2023, a complete Y chromosome assembly (T2T-Y) was integrated, yielding the T2T-CHM13+Y reference that encompasses all 24 human chromosomes without gaps or unresolved regions.^[8]^[90] Compared to prior references like GRCh38, which left approximately 8% of the genome unsequenced due to repetitive complexity, T2T-CHM13 incorporates 238 million additional base pairs, of which 182 million represent entirely novel sequence. These additions primarily fill the short arms of the five acrocentric chromosomes (13, 14, 15, 21, and 22), where ~20 million base pairs consist of ribosomal DNA (rDNA) arrays that comprise 99% of the region and include 219 complete rDNA units; centromeric satellite arrays, dominated by alpha satellites and totaling ~200 million base pairs or about 6.5% of the genome; and distal telomere tracts with their associated repetitive motifs. This resolution unveils the full architecture of these heterochromatic domains, previously inaccessible to short-read technologies.^[8]^[8] Analysis of the newly sequenced regions identified 63 novel protein-coding genes, 195 long non-coding RNAs (lncRNAs), and more than 100 pseudogenes, many located within segmental duplications and satellite arrays. These elements include duplicated copies of genes like those in the WASH complex, potentially influencing cellular trafficking and immune functions, thereby expanding the annotated functional repertoire of the human genome.^[8] The T2T-CHM13 assembly was constructed using an integrated pipeline leveraging high-accuracy, long-read sequencing technologies: circular consensus HiFi reads from PacBio Sequel II for base-level precision in repetitive areas, ultralong Oxford Nanopore reads exceeding 100 kb for spanning large repeats, and Hi-C chromatin interaction data to order and orient contigs into chromosome-scale scaffolds. This hybrid approach ensured high contiguity (contig N50 > 25 Mb) and accuracy (>99.9%), validated through orthogonal mapping and optical genome mapping.^[8] By 2025, T2T-CHM13 has been established as the foundational complete reference in genomic studies, serving as the primary scaffold for variant calling, epigenetic profiling, and pangenome construction. The planned GRCh39 build has been indefinitely postponed by the Genome Reference Consortium.^[91]^[92]

Human pangenome initiative

The Human Pangenome Reference Consortium (HPRC), launched in 2019 and funded by the National Human Genome Research Institute, released an initial draft of the human pangenome reference in May 2023. This draft comprises 47 phased diploid genome assemblies derived from 47 genetically diverse individuals, primarily selected from the 1000 Genomes Project cohort to represent global population diversity across multiple ancestries.^[93] The consortium aims to expand this resource to include high-quality assemblies from at least 350 individuals, with ongoing efforts targeting over 100 additional diverse genomes by 2025 to further enhance representation.^[94] The pangenome is structured as a non-linear graph-based reference, consisting of core genomic sequences aligned with "bubbles" that encapsulate variant paths, allowing for the representation of multiple haplotypes and structural variations without relying on a single linear sequence. This graph format, constructed using telomere-to-telomere assembly methods, facilitates more accurate mapping of sequencing reads to diverse genomes. Compared to traditional linear references, the pangenome improves variant calling accuracy, particularly for small variants in non-European ancestries, by approximately 34%.^[93] In 2025, the HPRC advanced the initiative with the production of 130 haplotype-resolved assemblies from 65 diverse human genomes, achieving a median contig length of 130 Mb and closing 92% of previously unresolved gaps in the human reference. These assemblies capture complex structural variants, such as large insertions and inversions, that are often missed by linear references, thereby providing a more complete depiction of human genomic diversity.^[76] Key benefits of the pangenome include the reduction of reference bias in variant calling, which minimizes mapping errors for underrepresented populations and enhances the detection of population-specific alleles. It also supports the development of more equitable polygenic risk scores by incorporating diverse haplotype data, improving predictive accuracy across global ancestries. The resource integrates sequencing data from the 1000 Genomes Project and aligns with efforts from the All of Us Research Program to promote inclusive genomic analyses.^[93]^[95]

Genetic variation

Single nucleotide polymorphisms

Single nucleotide polymorphisms (SNPs) are the most prevalent form of genetic variation in the human genome, consisting of substitutions at a single nucleotide position in the DNA sequence among individuals. These point mutations occur where one of the four nucleotide bases (adenine, thymine, cytosine, or guanine) differs between individuals or compared to a reference sequence. SNPs with a minor allele frequency greater than 1% are considered common and are estimated to number around 4 to 5 million per individual genome. Across the global human population, approximately 84 million such common SNPs have been identified.^[96]^[97]^[98] SNPs are distributed throughout the genome at a frequency of roughly one per 300 to 1,000 base pairs, reflecting the overall sequence diversity. The majority, about 88%, reside in non-coding regions, which constitute the bulk of the genome, while fewer occur in protein-coding exons due to purifying selection pressures that limit variation in functional elements. This uneven distribution underscores how most SNPs likely influence gene regulation or non-coding functions rather than directly altering protein sequences.^[99]^[96]^[100] The discovery and cataloging of SNPs accelerated through the Human Genome Project (HGP), which provided the foundational reference sequence, and the launch of the dbSNP database in 1998 by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). dbSNP serves as a public repository for submitted genetic variations, now encompassing over 1.1 billion unique reference SNPs from ongoing genomic studies and sequencing efforts. A key feature of SNPs is their tendency to occur in linkage disequilibrium (LD), where certain combinations of alleles are inherited together more often than expected by chance, forming haplotype blocks that span segments of the genome and aid in association studies.^[101]^[102] In terms of functional impacts, SNPs in coding regions are classified as synonymous, which do not change the amino acid sequence, or missense, which result in an amino acid substitution and potential protein dysfunction; roughly 3.5 million missense SNPs have been documented across human populations. Genome-wide association studies (GWAS) have leveraged SNPs to identify genetic links to complex traits, with over 5,000 SNPs associated with various traits reported by 2025 through large-scale meta-analyses. These associations highlight SNPs' role in polygenic inheritance without delving into specific disorders.^[103] Population-level patterns reveal greater SNP diversity in genomes of African ancestry, where individuals of African ancestry carry approximately 4 million SNPs on average, representing about 20% more than in non-African populations, due to historical demographic factors. This elevated variation in African genomes better captures the full spectrum of human genetic diversity. Additionally, efforts like the Human Pangenome Initiative improve SNP mapping by incorporating diverse reference assemblies beyond the standard linear genome.^[104]

Structural variations

Structural variations (SVs) in the human genome refer to genomic alterations involving segments of DNA at least 50 base pairs in length, including copy-number variants (CNVs), insertions, deletions (indels), inversions, and translocations. CNVs, which encompass duplications or deletions ranging from 1 kb to 5 Mb, represent a major class of SVs and affect approximately 12-20% of the genome across individuals. These variations arise primarily through mechanisms such as non-allelic homologous recombination (NAHR) within repetitive sequences like segmental duplications, as well as non-homologous end joining and replication-based errors. On average, a typical human genome harbors 20,000 to 26,000 SVs, collectively spanning millions of base pairs and contributing substantially to inter-individual genetic diversity. Detection of SVs has advanced significantly with technologies like array comparative genomic hybridization (array CGH), which identifies copy-number changes by comparing hybridization signals between reference and sample DNA, and long-read sequencing platforms such as PacBio and Oxford Nanopore, which resolve complex rearrangements through continuous DNA reads. Recent telomere-to-telomere (T2T) assemblies, including those from the Human Pangenome Reference Consortium in 2025, have uncovered over 10,000 novel SVs previously missed in short-read data, particularly in repetitive regions like centromeres and subtelomeres. For instance, long-read sequencing of 1,019 diverse individuals revealed more than 100,000 biallelic SVs, highlighting the role of these methods in cataloging multiallelic variants. SV hotspots are enriched in regions of segmental duplications, such as near subtelomeres, where sequence similarity promotes NAHR and leads to recurrent rearrangements. A well-known example is the 22q11.2 deletion syndrome, caused by a 3-Mb deletion mediated by low-copy repeats, affecting about 1 in 4,000 individuals and resulting in developmental disorders like DiGeorge syndrome. Evolutionarily, SVs drive divergence between species; comparisons between human and chimpanzee genomes show that SVs account for over 3.5% of structural differences, impacting thousands of genes and contributing to approximately 7-10 million base pairs of net insertion/deletion variation beyond single-nucleotide differences.

Population-level diversity

The human genome exhibits remarkable uniformity across individuals, with an average nucleotide diversity of approximately 0.1%, meaning that any two humans differ at about one base pair per 1,000 in their DNA sequences.^[105] This low level of variation underscores the shared ancestry of modern humans, yet subtle differences emerge when examining population-level patterns. Sub-Saharan African populations display the highest levels of heterozygosity, around 0.10%, reflecting their role as the cradle of human genetic diversity, while non-African groups show reduced heterozygosity around 0.07% due to historical demographic events.^[106] The Out-of-Africa migration model explains much of this population structure, where early humans leaving Africa approximately 60,000–70,000 years ago experienced serial founder effects and bottlenecks that diminished genetic diversity in descendant populations.^[107] For instance, Native American populations exhibit about 20% less genetic diversity than African groups, a consequence of these bottlenecks during the peopling of the Americas via Beringia.^[108] Admixture events, such as those following colonial-era migrations, have further shaped contemporary diversity, introducing gene flow between previously isolated groups and creating mosaic ancestries in many regions.^[109] Large-scale genomic projects have cataloged this diversity to enable cross-population comparisons. The 1000 Genomes Project sequenced the genomes of 2,504 individuals from 26 populations across five continental groups, identifying over 88 million variants and revealing how allele frequencies vary by ancestry. More recently, the NIH's All of Us Research Program has generated over 400,000 whole-genome sequences from diverse U.S. participants as of early 2025, contributing to the program's goal of one million diverse participants, emphasizing underrepresented ancestries to address health disparities. These efforts highlight that African and admixed populations contribute disproportionately to novel variant discovery, underscoring the limitations of European-biased references.^[110] In May 2025, the Human Pangenome Reference Consortium released an expanded dataset with high-quality phased genomes from over 200 diverse individuals, enhancing the capture of global genetic variation.^[111] Key metrics quantify inter-population differentiation. The fixation index (F_ST), a measure of genetic divergence, averages around 0.12 globally among human populations, indicating low but detectable structure primarily along continental lines.^[112] Runs of homozygosity (ROH), extended tracts of identical DNA inherited from both parents, serve as indicators of recent inbreeding or endogamy; longer ROH are more prevalent in isolated or consanguineous groups, such as certain South Asian or Middle Eastern populations, and correlate with elevated risks for recessive disorders.^[113] By 2025, human pangenome initiatives have provided deeper insights into non-European diversity, demonstrating that genomes from African, Asian, and Indigenous ancestries harbor 20–30% more structural and sequence variants absent from traditional references like GRCh38.^[93] This disparity arises from reference biases that underrepresent non-European haplotypes, as briefly noted in the Human Pangenome Reference Consortium's work, which integrates 47 diverse assemblies to better capture global variation.^[114]

Biological implications

Genetic disorders

The human genome harbors variants that can lead to genetic disorders, which arise from alterations in DNA sequence, structure, or inheritance patterns, disrupting normal gene function and cellular processes. These disorders range from those caused by changes in a single gene to those influenced by multiple genetic loci interacting with environmental factors, highlighting the genome's role in disease susceptibility. Approximately 7,000 monogenic disorders are currently known, each typically resulting from mutations in a single gene that follow predictable Mendelian inheritance patterns.^[115] Monogenic disorders often stem from point mutations, such as single nucleotide polymorphisms (SNPs), or other specific alterations like repeat expansions. For instance, cystic fibrosis is primarily caused by SNPs and small deletions in the CFTR gene, with the most common being the ΔF508 mutation that impairs chloride ion transport across cell membranes.^[116] In contrast, Huntington's disease results from a CAG trinucleotide repeat expansion in the HTT gene, where repeats exceeding 36 copies lead to toxic polyglutamine tract formation in the huntingtin protein, causing progressive neurodegeneration.^[117] Polygenic disorders involve the cumulative effect of variants across numerous genes, often SNPs, contributing to complex traits and disease risk. Schizophrenia exemplifies this, with genome-wide association studies identifying over 100 independent risk loci, where common SNPs collectively explain a substantial portion of heritability through subtle disruptions in synaptic function and neurodevelopment. These polygenic risks underscore how widespread genomic variation, as detailed in sections on single nucleotide polymorphisms and structural variations, underlies multifactorial conditions. Beyond point mutations and polygenic effects, other genomic mechanisms contribute to disorders, including trinucleotide repeat expansions, copy number variations (CNVs), and uniparental disomy (UPD). Fragile X syndrome arises from CGG repeat expansion (>200 repeats) in the FMR1 gene's 5' untranslated region, leading to gene silencing and intellectual disability.^[118] DiGeorge syndrome (22q11.2 deletion syndrome) is frequently due to a 1.5–3 Mb heterozygous deletion CNV at chromosome 22q11.2, affecting multiple genes involved in immune, cardiac, and cognitive development.^[119] UPD, where both copies of a chromosome pair are inherited from one parent, can unmask recessive mutations or disrupt imprinting; for example, maternal UPD15 causes Prader-Willi syndrome by silencing paternally expressed genes.^[120] Advances in whole-genome sequencing (WGS) have revolutionized diagnosis for undiagnosed cases, identifying causative variants in 30–50% of pediatric patients with suspected genetic disorders through comprehensive detection of SNPs, CNVs, and structural changes.^[121] Recent developments in genome editing, such as CRISPR-Cas9-based therapies, have led to approved treatments for monogenic disorders like sickle cell disease (as of 2023).^[122] Serious genetic disorders affect approximately 1% of live births worldwide, with consanguineous unions increasing the risk of autosomal recessive conditions by 2–3-fold due to higher homozygosity for rare deleterious variants.^[123]^[124]

Evolutionary dynamics

The human genome exhibits a high degree of conservation with other primates, reflecting shared evolutionary ancestry. Comparative genomic analyses reveal approximately 98.8% sequence similarity between the human and chimpanzee genomes in aligned regions, underscoring the close phylogenetic relationship that diverged around 6-7 million years ago. Within this conserved framework, ultraconserved elements—non-coding sequences longer than 200 base pairs that show perfect identity across distantly related mammals—comprise a small fraction (~0.1%) of the genome but are under extremely strong purifying selection, with mechanisms that eliminate nearly all variants to maintain functional integrity. These elements, often located near genes involved in RNA processing and development, highlight regions where evolutionary pressures have preserved sequence stability over tens of millions of years.^[125] Human-specific genomic changes have contributed to unique traits, particularly in cognition and communication. Mutations in the FOXP2 gene, a transcription factor implicated in neural development, include two amino acid substitutions that occurred after the human-chimpanzee split; these have been hypothesized to facilitate adaptations for articulate speech and language, though recent analyses find no strong evidence for recent positive selection.^[126] Similarly, genes like ASPM, which regulates brain size during development, have undergone accelerated evolution in the human lineage, showing signatures of positive selection that correlate with expanded cerebral cortex volume compared to other primates.^[127] These alterations represent targeted modifications in a otherwise conserved regulatory landscape, driving neurodevelopmental innovations. Gene duplications and losses account for roughly 3% of the human genome that is unique relative to chimpanzees, often conferring adaptive advantages. A prominent example is the expansion of salivary amylase genes (AMY1), where humans possess multiple copies—up to 20 per diploid genome—facilitating efficient starch digestion, an adaptation linked to dietary shifts toward carbohydrate-rich foods in agricultural societies.^[128] Such structural variations, including segmental duplications totaling about 5% of the genome, have introduced novel gene families absent or reduced in other great apes, influencing traits like immune response and sensory perception. Evolutionary dynamics in the human genome are shaped by diverse selection pressures that have molded its variation. Positive selection has favored alleles like the lactase persistence variant (rs4988235, -13,910*T) in European populations, enabling adult lactose digestion and spreading rapidly within the last 10,000 years in response to dairy farming.^[129] Balancing selection maintains polymorphism in the major histocompatibility complex (MHC) region, promoting genetic diversity to enhance pathogen resistance through heterozygous advantage.^[130] Meanwhile, purifying selection dominates, eliminating approximately 70% of deleterious mutations before they reach fixation, thereby preserving genomic stability across the population. Admixture with archaic hominins has further enriched the human genome with adaptive variants. Non-African populations carry 1-4% Neanderthal-derived DNA, introduced via interbreeding events around 50,000-60,000 years ago, with segments enriched in immunity-related genes that likely conferred advantages against novel pathogens.^[131] In Oceanic populations, Denisovan admixture contributes up to 4-6% of the genome, particularly influencing high-altitude adaptation and immune functions in regions like Papua New Guinea.^[132] These introgressed sequences demonstrate how gene flow from archaic humans has provided raw material for local adaptations without disrupting core genomic architecture.^[133]

Mitochondrial genome

The human mitochondrial genome is a distinct, maternally inherited circular DNA molecule separate from the nuclear genome, consisting of 16,569 base pairs that encode 37 genes, including 13 protein-coding genes essential for oxidative phosphorylation, 22 transfer RNA genes, and 2 ribosomal RNA genes, with no introns present.^[134] This compact structure includes a small non-coding control region (D-loop) that contains regulatory elements for replication and transcription, enabling efficient transcription of polycistronic RNAs from the coding regions that are subsequently processed into mature transcripts.^[136]^[137] Mitochondrial DNA (mtDNA) exists in multiple copies per cell, ranging from 100 to 10,000, with significant variation across tissues based on metabolic demands; for instance, oocytes contain exceptionally high numbers, often exceeding 100,000 copies, to support embryonic development.^[138]^[139] Inheritance is strictly uniparental and maternal, as sperm contribute negligible mtDNA during fertilization, leading to potential heteroplasmy—where cells harbor a mixture of wild-type and mutant mtDNA variants—that can shift during development and cause mitochondrial diseases such as Leber's hereditary optic neuropathy when pathogenic variants exceed a threshold.^[140]^[141] The mtDNA mutation rate is approximately 10–17 times higher than that of nuclear DNA due to limited repair mechanisms and proximity to reactive oxygen species, facilitating rapid evolution and the formation of haplogroups that trace human ancestry; for example, haplogroup L0 represents the most ancient root lineage found predominantly in African populations.^[142]^[143] The human mtDNA was fully sequenced in 1981, providing the foundational Cambridge Reference Sequence, and as of 2025, ongoing research integrates this with nuclear-mitochondrial interactions, including nuclear mitochondrial DNA segments (NUMTs) that arise from mtDNA transfers to the nucleus and influence gene regulation and disease susceptibility.^[134]^[144] Recent studies also highlight epigenetic-like modifications on mtDNA, such as methylation patterns, that may modulate gene expression without altering the sequence.^[145]

Epigenomic modifications

Epigenomic modifications encompass heritable chemical and structural changes to DNA and associated proteins that influence gene expression without altering the underlying nucleotide sequence. These modifications include DNA methylation, histone tail alterations, and the actions of non-coding RNAs, collectively forming the epigenome that responds to developmental cues, environmental signals, and cellular states. In humans, the epigenome exhibits tissue-specific patterns that fine-tune genomic function, with dysregulation implicated in diseases ranging from cancer to neurodevelopmental disorders. DNA methylation primarily occurs as 5-methylcytosine at CpG dinucleotides, where the human genome contains approximately 28 million such sites, with 70–80% typically methylated in somatic cells. This modification represses gene expression by inhibiting transcription factor binding or recruiting repressive protein complexes, playing critical roles in genomic imprinting—where parental alleles are differentially silenced—and X-chromosome inactivation in females, ensuring dosage compensation for X-linked genes. Histone modifications, such as acetylation and methylation on tails of core histones, further sculpt chromatin structure to either promote or inhibit accessibility. For instance, H3K27 acetylation (H3K27ac) marks active enhancers by opening chromatin and facilitating transcription factor recruitment, while H3K9 trimethylation (H3K9me3) signals heterochromatin formation and transcriptional repression. The Roadmap Epigenomics Consortium's 2015 analysis of 111 reference human epigenomes integrated these marks with DNA methylation and chromatin accessibility data to define 25 recurrent chromatin states across cell types, revealing how combinatorial patterns dictate functional genomic regions like promoters and insulators. Non-coding RNAs, particularly long non-coding RNAs (lncRNAs), contribute to epigenomic regulation by guiding repressive complexes to target loci. For example, certain lncRNAs physically interact with Polycomb repressive complex 2 (PRC2) to direct H3K27 trimethylation and gene silencing, as observed in developmental processes where they maintain stable repression of Hox gene clusters. Although the majority of the human epigenome remains stable, approximately 3–5% of the genome undergoes dynamic modifications that respond to environmental influences, such as nutritional stress. The Dutch Hunger Winter famine of 1944–45 demonstrated transgenerational effects, with prenatally exposed individuals exhibiting persistent hypomethylation at the imprinted IGF2 gene decades later, alongside increased metabolic disorder risk in their offspring. These changes highlight how external factors can propagate epigenetically across generations. Advances in single-cell epigenomics as of 2025 have enabled high-resolution mapping of modifications across over 100 cell types in human tissues, including the brain, where multimodal profiling of hundreds of thousands of nuclei identifies cell-type-specific states. Such studies also link global DNA hypomethylation—a hallmark of aging—to chromatin instability and age-related phenotypes, with loss of methylation at repetitive elements correlating with increased genomic instability in older individuals.