Fact-checked by Grok 2 weeks ago

1000 Genomes Project

The 1000 Genomes Project was an international research consortium established in 2008 and completed in 2015, with the primary goal of creating the largest publicly available catalog of human genetic variation by sequencing the genomes of at least 1,000 individuals from diverse global populations. This effort built upon prior initiatives like the , focusing on identifying common single nucleotide polymorphisms (SNPs), insertions, deletions, and structural variants to serve as a foundational resource for biomedical research, including genome-wide association studies and . The project involved collaborations among research institutions in the United States, , , and , ultimately sequencing low-coverage whole genomes (average depth of 7.4×) and high-coverage exomes (average depth of 65.7×) from 2,504 individuals across 26 populations representing five major continental ancestry groups: , , , , and the . These samples were drawn from unrelated healthy volunteers with open consent, ensuring ethical handling and broad accessibility of the data through repositories like the and the NCBI Sequence Read Archive. Advanced computational pipelines, including 24 variant-calling tools and machine-learning-based genotyping, enabled the discovery and phasing of 88 million variant sites, of which the project contributed or validated approximately 80 million in the dbSNP database (version 141), including 40 million novel SNPs and indels. Key outcomes included the characterization of 4.1 to 5.0 million variants per typical , with about 64 million being ( <0.5%) and 8 million common (>5% frequency), highlighting the extent of human genetic diversity and its variation across populations. The resulting has been instrumental in improving imputation accuracy, designing arrays, and advancing by providing a reference for interpreting variants in contexts. Post-project, the International Genome Sample Resource (IGSR) maintains the data, realigned to the GRCh38 human genome assembly and including high-coverage (30×) whole-genome sequences from 3,202 samples, while integrating variation data from additional populations via projects like the Human Genome Diversity Project. Recent developments include 2025 publications on long-read sequencing of 1,019 samples from 26 populations to characterize structural variants (Schloissnig et al., Nature) and high-coverage Oxford Nanopore sequencing (~37×) of 100 diverse original samples, identifying ~24,500 high-confidence SVs per genome on average and enabling detection of repeat expansions and methylation patterns previously missed by short-read methods, with ongoing efforts targeting 800+ ONT samples.

Background

Human Genetic Variation

Human genetic variation encompasses differences in DNA sequences among individuals, which arise from mutations and recombination events over evolutionary time. The primary types of genetic variants include single nucleotide polymorphisms (SNPs), which are substitutions of a single at a specific in the ; insertions and deletions (indels), involving the addition or removal of short DNA segments typically up to 50 base pairs; and structural variants (SVs), larger alterations such as deletions, duplications, inversions, and copy number variants that affect at least 50 and can span thousands of base pairs. These variants collectively account for the majority of genomic diversity, with SNPs and short indels comprising over 99.9% of identified differences, while SVs impact approximately 20 million base pairs per diploid . In populations, genetic variants vary in and , with variants (minor allele frequency >5%) being shared across diverse groups and rare variants (<0.5%) often population-specific. A typical genome harbors about 4.1 to 5.0 million variants, reflecting an average heterozygosity rate of approximately 1 in 1,000 bases, or a nucleotide diversity (π) of about 10^{-3}. This results in roughly 64 million rare, 12 million low-frequency (0.5–5%), and 8 million variant sites across global populations, with African genomes exhibiting the highest overall diversity due to ancient origins outside the continent. Genetic variation plays a crucial role in human evolution by driving adaptation to environmental pressures, such as pathogen resistance and metabolic adjustments, while also shaping population history through events like migrations, bottlenecks, and admixture that influence variant frequencies. In disease susceptibility, both common and rare variants contribute to complex traits and disorders; for instance, common variants often underlie polygenic risk scores for conditions like diabetes, whereas rare variants with larger effect sizes are implicated in monogenic diseases and some severe phenotypes. Understanding these patterns requires considering haplotype blocks—regions of the genome where alleles at multiple loci are inherited together due to low historical recombination—and linkage disequilibrium (LD), the non-random association of alleles at different sites within these blocks, which decays more rapidly in high-diversity populations like those of African ancestry. Earlier efforts, such as the HapMap project, effectively cataloged common variants in LD but were limited in capturing rare ones, highlighting the need for deeper sequencing to resolve full variation spectra.

Prior Genomics Efforts

The Human Genome Project, an international collaboration launched in 1990, culminated in the release of a high-quality reference sequence of the human genome in April 2003, covering approximately 92% of the euchromatic regions with fewer than 400 gaps remaining. This effort focused primarily on sequencing a single composite reference genome, derived mainly from one anonymous donor of mixed ancestry (about 70%) supplemented by contributions from 19 others, mostly of European descent, to establish a foundational template for identifying genes and understanding human biology. While groundbreaking, the project emphasized a consensus sequence rather than cataloging population-level genetic variation, leaving substantial gaps in representing diverse human genomes. Building on this foundation, the International HapMap Project, initiated in 2002, aimed to map patterns of common genetic variation by genotyping single nucleotide polymorphisms (SNPs) and haplotypes across human populations. Phases I and II, completed by 2007, cataloged approximately 3.1 million SNPs in 270 individuals from four continental populations: Utah residents with Northern and Western European ancestry (CEU, 90 individuals), Yoruba in Ibadan, Nigeria (YRI, 90 individuals), Han Chinese in Beijing (CHB, 45 individuals), and Japanese in Tokyo (JPT, 45 individuals). Phase III, finalized in 2010, expanded genotyping to 1.5 million additional SNPs in 1,184 individuals across 11 global populations, enhancing haplotype resolution for genome-wide association studies (GWAS). Despite these advances, the HapMap Project revealed critical limitations in capturing the full spectrum of human genetic diversity. It primarily targeted common variants with minor allele frequencies (MAF) greater than 5% in the initial phases, underrepresenting rare variants (MAF <1%) that constitute the majority of novel polymorphisms and play key roles in disease susceptibility. Additionally, the early focus on just four populations limited its applicability to global diversity, as groups like CEU, YRI, CHB, and JPT did not adequately reflect admixture or variation in other ancestries, such as South Asian or Native American. These gaps, highlighted in databases like dbSNP (which by 2008 included only 11 million SNPs but missed most low-frequency variants), underscored the need for deeper sequencing to detect rarer alleles and broader sampling. Preceding the full launch of the 1000 Genomes Project in 2008, pilot efforts demonstrated the feasibility of low-coverage whole-genome sequencing for variant discovery. In the low-coverage pilot, 179 individuals from diverse ancestries underwent sequencing at 2–4× depth, identifying over 95% of common variants (MAF ≥1%) while proving cost-effective strategies for scaling to larger cohorts and capturing structural variants overlooked by prior genotyping. These proofs-of-concept addressed HapMap's shortcomings by emphasizing next-generation sequencing to uncover low-frequency and rare variants across expanded population groups, including Europeans, East Asians, South Asians, West Africans, and Americans.

Project Initiation

Founding and Funding

The 1000 Genomes Project was announced on January 22, 2008, by an international consortium led by the (NHGRI) of the United States, the in the United Kingdom, and the (BGI) in China. This initiative addressed limitations in prior efforts like the by aiming to catalog a broader spectrum of human genetic variation through low-coverage whole-genome sequencing of at least 1,000 individuals from diverse populations. The project was co-chaired by from the Wellcome Trust Sanger Institute and from the , who guided its scientific direction from inception. Initial funding for the project was estimated at $30–50 million, leveraging emerging sequencing technologies to achieve cost efficiency, with major contributions from the founding organizations. Over its lifespan, the total budget reached approximately $120 million across five years, primarily supported by through its Large-Scale Sequencing Network (involving institutions like the , , and ), the for operations at the , and as a key sequencing partner. Funding evolved in tandem with the project's phased structure, beginning with NHGRI grants in fiscal year 2008 to support pilot studies launched that year, followed by expanded allocations for the full-scale production phase starting in 2010. Additional grants sustained data analysis and resource dissemination beyond the initial five-year plan, enabling completion of the full dataset in 2015, after which the Wellcome Trust continued support for data maintenance through the .

Consortium Structure

The 1000 Genomes Project Consortium was formed in 2008 as a global collaborative effort uniting over 400 scientists from more than 40 research centers worldwide to advance the cataloging of human genetic variation. This structure enabled multidisciplinary input from experts in genomics, bioinformatics, and population genetics, fostering a distributed model for data generation and analysis. Key participating institutions included the Broad Institute of MIT and Harvard, the Wellcome Trust Sanger Institute, and the Beijing Genomics Institute (BGI), which led sequencing efforts, and the Baylor College of Medicine's Human Genome Sequencing Center in pilot data production. Governance was provided by a steering committee co-chaired by David Altshuler of the Broad Institute and Richard Durbin of the Sanger Institute, which oversaw strategic decisions and coordination among specialized analysis groups, including those on structural variation (co-chaired by Evan Eichler, Jan Korbel, and Charles Lee) and genotype calling. The European Molecular Biology Laboratory's European Bioinformatics Institute (EMBL-EBI) served as the data coordination center, handling data integration, quality control, and public dissemination. The collaboration emphasized open-access principles through data-sharing agreements that promoted unrestricted use for research, supported by annual meetings starting in 2008 to review progress and resolve challenges.

Design and Scope

Primary Objectives

The 1000 Genomes Project was established with the primary goal of creating a comprehensive catalog of human genetic variants occurring at frequencies greater than 1% in the population, aiming to identify approximately 95% of such common variants across diverse human populations. This initiative sought to extend beyond previous efforts like the by providing high-resolution data on single nucleotide polymorphisms (SNPs), insertions, deletions, and other structural variants, including their haplotype structures, to enable precise genotyping and association analyses. By focusing on variants with minor allele frequencies above 1% genome-wide and 0.5% within coding regions, the project addressed gaps in understanding the full spectrum of genetic diversity influencing health and disease. A key sub-objective was to facilitate genome-wide association studies (GWAS) for common diseases by supplying a robust reference dataset that could improve the power and accuracy of identifying causal variants linked to traits like diabetes, heart disease, and cancer. The project also aimed to serve as a foundational resource for advancing , allowing researchers and clinicians to better interpret individual genomes in the context of population-level variation for tailored diagnostics and treatments. Additionally, it emphasized studying population-specific variations to reveal insights into human migration, adaptation, and evolutionary history, thereby supporting broader . These aims were designed to make the data publicly accessible, fostering global collaboration in genomics research. To achieve these objectives, the project initially targeted the low-coverage whole-genome sequencing of at least 1,000 individuals from multiple ancestries, a scope that was later expanded to 2,504 samples to enhance variant discovery and resolution. Particular emphasis was placed on capturing low-frequency variants (0.1–1% allele frequency), which had been largely missed by earlier projects and are critical for understanding rare disease contributions and fine-scale population differences. This approach ensured a deeper characterization of functional genomic regions, such as exons, where even rarer variants could be detected at higher sensitivity.

Sampling Strategy

The 1000 Genomes Project sampled 2,504 individuals from 26 populations spanning five major continental ancestry groups: African (AFR), American (AMR), East Asian (EAS), European (EUR), and South Asian (SAS). This selection aimed to capture a broad spectrum of global human genetic diversity while ensuring representation of common variants across ancestries. Populations were drawn from established cohorts, such as those in the International HapMap Project (e.g., Yoruba from Ibadan, Nigeria; Han Chinese from Beijing; Utah residents of Northern and Western European ancestry), supplemented by new collections to fill gaps in underrepresented regions. The sampling strategy prioritized unrelated individuals to maximize genetic independence and minimize confounding from familial correlations, with 1,092 such samples sequenced in phase 2 to provide breadth in variant discovery. To enhance haplotype resolution, the project incorporated parent-offspring trios—comprising mother-father-child sets—enabling accurate phasing of variants through inheritance patterns. Overall, the approach emphasized low-coverage whole-genome sequencing across the cohort for comprehensive variant detection at the population level, while higher-coverage data from select trios supported validation and structural variant analysis. Ethical considerations were integral to the sampling process, with all participants providing informed consent under protocols reviewed by Institutional Review Boards (IRBs) at contributing institutions. Samples were anonymized upon collection, retaining only self-reported ancestry and sex to protect privacy, and excluding any medical or phenotypic data. The project's Samples and Ethical, Legal, and Social Implications (ELSI) Group oversaw population selection to avoid vulnerable or isolated communities, ensuring equitable representation without commercial exploitation of donor materials.

Phased Approach

The 1000 Genomes Project adopted a phased approach to systematically develop and refine methods for cataloging human genetic variation, progressing from initial feasibility testing to large-scale production sequencing over several years. This structure enabled iterative improvements in sequencing depth, variant detection accuracy, and data analysis pipelines, while incorporating feedback from the scientific community at key milestones. Phase 1, spanning 2008 to 2010, focused on pilot projects to assess the feasibility of low-coverage whole-genome sequencing for approximately 1,000 individuals. These pilots included low-coverage sequencing (average 3.6× depth) of 179 individuals from four continental populations, high-coverage sequencing (42× depth) of six individuals from two parent-offspring trios, and targeted exome sequencing (56× depth) of 697 individuals across twelve populations. The primary goal was to evaluate and compare strategies for efficient variant discovery at a population scale, culminating in a data freeze and the 2010 release of a catalog containing over 14 million single-nucleotide polymorphisms identified from the low-coverage component. Community feedback during this phase informed refinements to sampling and analysis protocols for subsequent stages. Phase 2, from 2010 to 2012, expanded sequencing efforts to 1,092 samples from 14 populations representing Europe, East Asia, West Africa, and the Americas, emphasizing improved variant calling through combined low-coverage whole-genome sequencing (2–6× depth) and targeted high-coverage exome sequencing (50–100× depth) for 1,001 individuals. This phase built on pilot lessons to enhance detection of low-frequency variants (below 5% allele frequency) and structural variations, integrating advanced computational methods for genotype imputation and haplotype resolution. A major milestone was the 2012 data freeze, which incorporated community input on population representation and analysis tools to boost the reliability of the emerging variant catalog. Phase 3, conducted from 2012 to 2015, scaled up to high-coverage exome sequencing (average 65.7× depth) alongside low-coverage whole-genome sequencing (7.4× depth) for 2,504 individuals from 26 populations across five superpopulations. This final production phase prioritized comprehensive coverage of rare variants and refined phasing techniques to support downstream applications in population genetics and disease association studies. Key milestones included multiple data freezes for validation and the integration of extensive community feedback on ethical sampling and data standardization, leading to the project's completion with a robust, publicly accessible resource in 2015. The phased progression ensured that sampling across phases drew from diverse global ancestries to capture broad human genetic diversity in a structured manner.

Methods

Sequencing Technologies

The 1000 Genomes Project primarily utilized next-generation sequencing (NGS) technologies to generate its genomic data, with a focus on high-throughput platforms that enabled cost-effective analysis of large sample cohorts. In the pilot phase (2008–2010), sequencing was performed using a combination of platforms, including the Illumina Genome Analyzer for short-read paired-end sequencing, Roche 454 for longer reads in targeted regions, and Applied Biosystems SOLiD for color-space sequencing, allowing initial testing of strategies for variant discovery across 179 individuals. As the project progressed to phases 1 and 3 (2010–2015), the consortium shifted predominantly to Illumina platforms for greater efficiency and scalability; the Illumina HiSeq 2000 and HiSeq 2500 were employed for the majority of low-coverage whole-genome sequencing (WGS), producing paired-end reads typically 100–150 base pairs in length to balance throughput and accuracy. This transition to HiSeq systems facilitated higher data output, with phase 3 sequencing exclusively relying on Illumina technology and reads of at least 70 base pairs to enhance call quality. Coverage strategies were designed to optimize discovery of common variants while managing costs, employing low-coverage WGS for broad genome interrogation supplemented by targeted high-coverage sequencing. For the 2,504 samples in phase 3, low-coverage WGS achieved an average depth of 7.4× across the genome, enabling imputation of variants through population-level statistical models despite the shallow per-sample depth, which was targeted at approximately 4–6× in earlier phases to pilot scalable approaches. High-coverage exome sequencing, focusing on protein-coding regions, was applied to all phase 3 samples at an average depth of 65.7× using capture kits such as or , providing robust detection of rare coding variants that might be missed in low-coverage data. These strategies prioritized whole-genome breadth for structural and non-coding variants alongside exome depth for functional insights, with data aligned to the human reference genome (also known as ) using tools like to map short reads accurately. Short-read NGS technologies presented inherent challenges, particularly error rates in base calling and difficulties in resolving repetitive genomic regions, which complicated accurate variant identification. Illumina platforms, while offering high throughput and low per-base error rates (typically <0.1% after quality filtering), generated reads too short to span complex repeats or large insertions/deletions, leading to mapping ambiguities and under-detection of structural variants longer than 50 base pairs. The project's low-coverage approach exacerbated these issues for rare alleles, as insufficient read depth increased false negatives, though multi-sample joint calling mitigated this by leveraging population signals. To address repetitive regions, the consortium incorporated array-based genotyping for validation and phased data releases to refine alignments iteratively, ensuring reliable cataloging despite these limitations.

Data Processing and Analysis

The data processing pipeline for the 1000 Genomes Project began with the alignment of raw sequencing reads to the human reference genome assembly GRCh37 using the Burrows-Wheeler Aligner (BWA), specifically the BWA-MEM algorithm, which efficiently mapped short reads by handling mismatches, indels, and complex genomic regions while minimizing computational artifacts. This step produced binary alignment/map (BAM) files, standardized by the consortium, that served as input for downstream analyses, ensuring consistent representation of read placements across diverse sequencing platforms. Variant calling followed alignment, employing the Genome Analysis Toolkit (GATK) HaplotypeCaller for identifying single-nucleotide polymorphisms (SNPs) and small insertions/deletions (indels), which modeled haplotypes to improve accuracy in low-coverage data (mean 7.4× depth). An ensemble approach integrated calls from 24 specialized tools to capture a broad spectrum of variants, with machine-learning classifiers applied to prioritize high-confidence sites. For structural variants, including deletions, duplications, and inversions, tools like BreakDancer were used to detect breakpoints from discordant read pairs and split reads, supplemented by orthogonal validation methods such as microarrays and long-read sequencing. Quality control measures focused on minimizing false positives through filtering based on variant quality score thresholds derived from high-depth PCR-free sequencing data (>30× coverage), achieving a (FDR) below 5% for SNPs and indels. Genotype quality scores, reflecting Phred-scaled confidence in assignments, were thresholded alongside metrics like read depth and strand bias to exclude low-reliability calls. Population genetics principles informed additional filters, such as assessing frequencies and haplotype sharing across continental ancestries to flag artifacts inconsistent with expected patterns. Haplotype phasing reconstructed chromosome-wide segments using SHAPEIT, which leveraged from array genotypes and high-confidence bi-allelic variants to phase over 88 million sites across 2,504 samples, enabling accurate imputation of ungenotyped variants. This process incorporated multi-allelic and structural variants post-initial phasing, enhancing resolution for downstream applications like association studies.

Results

Pilot Phase Outcomes

The pilot phase of the 1000 Genomes Project, conducted from 2008 to 2009, comprised three distinct projects designed to evaluate and refine sequencing strategies for large-scale variation studies. The first involved low-coverage whole-genome sequencing of 179 individuals from four populations (60 from Yoruba in (YRI); 60 from residents with Northern and Western European ancestry (CEU); 30 from in (CHB); and 29 from Japanese in (JPT)), achieving an average coverage of 3.6-fold. The second focused on high-coverage whole-genome sequencing of two parent-offspring trios—one Yoruba trio from and one from the Centre d'Étude du Polymorphisme Humain residents with Northern and Western European ancestry (CEPH-UTAH)—at approximately 42-fold coverage. The third entailed targeted exon sequencing of 697 individuals from twelve subpopulations across four continental population groups (, , ), covering about 1.4 Mb of exonic regions at over 50-fold coverage. These pilots collectively tested next-generation sequencing technologies and variant calling pipelines, using methods such as alignment to the human reference genome and consensus-based . Key outcomes included the identification of 15 million single-nucleotide polymorphisms (SNPs), 1 million short insertions and deletions (indels <1 kb), and more than 20,000 structural variants (including deletions, insertions, and inversions), with over 50% of these variants novel compared to existing databases like HapMap and dbSNP. Specifically, the low-coverage pilot alone contributed the bulk of these discoveries, demonstrating that this approach could reliably detect over 95% of common variants (minor allele frequency >5%) with a below 5% for SNPs and indels. The trio sequencing provided high-confidence validation for patterns and mutations, while the exon pilot highlighted efficient capture of coding variants, identifying thousands of nonsynonymous changes. These results established the feasibility of population-scale sequencing for cataloging human . The pilots also revealed areas for methodological improvement, particularly in detecting complex structural variants like inversions and novel insertions, where power and resolution were limited even at higher coverages. Additionally, analysis underscored the value of population diversity, as samples of ancestry yielded 63% of novel SNPs, indicating that underrepresented groups harbor substantial undiscovered variation. These insights informed the project's shift toward deeper coverage for structural variants and broader sampling in subsequent phases. All pilot data, including raw sequences, alignments, and variant calls, were released in June 2010 through the project's website (www.1000genomes.org) and submitted to public repositories like dbSNP and the Laboratory's , facilitating immediate access for the global research community and enabling early applications in association studies.

Main Phase Discoveries

The main phase of the 1000 Genomes Project involved low-coverage whole-genome sequencing of 2,504 individuals from 26 populations across five major continental ancestry groups: , , , , and the Americas, along with high-coverage . This scale enabled the detection of population-specific frequencies, with approximately 86% of all confined to a single continental population, shedding light on historical patterns and events. For instance, certain rare globally (<0.5% frequency) were found to be common (>5%) within specific subpopulations, such as the Luhya in Webuye, (LWK), or the in (FIN). The project's analysis uncovered approximately 88 million variant sites, including 84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions and deletions (indels), and 60,000 structural variants, many of which were novel contributions to public databases like dbSNP. These discoveries highlighted the extensive diversity of , with a typical diploid differing from the reference at 4.1 to 5.0 million sites, including about 149 to 182 loss-of-function (protein-truncating) variants. Building on pilot phase validations of variant calling accuracy, this catalog emphasized the predominance of rare variants (frequency <0.5%), which constitute approximately 73% of all detected sites and underscore their evolutionary role in recent human history. Insights into the burden of rare variants revealed their potential contributions to complex traits, as an average genome carries around 2,000 variants previously associated with such traits through genome-wide association studies (GWAS). This rare variant enrichment varies by ancestry, with European-ancestry genomes showing a higher load of known disease-associated alleles due to ascertainment biases in prior studies. The data also facilitated integration with phenotype information, yielding initial association signals for traits like height and lipid levels; for example, imputation of phase 3 variants into lipid GWAS cohorts identified novel low-frequency signals influencing high-density lipoprotein cholesterol (HDL-C) and triglycerides.

Variant Catalog

The Variant Catalog of the 1000 Genomes Project represents a comprehensive inventory of human genetic variation derived from low-coverage whole-genome sequencing and exome sequencing of 2,504 individuals from 26 populations across five continental ancestry groups. This catalog encompasses over 88 million variants, including 84.7 million single-nucleotide polymorphisms (SNPs), 3.6 million short insertions and deletions (indels) up to 50 base pairs, and approximately 60,000 structural variants, providing a foundational resource for understanding global human genomic diversity. Variants in the catalog are categorized by type and frequency, with the majority being rare. Approximately 64 million variants have a minor allele frequency (MAF) below 0.5%, highlighting the project's ability to capture low-frequency variation essential for population genetics studies. In terms of genomic location, the catalog distinguishes between coding and non-coding regions; per individual genome, there are roughly 10,000 to 12,000 peptide-altering variants in coding exons, while non-coding regions, including regulatory elements, harbor 459,000 to 565,000 variants, underscoring the predominance of non-coding variation. Additionally, about 86% of all variants are restricted to a single continental ancestry group, reflecting population-specific evolutionary histories. Population differentiation is evident in the catalog, with African-ancestry genomes exhibiting the highest diversity—averaging 4.31 million SNPs and 625,000 indels per genome—compared to other groups, consistent with Africa's role as the origin of modern humans. Fixation index (FST)-based analyses, such as the population branch statistic (PBS), reveal elevated differentiation in genes like SLC24A5 and , associated with traits like skin pigmentation, further illustrating ancestry-specific variant patterns. Quality metrics for the demonstrate high reliability, with accuracy exceeding 99.4% for SNPs and 99.0% for indels in high-confidence calls, and a below 5% for most small variants. Validation through high-coverage sequencing of individuals confirmed these rates, ensuring the 's utility as a robust for downstream genomic .

Data and Accessibility

Release Phases

The 1000 Genomes Project disseminated its data through a series of incremental public releases aligned with its phased execution, enabling iterative improvements in variant calling and analysis pipelines. The pilot phase, focused on testing sequencing strategies, culminated in the release of low-coverage whole-genome data from 179 individuals in 2010. This initial dataset, totaling approximately 7.3 terabytes, included raw sequence reads and early variant calls, supporting the project's foundational analysis of common . Phase 1 data followed in late 2012, encompassing low-coverage sequencing and data from 1,092 individuals across 14 populations. This release expanded the variant catalog to over 38 million single-nucleotide polymorphisms (SNPs), 1.4 million short insertions and deletions, and 14,000 large deletions, with integrated calls provided in VCF format. It marked a significant advancement in capturing low-frequency variants (>1% ), as detailed in the project's primary analysis. The project's capstone, phase 3, delivered its final dataset in October 2015, incorporating low-coverage whole-genome sequencing, , and targeted deep sequencing for 2,504 samples from 26 global populations. Built on a data freeze from May 2013 using the , this release included comprehensive VCF files with phased —generated via tools like SHAPEIT2 and —and aligned sequence reads, cataloging over 88 million variants overall. Version 5a (v5a) of the phase 3 data represented the integrated final call set, emphasizing haplotype resolution for studies. Following the project's conclusion in 2015, maintenance shifted to the International Genome Sample Resource (IGSR), which integrated the original with the GRCh38 reference assembly in March 2019 through reanalysis pipelines. This update produced new variant calls and for enhanced accuracy and with contemporary genomic references. Subsequent minor corrections, including sample refinements and supplementary high-coverage for select cohorts, have occurred sporadically into the to address artifacts and incorporate orthogonal validation. To ensure proper attribution, users must cite the specific publications tied to the version employed—such as the 2010 paper for pilot , the 2012 paper for phase 1, and the 2015 papers for phase 3—along with IGSR's annual articles for post-2015 updates, per the consortium's reuse guidelines.

Public Repositories

The 1000 Genomes Project datasets are hosted across several primary public repositories to facilitate broad access for researchers. The Ensembl serves as a key interface, allowing users to visualize and query variant from the project alongside other genomic annotations, including population-specific allele frequencies and consequence predictions. At the (NCBI), are archived in the Sequence Read Archive (SRA) under BioProject PRJNA28889 for to raw sequencing reads, while the Database of Genotypes and Phenotypes (dbGaP) provides controlled access to individual-level and information for certain samples to protect participant . The European Genome-phenome Archive (EGA) at the (EBI) hosts controlled-access datasets, such as those under DAC EGAC00001000514, encompassing 2,504 samples with sequencing and array from the project. Data are distributed in standardized formats to support diverse analyses. Alignment files are provided in BAM format for read mappings against reference genomes like GRCh37 and GRCh38, while variant calls are available as multi-sample VCF files containing genotypes for single nucleotide variants, insertions/deletions, and structural variants across the 2,504 Phase 3 samples. These files, along with ancillary data like sample and population labels, can be downloaded via FTP from the IGSR site (ftp.1000genomes.ebi.ac.uk), , or Aspera for high-speed transfer, with the full dataset exceeding several petabytes in size. Several tools enable annotation, querying, and programmatic integration of the data. The Ensembl Variant Effect Predictor (VEP) annotates project variants with functional consequences, such as impacts on transcripts and regulatory regions, and integrates 1000 Genomes allele frequencies for prioritization. Programmatic access is supported through the Ensembl REST , which allows retrieval of variant details, statistics, and alignments without downloading entire files, as well as NCBI APIs for and EGA's query interfaces for approved users. Access policies balance openness with ethical considerations. Most data are openly available for non-commercial research under the Fort Lauderdale principles, requiring users to cite the original publications and adhere to the IGSR data disclaimer, which emphasizes responsible use and verification of third-party rights. Controlled access applies to sensitive individual-level data in and , where researchers must submit applications to institutional review boards and data access committees, ensuring compliance with privacy regulations like those from the and .

Impact and Legacy

Scientific Applications

The 1000 Genomes Project's reference panel has been extensively utilized in genome-wide studies (GWAS) to enable imputation, allowing researchers to infer ungenotyped variants and increase the effective sample size and statistical power of analyses. By 2016, the Phase 3 , comprising phased haplotypes from 2,504 individuals across 26 populations, had become the dominant reference for imputation, supporting the discovery of associations for such as age-related and . This resource has enhanced the resolution of GWAS by incorporating over 88 million variants, including low-frequency alleles that improve fine-mapping of causal loci, and has been applied in meta-analyses involving hundreds of thousands of participants. The project's data have significantly contributed to the development and refinement of polygenic risk scores (PRS), which aggregate the effects of numerous common variants to predict disease susceptibility. In multi-ancestry PRS models, the 1000 Genomes haplotypes facilitate pruning and ancestry-specific weighting, enabling more accurate risk estimation across diverse populations; for instance, a 2023 study integrated 1000 Genomes data with GWAS from five ancestries to create a PRS for that outperformed European-only models by 66–113% in across non-European ancestries (e.g., 113% for South Asian groups). This approach has been pivotal in addressing biases in PRS performance, particularly for admixed populations, by providing a reference for variant frequency and correlation structure. In diagnostics, the 1000 Genomes variant catalog serves as a critical filter to distinguish pathogenic mutations from common polymorphisms during or sequencing of affected individuals. By identifying variants with minor frequencies above 1% in diverse populations, researchers can prioritize rare, potentially causal variants, accelerating diagnoses in conditions like Mendelian disorders; this has been integrated into pipelines that process approximately 20,000 single-nucleotide variants per , reducing false positives and supporting clinical interpretation. Such applications have informed large-scale genomic projects like the UK's 100,000 Genomes Project, which analyzed thousands of cases. Population genomics research has leveraged the project's multi-population dataset to trace human admixture events and detect signals of . Analyses of the 26 sampled populations have revealed fine-scale admixture histories, such as sex-biased in African and American groups, and local ancestry patterns that illuminate post-out-of-Africa migrations, such as the around 15,000–20,000 years ago. Selection scans using integrated haplotype scores have identified adaptive variants, including those for in Europeans and high-altitude adaptation (EPAS1) in , providing insights into how shapes modern . Integration of 1000 Genomes data with large-scale biobanks like the has advanced mapping by enabling high-accuracy imputation of genotypes for over 500,000 participants, linking variants to traits such as and cardiovascular risk. The Phase 3 reference panel was used for phasing non-European ancestry samples and imputing rare variants, facilitating GWAS that uncovered novel associations across diverse and improved estimates. This synergy has supported approaches for privacy-preserving prediction, enhancing the mapping of genetic influences on complex diseases.

Broader Contributions

The 1000 Genomes Project's data, released as an open resource, has profoundly shaped genomic research worldwide, with its flagship 2015 publication cited nearly 18,000 times as of 2025, reflecting its extensive use across diverse studies. This open-access catalog of has served as a foundational , enabling researchers to variants and impute genotypes in large-scale analyses. Its influence extends to major initiatives like the Genome Aggregation Database (gnomAD), which builds directly on the project's framework to aggregate and annotate variants from over 800,000 exomes and genomes (as of gnomAD v4 in 2023), enhancing population-specific estimates. Similarly, the Research Program leverages the project's diverse reference panels to advance precision medicine through inclusive genomic data from approximately 860,000 participants (as of mid-2025, aiming for over one million), emphasizing underrepresented groups. In the realm of ethics, the project pioneered standards for diverse sampling and in global , developing model consent forms that explicitly address broad , lack of individual result return, and unrestricted future use without participant veto. These guidelines, reviewed by bodies like the Public Population Project in Genomics (P3G), ensured samples from 26 populations across , the , , , and were collected with local approvals, promoting equitable participation while minimizing risks to vulnerable communities. By mandating no commercial benefit sharing from derived products and prohibiting sampling from particularly isolated groups without robust protections, the project set precedents adopted in subsequent efforts, fostering trust and inclusivity in genomic data generation. The project's datasets have also had a significant educational impact, serving as core resources in bioinformatics programs worldwide. For instance, EMBL-EBI's summer schools and online courses utilize the 1000 Genomes to instruct on , variant calling, and analysis, enabling hands-on learning with real-world genomic sequences. These openly available files, including VCF formats and alignments, support curriculum development in universities and workshops, thousands of students and professionals in computational tools like PLINK and VEP for variant interpretation. This accessibility has democratized bioinformatics education, bridging theoretical concepts with practical applications in studies. Despite these contributions, the project faced critiques for underrepresentation of certain indigenous and isolated populations, such as Native American and Oceanian groups, which comprised less than 5% of samples and limited insights into rare variants in these ancestries. Studies have highlighted how this perpetuates inequities in genomic reference panels, potentially skewing clinical interpretations for underrepresented patients. Successor projects, including the high-coverage resequencing of the expanded 1000 Genomes cohort by the Genome and the Human Pangenome Reference Consortium, have addressed these gaps by incorporating deeper sequencing from diverse samples and integrating structural variant data from over 100 additional populations as of 2025. By 2025, the Human Pangenome Reference Consortium has expanded this legacy with a graph-based reference incorporating variation from over 200 diverse individuals, improving representation of structural variants.

References

  1. [1]
    About - 1000 Genomes
    The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human variation and genotype data.
  2. [2]
    1000 Genomes Project
    Oct 4, 2017 · The 1000 Genomes Project is a collaboration among research groups in the US, UK, and China and Germany to produce an extensive catalog of human genetic ...
  3. [3]
    A global reference for human genetic variation | Nature
    Sep 30, 2015 · The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome ...<|control11|><|separator|>
  4. [4]
    1000 Genomes Project summary
    The Project was planned during a meeting at The Welcome Genome Campus in September 2007. You can read the original plan in the meeting report. Once underway ...
  5. [5]
    The 1000 Genomes Project: Welcome to a New World - PMC - NIH
    The 1000 Genomes Project is an international research consortium that was set up in 2007 with the aim of sequencing the genomes of at least 1,000 volunteers ...
  6. [6]
    Human Genomic Variation
    Feb 1, 2023 · These large-scale genomic differences are called structural variants and involve at least 50 nucleotides and as many as thousands of nucleotides ...
  7. [7]
    Accurate whole human genome sequencing using reversible ...
    Nov 6, 2008 · The autosomal heterozygosity (π) of NA18507 is 9.94 × 10-4 (1 SNP ... This illustrates a clear correlation between recombination and nucleotide ...
  8. [8]
    The influence of evolutionary history on human health and disease
    Jan 6, 2021 · Nearly all genetic variants that influence disease risk have human-specific origins; however, the systems they influence have ancient roots ...
  9. [9]
    Examining the role of common variants in rare neurodevelopmental ...
    Nov 20, 2024 · We show that common variant predisposition for neurodevelopmental conditions is correlated with the rare variant component of risk.
  10. [10]
    Haplotype blocks and linkage disequilibrium in the human genome
    Aug 1, 2003 · Linkage disequilibrium (LD) is the nonrandom association of alleles at different sites. Recent studies have proposed that patterns of LD in ...
  11. [11]
    The HapMap and Genome-Wide Association Studies in Diagnosis ...
    The HapMap was not designed to capture these variants, although it can be used indirectly to do so, particularly for deletions that are in strong LD with SNPs ...
  12. [12]
    Human Genome Project Fact Sheet
    Jun 13, 2024 · In April 2003, the consortium announced that it had generated an essentially complete human genome sequence, which was significantly improved ...
  13. [13]
    A map of human genome variation from population-scale sequencing
    Oct 27, 2010 · Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with ...Missing: pre- | Show results with:pre-
  14. [14]
    About the International HapMap Project
    Jun 4, 2012 · The HapMap identifies the 250,000 to 500,000 tag SNPs that provide almost as much mapping information as all 10 million SNPs. What populations ...What Was The International... · What Is A Haplotype? · What Populations Were...
  15. [15]
  16. [16]
    International Consortium Announces the 1000 Genomes Project
    Feb 24, 2012 · The second pilot will involve sequencing the genomes of 180 people at low coverage that averages two passes of each genome. This will test the ...Missing: pre- | Show results with:pre-
  17. [17]
    International Consortium Announces the 1000 Genomes Project
    Jan 22, 2008 · Richard Durbin Ph.D., of the Wellcome Trust Sanger Institute, who is co-chair of the 1000 Genomes Project consortium ... David Altshuler M.D. ...
  18. [18]
    Participants | 1000 Genomes
    Steering Committee ; David Altshuler (co-chair) Broad Institute of MIT and Harvard ; Richard Durbin (co-chair) Wellcome Trust Sanger Institute ; Gonçalo Abecasis ...
  19. [19]
    1000 Genomes Project publishes analysis of completed pilot phase
    Feb 26, 2013 · Funded through numerous mechanisms by foundations and national governments, the 1000 Genome Project will cost some $120 million over five years, ...Missing: initial | Show results with:initial
  20. [20]
    Nov3 2015 1000 Genomes Project
    1000 Genomes Project funding from NHGRI began in Fiscal Year 2008, and has now concluded. The project was co-funded and co-led by the Wellcome ...Missing: evolution phases
  21. [21]
    Participants | 1000 Genomes
    ### Summary of 1000 Genomes Project Consortium Structure
  22. [22]
    Applied Biosystems And Baylor College of Medicine Summarize ...
    Applied Biosystems And Baylor College of Medicine Summarize Contributions to First Data Release of 1000 Genomes Project. February 17, 2009 |. 1 min read.
  23. [23]
    [PDF] A Workshop to Plan a Deep Catalog of Human Genetic Variation
    A meeting was held to discuss the scientific rationale and design of an international consortium to develop a comprehensive catalog of sequence variants in ...Missing: annual | Show results with:annual
  24. [24]
    None
    ### Primary Objectives and Goals of the 1000 Genomes Project
  25. [25]
  26. [26]
    1000 Genomes Project summary | 1000 Genomes
    ### Summary of Sampling Strategy for the 1000 Genomes Project
  27. [27]
    None
    ### Summary of Ethical Considerations for Sampling in the 1000 Genomes Project
  28. [28]
  29. [29]
  30. [30]
  31. [31]
    An integrated map of genetic variation from 1,092 human genomes
    Oct 31, 2012 · However, although more than 95% of common (>5% frequency) variants were discovered in the pilot phase of the 1000 Genomes Project, lower- ...Missing: pre- | Show results with:pre-
  32. [32]
    Phase 3 | 1000 Genomes
    The final phase of the 1000 Genomes Project was phase 3 and represents 2504 samples on GRCh37. The data from phase three of the 1000 Genomes Project was ...
  33. [33]
    Data access | 1000 Genomes
    For 1000 Genomes Project data, this includes low coverage, high coverage, exon targeted and exome to reflect the two non low coverage pilot sequencing ...
  34. [34]
    The impact of low-frequency and rare variants on lipid levels - PMC
    All four lipids, HDL-C, LDL-C, TC and TG were measured using basic enzymatic methods. Summary statistics of phenotypes in each cohort are presented in ...
  35. [35]
    1000 Genomes Project releases data from pilot projects on path to ...
    Jun 21, 2010 · The second pilot project sequenced the genomes of 179 people at low coverage – an average of three passes of the genome. Although sequencing ...Missing: pre- | Show results with:pre-
  36. [36]
  37. [37]
    International Genome Sample Resource (IGSR) collection of open ...
    Oct 4, 2019 · The International Genome Sample Resource (IGSR) maintains and expands the heavily used data resources created by the 1000 Genomes Project (1–4).Missing: timeline | Show results with:timeline
  38. [38]
    Data | 1000 Genomes
    ### Summary of IGSR Data Information
  39. [39]
    IGSR Disclaimer - 1000 Genomes
    Jan 11, 2016 · For all data collections in IGSR, please check the accompanying data reuse statements and cite any available publications appropriately. For any ...
  40. [40]
    Access to public datasets in the EGA
    Illumina HiSeq 2000, 1. EGAD00001003338, This is a test dataset derived from public data of the 1000 Genomes Project. Its purpose is not to allow for any ...
  41. [41]
    Ensembl Variant Effect Predictor (VEP)
    Ensembl VEP predicts the effect of your variants (SNPs, insertions, deletions, CNVs or structural variants) on gene transcripts and protein sequence.Running Ensembl VEP · Documentation · Download and install · Data formats
  42. [42]
    Home - SRA - NCBI - NIH
    SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis. Getting Started.Sequence Read Archive · Download SRA sequences · SRA Submission Quick StartMissing: 1000 | Show results with:1000
  43. [43]
    IGSR Disclaimer | 1000 Genomes
    ### Summary of Data Access Policies from IGSR Disclaimer
  44. [44]
    Applications of the 1000 Genomes Project resources - PMC - NIH
    Jul 19, 2016 · The 1000 Genomes Project was launched in 2008 to establish a deep catalogue of human genetic variation that could serve as a baseline for ...
  45. [45]
    Applications of the 1000 Genomes Project resources
    Jul 19, 2016 · The 1000 Genomes Project was launched in 2008 to establish a deep catalogue of human genetic variation that could serve as a baseline for ...Missing: grants | Show results with:grants
  46. [46]
    A multi-ancestry polygenic risk score improves risk prediction for ...
    Jul 6, 2023 · Here we develop a new and significantly improved polygenic score for CAD, termed GPS Mult, that incorporates genome-wide association data across five ...
  47. [47]
    Analysis of polygenic risk score usage and performance in diverse ...
    Jul 25, 2019 · Using 1000Genomes data (as described in the Methods) and commonly used but different methodological choices in the construction of polygenic ...
  48. [48]
    100000 Genomes Pilot on Rare-Disease Diagnosis in Health Care
    Nov 10, 2021 · Our pilot study of genome sequencing in a national health care system showed an increase in diagnostic yield across a range of rare diseases.
  49. [49]
    Population genetic history and polygenic risk biases in 1000 ...
    Aug 23, 2016 · Empirical polygenic risk score inferences. To compute polygenic risk scores in the 1000 Genomes samples using summary statistics from ...
  50. [50]
    The UK Biobank resource with deep phenotyping and genomic data
    Oct 10, 2018 · The UK Biobank project is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals ...
  51. [51]
    Efficacy of federated learning on genomic data: a study on the UK ...
    In this section, we compare federated, centralized and local phenotype-from-genotype prediction models on the UK Biobank dataset. Further, we analyze the ...
  52. [52]
    1,000 Genomes Project (2008–2015) | Embryo Project Encyclopedia
    Aug 10, 2025 · The 1000 Genomes Project, which began in 2008, was an international effort to create a detailed and publicly accessible catalog of human ...Missing: timeline | Show results with:timeline
  53. [53]
    International gnomAD Consortium releases its first major studies of ...
    May 27, 2020 · MacArthur and his colleagues at Broad and MGH built ExAC and then gnomAD to expand on the work of the 1000 Genomes Project, the first large- ...International Gnomad... · Gnomad Lookback · Comprehensive Catalog
  54. [54]
    Genomic data in the All of Us Research Program | Nature
    Feb 19, 2024 · The 1000 Genomes Project Consortium et al. A global reference for human genetic variation. Nature 526, 68–74 (2015). Article Google Scholar.
  55. [55]
    Sample Collection Principles | 1000 Genomes
    ### Summary of Ethical Standards, Diverse Sampling, and Consent in the 1000 Genomes Project
  56. [56]
  57. [57]
    Genome variation across human populations - EMBL-EBI
    We will be using sample data from the 1000 genomes project so it would be good to know where this data comes from and how and why it was generated and stored.Missing: education | Show results with:education
  58. [58]
    Data | 1000 Genomes
    Data is organised by collection and includes data shared pre-publication, intermediate and working data as well as analysis files described in publications. A ...Samples · Data collections · About · What tools can I use to...
  59. [59]
    NYGC Researchers Upgrade the 1000 Genomes Project Resource
    Sep 1, 2022 · Read about NYGC Researchers Expand and Upgrade the 1000 Genomes Project Resource at the New York Genome Center on September 1, 2022.Missing: governance | Show results with:governance
  60. [60]
    Beyond 1000 genomes: going deeper and wider | EMBL
    Jul 23, 2025 · The first study looks at 1,019 genomes from the 1000 Genomes Project dataset, spread across 26 populations from five continents. Using long ...
  61. [61]
    Comprehensive structural variation discovery in 1,019 humans from 26 populations
    Peer-reviewed article detailing long-read sequencing efforts in the 1000 Genomes Project cohort for structural variant characterization, published in Nature, 2025.
  62. [62]
    High-coverage Oxford Nanopore long-read sequencing of 100 diverse samples from the 1000 Genomes Project
    Publication on nanopore sequencing of 100 samples, reporting SV discoveries, repeat expansions, and methylation detection advantages.
  63. [63]
    1000 Genomes Project Long-Read Sequencing (1KGP-LRS) Consortium
    Official page describing ongoing long-read sequencing initiatives, including plans for 800+ ONT samples from the original cohort.