Fact-checked by Grok 2 weeks ago

Sequence assembly

Sequence assembly is a fundamental process in bioinformatics that involves reconstructing complete or near-complete biological sequences, such as DNA or RNA, from numerous short fragments called reads generated by high-throughput sequencing technologies.^[1] This reconstruction is essential because sequencing instruments produce reads that are typically 50–300 base pairs long, far shorter than the genomes or transcriptomes they aim to represent, necessitating computational algorithms to align and merge these fragments into contiguous sequences known as contigs.^[2] The process, which emerged in the early 1980s, gained prominence through the Human Genome Project's hierarchical shotgun sequencing but has evolved significantly with next-generation sequencing (NGS), enabling de novo assembly of novel genomes without prior reference sequences.^[3] Key methods in sequence assembly include de novo assembly, which builds sequences from scratch using overlap-layout-consensus (OLC) or de Bruijn graph algorithms, and reference-based assembly, which maps reads to an existing reference genome for guided reconstruction.^[1] OLC approaches, such as those in tools like SGA, identify overlapping reads to form graphs that are then simplified into contigs, while de Bruijn graphs, employed by assemblers like Velvet and ABySS, break reads into k-mers for efficient handling of large datasets.^[3] Assembly pipelines typically proceed through four main stages: preprocessing to correct errors and filter low-quality reads; graph construction to organize overlaps; graph simplification to resolve redundancies and ambiguities; and postprocessing to scaffold contigs using paired-end information and assess quality metrics like N50 contig length.^[3] Despite these advances, sequence assembly faces significant challenges, including handling repetitive genomic regions that cause ambiguities in read alignment, sequencing errors from platforms like Illumina or PacBio, and the computational demands of processing terabytes of data from high-coverage experiments.^[1] In complex genomes, such as those of plants with high heterozygosity or polyploidy, these issues can lead to fragmented assemblies or misassemblies, requiring hybrid approaches combining short- and long-read technologies for improved accuracy.^[4] The importance of sequence assembly lies in its role as a cornerstone of genomics, facilitating applications from gene annotation and variant discovery to metagenomics and evolutionary studies, ultimately advancing fields like personalized medicine and biodiversity conservation.^[5] Recent developments, including long-read sequencing from Oxford Nanopore and integration of machine learning for error correction, continue to enhance assembly quality and scalability.^[4]^[6]

Fundamentals

Definition and Purpose

Sequence assembly is the computational process of reconstructing long DNA or RNA sequences from numerous short, overlapping fragments known as sequencing reads, typically ranging from 50 to 150 base pairs (bp) in length for short-read technologies.^[7]^[1] This reconstruction yields contiguous sequences called contigs, which can be further linked into scaffolds approximating larger structures such as chromosomes or transcripts.^[5] The process is essential because modern high-throughput sequencing methods generate millions of these short reads rather than complete genomic sequences in one go.^[8] The primary purpose of sequence assembly is to enable comprehensive genomic analysis, including genome annotation to identify genes and regulatory elements, variant discovery for detecting mutations, evolutionary studies to compare species, and practical applications such as personalized medicine through tailored diagnostics and therapies.^[9] A landmark demonstration of its importance was the Human Genome Project, completed in 2003, which relied on assembly techniques to map approximately 3 billion base pairs of the human genome, providing the foundational reference for subsequent research.^[10]^[11] At a high level, the workflow begins with read generation from sequencing platforms, followed by overlap detection to identify matching regions between reads, contig formation by merging overlapping fragments into longer sequences, and scaffolding to estimate the order and orientation of contigs using additional data like paired-end mappings.^[12] This fragmented reconstruction can be likened to piecing together a book from shredded pages, where overlapping text snippets guide the alignment despite gaps or ambiguities, such as repetitive regions that complicate precise joining.^[13]

Key Challenges

Sequence assembly faces several inherent challenges that complicate the reconstruction of continuous genomic sequences from fragmented reads. These difficulties arise from the nature of biological genomes and the limitations of sequencing technologies, often leading to fragmented, erroneous, or incomplete assemblies.^[14] One major obstacle is the presence of repetitive regions, such as tandem repeats and segmental duplications, which create ambiguity in read placement because identical or highly similar sequences cannot be uniquely mapped. For instance, approximately 50% of the human genome consists of repetitive DNA, including transposable elements and other repeats that confound accurate reconstruction.^[15] This ambiguity results in collapsed repeats or misassemblies, particularly when read lengths are shorter than the repeat units.^[16] Sequencing errors further exacerbate assembly issues by introducing inaccuracies in base calls that hinder reliable overlap detection between reads. Error rates vary by technology; early next-generation sequencing platforms exhibited rates of 1-15%, while modern short-read methods like Illumina achieve around 0.1-1%, and long-read technologies such as PacBio or Oxford Nanopore often range from 10-20%.^[17]^[18] These errors, primarily substitutions, insertions, or deletions, propagate into contig formation and require computational correction, yet residual inaccuracies can lead to chimeric or incorrect consensus sequences.^[19] Coverage variability, characterized by uneven read depth across the genome, often results in gaps or over-representation in assemblies, making it difficult to resolve low-coverage regions. This unevenness stems from biases in library preparation, PCR amplification artifacts, and sequencing inefficiencies, sometimes producing chimeric reads that span unintended genomic breakpoints.^[20] In practice, such variability undermines coverage-based diagnostics for assembly quality and can leave substantial portions of the genome unassembled.^[21] The computational complexity of sequence assembly poses another significant barrier, as the problem of finding the optimal arrangement of reads is NP-hard, even for simplified models.^[22] For large eukaryotic genomes, this translates to immense time and memory demands when handling datasets that can reach petabytes in size, necessitating heuristic algorithms that trade optimality for feasibility.^[23] In diploid organisms, polymorphisms and heterozygosity add layers of complexity by requiring the distinction of allelic variants from two homologous chromosomes, often leading to haplotype phasing issues. High heterozygosity rates, such as 0.85-1.28% in some species, can cause assemblers to erroneously merge or separate haplotypes, resulting in redundant or fragmented contigs.^[24] This challenge is particularly acute in de novo assembly without a reference, where resolving true variants versus sequencing noise demands high coverage and sophisticated modeling.^[25]

Types of Sequence Assembly

De Novo Assembly

De novo assembly, also known as de novo genome assembly, is the process of reconstructing a genome sequence solely from raw sequencing reads without relying on a preexisting reference genome. This approach is particularly valuable for sequencing novel organisms, non-model species, or populations where no high-quality reference exists, enabling the generation of a complete, unbiased representation of the genetic material.^[13] The assembly process begins with read correction, where errors introduced during sequencing—such as base substitutions or indels—are identified and fixed using consensus from multiple overlapping reads or specialized error-correction algorithms. Next, overlaps between corrected reads are detected, often employing graph-based methods like de Bruijn graphs, which represent k-mer substrings to efficiently identify shared sequences. These overlaps are then used to build contigs, which are continuous stretches of DNA formed by merging aligned reads into longer sequences. Finally, scaffolding orders and orients these contigs into larger structures using long-range information from mate-pair libraries, paired-end reads, or chromatin interaction data like Hi-C, which provide distance constraints between distant genomic regions; gaps between scaffolds may remain unresolved without further data.^[26]^[3] One key advantage of de novo assembly is its ability to uncover novel genetic sequences, including those absent from existing databases, and to accurately resolve structural variants such as insertions, deletions, and rearrangements that might be missed or biased in reference-dependent methods. It is especially effective for genomes with unique evolutionary histories or high variability. However, the method is prone to fragmentation, particularly in repetitive regions where short reads cannot uniquely span repeats, leading to collapsed or incomplete assemblies; this challenge is exacerbated in complex eukaryotic genomes compared to simpler ones.^[27]^[28] In practice, de novo assembly has been widely applied to microbial genomes, where bacterial assemblies often achieve near-complete contiguity due to their relatively low complexity, compact size (typically 2–10 Mb), and fewer repetitive elements. For instance, tools like SPAdes have enabled high-quality drafts of bacterial isolates from environmental samples, closing most gaps and annotating functional elements with minimal fragmentation.^[29]^[30] Historically, de novo assembly became dominant in the early next-generation sequencing (NGS) era following the introduction of platforms like 454 in 2005, revolutionizing the study of uncultured microbes by allowing rapid reconstruction of genomes from metagenomic samples without prior cultivation or reference data. This shift facilitated thousands of microbial genome publications between 2006 and 2010, marking a departure from Sanger-era limitations.^[31]^[32]

Mapping-Based Assembly

Mapping-based assembly, also known as reference-guided assembly, is a strategy in bioinformatics that reconstructs a genome by aligning sequencing reads to a pre-existing reference genome, enabling the identification and correction of variations or gaps in the reference sequence. This approach leverages the reference as a scaffold to guide the placement of reads, facilitating the assembly of closely related genomes or resequencing efforts where the target organism shares significant similarity with the reference. Tools such as BWA (Burrows-Wheeler Aligner) and Bowtie are commonly employed for the initial read alignment step, as they efficiently map short reads to large reference genomes using Burrows-Wheeler transform-based indexing for speed and accuracy. BWA, for instance, supports gapped alignments to handle insertions, deletions, and mismatches, making it suitable for reconstructing sequences with polymorphisms.^[33]^[34] The process begins with read mapping, where sequencing reads are aligned to the reference genome using aligners like BWA-MEM, which employs a combination of seeding, chaining, and dynamic programming to produce high-quality alignments even for longer reads. Following alignment, variant calling identifies differences such as single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) by comparing mapped reads against the reference, often using tools that generate pileup files to assess coverage and consensus at each position. Consensus building then integrates these variants to produce a refined assembly, filling gaps in low-coverage regions or correcting errors in the reference through majority voting or probabilistic models based on read depth and quality scores. This method effectively handles structural variations like indels by local realignment around discrepancies.^[35]^[28] One key advantage of mapping-based assembly is its computational efficiency and higher accuracy for species closely related to the reference, as the guiding scaffold reduces search space and resolves ambiguities in repetitive regions by providing contextual anchors for read placement. It is particularly effective for detecting small-scale variants, achieving lower error rates compared to de novo methods in resequencing scenarios with high similarity. However, this approach introduces biases inherited from the reference genome's quality and completeness, potentially underrepresenting novel sequences or structural rearrangements in more divergent genomes, where unmapped reads may be discarded or poorly assembled.^[28]^[36] Mapping-based assembly is widely applied in resequencing projects and population genomics, such as the 1000 Genomes Project, which sequenced 2,504 individuals from diverse populations to catalog human genetic variation by aligning low-coverage whole-genome reads (mean depth 7.4×) to the human reference genome (GRCh37) using multiple aligners and variant callers to achieve high-confidence genotyping of over 88 million sites. This method enabled the discovery of common variants with frequencies above 1%, supporting studies on human diversity and disease association.^[37]

Hybrid Assembly

Hybrid assembly integrates multiple sequencing data types, typically combining high-accuracy short reads from platforms like Illumina with longer but error-prone reads from technologies such as PacBio or Oxford Nanopore, to leverage the strengths of each for improved genome reconstruction. This approach enhances contiguity by using long reads to span repetitive regions while employing short reads to correct errors and fill gaps, resulting in more complete assemblies than those from single data types alone.^[38]^[39] The process generally involves error correction of long reads using short-read alignments, followed by scaffolding or graph-based integration to build contigs. For instance, tools like MaSuRCA construct "mega-reads" by pairing short Illumina reads with long PacBio reads to create accurate, extended sequences that are then assembled via a hybrid de Bruijn-overlap method. Similarly, Unicycler builds a short-read assembly graph with SPAdes and bridges it using long reads for bacterial genomes, though adaptable principles apply to eukaryotes. These pipelines often incorporate polishing steps with short reads to refine the final output.^[40]^[38]^[39] Hybrid methods excel at resolving structural challenges like repeats and gaps, yielding higher-quality assemblies with greater contiguity and fewer misassemblies compared to short- or long-read-only approaches. They enable chromosome-level reconstructions, as demonstrated in the Telomere-to-Telomere (T2T) human genome assembly (T2T-CHM13), which used PacBio HiFi long reads for primary contig formation, Oxford Nanopore ultralong reads for scaffolding, and Illumina short reads for error correction and polishing to achieve a gapless 3.055 Gbp sequence. This integration resolved over 200 Mbp of previously unassembled repetitive regions.^[41]^[42] Recent advances from 2023 to 2025 have extended hybrid strategies to phased diploid and polyploid assemblies, improving handling of heterozygosity and multiple chromosome sets. Algorithms like Hifiasm, originally for HiFi reads, have been adapted in hybrid contexts to produce haplotype-resolved diploid assemblies by integrating short-read polishing, facilitating accurate phasing in complex genomes. For polyploid plants, hybrid pipelines combining long-read assembly with short-read correction have enabled high-contiguity reconstructions, addressing challenges from homeologous chromosomes in crops like wheat progenitors.^[43]^[44] In practice, hybrid assembly proves particularly effective for genomes dominated by repeats, such as fungal species like Trichoderma villosa, where it recovered more complete gene sets and repetitive elements than short-read methods alone. Similarly, in plants with over 80% repetitive content, like the wheat progenitor Aegilops tauschii (approximately 80% repeats), MaSuRCA-based hybrid assembly produced a highly contiguous 4 Gbp genome, spanning complex transposable element arrays and resolving structural variants missed in prior efforts.^[45]^[40]^[46]

Applications

Whole Genome Assembly

Whole genome assembly involves reconstructing the complete sequence of an organism's nuclear and organelle DNA, such as mitochondrial and chloroplast genomes, to produce full chromosomes or circular molecules.^[47] In prokaryotes, this process is simpler due to their typically circular chromosomes ranging from 1 to 10 megabases (Mb) in size, with fewer repetitive elements and no introns, allowing for more straightforward de novo reconstruction.^[47] Eukaryotic genomes, by contrast, feature linear chromosomes that can span up to 3 gigabases (Gb) in humans, complicated by extensive repetitive sequences, introns, and structural variations, which demand integrated approaches combining short- and long-read sequencing to achieve high continuity.^[47] A pivotal milestone in whole genome assembly was the Human Genome Project, which in 2003 produced a finished reference sequence covering approximately 92% of the euchromatic human genome using hierarchical shotgun sequencing and bacterial artificial chromosome clones.^[48] This effort laid the foundation for comparative genomics but left gaps in repetitive regions. Subsequent advancements culminated in the Telomere-to-Telomere (T2T) Consortium's 2022 achievement of the first complete, gapless human genome assembly (T2T-CHM13), spanning 3.055 Gb and including all centromeres, telomeres, and repetitive elements through ultra-long-read technologies like PacBio HiFi and Oxford Nanopore.^[41] Building on this foundation, a 2025 study sequenced 65 diverse human genomes to generate 130 haplotype-resolved assemblies with a median contig length of 130 Mb, closing 92% of remaining gaps and improving global genetic diversity representation.^[49] Assembling eukaryotic genomes faces specific hurdles, such as the highly repetitive alpha-satellite DNA in centromeres and the TTAGGG repeats at telomeres, which historically caused fragmentation due to short-read limitations and sequencing biases.^[50] In polyploid crop species like wheat or potato, multiple homologous chromosome sets exacerbate these issues, leading to haplotype confusion and inflated assembly sizes without phased approaches.^[51] The primary outputs of whole genome assembly include contigs—continuous sequences from overlapping reads—scaffolds, which link contigs with estimated gap sizes, and AGP (Assembly Gap Position) files that map these components to chromosomes for visualization and annotation.^[52] Post-2020, Hi-C chromatin conformation capture has become routine for integrating these outputs into chromosome-scale assemblies, enabling anchoring of scaffolds via long-range interaction data in diverse species from plants to insects.^[53]

Transcriptome Assembly

Transcriptome assembly is the computational process of reconstructing full-length messenger RNA (mRNA) sequences from short complementary DNA (cDNA) reads generated by RNA sequencing (RNA-Seq), enabling the capture of alternative splicing isoforms and expressed gene structures.^[54] This method has superseded earlier expressed sequence tag (EST) approaches, which relied on low-throughput Sanger sequencing to produce partial, low-coverage transcript fragments.^[55] By providing comprehensive coverage of the transcriptome, RNA-Seq-based assembly facilitates the identification of novel transcripts, quantification of gene expression, and annotation of splice variants, particularly in non-model organisms lacking a reference genome.^[56] The assembly process typically follows one of two strategies: de novo assembly, which builds transcripts directly from unaligned reads without prior genomic information, or reference-guided assembly, which maps reads to a known genome or transcriptome before reconstructing isoforms.^[56] In de novo approaches, tools like Trinity employ de Bruijn graph-based methods to resolve transcript contigs by clustering reads into splicing-aware components, effectively handling the complexity of alternative splicing.^[57] Reference-guided methods, such as StringTie, use network flow algorithms on aligned reads to model transcript structures and estimate abundances, improving accuracy when a reference is available.^[58] These pipelines often incorporate preprocessing steps like quality trimming and error correction to mitigate sequencing artifacts.^[56] Key advantages of transcriptome assembly include the ability to reveal dynamic gene expression profiles and discover unannotated transcripts without relying on a complete genome assembly, making it essential for evolutionary and functional genomics studies.^[59] However, significant challenges persist, such as isoform ambiguity from overlapping reads, under-detection of low-abundance transcripts, and biases in coverage due to RNA degradation or sequencing depth; these issues intensified with the rapid adoption of RNA-Seq following its introduction in 2008.^[59] Recent advancements, particularly since 2023, have leveraged long-read RNA-Seq technologies like PacBio and Oxford Nanopore to resolve full-length isoforms with higher fidelity, reducing fragmentation errors in complex splicing events.^[60] In single-cell RNA-Seq contexts, these long-read methods enhance resolution of cell-type-specific transcriptomes, enabling precise isoform detection in heterogeneous populations.^[61] Assemblies from such data can be evaluated for completeness using tools like BUSCO, which assess conserved ortholog recovery.^[56]

Sequencing Technologies

Short-Read Sequencing

Short-read sequencing technologies generate DNA fragments typically up to 300 base pairs (bp) in length, enabling high-throughput genome analysis with low per-base error rates. The most widely adopted platform is Illumina sequencing, which produces reads of 50–300 bp with an error rate of approximately 0.1% (or 1 in 1,000 bases), allowing billions of reads per run for deep coverage at reduced costs.^[17]^[62] Another early short-read method, Roche's 454 pyrosequencing, yielded longer reads of 400–1,000 bp but suffered from higher error rates of 1–2%, particularly insertions and deletions, and has been deprecated since 2016 due to inferior throughput and cost-efficiency compared to newer platforms.^[63]^[64] These technologies profoundly influenced sequence assembly by providing uniform, high-coverage data that supports de Bruijn graph-based algorithms, which break reads into k-mers to reconstruct contigs efficiently despite short lengths. However, short reads often fail to span repetitive regions exceeding their length, resulting in fragmented assemblies and gaps in complex genomic areas.^[65]^[66] From 2005 to 2015, short-read platforms dominated de novo assembly projects, enabling the sequencing of thousands of bacterial and eukaryotic genomes with coverages often exceeding 30×.^[67] Key advances mitigated some limitations through paired-end and mate-pair library preparations, where both ends of DNA fragments are sequenced to provide insert size information—typically 200–500 bp for paired-end and 2–20 kb for mate-pair—facilitating scaffolding and repeat resolution. By 2020, these innovations contributed to a dramatic cost reduction, bringing whole human genome sequencing below $1,000, democratizing assembly for large-scale studies. Despite these gains, short-read assemblies remain prone to fragmentation in repetitive or low-complexity regions, often requiring complementary approaches for complete genomes.^[68]^[69]^[70]

Long-Read Sequencing

Long-read sequencing technologies produce DNA reads exceeding 10 kb in length, enabling the spanning of repetitive regions and the resolution of complex genomic structures that challenge shorter-read methods. Pacific Biosciences (PacBio) HiFi sequencing generates highly accurate long reads of 10-25 kb using circular consensus sequencing (CCS), achieving an error rate of approximately 0.1% through multiple passes over the same template molecule.^[71]^[72] In contrast, Oxford Nanopore Technologies (ONT) employs nanopore-based detection to yield ultra-long reads up to 2 Mb, supporting real-time sequencing with raw error rates historically ranging from 5-15%, though recent advancements have pushed single-read accuracy above 99% for certain chemistries.^[73]^[74] These technologies have transformed sequence assembly by providing contiguous scaffolds that capture structural variants (SVs) and repetitive elements, which often fragment assemblies from shorter reads.^[75]^[76] The impact of long-read sequencing is exemplified in telomere-to-telomere (T2T) genome assemblies, where it has enabled complete, gapless representations of genomes. In 2022, the T2T Consortium produced the first fully complete human genome assembly (CHM13) using a combination of PacBio HiFi and ONT ultra-long reads, resolving over 200 million base pairs of previously unassembled repetitive sequences, including centromeres and acrocentric regions.^[41] In 2025, similar approaches facilitated T2T assemblies for plant genomes, such as those of crops like Medicago truncatula, where ultra-long ONT reads bridged extensive repeats and polyploid complexities to achieve chromosome-level contiguity without gaps.^[77] These advancements have improved the detection of SVs, which constitute a significant portion of human genetic variation, by directly traversing insertions, deletions, and inversions that short reads cannot reliably phase.^[78] Recent developments from 2023 to 2025 have further enhanced long-read utility in assembly through improved basecalling and hybrid strategies. ONT's Dorado basecaller, released in 2023, leverages GPU acceleration for faster, more accurate real-time processing of R10.4 flow cells, reducing computational bottlenecks and enabling on-device analysis.^[79]^[80] As of 2025, ONT's R10.4.1 flow cells achieve >99% single-read accuracy, with ongoing developments announced at London Calling 2025 enhancing throughput and real-time analysis capabilities.^[73]^[81] Integration with short-read data for polishing has also advanced, with tools iteratively correcting long-read errors using high-depth short-read alignments, often reducing assembly error rates by over 1% in vertebrate genomes.^[82] For instance, polishing PacBio assemblies of species like the green anole lizard has yielded near-complete, error-free drafts with enhanced scaffold N50 values exceeding 100 Mb.^[83] Such refinements, often in hybrid assembly contexts, underscore long-read sequencing's role in producing high-fidelity references for diverse applications.

Assembly Algorithms

Overlap-Layout-Consensus Methods

Overlap-layout-consensus (OLC) methods represent a foundational approach to de novo sequence assembly, particularly suited for datasets with long reads and lower coverage depths. These algorithms reconstruct the original sequence by first identifying overlapping regions between sequencing reads, then arranging the reads into a consistent layout that approximates the genome structure, and finally deriving a consensus sequence from the aligned reads to resolve errors and ambiguities. Developed initially for Sanger sequencing data, OLC approaches excel when read lengths are sufficient to span repetitive regions reliably, allowing for accurate overlap detection without excessive fragmentation.^[84]^[85] The process begins with overlap detection, where pairwise similarities between reads are computed to identify potential alignments. This step often employs techniques such as k-mer indexing, where short substrings (k-mers) of fixed length are extracted from reads and stored in a hash table to quickly filter candidate pairs for full alignment; for instance, assemblers like Arachne use k=24 for this purpose to balance sensitivity and computational efficiency. Alignments are then scored based on metrics like sequence identity and length, discarding weak overlaps that may arise from sequencing errors or distant homologies. The output forms an overlap graph, a directed graph with reads as nodes and weighted edges representing overlap quality and direction.^[86]^[87] In the layout phase, the overlap graph is traversed to find paths that represent contigs—continuous segments of the assembly. Algorithms such as unitig formation or Eulerian path approximation bundle overlapping reads into linear arrangements, resolving branches caused by repeats or errors through heuristics like coverage depth or edge weights. This step handles errors by prioritizing high-scoring paths and may incorporate mate-pair information for scaffolding. Finally, the consensus phase aligns reads along each contig and generates the nucleotide sequence by voting or probabilistic models, such as weighted majority for base calls, to achieve high accuracy; for example, Celera Assembler uses a dynamic programming approach to compute this consensus while detecting variants.^[88]^[85] OLC methods are particularly effective for long-read technologies like Sanger capillary electrophoresis or modern Pacific Biosciences (PacBio) sequencing, where read lengths often exceed 10 kb, enabling overlaps that capture unique genomic context even at 5-10x coverage. In contrast, they are less efficient for short-read data, where alternatives like de Bruijn graphs are preferred due to the high volume of fragments. Seminal implementations include the Celera Assembler, which powered the whole-genome shotgun assembly of Drosophila melanogaster and contributed to the Human Genome Project, achieving over 99% accuracy in non-repetitive regions.^[89]^[84] Contemporary OLC-based tools, such as Canu, adapt the paradigm for error-prone long reads by integrating read correction via adaptive k-mer weighting prior to overlap detection, yielding near-complete assemblies for bacterial and eukaryotic genomes with N50 contig sizes exceeding 10 Mb. Despite these advances, the naive OLC paradigm incurs O(n²) time complexity for overlap computation on n reads, which is mitigated through indexing and approximate matching but remains a bottleneck for ultra-large datasets.^[90]

De Bruijn Graph Methods

De Bruijn graph methods for sequence assembly transform the problem of reconstructing a genome from short sequencing reads into finding an Eulerian path in a directed graph, where reads are decomposed into fixed-length substrings known as k-mers.^[91] In this approach, introduced for DNA fragment assembly, the graph captures overlaps between k-mers to efficiently represent the underlying sequence, enabling polynomial-time solutions that avoid exhaustive pairwise alignments.^[91]^[92] The process begins with k-mer decomposition, where each read of length L is broken into L - k + 1 overlapping k-mers, providing the building blocks for the graph.^[92] Graph construction follows, with nodes representing unique (k-1)-mers (prefixes or suffixes of k-mers) and directed edges corresponding to k-mers that overlap by k-1 bases, such that an edge from node u to node v exists if the suffix of u matches the prefix of v.^[92] The number of nodes |V| equals the count of unique (k-1)-mers, while the number of edges |E| approximates the total number of k-mers observed, roughly read length times coverage depth.^[93] To address sequencing errors, which manifest as low-coverage "tips" or dead-end paths, the graph undergoes simplification by removing these tips based on coverage thresholds.^[92] "Bubbles"—short divergent paths arising from sequencing errors or biological variants—are then resolved by selecting the highest-coverage path or using paired-end information to pop the bubble.^[92] Finally, an Eulerian path traversing each edge exactly once reconstructs the contigs, with repeats handled by edge multiplicities reflecting coverage.^[91] These methods excel in memory efficiency for high-coverage short-read data, as the graph scales with unique k-mers rather than full read alignments, making them suitable for massive datasets from next-generation sequencing.^[92] They also manage repetitive regions effectively by leveraging k-mer coverage to distinguish true repeats from errors, reducing misassemblies compared to overlap-based alternatives.^[91] Prominent implementations include Velvet, which employs iterative k-mer sizing and graph simplification for de novo assembly, achieving high contiguity in bacterial genomes.^[94] Similarly, ABySS uses a parallelized de Bruijn graph to distribute computation across clusters, enabling scalable assembly of large eukaryotic genomes like the white spruce.^[95] Despite these strengths, de Bruijn graph methods are sensitive to the choice of k, where small k values increase error susceptibility and fail to span repeats, while large k values fragment assemblies in low-coverage areas.^[92] Low overall coverage exacerbates issues, as sparse edges hinder Eulerian path resolution and amplify error propagation.^[91]

Specialized Algorithms for Modern Data

Recent advancements in sequence assembly have focused on algorithms tailored to long-read and hybrid datasets, addressing challenges like high error rates, repetitive regions, and diploid complexity in technologies such as Oxford Nanopore and PacBio HiFi. These specialized methods extend traditional de Bruijn graph approaches by incorporating error-tolerant graph structures and phasing strategies, enabling more contiguous and haplotype-resolved assemblies. For instance, Flye employs an adaptive de Bruijn graph variant that handles error-prone long reads by iteratively resolving repeats through read overlap graphs, achieving superior contiguity in bacterial and eukaryotic genomes compared to k-mer-based assemblers.^[96] Haplotype-aware assembly pipelines, such as Verkko, integrate ultra-long Nanopore reads with proximity-ligation data to produce phased, telomere-to-telomere diploid assemblies. Developed in 2023, Verkko uses a hybrid de Bruijn graph to separate haplotypes and resolve structural variants, successfully assembling 20 of 46 human chromosomes without gaps in diploid samples. This approach facilitates phased assemblies for diploids by leveraging read phasing and graph untangling, improving accuracy in heterozygous regions. Complementing these, Hi-C integration in scaffolding tools like YaHS orders contigs into chromosome-scale structures using chromatin contact maps, enhancing overall assembly integrity without requiring prior chromosome counts.^[97]^[98] For repeat resolution, machine learning-enhanced methods target tandem repeats, which often cause assembly gaps. TRFill, introduced in 2025, fills these gaps in draft assemblies using only HiFi and Hi-C data, accurately reconstructing tandem regions through reference-guided alignment and haplotype inference, enabling population-level analysis of complex loci. Similarly, DeChat applies deep learning for haplotype- and repeat-aware error correction in Nanopore R10 reads, preserving variant information while reducing indel errors in repetitive contexts. These innovations have elevated performance metrics; modern long-read human assemblies now routinely achieve contig N50 lengths exceeding 10 Mb, a marked improvement over pre-2020 short-read assemblies limited to under 1 Mb N50.^[99]^[100] Looking ahead, AI-driven error correction models are poised to further refine long-read data. Tools like HERRO, released in 2024, use deep learning to correct ultra-long Nanopore reads while accounting for haplotype variations, reducing overall error rates below 1% in high-quality subsets and supporting more reliable phased assemblies.^[101]

Quality Assessment

Evaluation Metrics

Evaluation metrics for sequence assemblies assess contiguity, accuracy, completeness, and specificity to determine the quality of the reconstructed genome or transcriptome. Contiguity metrics evaluate how well the assembly captures the linear structure of the original sequence, with higher values indicating fewer but longer contiguous segments. The N50 metric represents the length of the shortest contig such that the sum of lengths of all contigs of that length or longer covers at least 50% of the total assembled length, providing a measure of assembly fragmentation.^[102] Complementing N50, L50 denotes the smallest number of contigs required to cover 50% of the total assembly length, where lower L50 values signify better contiguity.^[102] The total assembled length quantifies the overall span of the assembly in base pairs, ideally approaching the known genome size without excessive over- or underestimation, while the number of contigs indicates fragmentation, with fewer contigs preferred for higher-quality assemblies.^[102] Accuracy metrics focus on base-level errors by comparing the assembly to a reference sequence, revealing mismatches and structural variants. The mismatch rate, expressed as the average number of mismatches per 100,000 aligned bases, highlights substitution errors, with rates below 100 per 100 kb considered acceptable for polished assemblies.^[102] Similarly, the indel rate measures insertions and deletions per 100 kb, where low values (e.g., under 50 per 100 kb) indicate precise alignment to the reference, as computed in tools like QUAST.^[102] Completeness metrics gauge whether essential genomic content is represented, particularly genes and repetitive elements. BUSCO (Benchmarking Universal Single-Copy Orthologs) assesses completeness by searching for a conserved set of single-copy orthologs, with assemblies achieving 95% or higher complete BUSCOs deemed high-quality for eukaryotic genomes.^[103] The Long-read Assembly Index (LAI) evaluates continuity in repetitive regions by comparing intact long terminal repeat (LTR) retrotransposon pairs in the assembly to those in the intact genome, where LAI values above 10 suggest strong assembly of repeat-rich areas in plant genomes.^[104] Specificity metrics detect structural errors that could mislead downstream analyses, such as chimeric joins. The misassembly rate counts relocation or inversion events between the assembly and reference, often reported as the number of misassemblies per 100 kb, with rates near zero essential for reliable assemblies.^[102] For telomere-to-telomere (T2T) assemblies, which aim for gapless chromosome coverage, metrics include the consensus quality value (QV), where QV > 30 (equivalent to error rates < 0.1%) confirms high accuracy, alongside verification of complete telomere and centromere inclusion without misassemblies.^[105]

Control and Validation Techniques

Control and validation techniques in sequence assembly encompass a range of methods applied before, during, and after the assembly process to detect errors, improve contiguity, and ensure the reliability of the resulting sequences. These techniques are essential for addressing challenges such as sequencing errors, chimeric artifacts, and structural inaccuracies, particularly in complex datasets from transcriptomic or genomic sources. Pre-assembly quality control (QC) begins with read trimming to remove low-quality bases, adapters, and contaminants that could propagate errors into the assembly. Tools like FastQC assess read quality by generating reports on per-base sequence quality, GC content, and overrepresented sequences, enabling targeted trimming with subsequent software such as Trimmomatic or Cutadapt. Error correction further refines raw reads; for instance, Musket employs a k-mer spectrum-based approach to correct substitution and indel errors in high-throughput sequencing data, reducing noise without excessive loss of coverage. These steps typically improve assembly metrics like N50 by minimizing misalignment during overlap detection. Post-assembly validation focuses on verifying scaffold integrity and resolving artifacts. Optical mapping, which uses restriction enzyme digestion patterns to create high-resolution physical maps, validates scaffolding by aligning assembled contigs against these maps to detect misassemblies or gaps exceeding expected distances. Manual curation complements automated methods by inspecting chimeric junctions—regions where unrelated sequences are erroneously fused—often using visualization tools like IGV to manually edit and refine assemblies based on read depth anomalies or breakpoint evidence. Key techniques for overall validation include read-back mapping, where original reads are realigned to the assembled contigs to quantify alignment rates and coverage uniformity; a high percentage of aligned reads (e.g., >90%) indicates robust assembly, while discrepancies highlight errors. Simulation-based validation generates mock datasets mimicking real conditions, such as metagenomic communities, to test assembly accuracy against ground truth; tools like MetaSim create synthetic reads from reference genomes for this purpose. Recent advances emphasize automated polishing to iteratively refine draft assemblies. Pilon, for example, uses short-read alignments to correct base errors, indels, and small structural variants in long-read drafts. As of 2025, machine learning-based tools like DeepPolisher have emerged, enabling precise base-level error correction and significantly improving assembly accuracy.^[6] Standardized benchmarks like the GAGE (Genome Assembly Gold-Standard Evaluations) framework provide comparative evaluation by assembling reference datasets with multiple tools and assessing outcomes across species, establishing baselines for contiguity, accuracy, and runtime.

Pipelines and Tools

Assembly Workflows

Sequence assembly workflows encompass the end-to-end bioinformatics processes that transform raw sequencing reads into contiguous genome representations, typically involving pre-processing, core assembly, and post-processing stages to ensure accuracy and completeness.^[106] These pipelines are tailored to the sequencing technology and data type, such as short-read Illumina data or long-read PacBio/Oxford Nanopore outputs, and increasingly incorporate hybrid approaches for enhanced resolution.^[107] The workflow begins with data preparation to mitigate errors inherent in sequencing, proceeds to algorithmic reconstruction, and concludes with refinement to achieve biologically meaningful assemblies.^[106] Pre-processing starts with quality control (QC) to evaluate read integrity using tools that detect adapter contamination, low-quality bases, and biases, followed by trimming to remove artifacts and correct errors, particularly in long reads where error rates can exceed 10%.^[106] Read correction is crucial for noisy long-read data, employing algorithms that align reads to consensus sequences or use short reads as anchors to reduce indel and substitution errors before assembly.^[108] This stage also includes k-mer analysis to estimate genome size, heterozygosity, and optimal parameters, ensuring downstream steps handle repetitive or complex regions effectively.^[106] In the core assembly phase, algorithm selection depends on read length and error profile: overlap-layout-consensus (OLC) methods suit long reads for their tolerance of gaps, while de Bruijn graph approaches excel with short, high-accuracy reads.^[106] For de novo workflows, the process unfolds as read correction to polish inputs, followed by graph building—either via k-mer overlaps in de Bruijn graphs or read-to-read alignments in OLC—to represent sequence relationships, and culminates in consensus generation to resolve paths into contigs, often iterating to collapse bubbles from sequencing errors or variants.^[108] Reference-based workflows, conversely, map reads to an existing genome using aligners like BWA or minimap2, then refine variants through calling and polishing to fill gaps or correct mismatches, yielding a sample-specific assembly aligned to the reference scaffold.^[109] Hybrid workflows leverage complementary strengths by first generating a draft contig set from long reads to span repeats and structural variants, then integrating short reads for error correction and gap filling via alignment and consensus polishing, often in multiple rounds to boost base-level accuracy above 99%.^[107] Post-processing enhances contiguity through scaffolding, which orders and orients contigs using paired-end or mate-pair links, and includes initial annotation to identify genes and repeats for validation.^[106] Annotation at this stage involves structural prediction and functional assignment, preparing the assembly for downstream analyses like comparative genomics. To achieve chromosome-scale assemblies, workflows integrate chromatin conformation capture data such as Hi-C, which captures genome-wide proximity interactions to scaffold contigs by modeling contact frequencies as a graph and resolving orientations via iterative error correction, reducing misjoins by up to fourfold compared to read-based methods alone.^[110] ChIA-PET, a protein-specific variant, similarly maps long-range interactions to refine scaffolds, as demonstrated in re-annotating amphibian genomes by linking regulatory elements across chromosomes.^[111] Best practices emphasize parameter tuning, such as selecting k-mer sizes (e.g., 19-31 for human-scale genomes) to balance uniqueness and coverage—too small risks repeats, too large fragments graphs—via tools like GenomeScope for estimation based on heterozygosity and repetitiveness.^[112] For large genomes exceeding 1 Gb, high-performance computing (HPC) resources are essential, requiring at least 244 GB RAM and multi-core processors to manage memory-intensive graph constructions, with containerized pipelines ensuring scalability across clusters.^[112] Recent advancements include automated pipelines like nf-core/genomeassembler, which streamlines long-read assembly, polishing, and scaffolding in a Nextflow-based framework, supporting telomere-to-telomere (T2T) efforts by integrating high-fidelity inputs for complete chromosome reconstructions without manual intervention.^[113]^[114] Verkko2 (as of December 2024) further improves Verkko by enhancing repeat resolution and gap closing with proximity-ligation data integration.^[115]

Software Programs

Sequence assembly software encompasses a diverse array of tools designed to reconstruct genomic sequences from fragmented reads, categorized by read length, data type, and assembly strategy. These programs vary in their optimization for short-read (e.g., Illumina), long-read (e.g., PacBio HiFi or Oxford Nanopore), or hybrid approaches, with many being open-source and widely adopted in bioinformatics pipelines.^[116] For de novo short-read assembly, SPAdes is a prominent assembler optimized for bacterial and small eukaryotic genomes, employing a de Bruijn graph approach with multi-sized k-mers to handle uneven coverage and repeats effectively. It has been benchmarked as one of the top performers for single-cell and isolate assemblies, achieving high contiguity in complex datasets. MEGAHIT serves as an efficient alternative for metagenomic short-read data, using a succinct de Bruijn graph to enable ultra-fast assembly on single nodes even for large, complex communities, often completing in under 10 hours with modest RAM. Long-read assemblers address the limitations of short reads in resolving repeats and structural variants. Flye is tailored for Oxford Nanopore reads, utilizing a repeat graph-based algorithm to produce highly contiguous assemblies from error-prone long reads, with applications in bacterial and eukaryotic genomes. Hifiasm excels with PacBio HiFi reads, incorporating phased assembly graphs for haplotype-resolved diploid genomes; a 2023 update enhanced its diploid phasing capabilities, improving accuracy in heterozygous regions by leveraging trio data. Hybrid assemblers combine short and long reads to leverage the accuracy of short reads with the contiguity of long reads. MaSuRCA integrates Illumina short reads for error correction with long reads for scaffolding, making it suitable for large eukaryotic genomes and producing assemblies with fewer misassemblies in benchmarks. Unicycler focuses on bacterial genomes, using SPAdes for initial short-read graphs and long reads (e.g., Nanopore) for bridging repeats, resulting in circularized chromosome and plasmid contigs with high completeness.^[39] Reference-based assembly relies on alignment to a known genome for variant detection and scaffolding. BWA-MEM is a core mapping tool that aligns short or long reads to references with high sensitivity to indels and structural variants, forming the basis for downstream assembly refinement. GATK's HaplotypeCaller module facilitates variant assembly by modeling haplotypes from aligned reads, enabling precise reconstruction of genomic regions with polymorphisms. Comprehensive suites integrate multiple tools into user-friendly platforms. Galaxy provides workflows that orchestrate assemblers like SPAdes and Flye within a web-based interface, supporting reproducible de novo and reference-based pipelines for diverse sequencing data. Recent advancements include Verkko (2024), a hybrid pipeline for repeat-heavy genomes using PacBio HiFi and ultralong Oxford Nanopore reads to achieve telomere-to-telomere assemblies in challenging regions like centromeres. Dragonflye streamlines ultra-long read assembly for bacterial isolates, wrapping Flye with polishing steps to yield high-quality, complete genomes from Oxford Nanopore data in a single command.^[117]

References

[1]
Genome Assembly - an overview | ScienceDirect Topics
Genome assembly is defined as the process of organizing nucleotide sequences into the correct order, which is necessary due to the shorter lengths of sequence ...
[2]
DNA Sequence Assembly - News-Medical
DNA sequence assembly is a process that involves aligning and merging fragments of a DNA sequence to reconstruct the original structure of the DNA.
[3]
Next-Generation Sequence Assembly: Four Stages of Data ... - NIH
Dec 12, 2013 · In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages.
[4]
Recent advances in sequence assembly: principles and applications
Apr 26, 2017 · Sequence assembly [3] is a process by dividing the large pieces of DNA into small pieces, reading the small fragments and reconstituting the ...Assembly algorithms · Future development of... · Application of DNA assembly
[5]
An Overview of Genome Assembly
In bioinformatics, genome assembly represents the process of putting a large number of short DNA sequences back together to recreate the original chromosomes ...
[6]
Repetitive DNA and next-generation sequencing - PubMed Central
Nov 29, 2011 · NGS read lengths (50–150 bp) are considerably shorter than the 800–900 bp lengths that capillary-based (Sanger) sequencing methods were ...
[7]
What is a genome assembly? - NLM Support Center - NIH
It can (1) refer to a process in which researchers assemble genome sequences from smaller components, or it can (2) refer to the entire collection of sequences ...
[8]
Assembling Your DNA Sequences - Geneious
What is Sequence Assembly? · Assembling genomes or genes to study their relationships for phylogenetic studies · Identifying variants associated with a disease or ...
[9]
Human Genome Project Fact Sheet
Jun 13, 2024 · In 2003, the Human Genome Project produced a genome sequence that accounted for over 90% of the human genome. It was as close to complete as the ...
[10]
International Consortium Completes Human Genome Project
The international effort to sequence the 3 billion DNA letters in the human ... base pairs could be achieved. The finished human sequence is a fabulous ...
[11]
[PDF] The Human Genome Project (HGP)
A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to determine the sequence of the ...<|control11|><|separator|>
[12]
De Novo Sequencing | Assemble novel genomes - Illumina
De novo sequencing is sequencing a novel genome where no reference sequence is available, and it generates accurate reference sequences.
[13]
Current challenges and solutions of de novo assembly
Jun 4, 2019 · In this review, we first briefly introduce some of the major challenges faced by NGS sequence assembly. Then, we analyze the characteristics of various ...
[14]
Repetitive DNA sequence detection and its role in the human genome
Sep 19, 2023 · For instance, about 50% of the human genome consists of repeats, while roughly 4% of human genes harbor transposable elements in their protein- ...
[15]
The Challenge of Genome Sequence Assembly
Oct 17, 2018 · Chromosome assembly from 'short read' sequence data is confounded by the presence of repetitive genome regions with numerous similar sequence ...INTRODUCTION · THE STRENGTHS AND... · MAP IS NEEDED
[16]
High-throughput DNA sequencing errors are reduced by orders of ...
For instance, Illumina sequencing machines produce errors at a rate of ∼0.1–1 × 10−2 per base sequenced. These technologies typically produce billions of base ...
[17]
A comprehensive evaluation of long read error correction methods
Dec 21, 2020 · Both sequencing platforms are similar in terms of their high error rates (ranging from 10-20%) with most errors occurring due to insertions or ...
[18]
Sequencing error profiles of Illumina sequencing instruments
Mar 27, 2021 · Error rate is highly correlated with the sequencing cycle, rising toward the end of each read. In samples with smaller overlaps, the detected ...INTRODUCTION · MATERIALS AND METHODS · RESULTS AND DISCUSSION
[19]
Why Assembling Plant Genome Sequences Is So Challenging - PMC
Unfortunately, coverage variability is the rule and undermines the coverage-based diagnostics. It can be speculated that the sequencing itself needs to be ...
[20]
Variation of and associations with the depth and evenness ... - Nature
Jul 19, 2025 · Depth and evenness of sequencing coverage are considered possible indicators of genome assembly quality. A relatively even sequencing coverage ...
[21]
Genome assembly reborn: recent computational challenges
May 29, 2009 · Perhaps the biggest new challenge is posed by the large-scale sequencing of entire microbial communities (metagenomics).Genome Assembly Reborn... · Shotgun Sequencing · Overlap-Layout-Consensus...
[22]
Computability of Models for Sequence Assembly - SpringerLink
Here we present two theoretical results about the complexity of these models for sequence assembly. In the first part, we show sequence assembly to be NP-hard ...
[23]
Phased diploid genome assemblies and pan-genomes provide ...
Nov 2, 2020 · Despite high heterozygosity rates (0.85–1.28%), all assemblies showed high contiguity, with the scaffold N50 of 3.3–4.3 Mb in diploid assemblies ...
[24]
HapSolo: an optimization approach for removing secondary ... - NIH
Nevertheless, the assembly of heterozygous genomes still presents substantial challenges. One challenge is resolving distinct haplotypes in regions of high ...
[25]
The present and future of de novo whole-genome assembly
Oct 14, 2016 · This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
[26]
Genetic variation and the de novo assembly of human genomes - NIH
Sequence coverage is almost never uniform, and repetitive sequences of varying length, copy number and sequence complicate this process. This makes the correct ...De Novo Genome Assembly... · Figure 3. Genome Assembly... · Bioinformatics And...
[27]
Reference-guided de novo assembly approach improves genome ...
Nov 10, 2017 · However, de novo genome assemblies remain challenging due to short read length, missing data, repetitive regions, polymorphisms and sequencing ...
[28]
sequencing, de novoassembly and rapid analysis using open ...
Apr 1, 2013 · De novo assembly software and algorithms are powerful enough to allow average bacterial genomes to be assembled within hours or a few days ...
[29]
Comparison of De Novo Assembly Strategies for Bacterial Genomes
Jul 17, 2021 · The structures of repetitive regions include, for example, resistance gene cassettes, insertion sequences, and transposons.
[30]
History and current approaches to genome sequencing and assembly
In this review we provide a comprehensive historical background of the improvements in DNA sequencing technologies that have accompanied the major milestones ...
[31]
[PDF] DNA sequencing at 40: past, present and future
The first integrated NGS platforms came in 2005, with resequencing of Escherichia coli by Shendure, Porreca, Mitra and Church41, de novo assembly of Mycoplasma ...
[32]
Fast and accurate short read alignment with Burrows–Wheeler ...
May 18, 2009 · Furthermore, BWA always requires the full read to be aligned, from the first base to the last one (i.e. global with respect to reads), but ...INTRODUCTION · METHODS · RESULTS · DISCUSSION
[33]
Ultrafast and memory-efficient alignment of short DNA sequences to ...
Mar 4, 2009 · Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes.
[34]
[1303.3997] Aligning sequence reads, clone sequences and ... - arXiv
Mar 16, 2013 · BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human.
[35]
RaGOO: fast and accurate reference-guided scaffolding of draft ...
Oct 28, 2019 · Two common approaches have been used to achieve chromosome-scale assemblies, namely, reference-free (de novo) and reference-guided approaches.
[36]
A global reference for human genetic variation | Nature
Sep 30, 2015 · The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome ...
[37]
MaSuRCA genome assembler | Bioinformatics - Oxford Academic
We describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies.
[38]
Unicycler: Resolving bacterial genome assemblies from short and ...
Jun 8, 2017 · Here we present Unicycler, a new hybrid assembly pipeline for bacterial isolate genomes. Unicycler first assembles short reads into an accurate ...
[39]
Hybrid assembly of the large and highly repetitive genome of ...
The hybrid assembly combines long, high-error PacBio reads with shorter, accurate Illumina reads to create mega-reads, which are then assembled.
[40]
The complete sequence of a human genome | Science
Mar 31, 2022 · The resulting T2T-CHM13 reference assembly removes a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis, ...
[41]
A complete reference genome improves analysis of human genetic ...
T2T-CHM13 improves short-read mapping across populations. To investigate how the T2T-CHM13 assembly affects short-read variant calling, we realigned and ...
[42]
Haplotype-resolved de novo assembly using phased ... - Nature
Feb 1, 2021 · Hifiasm is a fast open-source de novo assembler specifically developed for HiFi reads. It mostly uses exact overlaps to construct the assembly ...
[43]
Recent Advances in Assembly of Complex Plant Genomes
We summarize the challenges of and advances in complex plant genome assembly, including feasible experimental strategies, upgrades to sequencing technology.
[44]
Hybrid Assembly Improves Genome Quality and Completeness of ...
Jan 30, 2022 · Hybrid assembly improved genome quality and completeness of T. villosa, revealing high potential for lignocellulose breakdown, with 14,540 ...3. Results And Discussion · 3.1. Illumina And Minion... · 3.2. Genome Assembly And...
[45]
Characterization of repetitive DNA landscape in wheat ...
May 12, 2015 · Thus, the hexaploid wheat genome is characterized by its large size (~17 Gb) and complexity, with repetitive sequences accounting for ~ 80% of ...
[46]
Genome Anatomies - NCBI - NIH
In general, prokaryotic genes are shorter than their eukaryotic counterparts, the average length of a bacterial gene being about two-thirds that of a eukaryotic ...An Overview of Genome... · The Anatomy of the Eukaryotic... · The Anatomy of the...
[47]
First complete sequence of a human genome - NIH
Apr 12, 2022 · Researchers finished sequencing the roughly 3 billion bases (or “letters”) of DNA that make up a human genome.Missing: assembly | Show results with:assembly
[48]
Centromere studies in the era of 'telomere-to-telomere' genomics - NIH
This review reflects the progress in centromere genomics, credited by recent advancements in long-read sequencing and assembly methods.
[49]
Current Strategies of Polyploid Plant Genome Sequence Assembly
Here, we review the challenges of the assembly of polyploid plant genomes, and also present recent advances in genomic resources and functional tools.
[50]
AGP Specification v2.1 - NCBI - NIH
Mar 14, 2024 · Describes the assembly of a larger sequence object from smaller objects. The large object can be a contig, a scaffold (supercontig), or a chromosome.Missing: outputs | Show results with:outputs
[51]
Technical considerations in Hi‐C scaffolding and evaluation of ...
Integrating Hi‐C links with assembly graphs for chromosome‐scale assembly. PLoS Computational Biology, 15(8), e1007273. 10.1371/journal.pcbi.1007273. [DOI] ...
[52]
RNA-Seq: a revolutionary tool for transcriptomics - PMC - NIH
RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered ...Missing: seminal | Show results with:seminal
[53]
Transcriptomics technologies - PMC - NIH
RNA-Seq leverages deep sampling of the transcriptome with many short fragments from a transcriptome to allow computational reconstruction of the original RNA ...
[54]
A simple guide to de novo transcriptome assembly and annotation
Jan 24, 2022 · We present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing ...Introduction · Pre-assembly quality control... · De novo transcriptome assembly<|control11|><|separator|>
[55]
Full-length transcriptome assembly from RNA-Seq data without a ...
May 15, 2011 · Grabherr et al. describe Trinity, an algorithm for assembling full-length transcripts from short reads without first mapping the reads to a ...Missing: paper | Show results with:paper
[56]
StringTie enables improved reconstruction of a transcriptome ... - NIH
Feb 18, 2015 · StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript ...
[57]
Challenges and advances for transcriptome assembly in non-model ...
After two decades of RNA microarrays [8], RNA-seq has democratized the analysis of transcriptomes for any non-model organism.
[58]
Systematic assessment of long-read RNA-seq methods for transcript ...
Jun 7, 2024 · The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth.Missing: seminal | Show results with:seminal
[59]
Advances in long-read single-cell transcriptomics | Human Genetics
May 24, 2024 · Long-read single-cell transcriptomics (scRNA-Seq) is revolutionizing the way we profile heterogeneity in disease. Traditional short-read ...
[60]
Long fragments achieve lower base quality in Illumina paired-end ...
Feb 27, 2019 · Illumina's technology provides high quality reads of DNA fragments with error rates below 1/1000 per base. Sequencing runs typically ...
[61]
Characteristics of 454 pyrosequencing data—enabling realistic ...
2.5 Read lengths The length of un-trimmed reads in 454 pyrosequencing is limited by either the number of flows (168 in GS20, 400 in GS FLX and 800 in GS FLX ...
[62]
(PDF) Accuracy and Quality Assessment of 454 GS-FLX Titanium ...
We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in ...
[63]
De novo assembly of short sequence reads - Oxford Academic
Aug 19, 2010 · Sequence reads as short as 20–30 nucleotides could be used to generate useful assemblies of both prokaryotic and eukaryotic genome sequences.THE CHALLENGE OF... · THE FEASIBILITY OF... · NOT ALL ASSEMBLERS ARE...
[64]
Impact of short-read sequencing on the misassembly of a plant ...
Feb 2, 2021 · Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when ...Missing: variability | Show results with:variability
[65]
The Most Frequently Used Sequencing Technologies and Assembly ...
Oct 18, 2020 · Currently, 82% of the bacterial genomes in RefSeq were produced by the short read Illumina sequencing technology (Figure 1C). Among the ...
[66]
Mate Pair Sequencing - Illumina
Mate pair sequencing involves generating long-insert paired-end DNA libraries useful for a number of sequencing applications.
[67]
Paired-end or mate-pair - GATK - Broad Institute
Jun 25, 2024 · In paired-end sequencing, the library preparation yields a set of fragments, and the machine sequences each fragment from both ends.
[68]
The Cost of Sequencing a Human Genome
Nov 1, 2021 · The cost to generate a whole-exome sequence was generally below $1,000. Commercial prices for whole-genome and whole-exome sequences have often ...
[69]
HiFi Reads - Highly accurate long-read sequencing - PacBio
HiFi reads are highly accurate long reads with 99.9% accuracy, up to 25 kb in length, and are free of systematic errors.
[70]
Long and Accurate: How HiFi Sequencing is Transforming Genomics
By contrast, HiFi reads, generated from 10–25-kb inserts using CCS mode, have an error rate of ≤ 1% by leveraging multiple passes around the template. CLR, ...
[71]
Nanopore sequencing accuracy
Oxford Nanopore sequencing hardware and chemistry have seen major upgrades in the shift to R10.4.1 and are now able to read DNA fragments at >99% single-read ...
[72]
Nanopore sequencing technology, bioinformatics and applications
Nov 8, 2021 · Other tools use long read length while accounting for high error rate. Many of these, such as tools for error correction, assembly and ...
[73]
Structural variant calling: the long and the short of it | Genome Biology
Nov 20, 2019 · Long reads help to increase the detection of SVs as they considerably ease de novo genome assembly and mapping. Nevertheless, the increased ...Short-Read Alignment... · Short-Read Dna-Seq Mapping · Rna-Seq Mapping
[74]
Long-read human genome sequencing and its applications - PMC
A base is incorrectly called in about 1 out of every 10 bases, resulting in an error rate of 8–15% in the CLR. HiFi reads are generated by circular consensus ...
[75]
Two complete telomere-to-telomere genome assemblies of ...
Sep 1, 2025 · In this study, we report the complete telomere-to-telomere (T2T) genome assemblies of A17 and R108 (Figures 1A and 1B), constructed by ...
[76]
[PDF] White Paper: Structural Variation in the Human Genome - PacBio
Low, unbiased coverage of the genome with long-read sequencing reveals most of the structural variants uncovered with de novo assembly, but at a price point ...
[77]
nanoporetech/dorado: Oxford Nanopore's Basecaller - GitHub
Dorado is a high-performance, easy-to-use, open source analysis engine for Oxford Nanopore reads. Detailed information about Dorado and its features is ...Missing: 2023-2025 | Show results with:2023-2025
[78]
London Calling 2023: Dorado — the future of basecalling
May 18, 2023 · Mark will be giving an update on the current status of the Dorado standalone basecaller, talking about the motivations and going through ...
[79]
Towards complete and error-free genome assemblies of all ... - Nature
Apr 28, 2021 · Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages.
[80]
Towards complete and error-free genome assemblies of all ... - NIH
Apr 28, 2021 · We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype ...<|control11|><|separator|>
[81]
overlap–layout–consensus and de-bruijn-graph - Oxford Academic
Dec 19, 2011 · OLC generally works in three steps: first overlaps (O) among all the reads are found, then it carries out a layout (L) of all the reads and ...INTRODUCTION · IDEAL SEQUENCING DATA... · SEQUENCING DATA AND...
[82]
Review of General Algorithmic Features for Genome Assemblers for ...
The overlap-layout-consensus, as the name suggests, consists of three steps (see Supplementary section, [27]). In the first step an overlap graph is created by ...
[83]
5.2: Genome Assembly I- Overlap-Layout-Consensus Approach
Mar 17, 2021 · This section will examine one of the most successful early methods for computationally assembling a genome from a set of DNA reads, called shotgun sequencing.Setting up the experiment · Finding overlapping reads · Merging reads into contigs
[84]
[PDF] Genome Assembly Intro & OLC - GitHub Pages
However, overlapping can still be one of the slowest steps in an assembly. Page 58. Overlap Layout Consensus. Overlap. Layout. Consensus.
[85]
Consensus generation and variant detection by Celera Assembler
Celera Assembler uses dynamic windowing to identify alleles, produce a set of haploid consensus sequences, and splits read segments in variant regions.Abstract · INTRODUCTION · ALGORITHM · RESULTS AND DISCUSSION
[86]
On the sequencing and assembly of the human genome - PNAS
Celera's assembly was missing the interiors of highly similar repetitive elements and the extremely dense repeat regions near the centromeres, whereas the HGSC ...Missing: original paper
[87]
Linear time complexity de novo long read genome assembly with ...
May 22, 2023 · Most long-read genome assemblers follow the Overlap-Layout-Consensus paradigm (OLC), a quadratic run time algorithm in its naïve implementation.
[88]
An Eulerian path approach to DNA fragment assembly | PNAS
### Summary of de Bruijn Graph Methods for Sequence Assembly
[89]
Why are de Bruijn graphs useful for genome assembly? - PMC - NIH
(d) Modern short-read assembly algorithms construct a de Bruijn graph by representing all k-mer prefixes and suffixes as nodes, then drawing edges that ...Missing: seminal | Show results with:seminal
[90]
[PDF] De Bruijn Graph assembly
As usual, we start with a collection of reads, which are substrings of the reference genome. AABis a k-mer (k = 3). AAis its left k-1-mer, and ABis its right k- ...Missing: seminal paper
[91]
Velvet: Algorithms for de novo short read assembly using de Bruijn ...
We have developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly.Missing: seminal | Show results with:seminal
[92]
ABySS: A parallel assembler for short read sequence data - PMC - NIH
The primary innovation in ABySS is a distributed representation of a de Bruijn graph, which allows parallel computation of the assembly algorithm across a ...
[93]
Assembly of Long Error-Prone Reads Using Repeat Graphs - bioRxiv
Jan 12, 2018 · We present the Flye algorithm for constructing the A-Bruijn (assembly) graph from long error-prone reads, that, in contrast to the k-mer-based ...
[94]
Telomere-to-telomere assembly of diploid chromosomes with Verkko
Verkko is an iterative, graph-based pipeline for assembling complete, diploid genomes, using long and ultra-long reads to achieve telomere-to-telomere assembly.
[95]
YaHS: yet another Hi-C scaffolding tool - Oxford Academic
YaHS is a tool that constructs chromosome-scale scaffolds using Hi-C data, and is fast, reliable, and accurate.
[96]
TRFill: synergistic use of HiFi and Hi-C sequencing enables ...
Jul 28, 2025 · Our TRFill algorithm accurately fills assembly gaps using only PacBio HiFi and Hi-C data, without relying on the costly ONT UL reads. Our ...
[97]
Repeat and haplotype aware error correction in nanopore ...
Dec 19, 2024 · DeChat is a novel error correction tool for Nanopore R10 reads (<2% error), combining de Bruijn graphs and variant-aware alignment to preserve repeats and ...
[98]
Telomere-to-telomere phased genome assembly using error ...
May 21, 2024 · We have developed the HERRO model based on deep learning, which corrects Simplex nanopore reads longer than 10kbp and with a quality value higher than 10.<|control11|><|separator|>
[99]
QUAST: quality assessment tool for genome assemblies
This metric can be computed without a reference genome. No. of mismatches per 100 kb: The average number of mismatches per 100 000 aligned bases. QUAST also ...
[100]
BUSCO: assessing genome assembly and annotation completeness ...
Jun 9, 2015 · BUSCO quality assessments provide high-resolution quantifications citeable in the simple C[D],F,M,n notation for genomes, gene sets and ...
[101]
Assessing genome assembly quality using the LTR Assembly Index ...
Aug 10, 2018 · The BUSCO and CEGMA completeness are poor predictors of LAI (r2 ≤ 0.06, P ≥ 0.12) (Figure 2E and F), indicating that LAI is characterizing ...INTRODUCTION · MATERIALS AND METHODS · RESULTS · DISCUSSION
[102]
Genome assembly in the telomere-to-telomere era - PMC
Here we review recent progress on assembly algorithms and protocols. We focus on how to derive near telomere-to-telomere assemblies and discuss potential ...
[103]
Ten steps to get started in Genome Assembly and Annotation - NIH
Different sequencing technologies and generally applicable workflows for genome assembly are also detailed. We cover structural and functional annotation and ...
[104]
Efficient hybrid de novo assembly of human genomes with WENGAN
Dec 14, 2020 · WENGAN starts by building short-read contigs using a de Bruijn graph assembler (1 in Fig. 1). Then, the pair-end reads are pseudo-aligned back ...
[105]
NextDenovo: an efficient error correction and accurate assembly tool ...
Apr 26, 2024 · We present NextDenovo, an efficient error correction and assembly tool for noisy long reads, which achieves a high level of accuracy in genome assembly.
[106]
RGAAT: A Reference-based Genome Assembly and Annotation ...
Dec 21, 2018 · RGAAT can be used to generate variants between two assemblies by sequence comparison (Figure 3). We used BLAT for genome comparison because of ...Missing: refinement | Show results with:refinement
[107]
Integrating Hi-C links with assembly graphs for chromosome-scale ...
We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the ...
[108]
Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation ...
We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH) signaling in the frog Xenopus ...
[109]
[PDF] A Review of VGP's Current Techniques and Best Practices for the ...
K-mer size is a key parameter that must be large enough to map uniquely to the genome, but not too large, as it can lead to wasting computational resources. For ...Missing: tuning | Show results with:tuning
[110]
nf-core/genomeassembler: Assembly and scaffolding of ... - GitHub
nf-core/genomeassembler is a bioinformatics pipeline that carries out genome assembly, polishing and scaffolding from long reads (ONT or pacbio).Missing: 2024 T2T
[111]
[PDF] Genome assembly in the telomere-to-telomere era
Apr 22, 2024 · The T2T-CHM13 human genome was assembled with these two data types36. Currently, the only assemblers that can integrate a Homozygous genome.
[112]
Benchmarking of bioinformatics tools for the hybrid de novo ...
Accurate and complete de novo genome assemblies enable variant identification and the discovery of novel genomic features and biological functions.Missing: specificity | Show results with:specificity
[113]
rpetit3/dragonflye: :dragon: Assemble bacterial isolate ... - GitHub
Dragonflye is a pipeline that aims to make assembling Oxford Nanopore reads quick and easy. Still working on the quick part, but I think the easy part is there.Missing: Verkko | Show results with:Verkko