Fact-checked by Grok 2 weeks ago

Sequence assembly

Sequence assembly is a fundamental process in bioinformatics that involves reconstructing complete or near-complete biological sequences, such as DNA or RNA, from numerous short fragments called reads generated by high-throughput sequencing technologies. This reconstruction is essential because sequencing instruments produce reads that are typically 50–300 base pairs long, far shorter than the genomes or transcriptomes they aim to represent, necessitating computational algorithms to align and merge these fragments into contiguous sequences known as contigs. The process, which emerged in the early 1980s, gained prominence through the Human Genome Project's hierarchical shotgun sequencing but has evolved significantly with next-generation sequencing (NGS), enabling de novo assembly of novel genomes without prior reference sequences. Key methods in sequence assembly include de novo assembly, which builds sequences from scratch using overlap-layout-consensus (OLC) or algorithms, and reference-based assembly, which maps reads to an existing for guided reconstruction. OLC approaches, such as those in tools like , identify overlapping reads to form graphs that are then simplified into contigs, while de Bruijn graphs, employed by assemblers like and , break reads into k-mers for efficient handling of large datasets. Assembly pipelines typically proceed through four main stages: preprocessing to correct errors and filter low-quality reads; graph construction to organize overlaps; graph simplification to resolve redundancies and ambiguities; and postprocessing to scaffold contigs using paired-end information and assess quality metrics like N50 contig length. Despite these advances, sequence assembly faces significant challenges, including handling repetitive genomic regions that cause ambiguities in read alignment, sequencing errors from platforms like Illumina or PacBio, and the computational demands of processing terabytes of data from high-coverage experiments. In complex genomes, such as those of plants with high heterozygosity or , these issues can lead to fragmented assemblies or misassemblies, requiring hybrid approaches combining short- and long-read technologies for improved accuracy. The importance of sequence assembly lies in its role as a cornerstone of , facilitating applications from annotation and variant discovery to and evolutionary studies, ultimately advancing fields like and biodiversity conservation. Recent developments, including long-read sequencing from Oxford Nanopore and integration of for error correction, continue to enhance assembly quality and scalability.

Fundamentals

Definition and Purpose

Sequence assembly is the computational process of reconstructing long DNA or RNA sequences from numerous short, overlapping fragments known as sequencing reads, typically ranging from 50 to 150 base pairs (bp) in length for short-read technologies. This reconstruction yields contiguous sequences called contigs, which can be further linked into scaffolds approximating larger structures such as chromosomes or transcripts. The process is essential because modern high-throughput sequencing methods generate millions of these short reads rather than complete genomic sequences in one go. The primary purpose of sequence assembly is to enable comprehensive genomic analysis, including genome annotation to identify genes and regulatory elements, variant discovery for detecting mutations, evolutionary studies to compare species, and practical applications such as through tailored diagnostics and therapies. A landmark demonstration of its importance was the , completed in 2003, which relied on assembly techniques to map approximately 3 billion base pairs of the , providing the foundational reference for subsequent research. At a high level, the workflow begins with read generation from sequencing platforms, followed by overlap detection to identify matching regions between reads, contig formation by merging overlapping fragments into longer sequences, and scaffolding to estimate the order and orientation of contigs using additional data like paired-end mappings. This fragmented reconstruction can be likened to piecing together a book from shredded pages, where overlapping text snippets guide the alignment despite gaps or ambiguities, such as repetitive regions that complicate precise joining.

Key Challenges

Sequence assembly faces several inherent challenges that complicate the reconstruction of continuous genomic sequences from fragmented reads. These difficulties arise from the nature of biological genomes and the limitations of sequencing technologies, often leading to fragmented, erroneous, or incomplete assemblies. One major obstacle is the presence of repetitive regions, such as repeats and segmental duplications, which create in read placement because identical or highly similar sequences cannot be uniquely mapped. For instance, approximately 50% of the consists of repetitive DNA, including transposable elements and other repeats that confound accurate reconstruction. This ambiguity results in collapsed repeats or misassemblies, particularly when read lengths are shorter than the repeat units. Sequencing errors further exacerbate assembly issues by introducing inaccuracies in base calls that hinder reliable overlap detection between reads. Error rates vary by technology; early next-generation sequencing platforms exhibited rates of 1-15%, while modern short-read methods like Illumina achieve around 0.1-1%, and long-read technologies such as PacBio or Oxford Nanopore often range from 10-20%. These errors, primarily substitutions, insertions, or deletions, propagate into contig formation and require computational correction, yet residual inaccuracies can lead to chimeric or incorrect sequences. Coverage variability, characterized by uneven read depth across the genome, often results in gaps or over-representation in assemblies, making it difficult to resolve low-coverage regions. This unevenness stems from biases in library preparation, amplification artifacts, and sequencing inefficiencies, sometimes producing chimeric reads that span unintended genomic breakpoints. In practice, such variability undermines coverage-based diagnostics for assembly quality and can leave substantial portions of the genome unassembled. The of sequence assembly poses another significant barrier, as the problem of finding the optimal arrangement of reads is NP-hard, even for simplified models. For large eukaryotic genomes, this translates to immense time and memory demands when handling datasets that can reach petabytes in size, necessitating algorithms that trade optimality for feasibility. In diploid organisms, polymorphisms and heterozygosity add layers of complexity by requiring the distinction of allelic variants from two homologous chromosomes, often leading to phasing issues. High heterozygosity rates, such as 0.85-1.28% in some species, can cause assemblers to erroneously merge or separate haplotypes, resulting in redundant or fragmented contigs. This challenge is particularly acute in assembly without a , where resolving true variants versus sequencing demands high coverage and sophisticated modeling.

Types of Sequence Assembly

De Novo Assembly

De novo assembly, also known as de novo genome assembly, is the process of reconstructing a solely from raw sequencing reads without relying on a preexisting . This approach is particularly valuable for sequencing novel organisms, non-model species, or populations where no high-quality reference exists, enabling the generation of a complete, unbiased representation of the genetic material. The assembly process begins with read correction, where errors introduced during sequencing—such as base substitutions or indels—are identified and fixed using consensus from multiple overlapping reads or specialized error-correction algorithms. Next, overlaps between corrected reads are detected, often employing graph-based methods like de Bruijn graphs, which represent k-mer substrings to efficiently identify shared sequences. These overlaps are then used to build contigs, which are continuous stretches of DNA formed by merging aligned reads into longer sequences. Finally, scaffolding orders and orients these contigs into larger structures using long-range information from mate-pair libraries, paired-end reads, or chromatin interaction data like Hi-C, which provide distance constraints between distant genomic regions; gaps between scaffolds may remain unresolved without further data. One key advantage of assembly is its ability to uncover novel genetic sequences, including those absent from existing databases, and to accurately resolve structural variants such as insertions, deletions, and rearrangements that might be missed or biased in reference-dependent methods. It is especially effective for genomes with unique evolutionary histories or high variability. However, the method is prone to fragmentation, particularly in repetitive regions where short reads cannot uniquely span repeats, leading to collapsed or incomplete ; this challenge is exacerbated in complex eukaryotic genomes compared to simpler ones. In practice, assembly has been widely applied to microbial genomes, where bacterial assemblies often achieve near-complete contiguity due to their relatively low complexity, compact size (typically 2–10 Mb), and fewer repetitive elements. For instance, tools like SPAdes have enabled high-quality drafts of bacterial isolates from environmental samples, closing most gaps and annotating functional elements with minimal fragmentation. Historically, assembly became dominant in the early next-generation sequencing (NGS) era following the introduction of platforms like 454 in 2005, revolutionizing the study of uncultured microbes by allowing rapid reconstruction of from metagenomic samples without prior cultivation or reference data. This shift facilitated thousands of microbial publications between 2006 and 2010, marking a departure from Sanger-era limitations.

Mapping-Based Assembly

Mapping-based assembly, also known as reference-guided assembly, is a strategy in bioinformatics that reconstructs a by aligning sequencing reads to a pre-existing , enabling the identification and correction of variations or gaps in the reference sequence. This approach leverages the reference as a scaffold to guide the placement of reads, facilitating the assembly of closely related genomes or resequencing efforts where the target organism shares significant similarity with the reference. Tools such as BWA (Burrows-Wheeler Aligner) and Bowtie are commonly employed for the initial read alignment step, as they efficiently map short reads to large reference genomes using Burrows-Wheeler transform-based indexing for speed and accuracy. BWA, for instance, supports gapped alignments to handle insertions, deletions, and mismatches, making it suitable for reconstructing sequences with polymorphisms. The process begins with read mapping, where sequencing reads are aligned to the using aligners like BWA-MEM, which employs a combination of , , and dynamic programming to produce high-quality s even for longer reads. Following , variant calling identifies differences such as nucleotide polymorphisms (SNPs) and insertions/deletions (indels) by comparing mapped reads against the , often using tools that generate pileup files to assess coverage and consensus at each position. Consensus building then integrates these variants to produce a refined , filling gaps in low-coverage regions or correcting errors in the reference through majority voting or probabilistic models based on read depth and quality scores. This method effectively handles structural variations like indels by local realignment around discrepancies. One key advantage of mapping-based assembly is its computational efficiency and higher accuracy for species closely related to the reference, as the guiding scaffold reduces search space and resolves ambiguities in repetitive regions by providing contextual anchors for read placement. It is particularly effective for detecting small-scale variants, achieving lower error rates compared to methods in resequencing scenarios with high similarity. However, this approach introduces biases inherited from the reference genome's quality and completeness, potentially underrepresenting novel sequences or structural rearrangements in more divergent genomes, where unmapped reads may be discarded or poorly assembled. Mapping-based assembly is widely applied in resequencing projects and population genomics, such as the 1000 Genomes Project, which sequenced 2,504 individuals from diverse populations to catalog human genetic variation by aligning low-coverage whole-genome reads (mean depth 7.4×) to the human reference genome (GRCh37) using multiple aligners and variant callers to achieve high-confidence genotyping of over 88 million sites. This method enabled the discovery of common variants with frequencies above 1%, supporting studies on human diversity and disease association.

Hybrid Assembly

Hybrid assembly integrates multiple sequencing data types, typically combining high-accuracy short reads from platforms like Illumina with longer but error-prone reads from technologies such as PacBio or Oxford Nanopore, to leverage the strengths of each for improved reconstruction. This approach enhances contiguity by using long reads to span repetitive regions while employing short reads to correct errors and fill gaps, resulting in more complete assemblies than those from single data types alone. The process generally involves error correction of long reads using short-read alignments, followed by or graph-based integration to build contigs. For instance, tools like construct "mega-reads" by pairing short Illumina reads with long PacBio reads to create accurate, extended sequences that are then assembled via a hybrid de Bruijn-overlap method. Similarly, Unicycler builds a short-read graph with SPAdes and bridges it using long reads for bacterial genomes, though adaptable principles apply to eukaryotes. These pipelines often incorporate polishing steps with short reads to refine the final output. Hybrid methods excel at resolving structural challenges like repeats and gaps, yielding higher-quality assemblies with greater contiguity and fewer misassemblies compared to short- or long-read-only approaches. They enable chromosome-level reconstructions, as demonstrated in the Telomere-to-Telomere (T2T) assembly (T2T-CHM13), which used PacBio HiFi long reads for primary contig formation, Oxford Nanopore ultralong reads for , and Illumina short reads for error correction and to achieve a gapless 3.055 Gbp sequence. This integration resolved over 200 Mbp of previously unassembled repetitive regions. Recent advances from 2023 to 2025 have extended strategies to phased diploid and polyploid assemblies, improving handling of heterozygosity and multiple sets. Algorithms like Hifiasm, originally for HiFi reads, have been adapted in hybrid contexts to produce haplotype-resolved diploid assemblies by integrating short-read polishing, facilitating accurate phasing in complex genomes. For polyploid plants, hybrid pipelines combining long-read assembly with short-read correction have enabled high-contiguity reconstructions, addressing challenges from homeologous chromosomes in crops like progenitors. In practice, hybrid assembly proves particularly effective for genomes dominated by repeats, such as fungal species like Trichoderma villosa, where it recovered more complete gene sets and repetitive elements than short-read methods alone. Similarly, in plants with over 80% repetitive content, like the progenitor Aegilops tauschii (approximately 80% repeats), MaSuRCA-based hybrid assembly produced a highly contiguous 4 Gbp genome, spanning complex arrays and resolving structural variants missed in prior efforts.

Applications

Whole Genome Assembly

Whole genome assembly involves reconstructing the complete sequence of an organism's nuclear and organelle DNA, such as mitochondrial and chloroplast genomes, to produce full chromosomes or circular molecules. In prokaryotes, this process is simpler due to their typically circular chromosomes ranging from 1 to 10 megabases (Mb) in size, with fewer repetitive elements and no introns, allowing for more straightforward de novo reconstruction. Eukaryotic genomes, by contrast, feature linear chromosomes that can span up to 3 gigabases (Gb) in humans, complicated by extensive repetitive sequences, introns, and structural variations, which demand integrated approaches combining short- and long-read sequencing to achieve high continuity. A pivotal milestone in whole genome assembly was the , which in 2003 produced a finished reference sequence covering approximately 92% of the euchromatic using hierarchical and bacterial artificial chromosome clones. This effort laid the foundation for but left gaps in repetitive regions. Subsequent advancements culminated in the Telomere-to-Telomere (T2T) Consortium's 2022 achievement of the first complete, gapless assembly (T2T-CHM13), spanning 3.055 Gb and including all centromeres, telomeres, and repetitive elements through ultra-long-read technologies like PacBio HiFi and Oxford Nanopore. Building on this foundation, a 2025 study sequenced 65 diverse human genomes to generate 130 haplotype-resolved assemblies with a median contig length of 130 Mb, closing 92% of remaining gaps and improving global representation. Assembling eukaryotic genomes faces specific hurdles, such as the highly repetitive alpha-satellite DNA in centromeres and the TTAGGG repeats at telomeres, which historically caused fragmentation due to short-read limitations and sequencing biases. In polyploid crop species like or , multiple sets exacerbate these issues, leading to haplotype confusion and inflated assembly sizes without phased approaches. The primary outputs of whole genome assembly include contigs—continuous sequences from overlapping reads—scaffolds, which link contigs with estimated gap sizes, and (Assembly Gap Position) files that map these components to chromosomes for visualization and annotation. Post-2020, chromatin conformation capture has become routine for integrating these outputs into chromosome-scale assemblies, enabling anchoring of scaffolds via long-range interaction data in diverse species from to insects.

Transcriptome Assembly

Transcriptome assembly is the computational process of reconstructing full-length messenger RNA (mRNA) sequences from short complementary DNA (cDNA) reads generated by RNA sequencing (RNA-Seq), enabling the capture of alternative splicing isoforms and expressed gene structures. This method has superseded earlier expressed sequence tag (EST) approaches, which relied on low-throughput Sanger sequencing to produce partial, low-coverage transcript fragments. By providing comprehensive coverage of the transcriptome, RNA-Seq-based assembly facilitates the identification of novel transcripts, quantification of gene expression, and annotation of splice variants, particularly in non-model organisms lacking a reference genome. The assembly process typically follows one of two strategies: de novo assembly, which builds transcripts directly from unaligned reads without prior genomic information, or reference-guided assembly, which maps reads to a known or transcriptome before reconstructing isoforms. In de novo approaches, tools like employ de Bruijn graph-based methods to resolve transcript contigs by clustering reads into splicing-aware components, effectively handling the complexity of . Reference-guided methods, such as StringTie, use flow algorithms on aligned reads to model transcript structures and estimate abundances, improving accuracy when a reference is available. These pipelines often incorporate preprocessing steps like quality trimming and error correction to mitigate sequencing artifacts. Key advantages of transcriptome assembly include the ability to reveal dynamic profiles and discover unannotated transcripts without relying on a complete assembly, making it essential for evolutionary and studies. However, significant challenges persist, such as isoform ambiguity from overlapping reads, under-detection of low-abundance transcripts, and biases in coverage due to RNA degradation or sequencing depth; these issues intensified with the rapid adoption of following its introduction in 2008. Recent advancements, particularly since 2023, have leveraged long-read technologies like PacBio and Oxford Nanopore to resolve full-length isoforms with higher fidelity, reducing fragmentation errors in complex splicing events. In single-cell contexts, these long-read methods enhance resolution of cell-type-specific transcriptomes, enabling precise isoform detection in heterogeneous populations. Assemblies from such data can be evaluated for completeness using tools like BUSCO, which assess conserved ortholog recovery.

Sequencing Technologies

Short-Read Sequencing

Short-read sequencing technologies generate DNA fragments typically up to 300 base pairs (bp) in length, enabling high-throughput genome analysis with low per-base error rates. The most widely adopted platform is Illumina sequencing, which produces reads of 50–300 bp with an error rate of approximately 0.1% (or 1 in 1,000 bases), allowing billions of reads per run for deep coverage at reduced costs. Another early short-read method, Roche's 454 , yielded longer reads of 400–1,000 bp but suffered from higher error rates of 1–2%, particularly insertions and deletions, and has been deprecated since 2016 due to inferior throughput and cost-efficiency compared to newer platforms. These technologies profoundly influenced sequence assembly by providing uniform, high-coverage data that supports de Bruijn graph-based algorithms, which break reads into k-mers to reconstruct contigs efficiently despite short lengths. However, short reads often fail to span repetitive regions exceeding their length, resulting in fragmented assemblies and gaps in complex genomic areas. From 2005 to 2015, short-read platforms dominated assembly projects, enabling the sequencing of thousands of bacterial and eukaryotic genomes with coverages often exceeding 30×. Key advances mitigated some limitations through paired-end and mate-pair library preparations, where both ends of DNA fragments are sequenced to provide insert size information—typically 200–500 for paired-end and 2–20 for mate-pair—facilitating and repeat resolution. By 2020, these innovations contributed to a dramatic , bringing whole sequencing below $1,000, democratizing for large-scale studies. Despite these gains, short-read assemblies remain prone to fragmentation in repetitive or low-complexity regions, often requiring complementary approaches for complete genomes.

Long-Read Sequencing

Long-read sequencing technologies produce DNA reads exceeding 10 kb in length, enabling the spanning of repetitive regions and the resolution of complex genomic structures that challenge shorter-read methods. (PacBio) HiFi sequencing generates highly accurate long reads of 10-25 kb using circular consensus sequencing (), achieving an error rate of approximately 0.1% through multiple passes over the same template molecule. In contrast, (ONT) employs nanopore-based detection to yield ultra-long reads up to 2 Mb, supporting real-time sequencing with raw error rates historically ranging from 5-15%, though recent advancements have pushed single-read accuracy above 99% for certain chemistries. These technologies have transformed sequence assembly by providing contiguous scaffolds that capture structural variants (SVs) and repetitive elements, which often fragment assemblies from shorter reads. The impact of long-read sequencing is exemplified in telomere-to-telomere (T2T) genome assemblies, where it has enabled complete, gapless representations of genomes. In 2022, the T2T Consortium produced the first fully complete assembly (CHM13) using a combination of PacBio HiFi and ONT ultra-long reads, resolving over 200 million base pairs of previously unassembled repetitive sequences, including centromeres and acrocentric regions. In 2025, similar approaches facilitated T2T assemblies for plant genomes, such as those of crops like , where ultra-long ONT reads bridged extensive repeats and polyploid complexities to achieve chromosome-level contiguity without gaps. These advancements have improved the detection of SVs, which constitute a significant portion of , by directly traversing insertions, deletions, and inversions that short reads cannot reliably phase. Recent developments from 2023 to 2025 have further enhanced long-read utility in assembly through improved basecalling and strategies. ONT's basecaller, released in 2023, leverages GPU acceleration for faster, more accurate real-time processing of R10.4 flow cells, reducing computational bottlenecks and enabling on-device analysis. As of 2025, ONT's R10.4.1 flow cells achieve >99% single-read accuracy, with ongoing developments announced at 2025 enhancing throughput and real-time analysis capabilities. Integration with short-read data for has also advanced, with tools iteratively correcting long-read errors using high-depth short-read alignments, often reducing assembly error rates by over 1% in vertebrate genomes. For instance, polishing PacBio assemblies of species like the green anole lizard has yielded near-complete, error-free drafts with enhanced scaffold N50 values exceeding 100 Mb. Such refinements, often in assembly contexts, underscore long-read sequencing's role in producing high-fidelity references for diverse applications.

Assembly Algorithms

Overlap-Layout-Consensus Methods

Overlap-layout-consensus (OLC) methods represent a foundational approach to sequence assembly, particularly suited for datasets with long reads and lower coverage depths. These algorithms reconstruct the original sequence by first identifying overlapping regions between sequencing reads, then arranging the reads into a consistent layout that approximates the structure, and finally deriving a from the aligned reads to resolve errors and ambiguities. Developed initially for data, OLC approaches excel when read lengths are sufficient to span repetitive regions reliably, allowing for accurate overlap detection without excessive fragmentation. The process begins with overlap detection, where pairwise similarities between reads are computed to identify potential alignments. This step often employs techniques such as indexing, where short substrings () of fixed length are extracted from reads and stored in a to quickly filter candidate pairs for full alignment; for instance, assemblers like use k=24 for this purpose to balance sensitivity and computational efficiency. Alignments are then scored based on metrics like sequence identity and length, discarding weak overlaps that may arise from sequencing errors or distant homologies. The output forms an overlap graph, a with reads as nodes and weighted edges representing overlap quality and direction. In the phase, the overlap graph is traversed to find paths that represent contigs—continuous segments of the assembly. Algorithms such as unitig formation or approximation bundle overlapping reads into linear arrangements, resolving branches caused by repeats or errors through heuristics like coverage depth or edge weights. This step handles errors by prioritizing high-scoring paths and may incorporate mate-pair information for . Finally, the phase aligns reads along each contig and generates the sequence by voting or probabilistic models, such as weighted majority for base calls, to achieve high accuracy; for example, Celera Assembler uses a dynamic programming approach to compute this while detecting variants. OLC methods are particularly effective for long-read technologies like Sanger capillary electrophoresis or modern (PacBio) sequencing, where read lengths often exceed 10 kb, enabling overlaps that capture unique genomic context even at 5-10x coverage. In contrast, they are less efficient for short-read data, where alternatives like de Bruijn graphs are preferred due to the high volume of fragments. Seminal implementations include the Celera Assembler, which powered the whole-genome shotgun assembly of and contributed to the , achieving over 99% accuracy in non-repetitive regions. Contemporary OLC-based tools, such as Canu, adapt the paradigm for error-prone long reads by integrating read correction via adaptive weighting prior to overlap detection, yielding near-complete assemblies for bacterial and eukaryotic genomes with N50 contig sizes exceeding 10 . Despite these advances, the naive OLC paradigm incurs O(n²) for overlap computation on n reads, which is mitigated through indexing and approximate matching but remains a bottleneck for ultra-large datasets.

De Bruijn Graph Methods

De Bruijn graph methods for sequence assembly transform the problem of reconstructing a from short sequencing reads into finding an in a , where reads are decomposed into fixed-length substrings known as k-mers. In this approach, introduced for DNA fragment assembly, the graph captures overlaps between k-mers to efficiently represent the underlying sequence, enabling polynomial-time solutions that avoid exhaustive pairwise alignments. The process begins with k-mer decomposition, where each read of length L is broken into L - k + 1 overlapping , providing the building blocks for the . construction follows, with representing unique (k-1)-mers (prefixes or of k-mers) and directed corresponding to k-mers that overlap by k-1 bases, such that an from u to v exists if the of u matches the prefix of v. The number of |V| equals the count of unique (k-1)-mers, while the number of |E| approximates the total number of k-mers observed, roughly read length times coverage depth. To address sequencing errors, which manifest as low-coverage "" or dead-end paths, the undergoes simplification by removing these based on coverage thresholds. "Bubbles"—short divergent paths arising from sequencing errors or biological variants—are then resolved by selecting the highest-coverage path or using paired-end information to pop the bubble. Finally, an traversing each exactly once reconstructs the contigs, with repeats handled by multiplicities reflecting coverage. These methods excel in memory efficiency for high-coverage short-read data, as the graph scales with unique k-mers rather than full read alignments, making them suitable for massive datasets from next-generation sequencing. They also manage repetitive regions effectively by leveraging k-mer coverage to distinguish true repeats from errors, reducing misassemblies compared to overlap-based alternatives. Prominent implementations include , which employs iterative k-mer sizing and simplification for assembly, achieving high contiguity in bacterial genomes. Similarly, uses a parallelized to distribute computation across clusters, enabling scalable assembly of large eukaryotic genomes like the . Despite these strengths, de Bruijn graph methods are sensitive to the choice of k, where small k values increase error susceptibility and fail to span repeats, while large k values fragment assemblies in low-coverage areas. Low overall coverage exacerbates issues, as sparse edges hinder resolution and amplify error propagation.

Specialized Algorithms for Modern Data

Recent advancements in sequence assembly have focused on algorithms tailored to long-read and datasets, addressing challenges like high error rates, repetitive regions, and diploid complexity in technologies such as Oxford Nanopore and PacBio HiFi. These specialized methods extend traditional approaches by incorporating error-tolerant graph structures and phasing strategies, enabling more contiguous and haplotype-resolved assemblies. For instance, Flye employs an adaptive variant that handles error-prone long reads by iteratively resolving repeats through read overlap graphs, achieving superior contiguity in bacterial and eukaryotic genomes compared to k-mer-based assemblers. Haplotype-aware assembly pipelines, such as Verkko, integrate ultra-long reads with proximity-ligation data to produce phased, telomere-to-telomere diploid assemblies. Developed in 2023, Verkko uses a hybrid to separate haplotypes and resolve structural variants, successfully assembling 20 of 46 chromosomes without gaps in diploid samples. This approach facilitates phased assemblies for diploids by leveraging read phasing and untangling, improving accuracy in heterozygous regions. Complementing these, integration in scaffolding tools like YaHS orders contigs into chromosome-scale structures using contact maps, enhancing overall assembly integrity without requiring prior chromosome counts. For repeat resolution, machine learning-enhanced methods target tandem repeats, which often cause assembly gaps. TRFill, introduced in 2025, fills these gaps in draft assemblies using only HiFi and data, accurately reconstructing tandem regions through reference-guided alignment and inference, enabling population-level analysis of complex loci. Similarly, DeChat applies for - and repeat-aware error correction in R10 reads, preserving variant information while reducing errors in repetitive contexts. These innovations have elevated performance metrics; modern long-read human assemblies now routinely achieve contig N50 lengths exceeding 10 Mb, a marked improvement over pre-2020 short-read assemblies limited to under 1 Mb N50. Looking ahead, AI-driven error correction models are poised to further refine long-read data. Tools like HERRO, released in 2024, use deep learning to correct ultra-long Nanopore reads while accounting for haplotype variations, reducing overall error rates below 1% in high-quality subsets and supporting more reliable phased assemblies.

Quality Assessment

Evaluation Metrics

Evaluation metrics for sequence assemblies assess contiguity, accuracy, completeness, and specificity to determine the quality of the reconstructed genome or transcriptome. Contiguity metrics evaluate how well the assembly captures the linear structure of the original sequence, with higher values indicating fewer but longer contiguous segments. The N50 metric represents the length of the shortest contig such that the sum of lengths of all contigs of that length or longer covers at least 50% of the total assembled length, providing a measure of assembly fragmentation. Complementing N50, L50 denotes the smallest number of contigs required to cover 50% of the total assembly length, where lower L50 values signify better contiguity. The total assembled length quantifies the overall span of the assembly in base pairs, ideally approaching the known genome size without excessive over- or underestimation, while the number of contigs indicates fragmentation, with fewer contigs preferred for higher-quality assemblies. Accuracy metrics focus on base-level errors by comparing the assembly to a reference sequence, revealing mismatches and structural variants. The mismatch rate, expressed as the average number of mismatches per 100,000 aligned bases, highlights substitution errors, with rates below 100 per 100 kb considered acceptable for polished assemblies. Similarly, the indel rate measures insertions and deletions per 100 kb, where low values (e.g., under 50 per 100 kb) indicate precise to the reference, as computed in tools like QUAST. Completeness metrics gauge whether essential genomic content is represented, particularly genes and repetitive elements. BUSCO (Benchmarking Universal Single-Copy Orthologs) assesses completeness by searching for a conserved set of single-copy orthologs, with assemblies achieving 95% or higher complete BUSCOs deemed high-quality for eukaryotic genomes. The Long-read Assembly Index (LAI) evaluates continuity in repetitive regions by comparing intact long terminal repeat (LTR) retrotransposon pairs in the assembly to those in the intact genome, where LAI values above 10 suggest strong assembly of repeat-rich areas in plant genomes. Specificity metrics detect structural errors that could mislead downstream analyses, such as chimeric joins. The misassembly rate counts relocation or inversion events between the assembly and reference, often reported as the number of misassemblies per 100 , with rates near zero essential for reliable assemblies. For telomere-to-telomere (T2T) assemblies, which aim for gapless coverage, metrics include the consensus quality value (QV), where QV > 30 (equivalent to error rates < 0.1%) confirms high accuracy, alongside verification of complete telomere and centromere inclusion without misassemblies.

Control and Validation Techniques

Control and validation techniques in sequence assembly encompass a range of methods applied before, during, and after the assembly process to detect errors, improve contiguity, and ensure the reliability of the resulting sequences. These techniques are essential for addressing challenges such as sequencing errors, chimeric artifacts, and structural inaccuracies, particularly in complex datasets from transcriptomic or genomic sources. Pre-assembly quality control (QC) begins with read trimming to remove low-quality bases, adapters, and contaminants that could propagate errors into the assembly. Tools like assess read quality by generating reports on per-base sequence quality, GC content, and overrepresented sequences, enabling targeted trimming with subsequent software such as or . Error correction further refines raw reads; for instance, employs a k-mer spectrum-based approach to correct substitution and indel errors in high-throughput sequencing data, reducing noise without excessive loss of coverage. These steps typically improve assembly metrics like N50 by minimizing misalignment during overlap detection. Post-assembly validation focuses on verifying scaffold integrity and resolving artifacts. Optical mapping, which uses restriction enzyme digestion patterns to create high-resolution physical maps, validates scaffolding by aligning assembled contigs against these maps to detect misassemblies or gaps exceeding expected distances. Manual curation complements automated methods by inspecting chimeric junctions—regions where unrelated sequences are erroneously fused—often using visualization tools like IGV to manually edit and refine assemblies based on read depth anomalies or breakpoint evidence. Key techniques for overall validation include read-back mapping, where original reads are realigned to the assembled contigs to quantify alignment rates and coverage uniformity; a high percentage of aligned reads (e.g., >90%) indicates robust assembly, while discrepancies highlight errors. Simulation-based validation generates mock datasets mimicking real conditions, such as metagenomic communities, to test assembly accuracy against ; tools like MetaSim create synthetic reads from reference genomes for this purpose. Recent advances emphasize automated polishing to iteratively refine draft assemblies. Pilon, for example, uses short-read alignments to correct base errors, indels, and small structural variants in long-read drafts. As of 2025, machine learning-based tools like DeepPolisher have emerged, enabling precise base-level error correction and significantly improving assembly accuracy. Standardized benchmarks like the GAGE (Genome Assembly Gold-Standard Evaluations) framework provide comparative evaluation by assembling reference datasets with multiple tools and assessing outcomes across species, establishing baselines for contiguity, accuracy, and runtime.

Pipelines and Tools

Assembly Workflows

Sequence assembly workflows encompass the end-to-end bioinformatics processes that transform raw sequencing reads into contiguous genome representations, typically involving pre-processing, core assembly, and post-processing stages to ensure accuracy and completeness. These pipelines are tailored to the sequencing technology and data type, such as short-read Illumina data or long-read PacBio/ Nanopore outputs, and increasingly incorporate approaches for enhanced resolution. The workflow begins with data preparation to mitigate errors inherent in sequencing, proceeds to algorithmic reconstruction, and concludes with refinement to achieve biologically meaningful assemblies. Pre-processing starts with (QC) to evaluate read integrity using tools that detect contamination, low-quality bases, and biases, followed by trimming to remove artifacts and correct errors, particularly in long reads where error rates can exceed 10%. Read correction is crucial for noisy long-read data, employing algorithms that align reads to consensus sequences or use short reads as anchors to reduce and errors before . This stage also includes analysis to estimate genome size, heterozygosity, and optimal parameters, ensuring downstream steps handle repetitive or complex regions effectively. In the core assembly phase, algorithm selection depends on read length and error profile: overlap-layout-consensus (OLC) methods suit long reads for their tolerance of gaps, while approaches excel with short, high-accuracy reads. For workflows, the process unfolds as read correction to polish inputs, followed by graph building—either via k-mer overlaps in s or read-to-read alignments in OLC—to represent sequence relationships, and culminates in generation to resolve paths into contigs, often iterating to collapse bubbles from sequencing errors or variants. Reference-based workflows, conversely, map reads to an existing using aligners like BWA or minimap2, then refine variants through calling and polishing to fill gaps or correct mismatches, yielding a sample-specific assembly aligned to the reference scaffold. Hybrid workflows leverage complementary strengths by first generating a draft contig set from long reads to span repeats and structural variants, then integrating short reads for error correction and gap filling via alignment and , often in multiple rounds to boost base-level accuracy above 99%. Post-processing enhances contiguity through , which orders and orients contigs using paired-end or mate-pair links, and includes initial to identify genes and repeats for validation. at this stage involves structural prediction and functional assignment, preparing the assembly for downstream analyses like . To achieve chromosome-scale assemblies, workflows integrate chromatin conformation capture data such as , which captures genome-wide proximity interactions to scaffold contigs by modeling contact frequencies as a and resolving orientations via iterative error correction, reducing misjoins by up to fourfold compared to read-based methods alone. , a protein-specific variant, similarly maps long-range interactions to refine scaffolds, as demonstrated in re-annotating genomes by linking regulatory elements across chromosomes. Best practices emphasize parameter tuning, such as selecting sizes (e.g., 19-31 for human-scale genomes) to balance uniqueness and coverage—too small risks repeats, too large fragments s—via tools like GenomeScope for estimation based on heterozygosity and repetitiveness. For large genomes exceeding 1 , (HPC) resources are essential, requiring at least 244 RAM and multi-core processors to manage memory-intensive graph constructions, with containerized pipelines ensuring across clusters. Recent advancements include automated pipelines like nf-core/genomeassembler, which streamlines long-read assembly, polishing, and scaffolding in a Nextflow-based framework, supporting telomere-to-telomere (T2T) efforts by integrating high-fidelity inputs for complete chromosome reconstructions without manual intervention. Verkko2 (as of December 2024) further improves Verkko by enhancing repeat resolution and gap closing with proximity-ligation .

Software Programs

Sequence assembly software encompasses a diverse array of tools designed to reconstruct genomic sequences from fragmented reads, categorized by read length, , and assembly strategy. These programs vary in their optimization for short-read (e.g., Illumina), long-read (e.g., PacBio HiFi or Nanopore), or hybrid approaches, with many being open-source and widely adopted in bioinformatics pipelines. For de novo short-read assembly, SPAdes is a prominent assembler optimized for bacterial and small eukaryotic genomes, employing a approach with multi-sized k-mers to handle uneven coverage and repeats effectively. It has been benchmarked as one of the top performers for single-cell and isolate assemblies, achieving high contiguity in complex datasets. MEGAHIT serves as an efficient alternative for metagenomic short-read data, using a succinct to enable ultra-fast assembly on single nodes even for large, complex communities, often completing in under 10 hours with modest RAM. Long-read assemblers address the limitations of short reads in resolving repeats and structural variants. Flye is tailored for Oxford Nanopore reads, utilizing a repeat graph-based to produce highly contiguous assemblies from error-prone long reads, with applications in bacterial and eukaryotic genomes. Hifiasm excels with PacBio HiFi reads, incorporating phased assembly graphs for haplotype-resolved diploid genomes; a 2023 update enhanced its diploid phasing capabilities, improving accuracy in heterozygous regions by leveraging trio data. Hybrid assemblers combine short and long reads to leverage the accuracy of short reads with the contiguity of long reads. integrates Illumina short reads for error correction with long reads for , making it suitable for large eukaryotic genomes and producing assemblies with fewer misassemblies in benchmarks. Unicycler focuses on bacterial genomes, using SPAdes for initial short-read graphs and long reads (e.g., ) for bridging repeats, resulting in circularized and contigs with high completeness. Reference-based assembly relies on alignment to a known genome for variant detection and scaffolding. BWA-MEM is a core mapping tool that aligns short or long reads to references with high sensitivity to indels and structural variants, forming the basis for downstream assembly refinement. GATK's HaplotypeCaller module facilitates variant assembly by modeling haplotypes from aligned reads, enabling precise reconstruction of genomic regions with polymorphisms. Comprehensive suites integrate multiple tools into user-friendly platforms. provides workflows that orchestrate assemblers like SPAdes and Flye within a web-based , supporting reproducible and reference-based pipelines for diverse sequencing data. Recent advancements include Verkko (2024), a hybrid pipeline for repeat-heavy genomes using PacBio HiFi and ultralong reads to achieve telomere-to-telomere assemblies in challenging regions like centromeres. Dragonflye streamlines ultra-long read assembly for bacterial isolates, wrapping Flye with polishing steps to yield high-quality, complete genomes from data in a single command.

References

  1. [1]
    Genome Assembly - an overview | ScienceDirect Topics
    Genome assembly is defined as the process of organizing nucleotide sequences into the correct order, which is necessary due to the shorter lengths of sequence ...
  2. [2]
    DNA Sequence Assembly - News-Medical
    DNA sequence assembly is a process that involves aligning and merging fragments of a DNA sequence to reconstruct the original structure of the DNA.
  3. [3]
    Next-Generation Sequence Assembly: Four Stages of Data ... - NIH
    Dec 12, 2013 · In this review, we address the basic framework of next-generation genome sequence assemblers, which comprises four basic stages.
  4. [4]
    Recent advances in sequence assembly: principles and applications
    Apr 26, 2017 · Sequence assembly [3] is a process by dividing the large pieces of DNA into small pieces, reading the small fragments and reconstituting the ...Assembly algorithms · Future development of... · Application of DNA assembly
  5. [5]
    An Overview of Genome Assembly
    In bioinformatics, genome assembly represents the process of putting a large number of short DNA sequences back together to recreate the original chromosomes ...
  6. [6]
    Repetitive DNA and next-generation sequencing - PubMed Central
    Nov 29, 2011 · NGS read lengths (50–150 bp) are considerably shorter than the 800–900 bp lengths that capillary-based (Sanger) sequencing methods were ...
  7. [7]
    What is a genome assembly? - NLM Support Center - NIH
    It can (1) refer to a process in which researchers assemble genome sequences from smaller components, or it can (2) refer to the entire collection of sequences ...
  8. [8]
    Assembling Your DNA Sequences - Geneious
    What is Sequence Assembly? · Assembling genomes or genes to study their relationships for phylogenetic studies · Identifying variants associated with a disease or ...
  9. [9]
    Human Genome Project Fact Sheet
    Jun 13, 2024 · In 2003, the Human Genome Project produced a genome sequence that accounted for over 90% of the human genome. It was as close to complete as the ...
  10. [10]
    International Consortium Completes Human Genome Project
    The international effort to sequence the 3 billion DNA letters in the human ... base pairs could be achieved. The finished human sequence is a fabulous ...
  11. [11]
    [PDF] The Human Genome Project (HGP)
    A computer program looks for overlaps in the DNA sequences, using them to reassemble the fragments in their correct order to determine the sequence of the ...<|control11|><|separator|>
  12. [12]
    De Novo Sequencing | Assemble novel genomes - Illumina
    De novo sequencing is sequencing a novel genome where no reference sequence is available, and it generates accurate reference sequences.
  13. [13]
    Current challenges and solutions of de novo assembly
    Jun 4, 2019 · In this review, we first briefly introduce some of the major challenges faced by NGS sequence assembly. Then, we analyze the characteristics of various ...
  14. [14]
    Repetitive DNA sequence detection and its role in the human genome
    Sep 19, 2023 · For instance, about 50% of the human genome consists of repeats, while roughly 4% of human genes harbor transposable elements in their protein- ...
  15. [15]
    The Challenge of Genome Sequence Assembly
    Oct 17, 2018 · Chromosome assembly from 'short read' sequence data is confounded by the presence of repetitive genome regions with numerous similar sequence ...INTRODUCTION · THE STRENGTHS AND... · MAP IS NEEDED
  16. [16]
    High-throughput DNA sequencing errors are reduced by orders of ...
    For instance, Illumina sequencing machines produce errors at a rate of ∼0.1–1 × 10−2 per base sequenced. These technologies typically produce billions of base ...
  17. [17]
    A comprehensive evaluation of long read error correction methods
    Dec 21, 2020 · Both sequencing platforms are similar in terms of their high error rates (ranging from 10-20%) with most errors occurring due to insertions or ...
  18. [18]
    Sequencing error profiles of Illumina sequencing instruments
    Mar 27, 2021 · Error rate is highly correlated with the sequencing cycle, rising toward the end of each read. In samples with smaller overlaps, the detected ...INTRODUCTION · MATERIALS AND METHODS · RESULTS AND DISCUSSION
  19. [19]
    Why Assembling Plant Genome Sequences Is So Challenging - PMC
    Unfortunately, coverage variability is the rule and undermines the coverage-based diagnostics. It can be speculated that the sequencing itself needs to be ...
  20. [20]
    Variation of and associations with the depth and evenness ... - Nature
    Jul 19, 2025 · Depth and evenness of sequencing coverage are considered possible indicators of genome assembly quality. A relatively even sequencing coverage ...
  21. [21]
    Genome assembly reborn: recent computational challenges
    May 29, 2009 · Perhaps the biggest new challenge is posed by the large-scale sequencing of entire microbial communities (metagenomics).Genome Assembly Reborn... · Shotgun Sequencing · Overlap-Layout-Consensus...
  22. [22]
    Computability of Models for Sequence Assembly - SpringerLink
    Here we present two theoretical results about the complexity of these models for sequence assembly. In the first part, we show sequence assembly to be NP-hard ...
  23. [23]
    Phased diploid genome assemblies and pan-genomes provide ...
    Nov 2, 2020 · Despite high heterozygosity rates (0.85–1.28%), all assemblies showed high contiguity, with the scaffold N50 of 3.3–4.3 Mb in diploid assemblies ...
  24. [24]
    HapSolo: an optimization approach for removing secondary ... - NIH
    Nevertheless, the assembly of heterozygous genomes still presents substantial challenges. One challenge is resolving distinct haplotypes in regions of high ...
  25. [25]
    The present and future of de novo whole-genome assembly
    Oct 14, 2016 · This review provides guidelines to determine the optimal approach for a given input data type, computational budget or genome.
  26. [26]
    Genetic variation and the de novo assembly of human genomes - NIH
    Sequence coverage is almost never uniform, and repetitive sequences of varying length, copy number and sequence complicate this process. This makes the correct ...De Novo Genome Assembly... · Figure 3. Genome Assembly... · Bioinformatics And...
  27. [27]
    Reference-guided de novo assembly approach improves genome ...
    Nov 10, 2017 · However, de novo genome assemblies remain challenging due to short read length, missing data, repetitive regions, polymorphisms and sequencing ...
  28. [28]
    sequencing, de novoassembly and rapid analysis using open ...
    Apr 1, 2013 · De novo assembly software and algorithms are powerful enough to allow average bacterial genomes to be assembled within hours or a few days ...
  29. [29]
    Comparison of De Novo Assembly Strategies for Bacterial Genomes
    Jul 17, 2021 · The structures of repetitive regions include, for example, resistance gene cassettes, insertion sequences, and transposons.
  30. [30]
    History and current approaches to genome sequencing and assembly
    In this review we provide a comprehensive historical background of the improvements in DNA sequencing technologies that have accompanied the major milestones ...
  31. [31]
    [PDF] DNA sequencing at 40: past, present and future
    The first integrated NGS platforms came in 2005, with resequencing of Escherichia coli by Shendure, Porreca, Mitra and Church41, de novo assembly of Mycoplasma ...
  32. [32]
    Fast and accurate short read alignment with Burrows–Wheeler ...
    May 18, 2009 · Furthermore, BWA always requires the full read to be aligned, from the first base to the last one (i.e. global with respect to reads), but ...INTRODUCTION · METHODS · RESULTS · DISCUSSION
  33. [33]
    Ultrafast and memory-efficient alignment of short DNA sequences to ...
    Mar 4, 2009 · Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes.
  34. [34]
    [1303.3997] Aligning sequence reads, clone sequences and ... - arXiv
    Mar 16, 2013 · BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human.
  35. [35]
    RaGOO: fast and accurate reference-guided scaffolding of draft ...
    Oct 28, 2019 · Two common approaches have been used to achieve chromosome-scale assemblies, namely, reference-free (de novo) and reference-guided approaches.
  36. [36]
    A global reference for human genetic variation | Nature
    Sep 30, 2015 · The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome ...
  37. [37]
    MaSuRCA genome assembler | Bioinformatics - Oxford Academic
    We describe a new hybrid approach that has the computational efficiency of de Bruijn graph methods and the flexibility of overlap-based assembly strategies.
  38. [38]
    Unicycler: Resolving bacterial genome assemblies from short and ...
    Jun 8, 2017 · Here we present Unicycler, a new hybrid assembly pipeline for bacterial isolate genomes. Unicycler first assembles short reads into an accurate ...
  39. [39]
    Hybrid assembly of the large and highly repetitive genome of ...
    The hybrid assembly combines long, high-error PacBio reads with shorter, accurate Illumina reads to create mega-reads, which are then assembled.
  40. [40]
    The complete sequence of a human genome | Science
    Mar 31, 2022 · The resulting T2T-CHM13 reference assembly removes a 20-year-old barrier that has hidden 8% of the genome from sequence-based analysis, ...
  41. [41]
    A complete reference genome improves analysis of human genetic ...
    T2T-CHM13 improves short-read mapping across populations. To investigate how the T2T-CHM13 assembly affects short-read variant calling, we realigned and ...
  42. [42]
    Haplotype-resolved de novo assembly using phased ... - Nature
    Feb 1, 2021 · Hifiasm is a fast open-source de novo assembler specifically developed for HiFi reads. It mostly uses exact overlaps to construct the assembly ...
  43. [43]
    Recent Advances in Assembly of Complex Plant Genomes
    We summarize the challenges of and advances in complex plant genome assembly, including feasible experimental strategies, upgrades to sequencing technology.
  44. [44]
    Hybrid Assembly Improves Genome Quality and Completeness of ...
    Jan 30, 2022 · Hybrid assembly improved genome quality and completeness of T. villosa, revealing high potential for lignocellulose breakdown, with 14,540 ...3. Results And Discussion · 3.1. Illumina And Minion... · 3.2. Genome Assembly And...
  45. [45]
    Characterization of repetitive DNA landscape in wheat ...
    May 12, 2015 · Thus, the hexaploid wheat genome is characterized by its large size (~17 Gb) and complexity, with repetitive sequences accounting for ~ 80% of ...
  46. [46]
    Genome Anatomies - NCBI - NIH
    In general, prokaryotic genes are shorter than their eukaryotic counterparts, the average length of a bacterial gene being about two-thirds that of a eukaryotic ...An Overview of Genome... · The Anatomy of the Eukaryotic... · The Anatomy of the...
  47. [47]
    First complete sequence of a human genome - NIH
    Apr 12, 2022 · Researchers finished sequencing the roughly 3 billion bases (or “letters”) of DNA that make up a human genome.Missing: assembly | Show results with:assembly
  48. [48]
    Centromere studies in the era of 'telomere-to-telomere' genomics - NIH
    This review reflects the progress in centromere genomics, credited by recent advancements in long-read sequencing and assembly methods.
  49. [49]
    Current Strategies of Polyploid Plant Genome Sequence Assembly
    Here, we review the challenges of the assembly of polyploid plant genomes, and also present recent advances in genomic resources and functional tools.
  50. [50]
    AGP Specification v2.1 - NCBI - NIH
    Mar 14, 2024 · Describes the assembly of a larger sequence object from smaller objects. The large object can be a contig, a scaffold (supercontig), or a chromosome.Missing: outputs | Show results with:outputs
  51. [51]
    Technical considerations in Hi‐C scaffolding and evaluation of ...
    Integrating Hi‐C links with assembly graphs for chromosome‐scale assembly. PLoS Computational Biology, 15(8), e1007273. 10.1371/journal.pcbi.1007273. [DOI] ...
  52. [52]
    RNA-Seq: a revolutionary tool for transcriptomics - PMC - NIH
    RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered ...Missing: seminal | Show results with:seminal
  53. [53]
    Transcriptomics technologies - PMC - NIH
    RNA-Seq leverages deep sampling of the transcriptome with many short fragments from a transcriptome to allow computational reconstruction of the original RNA ...
  54. [54]
    A simple guide to de novo transcriptome assembly and annotation
    Jan 24, 2022 · We present a comprehensive overview of de novo transcriptome assembly and annotation. We discuss the procedures involved, including pre- and post-processing ...Introduction · Pre-assembly quality control... · De novo transcriptome assembly<|control11|><|separator|>
  55. [55]
    Full-length transcriptome assembly from RNA-Seq data without a ...
    May 15, 2011 · Grabherr et al. describe Trinity, an algorithm for assembling full-length transcripts from short reads without first mapping the reads to a ...Missing: paper | Show results with:paper
  56. [56]
    StringTie enables improved reconstruction of a transcriptome ... - NIH
    Feb 18, 2015 · StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript ...
  57. [57]
    Challenges and advances for transcriptome assembly in non-model ...
    After two decades of RNA microarrays [8], RNA-seq has democratized the analysis of transcriptomes for any non-model organism.
  58. [58]
    Systematic assessment of long-read RNA-seq methods for transcript ...
    Jun 7, 2024 · The study revealed that libraries with longer, more accurate sequences produce more accurate transcripts than those with increased read depth.Missing: seminal | Show results with:seminal
  59. [59]
    Advances in long-read single-cell transcriptomics | Human Genetics
    May 24, 2024 · Long-read single-cell transcriptomics (scRNA-Seq) is revolutionizing the way we profile heterogeneity in disease. Traditional short-read ...
  60. [60]
    Long fragments achieve lower base quality in Illumina paired-end ...
    Feb 27, 2019 · Illumina's technology provides high quality reads of DNA fragments with error rates below 1/1000 per base. Sequencing runs typically ...
  61. [61]
    Characteristics of 454 pyrosequencing data—enabling realistic ...
    2.5 Read lengths​​ The length of un-trimmed reads in 454 pyrosequencing is limited by either the number of flows (168 in GS20, 400 in GS FLX and 800 in GS FLX ...
  62. [62]
    (PDF) Accuracy and Quality Assessment of 454 GS-FLX Titanium ...
    We obtained a mean error rate for 454 sequences of 1.07%. More importantly, the error rate is not randomly distributed; it occasionally rose to more than 50% in ...
  63. [63]
    De novo assembly of short sequence reads - Oxford Academic
    Aug 19, 2010 · Sequence reads as short as 20–30 nucleotides could be used to generate useful assemblies of both prokaryotic and eukaryotic genome sequences.THE CHALLENGE OF... · THE FEASIBILITY OF... · NOT ALL ASSEMBLERS ARE...
  64. [64]
    Impact of short-read sequencing on the misassembly of a plant ...
    Feb 2, 2021 · Our study provides insights on the causes of variable coverage regions and a quantitative assessment of factors contributing to plant genome misassembly when ...Missing: variability | Show results with:variability
  65. [65]
    The Most Frequently Used Sequencing Technologies and Assembly ...
    Oct 18, 2020 · Currently, 82% of the bacterial genomes in RefSeq were produced by the short read Illumina sequencing technology (Figure 1C). Among the ...
  66. [66]
    Mate Pair Sequencing - Illumina
    Mate pair sequencing involves generating long-insert paired-end DNA libraries useful for a number of sequencing applications.
  67. [67]
    Paired-end or mate-pair - GATK - Broad Institute
    Jun 25, 2024 · In paired-end sequencing, the library preparation yields a set of fragments, and the machine sequences each fragment from both ends.
  68. [68]
    The Cost of Sequencing a Human Genome
    Nov 1, 2021 · The cost to generate a whole-exome sequence was generally below $1,000. Commercial prices for whole-genome and whole-exome sequences have often ...
  69. [69]
    HiFi Reads - Highly accurate long-read sequencing - PacBio
    HiFi reads are highly accurate long reads with 99.9% accuracy, up to 25 kb in length, and are free of systematic errors.
  70. [70]
    Long and Accurate: How HiFi Sequencing is Transforming Genomics
    By contrast, HiFi reads, generated from 10–25-kb inserts using CCS mode, have an error rate of ≤ 1% by leveraging multiple passes around the template. CLR, ...
  71. [71]
    Nanopore sequencing accuracy
    Oxford Nanopore sequencing hardware and chemistry have seen major upgrades in the shift to R10.4.1 and are now able to read DNA fragments at >99% single-read ...
  72. [72]
    Nanopore sequencing technology, bioinformatics and applications
    Nov 8, 2021 · Other tools use long read length while accounting for high error rate. Many of these, such as tools for error correction, assembly and ...
  73. [73]
    Structural variant calling: the long and the short of it | Genome Biology
    Nov 20, 2019 · Long reads help to increase the detection of SVs as they considerably ease de novo genome assembly and mapping. Nevertheless, the increased ...Short-Read Alignment... · Short-Read Dna-Seq Mapping · Rna-Seq Mapping
  74. [74]
    Long-read human genome sequencing and its applications - PMC
    A base is incorrectly called in about 1 out of every 10 bases, resulting in an error rate of 8–15% in the CLR. HiFi reads are generated by circular consensus ...
  75. [75]
    Two complete telomere-to-telomere genome assemblies of ...
    Sep 1, 2025 · In this study, we report the complete telomere-to-telomere (T2T) genome assemblies of A17 and R108 (Figures 1A and 1B), constructed by ...
  76. [76]
    [PDF] White Paper: Structural Variation in the Human Genome - PacBio
    Low, unbiased coverage of the genome with long-read sequencing reveals most of the structural variants uncovered with de novo assembly, but at a price point ...
  77. [77]
    nanoporetech/dorado: Oxford Nanopore's Basecaller - GitHub
    Dorado is a high-performance, easy-to-use, open source analysis engine for Oxford Nanopore reads. Detailed information about Dorado and its features is ...Missing: 2023-2025 | Show results with:2023-2025
  78. [78]
    London Calling 2023: Dorado — the future of basecalling
    May 18, 2023 · Mark will be giving an update on the current status of the Dorado standalone basecaller, talking about the motivations and going through ...
  79. [79]
    Towards complete and error-free genome assemblies of all ... - Nature
    Apr 28, 2021 · Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages.
  80. [80]
    Towards complete and error-free genome assemblies of all ... - NIH
    Apr 28, 2021 · We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype ...<|control11|><|separator|>
  81. [81]
    overlap–layout–consensus and de-bruijn-graph - Oxford Academic
    Dec 19, 2011 · OLC generally works in three steps: first overlaps (O) among all the reads are found, then it carries out a layout (L) of all the reads and ...INTRODUCTION · IDEAL SEQUENCING DATA... · SEQUENCING DATA AND...
  82. [82]
    Review of General Algorithmic Features for Genome Assemblers for ...
    The overlap-layout-consensus, as the name suggests, consists of three steps (see Supplementary section, [27]). In the first step an overlap graph is created by ...
  83. [83]
    5.2: Genome Assembly I- Overlap-Layout-Consensus Approach
    Mar 17, 2021 · This section will examine one of the most successful early methods for computationally assembling a genome from a set of DNA reads, called shotgun sequencing.Setting up the experiment · Finding overlapping reads · Merging reads into contigs
  84. [84]
    [PDF] Genome Assembly Intro & OLC - GitHub Pages
    However, overlapping can still be one of the slowest steps in an assembly. Page 58. Overlap Layout Consensus. Overlap. Layout. Consensus.
  85. [85]
    Consensus generation and variant detection by Celera Assembler
    Celera Assembler uses dynamic windowing to identify alleles, produce a set of haploid consensus sequences, and splits read segments in variant regions.Abstract · INTRODUCTION · ALGORITHM · RESULTS AND DISCUSSION
  86. [86]
    On the sequencing and assembly of the human genome - PNAS
    Celera's assembly was missing the interiors of highly similar repetitive elements and the extremely dense repeat regions near the centromeres, whereas the HGSC ...Missing: original paper
  87. [87]
    Linear time complexity de novo long read genome assembly with ...
    May 22, 2023 · Most long-read genome assemblers follow the Overlap-Layout-Consensus paradigm (OLC), a quadratic run time algorithm in its naïve implementation.
  88. [88]
    An Eulerian path approach to DNA fragment assembly | PNAS
    ### Summary of de Bruijn Graph Methods for Sequence Assembly
  89. [89]
    Why are de Bruijn graphs useful for genome assembly? - PMC - NIH
    (d) Modern short-read assembly algorithms construct a de Bruijn graph by representing all k-mer prefixes and suffixes as nodes, then drawing edges that ...Missing: seminal | Show results with:seminal
  90. [90]
    [PDF] De Bruijn Graph assembly
    As usual, we start with a collection of reads, which are substrings of the reference genome. AABis a k-mer (k = 3). AAis its left k-1-mer, and ABis its right k- ...Missing: seminal paper
  91. [91]
    Velvet: Algorithms for de novo short read assembly using de Bruijn ...
    We have developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly.Missing: seminal | Show results with:seminal
  92. [92]
    ABySS: A parallel assembler for short read sequence data - PMC - NIH
    The primary innovation in ABySS is a distributed representation of a de Bruijn graph, which allows parallel computation of the assembly algorithm across a ...
  93. [93]
    Assembly of Long Error-Prone Reads Using Repeat Graphs - bioRxiv
    Jan 12, 2018 · We present the Flye algorithm for constructing the A-Bruijn (assembly) graph from long error-prone reads, that, in contrast to the k-mer-based ...
  94. [94]
    Telomere-to-telomere assembly of diploid chromosomes with Verkko
    Verkko is an iterative, graph-based pipeline for assembling complete, diploid genomes, using long and ultra-long reads to achieve telomere-to-telomere assembly.
  95. [95]
    YaHS: yet another Hi-C scaffolding tool - Oxford Academic
    YaHS is a tool that constructs chromosome-scale scaffolds using Hi-C data, and is fast, reliable, and accurate.
  96. [96]
    TRFill: synergistic use of HiFi and Hi-C sequencing enables ...
    Jul 28, 2025 · Our TRFill algorithm accurately fills assembly gaps using only PacBio HiFi and Hi-C data, without relying on the costly ONT UL reads. Our ...
  97. [97]
    Repeat and haplotype aware error correction in nanopore ...
    Dec 19, 2024 · DeChat is a novel error correction tool for Nanopore R10 reads (<2% error), combining de Bruijn graphs and variant-aware alignment to preserve repeats and ...
  98. [98]
    Telomere-to-telomere phased genome assembly using error ...
    May 21, 2024 · We have developed the HERRO model based on deep learning, which corrects Simplex nanopore reads longer than 10kbp and with a quality value higher than 10.<|control11|><|separator|>
  99. [99]
    QUAST: quality assessment tool for genome assemblies
    This metric can be computed without a reference genome. No. of mismatches per 100 kb: The average number of mismatches per 100 000 aligned bases. QUAST also ...
  100. [100]
    BUSCO: assessing genome assembly and annotation completeness ...
    Jun 9, 2015 · BUSCO quality assessments provide high-resolution quantifications citeable in the simple C[D],F,M,n notation for genomes, gene sets and ...
  101. [101]
    Assessing genome assembly quality using the LTR Assembly Index ...
    Aug 10, 2018 · The BUSCO and CEGMA completeness are poor predictors of LAI (r2 ≤ 0.06, P ≥ 0.12) (Figure 2E and F), indicating that LAI is characterizing ...INTRODUCTION · MATERIALS AND METHODS · RESULTS · DISCUSSION
  102. [102]
    Genome assembly in the telomere-to-telomere era - PMC
    Here we review recent progress on assembly algorithms and protocols. We focus on how to derive near telomere-to-telomere assemblies and discuss potential ...
  103. [103]
    Ten steps to get started in Genome Assembly and Annotation - NIH
    Different sequencing technologies and generally applicable workflows for genome assembly are also detailed. We cover structural and functional annotation and ...
  104. [104]
    Efficient hybrid de novo assembly of human genomes with WENGAN
    Dec 14, 2020 · WENGAN starts by building short-read contigs using a de Bruijn graph assembler (1 in Fig. 1). Then, the pair-end reads are pseudo-aligned back ...
  105. [105]
    NextDenovo: an efficient error correction and accurate assembly tool ...
    Apr 26, 2024 · We present NextDenovo, an efficient error correction and assembly tool for noisy long reads, which achieves a high level of accuracy in genome assembly.
  106. [106]
    RGAAT: A Reference-based Genome Assembly and Annotation ...
    Dec 21, 2018 · RGAAT can be used to generate variants between two assemblies by sequence comparison (Figure 3). We used BLAT for genome comparison because of ...Missing: refinement | Show results with:refinement
  107. [107]
    Integrating Hi-C links with assembly graphs for chromosome-scale ...
    We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the ...
  108. [108]
    Xenopus tropicalis Genome Re-Scaffolding and Re-Annotation ...
    We applied ChIA-PET to analyze gene regulatory networks, including 3D chromosome interactions, underlying thyroid hormone (TH) signaling in the frog Xenopus ...
  109. [109]
    [PDF] A Review of VGP's Current Techniques and Best Practices for the ...
    K-mer size is a key parameter that must be large enough to map uniquely to the genome, but not too large, as it can lead to wasting computational resources. For ...Missing: tuning | Show results with:tuning
  110. [110]
    nf-core/genomeassembler: Assembly and scaffolding of ... - GitHub
    nf-core/genomeassembler is a bioinformatics pipeline that carries out genome assembly, polishing and scaffolding from long reads (ONT or pacbio).Missing: 2024 T2T
  111. [111]
    [PDF] Genome assembly in the telomere-to-telomere era
    Apr 22, 2024 · The T2T-CHM13 human genome was assembled with these two data types36. Currently, the only assemblers that can integrate a Homozygous genome.
  112. [112]
    Benchmarking of bioinformatics tools for the hybrid de novo ...
    Accurate and complete de novo genome assemblies enable variant identification and the discovery of novel genomic features and biological functions.Missing: specificity | Show results with:specificity
  113. [113]
    rpetit3/dragonflye: :dragon: Assemble bacterial isolate ... - GitHub
    Dragonflye is a pipeline that aims to make assembling Oxford Nanopore reads quick and easy. Still working on the quick part, but I think the easy part is there.Missing: Verkko | Show results with:Verkko