Fact-checked by Grok 2 weeks ago

Gene prediction

Gene prediction is the computational identification of protein-coding regions and functional elements, such as exons and introns, within genomic DNA sequences.^[1] This process, also known as gene finding, aims to distinguish genes from non-coding intergenic regions and accurately delineate their boundaries and internal structures.^[2] Essential in bioinformatics, gene prediction enables the annotation of newly sequenced genomes, facilitating downstream analyses like functional genomics and evolutionary studies.^[1] The importance of gene prediction has grown with the explosion of genomic data since the Human Genome Project in the 1990s, with the advent of high-throughput sequencing technologies that generate vast amounts of genomic data, requiring automated tools to interpret these sequences and predict protein synthesis pathways.^[2] Ab initio methods, which rely on statistical models of gene features like codon usage and splice sites without external evidence, represent a core approach; notable tools include GENSCAN and AUGUSTUS, which use hidden Markov models (HMMs) to achieve sensitivities around 78% and specificities of 81% in benchmark tests.^[1] Evidence-based or homology-based methods enhance accuracy by aligning query sequences to known proteins or expressed sequence tags (ESTs) using tools like BLAST or GeneWise, though they are limited by database coverage, capturing only about 50% of novel genes.^[1] Comparative methods leverage alignments across species to identify conserved gene structures, as in tools like N-SCAN, improving predictions in divergent eukaryotes.^[2] Despite advances, challenges persist, particularly in eukaryotes with complex splicing and alternative isoforms, where short exons or proteins under 100 amino acids often evade detection, leading to accuracies as low as 52% F1-score in diverse organisms.^[3] Benchmarks across 147 eukaryotic species show AUGUSTUS outperforming others like GeneID and SNAP, with perfect gene structure predictions in only 23.5% of cases, highlighting the need for species-specific training and integration of machine learning to handle genome drafts and phylogenetic variability.^[3] Ongoing developments focus on hybrid approaches combining ab initio prediction with RNA-seq evidence to boost reliability in large-scale annotations.^[1]

Overview

Definition and Scope

Gene prediction is the computational process of identifying the locations and structures of genes within genomic DNA sequences, encompassing both protein-coding regions that translate into proteins and non-coding genes that produce functional RNAs such as ribosomal, transfer, and long non-coding RNAs.^[4] This involves distinguishing genic regions from non-genic intergenic spaces and delineating key structural elements to enable accurate genome interpretation.^[5] The scope of gene prediction differs significantly between prokaryotes and eukaryotes due to variations in genome organization. Prokaryotic genes are generally simpler, lacking introns and often clustered in operons for coordinated expression, which allows for higher gene density and more straightforward boundary detection.^[1] Eukaryotic genes, however, exhibit greater complexity with multiple exons separated by introns that require splicing, along with associated regulatory features like promoters and polyadenylation sites, expanding the prediction to include these elements.^[6] Gene prediction plays a pivotal role in genome annotation by providing the foundational framework for identifying functional elements, which is essential for advancing functional genomics, evolutionary studies, drug discovery, and proteomic analyses.^[7] Without reliable gene models, downstream research into gene regulation, protein functions, and disease mechanisms would be severely limited.^[4] In its basic workflow, gene prediction takes a raw genomic sequence as input and applies computational algorithms to generate output models specifying gene start and end positions, exon-intron boundaries, and splice sites, often integrating evidence from sequence statistics or comparative data.^[8] This process is particularly challenged by phenomena like alternative splicing in eukaryotes, which can produce multiple isoforms from a single gene.^[6]

Biological Context and Challenges

Genes in prokaryotes are characterized by simple, continuous structures consisting of open reading frames (ORFs) that extend uninterrupted from a start codon to a stop codon, lacking introns and often organized into operons—clusters of functionally related genes transcribed together under a single promoter to enable coordinated regulation.^[9] This polycistronic arrangement facilitates efficient expression of metabolic pathways or stress responses in compact genomes. In contrast, eukaryotic genes exhibit greater complexity, featuring multiple exons separated by non-coding introns that must be precisely removed during splicing, along with 5' and 3' untranslated regions (UTRs) that regulate mRNA stability and translation. Additionally, eukaryotic gene regulation involves distal elements like enhancers, which are cis-regulatory sequences that can loop to interact with promoters over long distances to boost transcription in a tissue-specific manner.^[10] Prediction algorithms exploit several intrinsic biological signals embedded in genomic sequences. In both prokaryotes and eukaryotes, ORFs represent potential coding regions defined by the absence of stop codons in three reading frames, serving as a primary indicator of protein-coding potential.^[1] Codon usage bias, the non-random preference for certain synonymous codons, further distinguishes coding from non-coding regions, as highly expressed genes tend to favor codons matched to abundant tRNAs.^[11] Variations in GC content also provide compositional signals, with coding regions often showing distinct GC profiles compared to intergenic spaces, influencing DNA stability and mutation rates.^[12] For eukaryotic genes, splice site consensus sequences are pivotal, adhering to the GT-AG rule where introns typically begin with GT at the 5' donor site and end with AG at the 3' acceptor site, flanked by polypyrimidine tracts and branch points that guide the spliceosome.^[13] Despite these signals, accurate gene prediction faces significant biological challenges. Alternative splicing allows a single gene to produce multiple mRNA isoforms by varying exon inclusion, potentially generating thousands of protein variants from one locus and complicating boundary delineation.^[14] Pseudogenes, inactivated duplicates of functional genes that retain sequence similarity including ORFs and splice signals, frequently mislead predictors into false positives.^[15] Non-coding RNAs (ncRNAs), such as long non-coding RNAs that regulate gene expression without protein-coding capacity, overlap with or mimic coding features, while repetitive genomic regions harbor duplicated gene fragments that defy unique alignment.^[16] Low-expression genes, producing minimal transcripts, evade detection in expression-based validation, and assembly errors from next-generation sequencing (NGS) data—such as chimeric contigs in intron-rich areas—propagate inaccuracies into prediction pipelines.^[17] Advancements in sequencing technologies profoundly influence gene prediction fidelity. Short-read NGS platforms, dominant since the 2000s, generate high-throughput but fragmented data (typically 100-300 bp reads), struggling to span repetitive elements, long introns, or alternative splice junctions, resulting in incomplete or erroneous assemblies that hinder comprehensive gene annotation.^[18] Long-read technologies, including PacBio's single-molecule real-time sequencing and Oxford Nanopore's nanopore-based approach, produce reads exceeding 10 kb, enabling resolution of complex structures like multi-exon genes and operon boundaries with higher contiguity and fewer misassemblies, though they require error correction to achieve base-level accuracy comparable to short reads.^[18]

Historical Development

Early Methods (Pre-2000)

The origins of computational gene prediction in the 1980s were rooted in prokaryotic genomes, where simple rule-based systems identified open reading frames (ORFs) by scanning for start and stop codons, often incorporating codon usage bias to distinguish coding from non-coding regions. These early methods, such as those developed by Fickett, relied on statistical tests for coding potential based on dinucleotide frequencies and were applied to bacterial sequences like those from Escherichia coli. Similarly, Staden's 1984 approach measured the effects of protein coding on DNA sequence composition, using metrics like asymmetric base preferences to locate genes in prokaryotic DNA, marking a foundational shift from manual annotation to automated ORF detection. These rule-based predictors excelled in compact, intronless prokaryotic genomes but struggled with overlapping genes or atypical codon usage, highlighting the need for more probabilistic models.^[19] In the 1990s, advances in eukaryotic gene prediction addressed the complexities of introns and alternative splicing through ab initio tools that modeled gene signals statistically. Weight matrices emerged as a key concept for scoring splice sites and promoters, representing positional nucleotide frequencies to identify donor and acceptor signals with greater accuracy than rigid rules. Hidden Markov models (HMMs) were introduced to capture the sequential dependencies in gene structures, enabling probabilistic predictions of exons and introns. Early empirical methods, such as those using BLAST for homology to known proteins, complemented these by aligning query sequences to databases, though they required prior annotations.^[19] Seminal tools like GRAIL (1991) integrated neural networks to combine multiple coding signals—such as hexamer frequencies and frame-specific composition—achieving exon sensitivity of around 75% on human DNA test sets. GENSCAN (1997), a landmark HMM-based predictor, modeled complete gene structures using weight array matrices for splice sites and fifth-order Markov chains for coding regions, outperforming predecessors with 79% exon sensitivity on vertebrate genes.^[20] These probabilistic approaches surpassed rule-based systems by accounting for variability in gene signals, though they often overpredicted short exons and underperformed on GC-rich regions. For the nematode Caenorhabditis elegans, Genefinder (developed by Green and Hillier around 1995) served as the first dedicated gene finder, using linear discriminant analysis on intron-exon boundaries to predict over 19,000 genes in the 1998 genome assembly, facilitating the initial annotation of this multicellular model organism.^[21]

Post-Genome Era Advances (2000s Onward)

The completion of the Human Genome Project in 2003 marked a pivotal moment in gene prediction, catalyzing the development of integrated annotation pipelines that combined ab initio prediction with empirical evidence from expressed sequence tags (ESTs) and protein alignments.^[22] The Ensembl pipeline, introduced in the early 2000s, exemplified this shift by automating the annotation of the human genome through a multi-step process that aligned cDNA and protein sequences to genomic DNA, refined ab initio predictions from tools like GENSCAN, and incorporated comparative data for validation, achieving exon-level sensitivities of around 85% in initial assessments.^[23]^[24] This era also saw the rise of comparative methods, enabled by multi-species sequencing efforts, which leveraged evolutionary conservation to identify coding regions; for instance, multi-species alignments improved the detection of functional elements by exploiting sequence similarity across vertebrates, outperforming single-genome approaches in identifying non-coding conserved sequences.^[25] In the 2010s, the advent of high-throughput RNA sequencing (RNA-Seq) revolutionized evidence-based gene prediction by providing direct transcriptomic data to guide and validate models. Tools like StringTie, released in 2015, assembled RNA-Seq reads into transcripts de novo, enabling more accurate reconstruction of gene structures and expression estimates compared to prior assemblers, with improvements in transcript completeness by up to 20-30% on benchmark datasets.^[26] Hybrid annotation pipelines, such as MAKER introduced in 2008, further advanced de novo genome annotation by iteratively integrating ab initio predictors, homology searches, and RNA-Seq evidence, particularly for emerging model organisms, resulting in higher-quality gene models through self-training on aligned evidence.^[27] Key initiatives like the nGASP assessment for the mouse genome in 2010 evaluated these methods on Mus musculus, highlighting the benefits of evidence integration in non-human mammals, while the GENCODE consortium's automated pipelines standardized annotation across species by merging manual curation with computational predictions, boosting overall reliability.^[28]^[29] Recent trends in the 2020s have focused on adapting gene prediction to long-read sequencing technologies, such as Pacific Biosciences and Oxford Nanopore, which resolve complex genomic regions like repeats and structural variants that confound short-read methods. These advances have enhanced the annotation of isoforms and alternative splicing events, with pipelines now incorporating long-read alignments to refine gene boundaries and improve accuracy in fragmented assemblies.^[30] Ongoing benchmarks, starting with the EGASP project in 2006, demonstrate substantial progress; initial exon-level accuracies hovered around 60-70%, but integrations of RNA-Seq and comparative data in subsequent evaluations, including GENCODE releases, have pushed sensitivities and specificities beyond 90% for well-supported exons, underscoring the maturation of hybrid approaches.^[31]^[32]

Core Prediction Methods

Empirical (Similarity-Based) Methods

Empirical (similarity-based) methods for gene prediction rely on identifying sequence homology between a query genomic sequence and known genes or proteins in databases, under the assumption that functional exons are more evolutionarily conserved than non-coding regions. These approaches use alignment algorithms to detect similar regions, such as local alignments via BLAST or exhaustive dynamic programming via Smith-Waterman, to infer gene structures by matching to annotated cDNAs, expressed sequence tags (ESTs), or protein sequences. For instance, TBLASTN translates the query DNA in all six reading frames and searches against protein databases, enabling the detection of frame-shifted or partially conserved genes. This homology evidence allows prediction of exon boundaries by aligning conserved protein domains, often using tools like HMMER that model profiles of protein families to capture distant similarities. The typical workflow involves an initial database search to identify potential homologs, followed by refined spliced alignments to account for introns and exon-intron boundaries. Tools such as Exonerate perform generalized spliced alignments between genomic DNA and protein or cDNA queries, automating heuristic selection for efficient computation while maintaining accuracy in eukaryotic gene structure prediction. Frame adjustments are made during alignment to resolve reading frame shifts due to insertions or deletions, and splice sites may be verified against consensus sequences for additional support. In cases of conserved domains, HMMER applies hidden Markov models to scan for matches, enhancing sensitivity for divergent sequences. These steps are particularly effective in closely related species, where sequence identity exceeds 70-80%, as demonstrated in applications like annotating bacterial operons or eukaryotic genomes with rich reference databases. These methods offer high accuracy for genes with detectable homologs, often achieving specificity above 90% in benchmark tests on conserved eukaryotic genes, due to direct evidence from experimental data. However, they are limited by database completeness and biases toward well-studied organisms, potentially missing novel or rapidly evolving genes that lack significant similarity. Additionally, performance declines with evolutionary distance, as low-identity alignments (<30%) may produce false positives or incomplete structures.

Ab Initio Methods

Ab initio methods for gene prediction rely on statistical models that capture intrinsic features of genomic sequences to identify gene structures de novo, without relying on external evidence such as homology to known genes. These approaches model the probabilistic architecture of genes, treating the DNA sequence as a chain of states representing genomic regions like intergenic spacers, promoters, exons, and introns. Hidden Markov Models (HMMs) are commonly used to represent state transitions—for instance, from intergenic regions to promoters, then to exons and introns—while emission probabilities encode sequence characteristics specific to each state. This framework allows the computation of the most likely gene parse using algorithms like the Viterbi algorithm for optimal path finding.^[33] Key signals exploited by these methods include codon usage bias, hexamer frequencies in coding regions, start and stop codons, and splice site motifs such as donor (GT) and acceptor (AG) sequences. To account for dependencies between positions in splice sites, techniques like maximal dependence decomposition (MDD) cluster related sequences and model pairwise positional correlations, improving accuracy over independent position assumptions. Additional signals, such as CpG islands near promoters and poly-A signals downstream of coding regions, help delineate gene boundaries. These features are parameterized through training on annotated genomes, enabling the model to score potential gene structures based on sequence likelihoods.^[34] Traditional implementations include GENSCAN, which employs dynamic programming to find the optimal gene parse by integrating exon content scores, splice signals, and intron lengths, achieving around 75-80% accuracy on human genes in early benchmarks. AUGUSTUS extends this with a semi-Markov HMM (Generalized HMM) to handle variable-length exons and alternative splicing, supporting eukaryotic predictions with configurable parameters for different species. Early neural network approaches, such as GRAIL and its successor GRAIL II, use feed-forward networks to score open reading frames and exons by integrating multiple sensors for coding potential, frame detection, and signal motifs, marking a 1990s advancement in pattern recognition for gene finding.^[35] Despite their strengths, ab initio methods face limitations from parameter training on known genes, which introduces species-specific biases and reduces performance on divergent genomes lacking sufficient annotated training data. For example, models tuned for vertebrates often underperform on invertebrates due to differences in intron-exon structures. Predictions can be validated briefly through similarity searches against protein databases, but core accuracy depends on the model's ability to generalize intrinsic signals.^[36]

Hybrid Methods

Hybrid methods in gene prediction combine ab initio and empirical approaches to leverage the strengths of both, integrating intrinsic sequence features with external evidence such as homology matches or transcript alignments to produce more accurate gene models.^[37] These methods typically weight predictions from ab initio models, which rely on statistical patterns like those captured in hidden Markov models (HMMs), against similarity-based evidence to resolve ambiguities in gene structure.^[37] By fusing these layers through consensus mechanisms or machine learning-based voting, hybrid approaches achieve higher precision and recall than standalone methods, particularly in eukaryotic genomes where gene architectures are complex. Modern implementations increasingly incorporate RNA-seq data for transcript evidence to further refine predictions.^[37] Key techniques in hybrid prediction include dynamic programming algorithms that align and merge partial gene models from multiple sources, allowing for the reconciliation of exon predictions across ab initio and homology data.^[38] For instance, tools extend ab initio predictors like GeneID by incorporating BLAST-derived protein homology hits to refine exon boundaries and confirm coding regions.^[37] Additionally, the Program to Assemble Spliced Alignments (PASA) uses spliced alignments of expressed transcripts to automatically model gene structures, iteratively updating predictions by clustering and assembling alignment evidence into full transcripts.^[38] The primary benefits of hybrid methods lie in their ability to balance de novo gene discovery from ab initio signals with validation from empirical evidence, thereby reducing false positives in genomes lacking close relatives and improving overall annotation coverage. This integration mitigates the limitations of pure ab initio methods, which can overpredict in non-coding regions, while enhancing empirical methods' sensitivity to novel genes without strong homologs.^[37] Prominent examples include the MAKER pipeline, which iteratively combines homology searches (e.g., via BLASTX and Exonerate), ab initio predictions (e.g., from AUGUSTUS or SNAP), and RNA-Seq alignments to generate consensus annotations, achieving exon-level accuracies of around 60-70% in benchmark eukaryotic genomes, with higher performance on confirmed genes.^[27] Similarly, EVidenceModeler (EVM) employs a weighted consensus strategy to integrate diverse evidence types, assigning scores to ab initio predictions, protein alignments, and transcript mappings before resolving overlaps via dynamic programming, as demonstrated in its application to produce high-quality annotations for species like Arabidopsis thaliana.^[38]

Comparative Methods

Pairwise Genome Comparisons

Pairwise genome comparisons represent a comparative method for gene prediction that leverages alignments between two related genomes to identify conserved regions indicative of genes. In this approach, a query genome is aligned to a well-annotated reference genome, with gene structures predicted within blocks of high sequence conservation. Exons typically exhibit greater conservation than introns due to stronger purifying selection on coding sequences, allowing the method to delineate gene boundaries and splice sites more accurately than single-genome analyses. This principle exploits evolutionary conservation as a signal for functional elements, building on basic homology detection but extending it to model gene architecture across species. Key algorithms in pairwise comparisons employ spliced alignments to account for introns and frame shifts. For instance, SLAM (Spliced Alignments for Multispecies) uses a generalized pair hidden Markov model (pair HMM) to simultaneously align genomic sequences and predict gene structures, incorporating parameters for exon-intron boundaries, frame preservation, and substitution rates specific to coding regions. This model detects frame-preserving matches between potential exons while penalizing non-coding alignments, enabling reliable identification of orthologous genes even with moderate divergence. Similarly, TwinScan applies a pair HMM framework to perform co-linear gene prediction, assuming synteny between closely related genomes to refine predictions by integrating conservation patterns with ab initio signals like codon usage. These tools were particularly effective in early applications, such as predicting human genes using mouse alignments, where TwinScan achieved higher sensitivity for novel genes compared to non-comparative methods. Pairwise methods are especially useful for species with available high-quality reference genomes and close phylogenetic relationships, such as human and mouse, where conservation levels support accurate exon identification. By capturing evolutionary signals, these approaches reduce false positives in gene prediction, as conserved regions are less likely to represent non-functional sequences. However, they require a closely related reference to maintain alignment quality and are sensitive to insertions, deletions, and genomic rearrangements that disrupt collinearity, potentially leading to fragmented predictions in more divergent pairs. Despite these limitations, pairwise comparisons have been instrumental in improving annotation accuracy for eukaryotic genomes during the post-genome era.

Multiple Genome Alignments

Multiple genome alignments leverage the comparative analysis of several related genomes to identify conserved sequences that signal potential genes, exons, or regulatory elements, providing a more robust framework than single-genome approaches by capturing evolutionary conservation across species.^[39] This method typically employs progressive alignment strategies, where genomes are aligned hierarchically starting from pairwise comparisons and iteratively incorporating additional sequences. Seminal tools like the Threaded Blockset Aligner (TBA) combined with MultiZ facilitate this process by constructing alignments of large genomic regions, such as those spanning megabases in mammalian genomes, enabling the detection of phylo-conserved elements that correspond to functional genomic features like exons.^[40] To enhance gene prediction, these alignments integrate phylogenetic models that quantify conservation levels, such as phastCons scores derived from hidden Markov models fitted to the alignment and a species phylogeny. PhastCons identifies evolutionarily conserved regions by estimating the probability of nucleotide substitutions under a two-state model (conserved versus non-conserved), which helps pinpoint exons and genes even in non-coding regions.^[41]^[42] For instance, the N-SCAN algorithm extends ab initio gene prediction by incorporating multiple alignments, modeling phylogenetic dependencies, context-dependent substitution rates, and insertions/deletions to improve accuracy in predicting gene structures de novo. Building on pairwise alignments as foundational blocks, N-SCAN has demonstrated superior performance in vertebrate genomes by utilizing multi-species data to refine predictions.^[43] Recent advances as of 2025 incorporate pangenome graphs, which represent multiple genomes as variation-inclusive structures rather than linear references, enabling more accurate comparative alignments that account for structural variants and population diversity. Tools like Progressive Cactus have evolved to support thousand-genome scale alignments, facilitating pangenome-based gene prediction that uncovers novel genes and improves annotation in complex eukaryotic genomes.^[44]^[45] The primary benefits of multiple genome alignments include enhanced detection of distant homologs through accumulated conservation signals across diverse species and the identification of regulatory elements that may lack strong pairwise similarity.^[39] These approaches have been pivotal in projects like the UCSC Genome Browser's vertebrate multi-alignments, which analyze dozens of species to annotate conserved non-coding sequences alongside genes, aiding in the discovery of regulatory conservation in human and other mammalian genomes.^[46] However, challenges persist, including alignment errors in repetitive or highly divergent regions where sequence homology is obscured, leading to fragmented or inaccurate mappings that propagate to gene predictions. Additionally, the computational intensity of progressive alignments scales poorly with the number of genomes and sequence lengths, demanding significant resources for large-scale applications like thousand-genome alignments.^[47]^[48]

Specialized Applications

Pseudogene Prediction

Pseudogenes are genomic sequences that resemble functional genes but have become non-functional due to accumulated mutations, such as premature stop codons, frameshifts, deletions, or truncations, leading to the absence of intact open reading frames (ORFs).^[49] These include processed (retrotransposed) pseudogenes, which originate from reverse-transcribed mRNA and typically lack introns, promoters, and poly-A signals at their 3' ends; duplicated (unprocessed) pseudogenes, which arise from gene duplication events and retain original intron-exon structures; and truncated pseudogenes, which are partial copies lacking essential regulatory or coding elements.^[50] Such mutations render pseudogenes incapable of producing functional proteins, distinguishing them from active genes.^[51] Detection of pseudogenes focuses on identifying genomic regions with high sequence similarity to functional genes but harboring disabling features that disrupt translation.^[49] Common strategies involve scanning genomes for homologs of known protein-coding genes, often using iterative similarity searches to capture decayed sequences that have diverged over time.^[52] For instance, PSI-BLAST is employed to detect distant or partially conserved homologs that may represent inactivated gene copies.^[53] These approaches typically align query sequences against the genome and flag candidates based on the presence of stop codons, frameshifts, or other inactivating mutations within otherwise conserved coding regions.^[51] Specialized tools enhance pseudogene prediction by integrating homology-based searches with mutation scoring and structural analysis.^[50] PseudoPipe, a widely adopted pipeline, automates the process by performing BLAST-based alignments of functional gene sequences to the genome, followed by classification into processed, duplicated, or truncated types based on intron-exon preservation, poly-A tails, and disruption scores.^[51] It achieves high sensitivity, identifying approximately 81% of known pseudogenes in model organisms like Arabidopsis thaliana.^[50] Recent pipelines, such as P-GRe (2023), further improve automated pseudogene retrieval in eukaryotic genomes by maximizing detection efficiency with minimal input requirements.^[54] Other methods, such as those incorporating whole-genome expression profiling, further refine predictions by mapping mRNA and EST data to candidate loci; pseudogenes are confirmed when alignments reveal frameshifts or nonsense mutations without supporting full-length transcripts.^[52] Distinguishing pseudogenes from functional genes often relies on the absence of expression evidence, such as no detectable mRNA or protein products, which helps filter false positives.^[52] Accurate pseudogene prediction is essential to prevent over-annotation in genome assemblies, as these sequences can mimic incomplete or lowly expressed genes and confound standard gene finders, including ab initio methods that rely on sequence signals alone.^[49] Beyond annotation challenges, pseudogenes contribute to evolutionary dynamics by serving as substrates for gene regulation, such as acting as competing endogenous RNAs (ceRNAs) that modulate functional gene expression through microRNA sequestration.^[55] This regulatory potential underscores their biological significance, influencing processes like development and disease.^[56]

Metagenomic Gene Prediction

Metagenomic gene prediction addresses the challenge of identifying protein-coding genes within DNA sequences derived from environmental samples, which contain complex mixtures of uncultured microorganisms, predominantly prokaryotes, without relying on reference genomes. These sequences often consist of short, fragmented reads or partially assembled contigs, complicating traditional gene-finding approaches due to incomplete coverage, sequencing errors, and the absence of organism-specific priors.^[57] Key methods in metagenomic gene prediction leverage statistical composition features, such as GC content bias and codon usage patterns, to distinguish coding from non-coding regions, particularly in prokaryotic sequences. For instance, MetaGeneMark employs hidden Markov models (HMMs) trained on hexamer frequencies within GC-partitioned models to capture prokaryotic coding signals like start and stop codons, enabling ab initio predictions on anonymous sequences.^[58] Taxonomic binning, using sequence composition or coverage profiles, further refines predictions by grouping contigs into putative microbial populations, allowing tailored models for each bin.^[59] Prominent tools include FragGeneScan, introduced in 2010, which integrates HMMs with explicit sequencing error models to predict fragmented open reading frames (ORFs) in short, error-prone reads, achieving higher sensitivity for incomplete genes compared to earlier methods.^[60] Prodigal, also from 2010, uses dynamic programming to score potential gene starts and has an "anonymous" mode adapted for metagenomes, balancing speed and accuracy on assembled contigs.^[61] MetaGeneMark similarly relies on composition-based HMMs for unsupervised gene calling. These tools face challenges from chimeric assemblies and low-coverage regions, which can lead to fragmented or missed predictions.^[62] Recent advances integrate gene prediction with de novo assembly pipelines, such as MEGAHIT, to process longer contigs and improve ORF completeness before prediction. Updated versions like MetaGeneMark-2 (2022) enhance accuracy through refined modeling of atypical prokaryotic genes.^[63] As of 2025, emerging machine learning approaches are increasingly applied to improve annotation in diverse metagenomes.^[64] Benchmarks indicate typical sensitivities of 70-80% for prokaryotic genes in simulated short-read metagenomes with moderate errors, though specificity can drop below 60% due to overprediction in non-coding areas. In binned contigs, empirical homology searches can briefly validate predictions against databases.^[62]

Modern Techniques

Machine Learning Approaches

Machine learning approaches to gene prediction leverage supervised and unsupervised techniques to model sequence patterns, distinguishing coding from non-coding regions more effectively than purely statistical methods. These methods typically involve training on annotated genomic corpora to learn features such as k-mer frequencies, which capture local nucleotide composition, and periodic signals indicative of coding potential. Supervised learning, in particular, uses classifiers like support vector machines (SVMs) and decision trees for tasks such as exon classification and splice site detection, where hand-crafted features—including physicochemical properties like codon bias and GC content—are engineered to represent sequence context.^[65]^[66] A seminal example is Glimmer, developed for prokaryotic genomes, which employs unsupervised learning via interpolated Markov models (IMMs) to predict open reading frames (ORFs). IMMs dynamically weight k-mers of varying lengths (up to order 8) based on their reliability in training data from annotated microbial sequences, enabling robust handling of variable-length compositional signals without explicit supervision. This approach improved gene detection in bacteria like Haemophilus influenzae by adapting to local sequence biases, outperforming fixed-order Markov chains. For eukaryotic prediction, SNAP integrates supervised neural networks within a semi-hidden Markov model framework, using weight matrices and Markov models trained on features like splice donor/acceptor sites and codon usage from species-specific corpora such as Arabidopsis thaliana. SNAP's neural network components score potential exons by learning nonlinear patterns in these features, facilitating adaptation to novel genomes through bootstrapping.^[67]^[68] Decision trees have been applied in systems like MORGAN for vertebrate DNA, where they classify potential genes by recursively partitioning features such as dinucleotide frequencies and frame-specific scores derived from training on known exons. SVMs, as in mGene, extend this by optimizing hyperplanes in high-dimensional feature spaces for eukaryotic gene finding, incorporating k-mer profiles and intron length distributions to achieve high sensitivity in distinguishing true exons. Ensemble methods predating deep learning, such as Jigsaw, combine outputs from multiple classifiers—including decision trees and HMMs—via evidence combination, reducing errors from individual models by weighting predictions based on their reliability on annotated data. These techniques enhance rule-based methods by learning from empirical distributions, better accommodating genome-specific variability in signal strengths and lengths.^[66]^[69]^[70]^[71]

Deep Learning and AI Integration

The integration of deep learning and artificial intelligence into gene prediction has accelerated since around 2015, driven by the availability of large-scale datasets from next-generation sequencing (NGS) technologies, which provide the volume of data necessary for training complex neural architectures. Unlike earlier machine learning approaches that relied on hand-engineered features, deep learning enables end-to-end prediction directly from raw DNA sequences, automatically learning hierarchical representations of genomic patterns such as splice motifs and exon-intron boundaries. This shift has allowed models to capture subtle, non-linear relationships in sequence data, improving accuracy in de novo gene structure identification without prior annotations.^[72] Key advancements include convolutional neural networks (CNNs) for detecting splice sites, exemplified by SpliceAI, which uses a deep residual neural network to predict splice junctions and alternative splicing events from pre-mRNA sequences with high precision.^[73] Recurrent neural networks (RNNs) and long short-term memory (LSTM) units have been employed to model long-range sequence dependencies and contextual information, as seen in Tiberius, an end-to-end model that processes extended genomic contexts for accurate exon chaining.^[74] More recently, transformer-based architectures have emerged for DNA language modeling; the Nucleotide Transformer, pretrained on diverse eukaryotic genomes, fine-tunes for tasks like promoter and enhancer prediction, leveraging self-attention mechanisms to handle sequences up to hundreds of kilobases.^[72] Notable recent models include Sensor-NN, a neural network that aggregates signals from multiple sequence sensors (e.g., codon usage and GC content) for ab initio eukaryotic gene prediction, achieving robust performance across distant taxa.^[75] Tiberius further advances this by hybridizing deep learning with hidden Markov models (HMMs), integrating CNN and LSTM layers for feature extraction and probabilistic state transitions, resulting in over 95% exon sensitivity on human benchmarks in de novo mode.^[74] These innovations have pushed overall accuracy, with Tiberius correctly predicting the exon-intron structure of approximately two-thirds of human protein-coding genes without errors.^[76] Deep learning models are increasingly integrated with long-read sequencing data, such as from PacBio or Oxford Nanopore, to resolve complex isoforms and repetitive regions that short reads obscure; for instance, neural networks trained on raw nanopore signals can directly detect gene boundaries and classify isoforms in unassembled data. AI-driven isoform prediction has also advanced through models like SpliceAI, which forecast multiple splicing outcomes per gene to capture tissue-specific variability. Looking ahead, foundation models such as Enformer, which predict regulatory effects and gene expression from distal DNA sequences, promise to extend gene prediction to include non-coding and regulatory elements, enabling comprehensive annotation of functional genomic units. In 2025, DeepMind's AlphaGenome further advanced this by predicting gene regulation directly from DNA sequences using large-scale AI models.^[77]

Evaluation and Tools

Performance Metrics

Performance metrics for gene prediction evaluate the accuracy of computational outputs against reference annotations, typically at nucleotide, exon, and gene levels. At the nucleotide level, sensitivity (SN) measures the proportion of true coding bases correctly identified, while specificity (SP) assesses the proportion of predicted coding bases that are actually coding. These are calculated as

\text{SN} = \frac{\text{TP}}{\text{TP} + \text{FN}}, \quad \text{SP} = \frac{\text{TP}}{\text{TP} + \text{FP}},

where TP are true positives, FN false negatives, and FP false positives.^[78]^[79] Correlation coefficients, such as the Pearson correlation between predicted and reference coding probabilities, provide an additional measure of overall agreement at this level.^[80] At the exon level, evaluation focuses on matching predicted exons to reference exons, often requiring significant overlap (e.g., >95% boundary match for exact exons) to count as correct, though partial overlaps are considered in relaxed metrics. Gene-level metrics assess structural overlap, such as the proportion of genes with all exons correctly predicted or the correlation between predicted and true gene structures, accounting for intron-exon boundaries and alternative splicing.^[78]^[81] These hierarchical metrics reveal that while nucleotide-level accuracy often exceeds 90%, exon- and gene-level performance drops due to the complexity of splicing patterns.^[3] Advanced metrics extend to transcript-level assessment, incorporating expression data like transcripts per million (TPM) to validate predicted isoforms against observed expression levels; higher TPM correlation indicates biologically relevant predictions. For pseudogene prediction, false discovery rate (FDR) controls the proportion of incorrectly annotated pseudogenes among predictions, typically set at <0.05 to minimize false positives in non-functional gene identification.^[82]^[83] Key challenges in these evaluations include class imbalance, where non-genic regions vastly outnumber genic ones (e.g., <2% of the human genome is coding), leading to models biased toward high specificity at the expense of sensitivity. Handling partial matches is also problematic, as fragmented or overlapping exon predictions require nuanced overlap thresholds to avoid under- or over-counting accuracy.^[84]^[79] Benchmarks like the EGASP consortium demonstrate typical exon sensitivity of 70-80% for human genes in the mid-2000s, with modern evidence-integrated methods in the 2020s achieving 80-90% exon sensitivity through improved alignments and machine learning.^[79]^[3]

Software Tools and Benchmarks

Gene prediction software encompasses a range of open-source implementations tailored to prokaryotic, eukaryotic, and hybrid annotation needs, with many hosted on repositories like GitHub for collaborative development and accessibility.^[85]^[86] For prokaryotic genomes, Prodigal employs a dynamic programming algorithm based on log-likelihood functions to identify protein-coding genes, offering high accuracy on bacterial and archaeal sequences while handling incomplete assemblies.^[61] Glimmer, utilizing interpolated Markov models, excels at detecting genes in microbial DNA, including bacteria, archaea, and viruses, and has been a benchmark tool since its inception.^[87] In eukaryotic contexts, Augustus applies a generalized hidden Markov model to predict gene structures, incorporating evidence from RNA-Seq or proteins for enhanced precision in intron-exon delineation.^[88] GeneMark, originally developed using Markov chain models, supports self-training modes like GeneMark-ES for novel eukaryotic genomes, particularly effective for fungi and other organisms with atypical intron patterns.^[89]^[90] Hybrid pipelines integrate ab initio predictions with extrinsic evidence for comprehensive annotation. MAKER combines outputs from tools like Augustus and SNAP with alignments to ESTs and proteins, iteratively refining models via Evidence Modeler for emerging model organisms.^[27] BRAKER automates training and prediction using GeneMark-ETP and Augustus, leveraging RNA-Seq data to generate reliable gene sets in novel eukaryotic genomes without manual intervention; the latest version, BRAKER3 (as of 2023), further improves automation by filtering spurious predictions and integrating additional transcript evidence.^[91]^[92] Modern pipelines like Funannotate provide end-to-end workflows for fungal and eukaryotic annotation, incorporating Evidence Modeler to consensus gene models from multiple predictors such as Augustus and GeneMark, alongside BUSCO for quality assessment, and are particularly suited for 2020s-era high-throughput sequencing data.^[93] Deep learning-based tools, such as those employing CNN-RNN hybrids or end-to-end models like Tiberius (2024), are emerging for specialized tasks like coding region identification in plant and human genomes, achieving high exon-level F1-scores (e.g., 92.6%) and broader adoption in general gene prediction.^[94]^[74] Benchmarks for evaluating these tools rely on standardized datasets and competitions. The ENCODE project's EGASP assessed human genome annotations, revealing Augustus as one of the top ab initio predictors with sensitivity and specificity metrics highlighting gaps in alternative transcript detection.^[79] GENCODE provides reference annotations for human and mouse, with the 2025 release continuing to refine comprehensive gene sets through integrated evidence and manual curation.^[95]^[96]^[32] These evaluations demonstrate that evidence-integrated pipelines like BRAKER often achieve 10-20% higher precision than purely ab initio approaches on complex eukaryotic datasets, though performance varies by genome type.^[97] Practical usage includes web servers for accessibility; the NCBI Eukaryotic Genome Annotation Pipeline employs Gnomon for homology-based and ab initio predictions, integrating alignments to generate RefSeq annotations for thousands of eukaryotic assemblies.^[98] Open-source repositories facilitate customization, with tools like Prodigal and Augustus available via GitHub for local execution and integration into workflows.^[85]

Tool	Domain	Key Features	Primary Citation/Source
Prodigal	Prokaryotic	Dynamic programming for ORFs in drafts	Hyatt et al., 2010
Glimmer	Prokaryotic	Interpolated Markov models for microbes	Salzberg et al., 1998
Augustus	Eukaryotic	GHMM with evidence integration	Stanke et al., 2006
GeneMark	Eukaryotic/Prokaryotic	Self-training HMM variants	Besemer & Borodovsky, 2005
MAKER	Hybrid	Evidence alignment and model consensus	Cantarel et al., 2008
BRAKER	Hybrid	Automated RNA-Seq training	Hoff et al., 2019
Funannotate	Eukaryotic (fungi focus)	Pipeline with BUSCO and EVM	Funannotate Docs

References

[1]
A Brief Review of Computational Gene Prediction Methods - PMC
Gene prediction by computational methods for finding the location of protein coding regions is one of the essential issues in bioinformatics.
[2]
(PDF) GENE PREDICTION: A REVIEW - ResearchGate
Gene identification is a critical and challenging problem in bioinformatics. The task of gene prediction is to find sub-sequences of bases that encode proteins.
[3]
A benchmark study of ab initio gene prediction methods in diverse ...
Apr 9, 2020 · A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics 21, 293 (2020). https://doi.org ...
[4]
Gene prediction: the end of the beginning - PMC - PubMed Central
All ab initio gene prediction programs have to balance sensitivity against accuracy. It is often only possible to detect all the real exons present in a ...
[5]
Annotating noncoding RNA genes - PubMed
More than 1500 homologs of known "classical" RNA genes can be annotated in the human genome sequence, and automatic homology-based methods predict up to 5000 ...
[6]
Understanding the causes of errors in eukaryotic protein-coding ...
Nov 10, 2020 · Gene prediction errors in primate proteomes affect up to 50% of the sequences. Major causes of errors include undetermined genome regions, genome sequencing or ...Discussion · Multiple Alignments And... · Intron--Exon Maps And...
[7]
Gene prediction: compare and CONTRAST | Genome Biology
Dec 20, 2007 · Gene prediction is one of the most important and alluring problems in computational biology. Its importance comes from the inherent value of ...
[8]
Protocol for gene annotation, prediction, and validation of genomic ...
Sep 22, 2022 · We describe a detailed step-by-step protocol for gene annotation, prediction of genomic gene expansion, and its computational and experimental validation.Step-By-Step Method Details · Genome Annotation · Gene Family Construction<|separator|>
[9]
Operons - PMC - NIH
Operons (clusters of co-regulated genes with related functions) are a well-known feature of prokaryotic genomes. Archeal and bacterial genomes generally contain ...
[10]
Mechanisms of enhancer action: the known and the unknown
Apr 15, 2021 · The regulators consist of both the cis-regulatory elements (CRE) in the genome and relevant gene products; the latter includes proteins (as ...
[11]
Synonymous but not the same: the causes and consequences of ...
Nov 23, 2010 · Biases in synonymous codon usage are pervasive across taxa, genomes and genes, and understanding their causes has implications for molecular ...
[12]
GPRED-GC: a Gene PREDiction model accounting for 5 ′- 3 - NIH
Dec 24, 2019 · Genes of variant GC changing patterns can be represented by the new exon states. The added exon states allow the HMM to predict genes of ...
[13]
Analysis of canonical and non-canonical splice sites in mammalian ...
The canonical splice sites demonstrate well-defined conserved positions additional to the conserved dinucleotides GT-AG with donor site consensus: AG│GTRAGT and ...Materials And Methods · Results · Acknowledgements
[14]
Review: Alternative Splicing (AS) of Genes As An Approach for ...
The inequality in the ratio of gene to protein formation gave rise to the theory of alternative splicing (AS). AS is a mechanism by which one gene gives rise to ...
[15]
Pseudogenes: Pseudo-functional or key regulators in health and ...
Pseudogenes have long been labeled as “junk” DNA, failed copies of genes that arise during the evolution of genomes. However, recent results are challenging ...
[16]
Long non-coding RNAs: definitions, functions, challenges ... - Nature
Jan 3, 2023 · Genes specifying long non-coding RNAs (lncRNAs) occupy a large fraction of the genomes of complex organisms. The term 'lncRNAs' encompasses ...
[17]
Review on the Computational Genome Annotation of Sequences ...
Sep 18, 2020 · Gene prediction is a complex process, especially for eukaryotic DNA [3]. The varying sizes of introns (noncoding sequences) in-between exons ...
[18]
Comparative Analysis of PacBio and Oxford Nanopore Sequencing ...
Aug 23, 2021 · The main limitations of short-read sequencing are that assembled transcripts from short reads do not cover full-length transcripts in eukaryotic ...
[19]
SURVEY AND SUMMARY: Current methods of gene prediction ...
This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations.Missing: early | Show results with:early
[20]
Prediction of complete gene structures in human genomic DNA
GENSCAN is shown to have substantially higher accuracy than existing methods when tested on standardized sets of human and vertebrate genes, ...
[21]
Overview of gene structure in C. elegans - WormBook - NCBI - NIH
In the early stage of the C. elegans sequencing project, the ab initio gene prediction program Genefinder was used to find protein-coding genes.What is a gene? · Protein-coding genes · Pseudogenes · Non-coding RNA genes
[22]
Human Genome Project Fact Sheet
Jun 13, 2024 · The Human Genome Project was a landmark global scientific effort whose signature goal was to generate the first sequence of the human genome.Missing: prediction Ensembl pipeline
[23]
The Ensembl genome database project. - Abstract - Europe PMC
It is a comprehensive source of stable automatic annotation of the human genome sequence, with confirmed gene predictions that have been integrated with ...
[24]
Genome annotation past, present, and future: How to define an ORF ...
2003, predicted only 75% of known exons correctly, as compared to 85% for Ensembl (Flicek et al. 2003). Given the recent progress in de novo gene prediction, it ...
[25]
Multi-species sequence comparison: the next frontier in genome ...
Multi-species sequence comparison is more powerful for discovering functional sequences than pairwise comparisons, identifying coding and non-coding elements.Missing: post- | Show results with:post-
[26]
StringTie enables improved reconstruction of a transcriptome ... - NIH
Feb 18, 2015 · StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript ...
[27]
MAKER: An easy-to-use annotation pipeline designed for emerging ...
We have designed an easy-to-use annotation tool called MAKER in an attempt to meet all of these design criteria. ... Tools for tRNA gene prediction exist (Lowe ...
[28]
nGASP – the nematode genome annotation assessment project - PMC
The nematode genome annotation assessment project (nGASP) was launched to objectively assess the accuracy of protein-coding gene prediction software in C.Missing: 2010 | Show results with:2010
[29]
GENCODE: The reference human genome annotation for The ...
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, ...
[30]
Advancements in long-read genome sequencing technologies and ...
Apr 11, 2024 · The recent advent of long read sequencing technologies, such as Pacific Biosciences (PacBio) and Oxford Nanopore technology (ONT), have led to ...
[31]
EGASP: the human ENCODE Genome Annotation Assessment Project
Aug 7, 2006 · We present the results of EGASP, a community experiment to assess the state-of-the-art in genome annotation within the ENCODE regions.
[32]
GENCODE 2025: reference gene annotation for human and mouse
Nov 20, 2024 · We aim to annotate all human and mouse protein-coding genes, long non-coding RNAs (lncRNAs), pseudogenes and small RNAs (these are annotated by ...
[33]
[PDF] Chapter 4 An Introduction to Hidden Markov Models for Biological ...
In gene finding several signals must be recognized and combined into a prediction of exons and introns, and the prediction must conform to various rules to make ...
[34]
Prediction of complete gene structures in human genomic DNA
The model is applied to the problem of gene identification in a computer program, GENSCAN, which identifies complete exon/intron structures of genes in genomic ...
[35]
AUGUSTUS: ab initio prediction of alternative transcripts
At the recent EGASP workshop in Cambridge, UK, a systematic evaluation of existing gene finders for the human genome has been carried out based on a large set ...
[36]
Recognizing exons in genomic sequence using GRAIL II - PubMed
We have described an improved neural network system for recognizing protein coding regions (exons) in human genomic DNA sequences.Missing: early | Show results with:early
[37]
Current methods of gene prediction, their strengths and weaknesses
This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations.
[38]
A beginner's guide to eukaryotic genome annotation - Nature
Apr 18, 2012 · Here we provide an overview of the genome annotation process and the available tools and describe some best-practice approaches.
[39]
Automated eukaryotic gene structure annotation using ...
Jan 11, 2008 · Exon prediction accuracy is increased by about 6%, and exon prediction accuracies for each exon type are mostly improved, with the exception of ...Missing: benchmarks | Show results with:benchmarks
[40]
Analysis of Multiple Genomic Sequence Alignments - PubMed Central
We demonstrate that multiple sequence alignments are, overall, superior to pairwise alignments for identification of mammalian regulatory regions.
[41]
Aligning Multiple Genomic Sequences With the Threaded Blockset ...
To perform the dynamic-programming alignment step, TBA runs a stand-alone program called MULTIZ, which can be used to align highly rearranged or incompletely ...
[42]
PhastCons HOWTO
Jun 14, 2005 · PhastCons is a program for identifying evolutionarily conserved elements in a multiple alignment, given a phylogenetic tree.
[43]
Evolutionarily conserved elements in vertebrate, insect, worm ... - NIH
Using phastCons, we have conducted comprehensive searches for conserved elements in four separate genome-wide multiple alignments, consisting of five ...
[44]
Using multiple alignments to improve gene prediction - PubMed - NIH
N-SCAN can model the phylogenetic relationships between the aligned genome sequences, context dependent substitution rates, and insertions and deletions. An ...
[45]
https://academic.oup.com/bib/article/25/6/bbae588/7902219
[46]
Whole-Genome Alignment: Methods, Challenges, and Future ... - MDPI
Whole-genome alignment (WGA) is a critical process in comparative genomics, facilitating the detection of genetic variants and aiding our understanding of ...<|control11|><|separator|>
[47]
Progressive Cactus is a multiple-genome aligner for the thousand ...
Nov 11, 2020 · Progressive aligners use a 'guide tree' to recursively break a multiple alignment problem into many smaller sub-alignments, each of which is ...
[48]
A computational approach for identifying pseudogenes in the ...
Aug 7, 2006 · Pseudogenes are inheritable genetic elements showing sequence similarity to functional genes but with deleterious mutations.
[49]
Pseudogenes and Their Genome-Wide Prediction in Plants - MDPI
Nov 28, 2016 · Pseudogenes are paralogs generated from ancestral functional genes (parents) during genome evolution, which contain critical defects in their sequences.Pseudogenes And Their... · 4. Pseudogenes For... · 5. Pseudogene Prediction<|control11|><|separator|>
[50]
PseudoPipe: an automated pseudogene identification pipeline
Our method is designed and best used to detect those pseudogenes that are unable to be translated into proteins. The algorithm has been implemented into a ...Missing: review | Show results with:review
[51]
Systematic identification of pseudogenes through whole genome ...
We developed a novel bioinformatics method to systematically identify and validate pseudogenes by carefully profiling expression evidence over the whole genome.
[52]
Evolutionarily informed deep learning methods for predicting relative ...
Mar 6, 2019 · This model has been named the “pseudogene model” because of its ability to predict genes that are potentially pseudogenized and therefore lack ...
[53]
Pseudogenes as regulators of biological function - Portland Press
Apr 30, 2013 · Many pseudogenes are transcribed into RNA, and it is already known that some non-coding RNAs play a role in regulating gene expression.
[54]
Pseudogene-derived lncRNAs: emerging regulators of gene ...
Pseudogene transcripts act in four ways to regulate gene function (Figure 1), and can be differentially expressed between 0.03- and 45-fold in proliferating ...
[55]
A review of computational tools for generating metagenome ...
This review thoroughly investigates the computational tools used to identify microbes using MAGs, based on metagenomic sequencing, and it provides reasonable ...
[56]
None
Nothing is retrieved...<|separator|>
[57]
Bioinformatics strategies for taxonomy independent binning and ...
There are currently two types of binning methods: taxonomy dependent and taxonomy independent. The first type classifies the DNA fragments by performing a ...
[58]
FragGeneScan: predicting genes in short and error-prone reads
We have developed a novel gene prediction method FragGeneScan, which combines sequencing error models and codon usages in a hidden Markov model.
[59]
Prodigal: prokaryotic gene recognition and translation initiation site ...
Mar 8, 2010 · We developed a new gene prediction algorithm called Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm).
[60]
Short-read reading-frame predictors are not created equal
Jul 28, 2012 · In this study we evaluate five widely used ab initio gene-calling algorithms—FragGeneScan, MetaGeneAnnotator, MetaGeneMark, Orphelia, and ...Missing: paper | Show results with:paper
[61]
Accurate splice site prediction using support vector machines
Dec 21, 2007 · In this paper, we will apply Support Vector Machines (SVMs) to the recognition of splice sites. SVMs are known to be excellent algorithms for ...
[62]
A decision tree system for finding genes in DNA - PubMed - NIH
MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, ...
[63]
Improved microbial gene identification with GLIMMER
In this paper we have described several technical improvements made in the GLIMMER 2.0 gene-finding system and argued that the system is more accurate than ...
[64]
Gene finding in novel genomes | BMC Bioinformatics | Full Text
May 14, 2004 · In this paper, I introduce a new ab initio gene finding program called SNAP. SNAP is similar to Genscan and other generalized hidden Markov model (HMM) gene ...
[65]
Finding Genes in DNA Using Decision Trees and Dynamic ...
Mar 15, 2023 · This study demonstrates the use of decision tree classifiers as the basis for a general gene-finding system. The system uses a dynamic ...Missing: prediction | Show results with:prediction
[66]
mGene: Accurate SVM-based gene finding with an application ... - NIH
We present a highly accurate gene-prediction system for eukaryotic genomes, called mGene. It combines in an unprecedented manner the flexibility of generalized ...
[67]
[PDF] Computational Gene Prediction Using Multiple Sources of Evidence
Decision trees are constructed by using OC1 (Murthy et al. 1994) to apply the training examples to new data. Using the protein coding model P(c| vc) as an ...<|separator|>
[68]
Nucleotide Transformer: building and evaluating robust foundation ...
Nov 28, 2024 · The Nucleotide Transformer model accurately predicts genomics tasks. We developed a collection of transformer-based DNA LMs, NT, which have ...
[69]
Tiberius: end-to-end deep learning with an HMM for gene prediction
Gene prediction is the task of finding genes and their exon−intron structure in a genome and is a fundamental step in the analysis of a newly sequenced genome.
[70]
Ab initio gene prediction for protein-coding regions - Oxford Academic
Aug 10, 2023 · A new method uses a neural network with sensors to predict protein-coding regions, achieving improved accuracy with less training data, even on ...Ab Initio Gene Prediction... · 2 Methods · 3.2 Whole Chromosome Test...<|separator|>
[71]
Tiberius: end-to-end deep learning with an HMM for gene prediction
Tiberius's highly parallelized model is the fastest state-of-the-art gene prediction method, processing the human genome in under 2 hours.
[72]
Evaluation of Gene-Finding Programs on Mammalian Sequences
... performance metrics to measure the accuracy of prediction on the nucleotide, exon, and gene levels. Some of these measures were known and had been used ...
[73]
EGASP: the human ENCODE Genome Annotation Assessment Project
Nevertheless, the multiple transcript accuracy, taking into account alternative splicing, reached only approximately 40% to 50% accuracy.
[74]
RGASP Guidelines - GENCODE
Evaluation metrics will be standard gene prediction assessment metrics ... nucleotide, exon, transcript and gene level; There will be an evaluation at the ...
[75]
Evaluation of Gene Structure Prediction Programs - CRG
Sep 1, 2020 · The authors also found a large gap between accuracy at the nucleotide level and at the exon level. This indicates that, while the programs ...Base Level · Burset And Guigó (1996) · Rogic Et Al. (2001)
[76]
Transcript expression-aware annotation improves rare variant ...
May 27, 2020 · We develop a transcript-level annotation metric known as the 'proportion expressed across transcripts', which quantifies isoform expression for variants.
[77]
The GENCODE pseudogene resource | Genome Biology | Full Text
Sep 5, 2012 · Pseudogenes with P-values less than (False discovery rate × Rank/COUNT) were taken as significant, where false discovery rate is set to 0.05 ...
[78]
A Literature Review of Gene Function Prediction by Modeling Gene ...
We summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO ...
[79]
hyattpd/Prodigal: Prodigal Gene Prediction Software - GitHub
Predicts protein-coding genes: Prodigal provides fast, accurate protein-coding gene predictions in GFF3, Genbank, or Sequin table format. · Handles draft genomes ...
[80]
Yandell-Lab/maker: Genome Annotation Pipeline - GitHub
yandell-lab.org/software/maker.html. License. View license · 39 stars 3 forks ... prediction data, est_gff for EST data, etc.). You should use the online ...Missing: paper | Show results with:paper
[81]
[PDF] Microbial gene identification using interpolated Markov models
This paper describes a new system, GLIMMER, for finding genes in microbial genomes. In a series of tests on Haemophilus influenzae, Helicobacter pylori and.
[82]
GeneMark gene prediction
Novel eukaryotic genomes can be analyzed by the self-training GeneMark-ES. The fungal mode of GeneMark-ES accounts for fungal-specific intron organization.Missing: workflow | Show results with:workflow
[83]
GeneMark: web software for gene finding in prokaryotes, eukaryotes ...
GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses · View on publisher site · PDF (118.8 KB) · Cite · Collections · Permalink. PERMALINK.Missing: paper | Show results with:paper
[84]
Whole-Genome Annotation with BRAKER - PMC - NIH
BRAKER is a pipeline for highly accurate and fully automated gene prediction in novel eukaryotic genomes. It combines two major tools: GeneMark-ES/ET and ...
[85]
Gene Prediction — Funannotate 1.8.16 documentation
At the core of the prediction algorithm is Evidence Modeler, which takes several different gene prediction inputs and outputs consensus gene models.
[86]
[PDF] Deep Learning Applications in Plant Biology: Bridging Genotype ...
Feb 5, 2025 · DeepGene, a deep-learning-based gene prediction tool, uses a hybrid CNN-RNN algorithm to identify cod- ing regions in plant genomes.2 By ...
[87]
RNA-seq Genome Annotation Assessment Project - Gencode
The RNA-seq Genome Annotation Assessment Project (RGASP) is designed to evaluate computational methods for RNA-seq data analysis.
[88]
Assessment of transcript reconstruction methods for RNA-seq - Nature
Nov 3, 2013 · The RGASP consortium compared 25 RNA-seq analysis programs in their ability to identify exons, reconstruct transcripts and quantify ...
[89]
conventional and deep learning frameworks in genome annotation
Apr 4, 2024 · Gene structure prediction represents one of the most critical aspects of genome annotation, aiming to identify coding regions, exons, introns ...
[90]
The NCBI Eukaryotic Genome Annotation Pipeline - NIH
Apr 4, 2024 · Core components of the pipeline are alignment programs (Splign and ProSplign) and an HMM-based gene prediction program (Gnomon) developed at ...Process · Transcript alignments · Model prediction · Choosing the best models for...