Fact-checked by Grok 2 weeks ago

Expressed sequence tag

An expressed sequence tag (EST) is a short DNA sequence, typically 200–800 nucleotides in length, derived from a single-pass sequencing read of a randomly selected complementary DNA (cDNA) clone from a cDNA library prepared from messenger RNA (mRNA) of a specific tissue, cell type, or population. These tags represent fragments of expressed genes and provide direct evidence of transcriptionally active regions in the genome, enabling the identification of novel genes without the need for full-length sequencing. ESTs were first introduced in 1991 as a cost-effective approach to catalog genes through partial sequencing of cDNA clones, yielding over 600 tags that revealed 337 potentially new genes, many with homologs in other organisms. This method rapidly advanced gene discovery during the early , where ESTs facilitated the mapping of chromosomes via (PCR) and served as markers for locating coding regions in genomic sequences. Over time, millions of ESTs have been generated across species, stored in databases like NCBI's dbEST (retired in 2019 and integrated into the database), which as of 2019 contained sequences from diverse organisms to support and expression profiling. Key applications of ESTs include transcriptomics for estimating levels in specific conditions, development of molecular markers such as simple sequence repeats (SSRs) for genetic mapping, and assembly into contigs to approximate unigene sets that reduce and aid in annotating genomes. In non-model organisms, ESTs remain valuable for de novo transcriptome analysis, offering insights into and despite the rise of next-generation sequencing technologies. Their single-pass nature introduces errors like chimeric sequences or frame shifts, but clustering algorithms mitigate these to enhance utility in large-scale studies.

Definition and Characteristics

Definition

An expressed sequence tag (EST) is a short sub-sequence, typically 200–800 in length, derived from one or both ends of a cloned cDNA corresponding to a transcribed mRNA . These sequences are generated through partial sequencing of randomly selected cDNA clones, providing a snapshot of the expressed portion of the without requiring full-length sequencing. The term "expressed sequence tag" was coined in 1991 to describe this approach for efficient gene discovery in the . ESTs originate from messenger RNA (mRNA) transcripts, which represent actively expressed genes in particular tissues, developmental stages, or environmental conditions, thereby capturing functional genetic information rather than the entire genomic DNA, including non-coding regions and introns. This distinction allows ESTs to specifically identify transcribed regions of the genome, aiding in the annotation of genes and the study of gene expression patterns across different biological contexts. ESTs can be classified as sense or antisense based on their orientation relative to the reference gene sequence: ESTs align to the (mRNA-like) strand, while antisense ESTs align to the complementary strand, potentially indicating the presence of natural antisense transcripts or sequencing artifacts. This distinction is crucial for detecting pairs and understanding regulatory mechanisms such as .

Key Features and Properties

Expressed sequence tags (ESTs) typically range in length from 200 to 800 base pairs (bp), reflecting the partial sequencing of cDNA clones derived from mRNA transcripts. This variability arises primarily from limitations in cloning efficiency and the single-pass sequencing approach employed in early EST projects, where reads were often truncated due to technical constraints in Sanger sequencing technology. A common property of raw EST sequences is low quality at the 3' and 5' ends, frequently contaminated with vector sequences from cloning vectors or poly-A tails from mRNA priming during cDNA synthesis. These artifacts necessitate post-sequencing trimming to isolate high-quality portions of the transcript, as unprocessed ends can introduce errors in downstream analyses such as alignment and annotation. EST collections exhibit significant redundancy, with highly expressed genes often represented by multiple identical or near-identical sequences, resulting in uneven coverage across the . This bias stems from the proportional sampling of abundant mRNAs during cDNA library construction, where transcripts from housekeeping or highly active genes dominate the dataset, potentially underrepresenting low-abundance or tissue-specific genes. Additionally, ESTs are prone to chimeric sequences and other artifacts, such as those introduced by errors during reverse transcription of mRNA to cDNA. These issues can arise from incomplete or aberrant priming, leading to fused or erroneous sequences that may misrepresent true transcripts and require validation through clustering or additional sequencing.

Methods of Generation

cDNA Library Preparation

The preparation of a cDNA library begins with the extraction of total RNA from target tissues or cells, followed by enrichment for messenger RNA (mRNA) to focus on actively expressed genes. Tissues are selected based on specific biological contexts, such as developmental stage, organ type, or physiological condition, to capture tissue-specific transcripts. mRNA is isolated using oligo(dT)-cellulose chromatography, which binds to the poly(A) tails present on most eukaryotic mRNAs, effectively separating them from ribosomal and transfer RNAs. This step ensures that the library represents the transcriptome rather than the entire cellular RNA pool. Reverse transcription converts the purified mRNA into complementary DNA (cDNA), the foundational material for library construction. An oligo(dT) primer, complementary to the poly(A) tail, is annealed to the mRNA, and reverse transcriptase (typically from avian myeloblastosis virus or Moloney murine leukemia virus) synthesizes the first-strand cDNA by extending from the primer using the mRNA as a template. The RNA is then degraded with RNase H or alkaline hydrolysis, and the second strand is synthesized using DNA polymerase I and RNase H to create double-stranded cDNA. This process preserves the sequence information from expressed genes while converting the unstable RNA into stable DNA. Directional strategies, such as incorporating different restriction sites at the 5' and 3' ends (e.g., EcoRI at the 5' end and NotI at the 3' end), are often employed during second-strand synthesis to maintain the original mRNA orientation, facilitating subsequent sequencing from either end. The double-stranded cDNA is then cloned into a suitable to generate the library. Blunt or cohesive ends are created on the cDNA (e.g., via S1 nuclease treatment or linker addition), and it is ligated into plasmids (such as pBluescript) or bacteriophage lambda (like λZAP II), which are transformed into host cells for amplification. Each bacterial colony or phage plaque represents a unique cDNA , forming a library that can contain millions of . To enhance representation of low-abundance transcripts, is applied, often by denaturing and reannealing the cDNA population, where highly abundant sequences form duplexes more rapidly and are removed via hydroxyapatite chromatography or exonuclease digestion, equalizing clone frequencies. This step is crucial for EST generation, as it reduces redundancy from highly expressed genes like transcripts. The first cDNA libraries used for EST sequencing in 1991 were commercially prepared in λZAP II from mRNA.

Sequencing and Processing

Expressed sequence tags (ESTs) are generated through single-pass of (cDNA) clones, typically from the 5' or 3' ends, to produce short, partial sequences that represent expressed s. This method, first demonstrated in large-scale cDNA libraries, involves automated cycle sequencing with fluorescent terminators, followed by to read the fragments. The approach yields directional reads averaging 200-800 pairs, providing sufficient information for gene identification without full-length sequencing. Post-sequencing processing begins with base calling and quality assessment using tools like PHRED to assign Phred scores to each base, enabling the trimming of low-quality regions (typically those with scores below 20). Adapters and vector sequences are removed via alignment-based methods, such as against vector databases or specialized tools like SeqClean, which identifies and clips contaminants with high similarity (≥94% identity over ≥30 bases) at the sequence ends. Repeats, including poly-A tails and low-complexity regions, are masked using algorithms like RepeatMasker to prevent misalignment in downstream analyses, often excluding affected portions from further processing. Sequences shorter than 100 bases or with excessive undetermined bases (>3% 'N's) are discarded to ensure data reliability. Error correction addresses inaccuracies from during cDNA synthesis, which can introduce mismatches at rates of approximately 1 in 15,000 to 30,000 bases, and potential bacterial from hosts like E. coli. This is achieved through quality filtering, chimeric sequence detection (e.g., via abrupt quality drops or BLASTN against contaminants), and redundancy checks to remove duplicates or low-confidence reads. In cases of vector re-linearization artifacts, modified protocols like those in SeqClean recover otherwise lost sequences, reducing errors from 18.5% to near zero in tested datasets. During the 1990s, EST projects scaled up through high-throughput batch processing at facilities like The Institute for Genomic Research (TIGR) and the Wellcome Trust Sanger Institute, where dozens of automated ABI sequencers processed thousands of clones daily in 96-well formats. This enabled the generation of millions of ESTs for the , with pipelines integrating robotic liquid handling for template preparation and parallel sequencing runs to accelerate gene discovery.

Historical Development

Early Pioneering Work

The pioneering work on expressed sequence tags (ESTs) began with advancements in cDNA library construction during the late 1970s. In 1979, researchers at the (Caltech) developed techniques to amplify DNA copies of messenger RNAs (mRNAs) in bacterial plasmids, enabling the scalable cloning and propagation of expressed gene sequences from eukaryotic cells. This breakthrough addressed the limitations of earlier single-molecule cDNA synthesis by allowing the generation of large libraries representative of cellular transcriptomes. Building on these foundations, in 1982, J. Gregor Sutcliffe and colleagues introduced the strategy of sequencing random clones from cDNA libraries to facilitate partial gene identification. By examining randomly selected cDNA clones derived from rat brain poly(A)+ , they identified an 82- unique to brain tissue through partial sequencing and hybridization analysis. This approach highlighted the utility of short sequence reads from expressed genes for detecting tissue-specific elements without requiring full-length . A key demonstration of feasibility came in 1983, when Stephen D. Putney and co-workers sequenced inserts from 178 randomly selected clones in a rabbit skeletal muscle , yielding approximately 20,000 of sequence data. This "shotgun" sequencing effort identified clones corresponding to 13 distinct muscle proteins, including a novel isoform, and confirmed matches to six known polypeptides through database comparisons and protein alignments. The study emphasized the efficiency of random partial sequencing, estimating that 150–200 clones would suffice to capture most abundant transcripts in a tissue-specific library. These experiments collectively drove a conceptual shift in genomics, moving away from the resource-intensive complete sequencing of individual genes toward economical partial "tagging" of expressed sequences to map and discover genes rapidly. This paradigm emphasized statistical sampling from cDNA libraries to prioritize expressed regions, setting the stage for broader applications in gene identification.

Establishment and Expansion

The term "expressed sequence tag" (EST) was formally coined in 1991 by Mark D. Adams and colleagues at , who initiated the first systematic sequencing of (cDNA) from human brain tissue as a high-throughput approach to gene discovery. This work, published in Science, described partial sequencing of over 600 randomly selected cDNA clones from an infant library, yielding approximately 150,000 of sequence data that identified novel transcripts and demonstrated the potential of ESTs for mapping expressed genes in the . Building on earlier conceptual ideas of random cDNA sequencing, this effort marked the establishment of ESTs as a standardized genomic tool, emphasizing their role in efficient, cost-effective identification of coding regions without full-length cloning. Throughout the 1990s, EST generation expanded rapidly as part of pre-genome era initiatives, with millions of sequences produced for humans and key model organisms such as , , and Mus musculus. By October 1997, the dbEST database at the (NCBI) contained over 833,000 human ESTs and 237,000 from mouse, reflecting contributions from academic consortia, government-funded projects, and emerging private efforts like those at The Institute for Genomic Research (TIGR). This surge was driven by automated technologies and collaborative sequencing centers, enabling the cataloging of expressed genes across diverse tissues and developmental stages to support before complete genome assemblies were available. ESTs were integrated into major international efforts, notably the (HGP), where they facilitated transcript mapping by aligning short sequences to chromosomal locations and aiding in the annotation of gene structures. Proponents, including Adams and , advocated ESTs as a complementary strategy to whole-genome , providing rapid insights into the and estimating human gene numbers at around 60,000–100,000 based on early clustering analyses. By the late 1990s, HGP milestones included mapping over 16,000 genes via EST-based physical maps, which informed draft genome assembly and prioritized regions for deeper sequencing. A key milestone in EST expansion occurred by 2013, when public databases amassed over 74 million sequences from more than 2,473 species, underscoring the technique's enduring scale despite the rise of next-generation sequencing. This vast repository, predominantly in dbEST at the time, encapsulated decades of global contributions. dbEST was retired in 2019, with all EST data migrated to NCBI's database, where it continues to support comparative transcriptomics and evolutionary studies as of 2025.

Data Resources

Primary Databases

dbEST was the primary database for raw, uncurated expressed sequence tag (EST) data, functioning as a dedicated division of from its establishment in 1992 until its retirement in 2019. It archived single-pass cDNA sequences and associated , announced in a seminal publication that outlined its role in facilitating the storage and dissemination of ESTs from high-throughput sequencing efforts. By design, dbEST emphasized rapid deposition of unprocessed sequences, including details such as source and developmental stage, to support studies across species. Following retirement, EST sequences up to 2018 remain accessible via archived FTP downloads at ftp.ncbi.nlm.nih.gov/repository/dbEST, and are integrated into the broader nucleotide database for search via E-Utilities and APIs. Newer EST submissions since 2019 are processed through standard GenBank tools like tbl2asn, with many directed to the Sequence Read Archive (SRA) or Transcriptome Shotgun Assembly (TSA) divisions for high-throughput data. This shift has accumulated millions of historical EST entries, serving as a foundational resource for legacy data in downstream analyses as of 2025. In scope, dbEST encompassed ESTs from a wide array of , with early entries predominantly from sources due to initiatives like the . These human-focused datasets provided critical initial scale, with sequences often shorter than 1000 base pairs derived from mRNA to capture expressed genes. Fully integrated within the GenBank ecosystem, historical dbEST sequences are accessible through standard search interfaces, ensuring linkage to the broader repository of genetic data.

Assembly and Annotation Tools

Assembly of expressed sequence tags (ESTs) involves clustering and aligning overlapping sequences to form longer contigs or consensus sequences, reducing redundancy and improving transcript representation. This process addresses the short length and error-prone nature of individual ESTs by merging similar reads from the same gene, often using algorithms that detect sequence similarity over significant overlaps. Historically, the TIGR Gene Indices employed a protocol starting with vector trimming and contaminant removal, followed by clustering via FLAST (grouping sequences sharing at least 95% identity over 40 nucleotides) and assembly into tentative consensus (TC) sequences using CAP3 to generate non-chimeric, high-fidelity transcripts; these indices, active until the mid-2000s, are now archived or integrated into species-specific resources like The Arabidopsis Information Resource (TAIR). Similarly, the UniGene system, retired in 2019, partitioned ESTs into clusters representing unique genes or transcripts through pairwise sequence alignments that identified significant overlaps, without producing consensus sequences, thereby handling redundancy by grouping variants like alternative isoforms. Archived UniGene data remains available via FTP, but modern alternatives include tools like CD-HIT-EST for clustering and assembly. The STACK database, from the early 2000s, used a non-alignment-based d2_cluster algorithm relying on word composition to partition ESTs into tissue-specific bins, followed by alignment with PHRAP and consensus building to capture isoforms; it is no longer actively maintained. Annotation of assembled EST contigs typically involves aligning them to reference genomes or protein databases to infer function, with or similar tools used to assign putative roles based on . Due to the common 3' bias in EST libraries—where sequences are enriched in 3' untranslated regions (UTRs) from oligo(dT) priming—annotations often prioritize UTR features and adjust for incomplete coding sequence coverage, enhancing gene model accuracy by integrating polyA signals and avoiding over-reliance on partial open reading frames. As of 2025, contemporary tools like Trinotate or InterProScan facilitate of legacy and new data.

Tissue and Expression Data

Tissue and expression data for expressed sequence tags (ESTs) connect short cDNA-derived sequences to specific biological contexts, such as tissue types, developmental stages, and states, facilitating analysis of activity across conditions. This metadata, primarily derived from cDNA library annotations, enables inference of expression patterns without full sequences. Historically, the TissueInfo database, launched in 2001 and no longer maintained, focused on high-throughput annotation of ESTs for and expression using a structured . It imported and cleaned dbEST entries to create standardized profiles, employing a with over 165 categories, and computed metrics like isExpressedIn (binary presence) and mostExpressedIn (highest abundance ), achieving 69% accuracy for specificity in benchmarks. To ensure consistency, controlled vocabularies standardized descriptions; eVOC (eXpressed tissue, cell type, and developmental stage View Of Controlled vocabularies), deprecated in 2016, provided ontologies for anatomical systems (372 terms), cell types, stages, and pathologies. Modern resources as of 2025 include the Genotype-Tissue Expression (GTEx) project for human tissue-specific expression and the Expression Atlas from EMBL-EBI, which integrates legacy data with profiles using ontologies like the Experimental Factor Ontology (EFO). ESTs are mapped to expression profiles by aggregating counts from targeted libraries, associating sequences with stages (e.g., embryonic) or pathologies (e.g., tumor-derived) based on to reveal patterns like tissue-specific enrichment. Integration incorporates preparation details, including to reduce biases from abundant transcripts, requiring adjustments for quantitative comparisons.

Applications in Genomics

Gene Discovery and Annotation

Expressed sequence tags (ESTs) have been instrumental in discovering novel transcripts, particularly in non-model organisms where genomic resources are limited. By sequencing partial cDNAs from diverse tissues, EST projects can uncover previously unknown genes that are not captured by predictions or limited genome assemblies. For instance, in the parasite , sequencing of 1,949 ESTs from a normalized identified 67% of sequences with no database matches, revealing T. cruzi-specific novel transcripts such as members of the family and trans-sialidase superfamily, which are potential drug targets. This approach has similarly enabled gene discovery in other non-model species like and schistosomes, expanding transcript catalogs beyond model organisms. ESTs also refine gene predictions by aligning to genomic scaffolds, providing empirical evidence to delineate exon-intron boundaries and untranslated regions (UTRs). Tools like TWINSCAN_EST and N-SCAN_EST integrate EST alignments—typically via BLAT with high stringency (≥95% identity)—into hidden Markov model-based predictions, correcting errors in de novo gene models. In Caenorhabditis elegans, this improved gene sensitivity from 61% to 75% and specificity by 13%, accurately defining exons and UTRs in complex operons. For the , similar alignments boosted exact (ORF) sensitivity to 47%, aiding the annotation of UTRs that influence mRNA stability and localization. These alignments often leverage EST assemblies for comprehensive coverage, as detailed in specialized tools. Through EST-genome alignments, functional annotation infers protein domains and orthologs, linking novel transcripts to known biological roles. Translated EST sequences are scanned against databases like using tools such as ESTScan, identifying conserved domains that suggest function. In , analysis of 40,821 translated ESTs revealed 1,415 domains, including eukaryotic protein kinases and leucine-rich repeats, which are enriched in stress-response genes. TBLASTX alignments to reference genomes further identify orthologs; for example, 71% of sugarcane EST assemblies matched proteins, enabling transfer of functional annotations like defense-related WRKY transcription factors. This homology-based strategy has been widely applied to annotate EST-derived genes across eukaryotes. In early human gene catalogs, ESTs played a pivotal role by adding thousands of novel genes to annotation sets. By 2001, alignments of millions of human ESTs to the draft genome supported approximately 21,000 individual transcriptional units, many representing previously uncharacterized genes with intact ORFs. This contribution refined estimates from over 100,000 to around 25,000 protein-coding genes by 2006, with EST evidence validating 15,642 expression-supported genes in comprehensive transcript indices.

Expression Profiling and Microarrays

Expressed sequence tags (ESTs) serve as a foundational resource for designing DNA microarrays, where short sequences derived from the 5' or 3' ends of cDNA clones are amplified via and spotted onto glass slides or membranes as probes. These probes enable the simultaneous interrogation of thousands of genes by hybridizing with fluorescently or radioactively labeled cDNA targets synthesized from mRNA of the sample under study. This approach allows for the quantitative measurement of gene expression patterns, as the intensity of hybridization signals correlates with transcript abundance. In expression , EST-based microarrays facilitate comparisons across biological conditions, such as different tissues, developmental stages, disease states, or treatment responses. For instance, in , these arrays have been used to profile in (SCC), identifying 118 differentially expressed genes between normal skin, , and SCC samples, with 42 up-regulated (e.g., CDH1 and MMP1) and 76 down-regulated (e.g., ERCC1), many of which were novel candidates for diagnostic markers. Such applications highlight the utility of EST microarrays in elucidating molecular mechanisms of tumorigenesis and progression. Normalization in EST microarray experiments often leverages the redundancy observed in EST libraries, where the frequency of ESTs corresponding to a provides an estimate of baseline expression levels across tissues. Aggregated EST counts from databases are used to compute normalized expression values, accounting for library biases and enabling reliable comparisons; for example, tools like GeneHub-GEPIS integrate these counts to generate digital profiles for normal and cancerous tissues. Additionally, print-tip normalization is applied to microarray data to correct for systematic variations, further incorporating EST redundancy to minimize artifacts from uneven probe representation. To enhance probe accuracy and reduce cross-hybridization, ESTs are assembled into contigs—consensus sequences from overlapping clones—prior to selection. This clustering approach, using tools like CAP3 for assembly and based on alignments, identifies unique representatives, such as the 3'-most EST with sufficient high-quality sequence length (at least 300 ), thereby minimizing and improving specificity in organisms with incomplete genomes. In one application to ESTs, this method yielded 93-96% coverage of unique annotations across 442 contigs, demonstrating improved design efficiency.

Limitations and Current Relevance

Technical Challenges

One major technical challenge in expressed sequence tag (EST) analysis stems from the inherent errors introduced during single-pass , the standard method used for generating ESTs. These short reads, typically 200–800 base pairs long, suffer from a base-calling error rate of about 1–2%, which is higher than in double-pass sequencing due to the lack of verification reads. Such errors often manifest as substitutions, insertions, or deletions that cause frameshifts in predicted coding sequences, disrupting open reading frames and complicating downstream . Additionally, chimeric clones—artifacts arising from errors during construction—can lead to hybrid sequences that misrepresent true transcripts, further propagating inaccuracies in assemblies. Tools like ESTScan have been developed to detect and correct these frameshifts by leveraging codon usage biases, achieving up to 95% accuracy in coding sequence prediction for human ESTs, though with a 10% in some cases. Another significant issue is the inherent in EST library construction, which favors highly abundant mRNAs and underrepresents low-expressed or tissue-specific genes. During reverse transcription and cDNA , shorter and more abundant transcripts are preferentially cloned and sequenced, leading to overrepresentation of genes while rare transcripts may be entirely missed unless libraries are normalized—a process that itself introduces distortions by depleting common sequences. Analysis of over 900 mouse EST libraries encompassing 4.3 million sequences revealed systematic variations in transcript length distributions, with non-normalized libraries showing pronounced bias toward abundant species, potentially skewing quantitative comparisons of across tissues or conditions. This under-sampling can result in incomplete coverage, with low-abundance genes comprising less than 10% of detected sequences in standard libraries. Annotation of ESTs is prone to inaccuracies, particularly false positives arising from cross-species contamination and alignments to pseudogenes. Contaminating sequences from microbial or other eukaryotic sources during library preparation can mimic genuine transcripts, with studies estimating up to 2.4% of candidate splice variants as false positives due to pre-mRNA or genomic DNA pollution in human EST datasets. Similarly, pseudogenes—non-functional gene copies with high sequence similarity to parental genes—frequently align to ESTs, leading to erroneous classification of inactivated loci as expressed; for instance, 14–16% of annotated pseudogenes in human genomes show spurious EST support, inflating gene counts and complicating functional inference. These issues are exacerbated by fragmented EST coverage, which aligns ambiguously across paralogous regions, and require stringent filtering, such as 97% identity thresholds in BLAST searches, to mitigate. Finally, assembling ESTs into contigs introduces artifacts, particularly misassemblies driven by sequence polymorphisms and . Single nucleotide polymorphisms (SNPs) within populations create sequence variants that assemblers may interpret as separate contigs or chimeric merges, fragmenting true transcripts; for example, heterozygous SNPs can split alleles into distinct clusters, reducing contig accuracy by up to 20% in polymorphic datasets. further complicates this, as isoform-specific exons generate overlapping but non-identical reads that lead to incomplete or erroneous contig formations, with traditional assemblers like CAP3 prone to errors in resolving splicing graphs. Advanced approaches, such as splicing graph models, address these by representing transcripts as paths in a , but residual misassemblies persist in complex splicing events, affecting reliability for isoform discovery.

Comparison with Modern Sequencing Technologies

The advent of next-generation sequencing (NGS) technologies, particularly RNA sequencing (RNA-seq), has largely superseded expressed sequence tags (ESTs) as the primary method for transcriptome profiling. Unlike ESTs, which rely on low-throughput Sanger sequencing to generate short, single-pass cDNA reads typically 200–800 base pairs in length, RNA-seq enables the capture of full-length transcripts with high depth of coverage, facilitating the detection of novel isoforms, alternative splicing events, and low-abundance genes at a fraction of the cost—often dropping from thousands of dollars per sample for ESTs to under $100 for RNA-seq by the mid-2010s. This shift has rendered ESTs obsolete for most high-resolution applications, as RNA-seq provides comprehensive, quantitative expression data without the biases introduced by partial cDNA cloning and normalization in EST libraries. No major new EST sequencing projects have been initiated since approximately 2013, coinciding with the widespread adoption of NGS platforms, and the dedicated dbEST database was retired by the (NCBI) in 2019, integrating its contents into the broader nucleotide database. The total number of public EST sequences has thus remained stable at around 74 million records, encompassing data primarily generated between the late 1990s and early from diverse organisms. Despite their obsolescence in mainstream research, ESTs maintain niche utility in resource-limited settings, particularly for preliminary transcriptome assembly in non-model species such as understudied and microbes, where access to NGS infrastructure remains constrained into the 2020s. For instance, Sanger-based EST approaches have supported marker discovery and identification in drought-tolerant crops like common bean, offering a low-cost entry point for labs in developing regions lacking high-throughput capabilities. Similarly, EST-derived simple sequence repeat () markers continue to aid genetic mapping in non-model plant species, complementing limited RNA-seq efforts. The enduring value of historical EST datasets lies in their reanalysis using modern bioinformatics tools, which enable updated functional annotations, error correction, and integration with contemporary data to refine gene models and reveal previously overlooked transcripts. Such efforts have improved genome annotations in species like the brown alga Ectocarpus by aligning legacy ESTs against high-coverage assemblies, enhancing predictions of splicing variants and expression patterns. This legacy reuse underscores ESTs' role in bridging early transcriptomics with current genomic frameworks, particularly for organisms where initial EST surveys laid foundational expression profiles.

References

  1. [1]
    What is dbEST? - NCBI - NIH
    Jun 5, 2019 · About ESTs. Expressed Sequence Tags (ESTs) are short (usually <1000 bp), single-pass sequence reads from mRNA (cDNA). Typically they ...
  2. [2]
    hitchhiker's guide to expressed sequence tag (EST) analysis
    Expressed sequence tag (EST) and complementary DNA (cDNA) sequences provide direct evidence for all the sampled transcripts and they are currently the most ...<|control11|><|separator|>
  3. [3]
    Expressed Sequence Tags and Human Genome Project - Science
    ESTs have applications in the discovery of new human genes, mapping of the human genome, and identification of coding regions in genomic sequences.
  4. [4]
    Expressed Sequence Tags (ESTs) and Gene Discovery - NCBI
    Oct 12, 2006 · The first step to organize the EST data is the assembly of sequences to reduce redundancy of information, as discussed in the previous section.Estimation of Transcriptome... · Use of Expression Data for...
  5. [5]
    The molecular ecologist's guide to expressed sequence tags
    Nov 29, 2006 · ESTs can serve as a source of molecular markers, and can also provide an entrée into gene and genome-level questions, even for studies of ...
  6. [6]
    Making sense of EST sequences by CLOBBing them
    Oct 25, 2002 · Expressed sequence tags (ESTs) are single pass reads from randomly selected cDNA clones. They provide a highly cost-effective method to access ...
  7. [7]
    Glossary - The NCBI Handbook - NIH
    Expressed Sequence Tag. ESTs are short (usually approximately 300–500 base ... A view in the Unigene browser for comparing proteins to the EST cDNA sequences.Bookshelf · The Ncbi Handbook (internet)... · Glossary<|control11|><|separator|>
  8. [8]
    Gene Discovery through Expressed Sequence Tag Sequencing in ...
    Partial cDNA sequencing to generate expressed sequence tags (ESTs) is being used at present for the fast and efficient obtainment of a detailed profile of genes ...
  9. [9]
    Expressed sequence tags: an overview - PubMed
    Expressed sequence tags (ESTs) are fragments of mRNA sequences derived through single sequencing reactions performed on randomly selected clones from cDNA ...
  10. [10]
    Making sense of EST sequences by CLOBBing them - PMC
    Expressed sequence tags (ESTs) are single pass reads from randomly selected cDNA clones. They provide a highly cost-effective method to access and identify ...
  11. [11]
    1991: ESTs, Fragments of Genes
    May 6, 2013 · An expressed-sequence tag (EST) is a stretch of DNA sequence made by copying a portion of an mRNA molecule. As such, all ESTs replicate sequences from genes.
  12. [12]
    Computational discovery of sense-antisense transcription in the ...
    Aug 22, 2002 · Overlapping but oppositely oriented transcripts have the potential to form sense-antisense perfect double-stranded (ds) RNA duplexes.
  13. [13]
    Computational discovery of sense-antisense transcription in the ...
    We postulated that additional examples of this phenomenon might be obtainable by mining public human and mouse expressed sequence tag (EST) databases. Table 1.
  14. [14]
    Over 20% of human transcripts might form sense–antisense pairs
    ... sequence, especially an expressed sequence tag sequence. In this study, we ... (15) predicted that ∼15% of the mouse genes formed sense–antisense (SA) transcript ...
  15. [15]
    Expressed Sequence Tag - an overview | ScienceDirect Topics
    Expressed sequence tags (ESTs) are single-pass reads of approximately 200–800 bp generated from randomly selected cDNA clones. Some studies have confirmed that ...
  16. [16]
    EGassembler: online bioinformatics service for large-scale ...
    The sequence cleaning process involves basic procedures such as, removing the polyA/polyT tail, clipping low-quality ends (the ends rich in undetermined bases) ...
  17. [17]
    Comparison of Expressed Sequence Tags (ESTs) to Human ...
    Expressed sequence tags, or ESTs, are short sequences, a few hundred base pairs in length, which are derived by partial, single pass sequencing of the inserts ...
  18. [18]
    A Hitchhiker's Guide to Expressed Sequence Tag (EST) Analysis
    Aug 6, 2025 · We propose a road map for EST analysis to accelerate the effective analyses of EST data sets. An investigation of EST analysis platforms reveals ...
  19. [19]
    Comparative Analysis of Expressed Sequence Tag (EST) Libraries ...
    Jan 18, 2008 · ... errors or PCR/reverse transcription artifacts during the preparation of the libraries. For searching simple sequence repeats or ...<|control11|><|separator|>
  20. [20]
    A score system for quality evaluation of RNA sequence tags
    Jun 6, 2009 · The occurrence of these errors produces sequence-based artifacts, introducing noise into the sampled transcriptome profile. ... Expressed Sequence ...
  21. [21]
    Normalization and Subtraction of Cap-Trapper-Selected cDNAs to ...
    Our method keeps the proportion of full-length cDNAs in the subtracted/normalized library high. Moreover, our method dramatically enhances the discovery of new ...
  22. [22]
    cDNA Library Construction Protocol - Creative Biogene
    A cDNA library is a combination of cloned cDNA fragments constituting some portion of the transcriptome of an organism which are inserted into many host cells.
  23. [23]
    A method for the construction of equalized directional cDNA libraries ...
    We describe a simple method for making high-quality, directional, random-primed, cDNA libraries from small amounts of degraded total RNA. This technique is ...
  24. [24]
    Expressed sequence tags: normalization and subtraction of cDNA ...
    Here we describe in detail, protocols for normalization and subtraction of cDNA libraries followed by an example using the toxic dinoflagellate Alexandrium ...
  25. [25]
    [PDF] EST sequencing and usage
    Introduction. Expressed sequence tags, or ESTs, are single DNA sequencing reads made from complementary DNA (cDNA) clone libraries constructed from a known ...
  26. [26]
    The evolution of next-generation sequencing technologies - PMC
    In the 1990s, larger facilities were established, like the Institute for Genomic Research (TIGR), which had 30 sequencers and the Welcome Trust Sanger Institute ...
  27. [27]
    Common 82-nucleotide sequence unique to brain RNA. - PNAS
    Common 82-nucleotide sequence unique to brain RNA. J G Sutcliffe ... Several randomly selected cDNA clones made from rat brain polyA+RNA have unusual properties.Missing: paper title
  28. [28]
    A new troponin T and cDNA clones for 13 different muscle proteins ...
    Apr 1, 1983 · We have determined sequences for about 20,000 nucleotides within 178 randomly selected clones of a rabbit muscle cDNA library, and report here ...
  29. [29]
    expressed sequence tags and human genome project - PubMed
    Complementary DNA sequencing: expressed sequence tags and human genome project ... Science. 1991 Jun 21;252(5013):1651-6. doi: 10.1126/science.2047873.
  30. [30]
    High-Throughput Sequencing, Information Generation, and the ...
    At the start of the 1990s there were fewer than 2,000 human genes known. Now there are millions of ESTs in the databases. However, the biggest impact in ...
  31. [31]
    How to get the best of dbEST - ScienceDirect.com
    Human and mouse sequences form the majority of the data held in this collection, with 833 000 and 237 000 entries, respectively (as of 25 October, 1997)[6].
  32. [32]
    1996: Human Gene Map Created
    May 9, 2013 · Scientists created a map showing the locations of ESTs (expressed sequence tags) representing fragments of more than 16,000 genes from ...
  33. [33]
    Generation, functional annotation and comparative analysis of black ...
    Oct 11, 2013 · As of, January 1st, 2013 dbEST release (130101), there were approximately 74.19 million ESTs from 2,473 species available in the GenBank at the ...
  34. [34]
    dbEST - FAIRsharing
    Nov 4, 2014 · DbEST is a division of GenBank that contains sequence data and other information on single-pass cDNA sequences, or Expressed Sequence Tags, from a number of ...Missing: primary history
  35. [35]
    dbEST--database for "expressed sequence tags" - PubMed
    dbEST--database for "expressed sequence tags". Nat Genet. 1993 Aug;4(4):332-3. doi: 10.1038/ng0893-332. Authors. M S Boguski, T M Lowe, C M Tolstoshev.Missing: GenBank primary history establishment 1992
  36. [36]
    The TIGR Gene Indices: analysis of gene transcript sequences ... - NIH
    The TIGR Gene Indices (http://www.tigr.org/tdb/tgi.shtml) are a collection of species-specific databases that use a highly refined protocol to analyze EST ...
  37. [37]
    Efficient clustering of large EST data sets on parallel computers - PMC
    This clustering strategy is adopted in NCBI's UniGene database [http://ncbi.nlm.nih. gov/UniGene/; see Schuler (2)]. (ii) Each EST cluster is deemed to ...
  38. [38]
    STACK: Sequence Tag Alignment and Consensus Knowledgebase
    STACK is a tool for detection and visualisation of expressed transcript variation in the context of developmental and pathological states.
  39. [39]
  40. [40]
    eVOC: A Controlled Vocabulary for Unifying Gene Expression Data
    The Anatomical System ontology contains 372 terms. Cell Type Ontology. The Cell Type ontology provides a fine-grained description of where a gene is expressed.Missing: tags | Show results with:tags
  41. [41]
    Quantitative comparison of EST libraries requires compensation for ...
    Feb 17, 2006 · We demonstrate effects of the transcript sampling bias, and provide a method for identifying libraries that can be safely compared without bias.
  42. [42]
    Analysis and Functional Annotation of an Expressed Sequence Tag ...
    The trimming process, which included the removal of ribosomal RNA, poly(A) tails, low-quality sequences, and vector and adapter regions, was conducted as ...
  43. [43]
    Using ESTs to improve the accuracy of de novo gene prediction
    Jul 3, 2006 · De novo gene prediction systems, which ignore ESTs in favor of genomic sequence, can predict such "untouched" exons, but they are less accurate ...Discussion · Methods · Genome Alignments
  44. [44]
    EST assembly supported by a draft genome sequence
    Mar 13, 2007 · Using homology-based and ab initio prediction programs, with 5′ and 3′ UTRs added (based on EST data), the genome has been populated by gene ...
  45. [45]
    A draft annotation and overview of the human genome
    Jul 4, 2001 · Approximately 48% of the transcriptional units were based on consensus transcripts and 28% based on individual ESTs. A total of 9,372 ...Missing: 2000-2006 | Show results with:2000-2006
  46. [46]
    A comprehensive transcript index of the human genome generated ...
    For example, direct clustering of human expressed sequence tag (EST) ... Gene index analysis of the human genome estimates approximately 120,000 genes.
  47. [47]
    [PDF] ESTs, cDNA microarrays, and gene expression profiling - HAL
    Dec 16, 2021 · Expressed sequence tags are created by sequencing the 5′ and/or 3′ ends of randomly isolated gene transcripts that have been converted into cDNA ...
  48. [48]
    Expression profiling using cDNA microarrays - PubMed
    DNA targets, in the form of 3' expressed sequence tags (ESTs), are arrayed onto glass slides (or membranes) and probed with fluorescent- or radioactively- ...
  49. [49]
    Identification of differentially expressed genes in cutaneous ...
    Aug 8, 2006 · This study examined and identified differentially expressed genes in cutaneous squamous cell carcinoma (SCC).
  50. [50]
    GeneHub-GEPIS: digital expression profiling for normal and cancer ...
    Using aggregated expressed sequence tag (EST) library information and EST counts, the application calculates the normalized gene expression levels across a ...Materials And Methods · Genome-Guided Gene... · Program Description
  51. [51]
    Optimal cDNA microarray design using expressed sequence tags for ...
    Dec 7, 2004 · Probe selection for cDNA microarrays using expressed sequence tags (ESTs) is challenging due to high sequence redundancy and potential cross- ...
  52. [52]
    Annotated Expressed Sequence Tags and cDNA Microarrays ... - NIH
    To accelerate the molecular analysis of behavior in the honey bee (Apis mellifera), we created expressed sequence tag (EST) and cDNA microarray resources ...
  53. [53]
    Detecting and Analyzing DNA Sequencing Errors - NIH
    Here, we present a method to detect frameshift errors in DNA sequences that is based on the intrinsic properties of the coding sequences.<|separator|>
  54. [54]
    Expressed Sequence Tags - Springer Nature Experiments
    Expressed sequence tags (ESTs) present a special set of problems for bioinformatic analysis. They are partial and error-prone, and large datasets can have ...
  55. [55]
    None
    ### Summary of Sequence Errors, Frameshifts in ESTs, and ESTScan Corrections
  56. [56]
    Strengths and weaknesses of EST-based prediction of tissue ...
    Sep 28, 2004 · These results show the tendency of normalized libraries to be enriched for low-abundant transcripts. Table 1 RT-PCR validation results for ...
  57. [57]
    EST comparison indicates 38% of human mRNAs contain possible ...
    May 26, 2000 · Within this group we estimate 8% of inserts will be false positives (2.4% of all 4560 candidates) due to pre-mRNA or genomic DNA contamination ...
  58. [58]
    Systematic identification of pseudogenes through whole genome ...
    We developed a novel bioinformatics method to systematically identify and validate pseudogenes by carefully profiling expression evidence over the whole genome.Missing: inaccuracies contamination
  59. [59]
    Splicing graphs and EST assembly problem - ResearchGate
    Aug 7, 2025 · (b) Problems of EST assemblies in the presence of alternative splicing. While the ESTs can have ambiguous or degenerate conventional assemblies ...Missing: misassemblies | Show results with:misassemblies
  60. [60]
    Improved rat genome gene prediction by integration of ESTs with ...
    There are several improvements in the rat genome annotation resulting from incorporation of EST and RNA-Seq data into gene prediction models. First, our ...
  61. [61]
    RNA-Seq: a revolutionary tool for transcriptomics - PMC - NIH
    RNA-Seq is a deep-sequencing approach for transcriptome profiling, using high-throughput sequencing to map and quantify transcriptomes.
  62. [62]
    EST and GSS databases now retired - NCBI Insights - NIH
    Jul 25, 2019 · In July 2018, NCBI announced plans to retire the EST and GSS databases, and we have now implemented these changes.
  63. [63]
    Mining microsatellite markers from public expressed sequence tags ...
    Oct 13, 2015 · We explored a total of 14 498 726 EST sequences from the dbEST database (NCBI) in 257 plant genera from the IUCN Red List. We identify a very ...Fig. 1 · Table 3 · Discussion
  64. [64]
    Construction and EST sequencing of full-length, drought stress ...
    Nov 25, 2011 · This EST sequencing project was performed as part of a breeding project to discover molecular markers in common beans for marginal areas of Sub- ...
  65. [65]
    De Novo Transcriptome Assembly and EST-SSR Marker ... - MDPI
    Jan 21, 2023 · Transcriptome sequencing is more affordable and suitable for studying the genomes of non-model plant species than whole-genome sequencing [54].
  66. [66]
    Re‐annotation, improved large‐scale assembly and establishment ...
    Nov 21, 2016 · The large-scale assembly of the Ectocarpus genome was significantly improved and genome-wide gene re-annotation using extensive RNA-seq data ...