Gene expression profiling
Gene expression profiling is a molecular biology technique that simultaneously measures the expression levels of thousands of genes in a given biological sample, primarily by quantifying the abundance of messenger RNA (mRNA) transcripts.[1] This method generates a comprehensive snapshot of the transcriptome—the complete set of RNA molecules—enabling the identification of gene expression patterns associated with specific cellular states, developmental stages, environmental responses, or pathological conditions.[2] The primary technologies for gene expression profiling have evolved significantly since the mid-1990s. Early approaches relied on DNA microarrays, which use immobilized oligonucleotide or cDNA probes on a solid surface to hybridize with labeled RNA targets, allowing quantification through signal intensity measurements limited to known gene sequences.[1] More recently, RNA sequencing (RNA-seq) has become the dominant method, involving the conversion of RNA to complementary DNA (cDNA), fragmentation, and high-throughput sequencing to count RNA-derived reads, offering advantages in detecting novel transcripts, alternative splicing, and low-abundance genes without prior sequence knowledge.[2] Other techniques, such as digital molecular barcoding (e.g., NanoString nCounter), provide targeted quantification but are less comprehensive.[2] Data analysis typically involves normalization to account for technical variations, followed by statistical methods to identify differentially expressed genes and cluster patterns.[2] In research and medicine, gene expression profiling has transformative applications across diverse fields. In oncology, it enables tumor classification, prognosis prediction, and identification of therapeutic targets, such as distinguishing subtypes of acute myeloid leukemia (AML) or juvenile idiopathic arthritis (JIA).[1] In drug development, it supports tissue-specific target validation—revealing, for instance, epididymis-enriched genes in mice—and toxicogenomics to predict side effects by comparing profiles against databases like DrugMatrix, which catalogs responses to over 600 compounds.[3] Pharmacogenomics applications further personalize treatments by linking expression variations, such as those in cytochrome P450 enzymes like CYP2D6, to drug efficacy and adverse reactions.[3] Despite its power, gene expression profiling faces challenges that impact reliability and interpretation. Batch effects from experimental variations can confound results, while assumptions of uniform RNA extraction efficiency may overlook transcriptional amplification in certain cells, necessitating spike-in controls for accurate quantification.[2] Reproducibility issues and the high cost of RNA-seq, particularly for large-scale studies, remain barriers, though public repositories like NCBI's Gene Expression Omnibus (GEO)—housing over 6.5 million samples (as of 2024)—facilitate data sharing and validation.[4] Ongoing advancements aim to integrate profiling with multi-omics data for deeper biological insights.[1]Fundamentals
Definition and Principles
Gene expression profiling is the simultaneous measurement of the expression levels of multiple or all genes within a biological sample, typically achieved by quantifying the abundance of messenger RNA (mRNA) transcripts, to produce a comprehensive profile representing the transcriptome under defined conditions.[5] This technique captures the dynamic activity of genes, allowing researchers to assess how cellular states, environmental stimuli, or disease processes alter transcriptional output across the genome. The resulting profile provides a snapshot of gene activity, highlighting patterns that reflect biological function and regulation.[6] The foundational principles of gene expression profiling stem from the central dogma of molecular biology, which outlines the unidirectional flow of genetic information from DNA to RNA via transcription, followed by translation into proteins.[7] Transcription serves as the primary regulatory checkpoint, where external signals modulate the initiation and rate of mRNA synthesis, making it a focal point for profiling efforts.[8] Quantitatively, profiling measures expression as relative or absolute mRNA levels, often expressed in terms of fold changes, to distinguish qualitative differences in gene activation or repression between samples.[9] Central to this approach are key concepts such as the transcriptome, which comprises the complete set of RNA molecules transcribed from the genome at a specific time, and differential expression, referring to statistically significant variations in gene activity across conditions or cell types.[6][10] Normalization of data typically relies on housekeeping genes—constitutively expressed genes like GAPDH or ACTB that maintain stable levels—to correct for technical biases in measurement.[11] Although mRNA abundance approximates protein production by indicating transcriptional output, this correlation is imperfect due to post-transcriptional controls, including mRNA stability and translational efficiency, which can decouple transcript levels from final protein amounts.[12][13] As an illustrative example, gene expression profiling of immune cells during bacterial infection often detects upregulation of genes encoding cytokines and antimicrobial peptides, such as those in the interferon pathway, thereby revealing the molecular basis of the host's defensive response.[14]Historical Development
The foundations of gene expression profiling trace back to low-throughput techniques developed in the late 1970s, such as Northern blotting, which enabled the detection and quantification of specific RNA transcripts by hybridizing labeled probes to electrophoretically separated RNA samples transferred to a membrane. This method, introduced by Alwine et al. in 1977, laid the groundwork for measuring mRNA abundance but was limited to analyzing one or a few genes per experiment due to its labor-intensive nature.[15] By the mid-1990s, advancements like Serial Analysis of Gene Expression (SAGE), developed by Velculescu et al., marked a shift toward higher-throughput profiling by generating short sequence tags from expressed genes, allowing simultaneous analysis of thousands of transcripts via Sanger sequencing.[16] The microarray era began in 1995 with the invention of complementary DNA (cDNA) microarrays by Schena, Shalon, and colleagues under Patrick Brown at Stanford University, enabling parallel hybridization-based measurement of thousands of gene expressions on glass slides printed with DNA probes.[17] Commercialization accelerated in 1996 when Affymetrix released the GeneChip platform, featuring high-density oligonucleotide arrays for genome-wide expression monitoring, as demonstrated in early applications like Lockhart et al.'s work on hybridization to arrays.[18] Microarrays gained widespread adoption during the 2000s, playing a key role in the Human Genome Project's functional annotation efforts and enabling large-scale studies, such as Golub et al.'s 1999 demonstration of cancer subclassification using gene expression patterns from acute leukemias.[19] The advent of next-generation sequencing (NGS) around 2005, exemplified by the 454 pyrosequencing platform, revolutionized profiling by shifting from hybridization to direct sequencing of cDNA fragments, drastically increasing throughput and reducing biases.[20] RNA-Seq emerged as a cornerstone in 2008 with Mortazavi et al.'s method for mapping and quantifying mammalian transcriptomes through deep sequencing, providing unbiased detection of novel transcripts and precise abundance measurements.[21] By the 2010s, NGS costs plummeted—from millions per genome in the early 2000s to approximately $50–$200 per sample for RNA-Seq as of 2024, trending under $100 by 2025—driving a transition to sequencing-based methods over microarrays for most applications.[22] In the 2010s, single-cell RNA-Seq (scRNA-Seq) advanced resolution to individual cells, with early protocols like Tang et al. in 2009 evolving into scalable droplet-based systems such as Drop-seq in 2015 by Macosko et al., enabling profiling of thousands of cells to uncover cellular heterogeneity.[23][24] Spatial transcriptomics further integrated positional data, highlighted by 10x Genomics' Visium platform launched in 2019, which captures gene expression on tissue sections at near-single-cell resolution.[25] Into the 2020s, integration of artificial intelligence has enhanced pattern detection in expression data, as seen in models like GET (2025) that simulate and predict gene expression dynamics from sequencing inputs to identify disease-associated regulatory networks.[26]Techniques
Microarray-Based Methods
Microarray-based methods for gene expression profiling rely on the hybridization of labeled nucleic acids to immobilized DNA probes on a solid substrate, enabling the simultaneous measurement of expression levels for thousands of genes. In this approach, short DNA sequences known as probes, complementary to target genes of interest, are fixed to a chip or slide. Total RNA or mRNA from the sample is reverse-transcribed into complementary DNA (cDNA), labeled with fluorescent dyes, and allowed to hybridize to the probes. The intensity of fluorescence at each probe location, detected via laser scanning, quantifies the abundance of corresponding transcripts, providing a snapshot of gene expression patterns.[17][27] Two primary types of microarrays are used: cDNA microarrays and oligonucleotide microarrays. cDNA microarrays typically employ longer probes (500–1,000 base pairs) derived from cloned cDNA fragments, which are spotted onto the array surface using robotic printing; these often operate in a two-color format, where samples from two conditions (e.g., control and treatment) are labeled with distinct dyes like Cy3 (green) and Cy5 (red) and hybridized to the same array for direct ratio-based comparisons.[17][28] In contrast, oligonucleotide microarrays use shorter synthetic probes (25–60 mers), either spotted or synthesized in situ; prominent examples include the Affymetrix GeneChip, which features in situ photolithographic synthesis of one-color arrays with multiple probes per gene for mismatch controls to enhance specificity, and Illumina BeadChips, which attach oligonucleotides to microbeads in wells for high-density, one-color detection.[29] NimbleGen arrays represent a variant of oligonucleotide microarrays using maskless photolithography for flexible, high-density probe synthesis, supporting both one- and two-color formats. Spotted arrays (common for cDNA) offer flexibility in custom probe selection but may suffer from variability in spotting, while in situ synthesized arrays provide uniformity and higher probe densities, up to 1.4 million probe sets (comprising over 5 million probes) on platforms like the Affymetrix Exon 1.0 ST array.[28][30] The standard workflow begins with RNA extraction from cells or tissues, followed by isolation of mRNA and reverse transcription to generate first-strand cDNA. This cDNA is then labeled—using Cy3 and Cy5 for two-color arrays or a single dye like biotin for one-color systems—and hybridized to the microarray overnight under controlled temperature and stringency conditions to allow specific binding. Post-hybridization, unbound material is washed away, and the array is scanned with a laser to measure fluorescence intensities at each probe spot, yielding raw data as pixel intensity values that reflect transcript abundance.[27] These methods achieved peak adoption in the 2000s for high-throughput profiling of known genes, offering cost-effective analysis for targeted gene panels, but have become niche with the rise of sequencing technologies due to limitations like probe cross-hybridization, which can lead to false positives from non-specific binding, and an inability to detect novel or low-abundance transcripts beyond the fixed probe set.[27] Compared to sequencing, microarrays exhibit lower dynamic range, typically spanning 3–4 orders of magnitude in detection sensitivity.[31] Invented in 1995, this technology revolutionized expression analysis by enabling genome-scale studies.[17]Sequencing-Based Methods
Sequencing-based methods for gene expression profiling primarily rely on RNA sequencing (RNA-Seq), which enables comprehensive, unbiased measurement of the transcriptome by directly sequencing RNA molecules or their complementary DNA derivatives. Introduced as a transformative approach in the late 2000s, RNA-Seq has become the gold standard for transcriptomics since the 2010s, surpassing microarray techniques due to its ability to detect novel transcripts without prior knowledge of gene sequences. The core mechanism begins with RNA extraction from cells or tissues, followed by fragmentation to generate shorter pieces suitable for sequencing. These fragments are then reverse-transcribed into complementary DNA (cDNA), which undergoes library preparation involving end repair, adapter ligation, and amplification to create a sequencing-ready library.[32] Next-generation sequencing (NGS) platforms, such as Illumina's short-read systems, are commonly used to sequence these libraries, producing millions of reads that represent the original RNA population. The resulting data, typically output as FASTQ files containing raw sequence reads, require alignment to a reference genome using tools like STAR or HISAT2 to map reads accurately, accounting for splicing events. Quantification occurs by counting aligned reads per gene or transcript, often via featureCounts or Salmon, yielding digital expression measures in the form of read counts.[32] A key step in library preparation is mRNA enrichment, either through poly-A selection for eukaryotic polyadenylated transcripts or ribosomal RNA (rRNA) depletion to capture non-coding and prokaryotic RNAs, ensuring comprehensive coverage. Sequencing depth for human samples generally ranges from 20 to 50 million reads per sample to achieve robust detection of expressed genes, with higher depths for low-input or complex analyses.[33] Bulk RNA-Seq represents the standard variant, aggregating expression from millions of cells to provide an average profile suitable for population-level studies. Single-cell RNA-Seq (scRNA-Seq) extends this to individual cells, enabling dissection of cellular heterogeneity; droplet-based methods like those from 10x Genomics, commercialized around 2016, fueled an explosion in scRNA-Seq applications post-2016 by allowing high-throughput profiling of thousands to tens of thousands of cells per run.[34] Long-read sequencing technologies, such as PacBio's Iso-Seq, offer full-length transcript coverage, excelling in isoform resolution and alternative splicing detection without the need for computational assembly of short reads. Spatial RNA-Seq variants, including 10x Genomics' Visium platform introduced in 2020 and building on earlier spatial transcriptomics from 2016, preserve tissue architecture by capturing transcripts on spatially barcoded arrays, mapping expression to specific locations within samples. These methods provide key advantages, including the discovery of novel transcripts, precise quantification of alternative splicing, and sensitive detection of low-abundance genes, which microarrays cannot achieve due to reliance on predefined probes. RNA-Seq exhibits a dynamic range exceeding 10^5-fold, far surpassing the ~10^3-fold of arrays, allowing accurate measurement across expression levels from rare transcripts to highly abundant ones. By 2025, costs for bulk RNA-Seq have declined to under $200 per sample, including library preparation and sequencing, driven by advances in multiplexing and platform efficiency.[22] In precision medicine, RNA-Seq variants like scRNA-Seq are increasingly applied to resolve tumor heterogeneity, informing personalized therapies by revealing subclonal variations and therapeutic responses as of 2025.[35]Other Techniques
Other methods for gene expression profiling include digital molecular barcoding approaches, such as the NanoString nCounter system, which uses color-coded barcoded probes to directly hybridize with target RNA molecules without amplification or sequencing. This technique enables targeted quantification of up to 1,000 genes per sample with high precision and reproducibility, particularly useful for clinical diagnostics and validation studies due to its low technical variability and ability to handle degraded RNA.[2] Unlike microarrays, NanoString provides digital counts rather than analog signals, reducing background noise, though it is limited to predefined gene panels and less comprehensive than RNA-Seq.[36]Data Acquisition and Preprocessing
Experimental Design
The experimental design phase of gene expression profiling begins with clearly defining the biological question to guide all subsequent decisions, such as investigating the effect of a treatment on gene expression in specific cell types or tissues. This involves specifying the hypothesis, such as detecting differential expression due to a drug perturbation or disease state, to ensure the experiment addresses targeted objectives rather than exploratory aims. For instance, questions focused on treatment effects might prioritize controlled perturbations like siRNA knockdown or pharmacological interventions, while those involving disease modeling could use patient-derived samples. Adhering to established guidelines, such as the Minimum Information About a Microarray Experiment (MIAME) introduced in 2001, ensures comprehensive documentation of experimental parameters for reproducibility, with updates extending to sequencing-based methods via MINSEQE.[37][38] Sample selection and preparation are critical, encompassing choices like cell lines for in vitro studies, animal tissues for preclinical models, or human biopsies for clinical relevance. Biological replicates, derived from independent sources (e.g., different animals or patients), are essential to capture variability, with a minimum of three recommended per group to enable statistical inference, though six or more enhance power for detecting subtle changes. Technical replicates, which assess measurement consistency, should supplement but not replace biological ones. To mitigate systematic biases, randomization of sample processing order is employed to avoid batch effects, where unintended variations from equipment or timing confound results. Controls include reference samples (e.g., untreated baselines) and exogenous spike-ins like ERCC controls for RNA-Seq, which provide standardized benchmarks for normalization and sensitivity assessment across experiments.[39][40] Sample size determination relies on power analysis to detect desired fold changes (e.g., 1.5- to 2-fold) with adequate statistical power (typically 80-90%), factoring in expected variability and sequencing depth; tools like RNAseqPS facilitate this by simulating Poisson or negative binomial distributions for RNA-Seq data. For microarray experiments, similar calculations apply but emphasize probe hybridization efficiency. Platform selection weighs microarray for cost-effective, targeted profiling of known genes against RNA-Seq for unbiased, comprehensive transcriptome coverage, including low-abundance transcripts and isoforms, though RNA-Seq incurs higher costs and requires deeper sequencing for rare events. High-throughput formats, such as 96-well plates for single-cell RNA-Seq, support scaled designs but demand careful optimization. When human samples are involved, ethical oversight via Institutional Review Board (IRB) approval is mandatory to ensure informed consent, privacy protection, and minimal risk. Best practices from projects like ENCODE in the 2010s emphasize these elements for robust, reproducible RNA-Seq designs.[41][42][43][44][45]Normalization and Quality Control
Normalization and quality control are essential initial steps in processing gene expression data to ensure reliability and comparability across samples. Normalization addresses technical variations such as differences in starting RNA amounts, library preparation efficiencies, and sequencing depths, while quality control identifies and mitigates artifacts like low-quality reads or outliers that could skew downstream analyses. These processes aim to remove systematic biases without altering biological signals, enabling accurate quantification of gene expression levels.[46] For microarray-based methods, quantile normalization is a widely adopted technique that adjusts probe intensities so that the distribution of values across arrays matches a reference distribution, typically the average empirical distribution of all samples. This method assumes that most genes are not differentially expressed and equalizes the rank-order statistics between arrays, effectively correcting for global shifts and scaling differences. Introduced by Bolstad et al., quantile normalization has become standard in tools like the limma package for preprocessing Affymetrix and other oligonucleotide arrays.[47][48] In sequencing-based methods like RNA-seq, normalization accounts for both sequencing depth (library size) and gene length biases to produce comparable expression estimates. Common metrics include reads per kilobase of transcript per million mapped reads (RPKM), fragments per kilobase of transcript per million mapped reads (FPKM) for paired-end data, and transcripts per million (TPM), which scales RPKM to sum to 1 million across genes for better cross-sample comparability. TPM is calculated as: \text{TPM}_{i} = \frac{ \text{reads mapped to gene } i / \text{gene length in kb} }{ \sum_{j} ( \text{reads mapped to gene } j / \text{gene length in kb} ) } \times 1,000,000 This formulation ensures length- and depth-normalized values that are additive across transcripts. For count-based differential analysis, methods like the median-of-ratios approach in DESeq2 estimate size factors by dividing each gene's counts by its geometric mean across samples, then taking the median of these ratios as the normalization factor.[49][50] Quality control begins with assessing RNA integrity using the RNA Integrity Number (RIN), an automated metric derived from electropherogram analysis that scores total RNA from 1 (degraded) to 10 (intact), with values above 7 generally recommended for reliable gene expression profiling. For sequencing data, tools like FastQC evaluate raw reads for per-base quality scores, adapter contamination, overrepresented sequences, and GC content bias, flagging issues that necessitate trimming or filtering. Post-alignment, principal component analysis (PCA) plots visualize sample clustering to detect outliers, while saturation curves assess sequencing depth adequacy by plotting unique reads against total reads. Low-quality reads are typically removed using thresholds such as Phred scores below 20.[51][52] Batch effects, arising from technical variables like different experimental runs or reagent lots, can confound biological interpretations and are detected via PCA or surrogate variable analysis showing non-biological clustering. The ComBat method corrects these using an empirical Bayes framework that adjusts expression values while preserving biological variance, modeling batch as a covariate in a parametric or non-parametric manner. Spike-in controls, such as External RNA Controls Consortium (ERCC) mixes added at known concentrations, facilitate absolute quantification and validation of normalization by providing an independent scale for technical performance assessment.[53][54] Common pitfalls include ignoring 3' bias in poly-A selected RNA-seq, where reverse transcription from oligo-dT primers favors reads near the poly-A tail, leading to uneven coverage and distorted expression estimates for genes with varying 3' UTR lengths. Replicates from experimental design aid in robust QC by allowing variance estimation during outlier detection. Software packages like edgeR (using trimmed mean of M-values, TMM, normalization) and limma (with voom transformation for count data) integrate these preprocessing steps seamlessly before differential expression analysis.[55][46][48]Analysis Methods
Differential Expression Analysis
Differential expression analysis identifies genes whose expression levels differ significantly between experimental conditions, such as treated versus control samples or disease versus healthy states, by comparing normalized expression values across groups.[50] The core metric is the log2 fold change (log2FC), which quantifies the magnitude of change on a logarithmic scale, where a log2FC of 1 indicates a twofold upregulation and -1 a twofold downregulation.[56] Statistical significance is assessed using hypothesis tests to compute p-values, which are then adjusted for multiple testing across thousands of genes to control the false discovery rate (FDR) via methods like the Benjamini-Hochberg procedure. This approach assumes that expression data have been preprocessed through normalization to account for technical variations. For microarray data, which produce continuous intensity values, standard parametric tests like the t-test are commonly applied, though moderated versions improve reliability by borrowing information across genes. The limma package implements linear models with empirical Bayes moderation of t-statistics, enhancing power for detecting differences in small sample sizes. In contrast, RNA-Seq data consist of discrete read counts following a negative binomial distribution to model biological variability and overdispersion, where variance exceeds the mean.[50] Tools like DESeq2 fit generalized linear models assuming variance = mean + α × mean², with shrinkage estimation for dispersions and fold changes to stabilize estimates for low-count genes.[50] Similarly, edgeR employs empirical Bayes methods to estimate common and tagwise dispersions, enabling robust testing even with limited replicates.[56] Genes are typically selected as differentially expressed using thresholds such as |log2FC| > 1 and FDR < 0.05, balancing biological relevance and statistical confidence.[50] Results are often visualized in volcano plots, scatter plots of log2FC against -log10(p-value), where points above significance cutoffs highlight differentially expressed genes.[57] For RNA-Seq, zero counts—common due to low expression or technical dropout—are handled by adding small pseudocounts (e.g., 1) before log transformation for fold change calculation, preventing undefined values, though testing models like DESeq2 avoid pseudocounts in likelihood-based inference.[57] Power to detect changes depends on sample size, sequencing depth, and effect size; for instance, detecting a twofold change (log2FC = 1) at 80% power and FDR < 0.05 often requires at least three replicates per group for moderately expressed genes. In practice, this analysis has revealed upregulated genes in cancer tissues compared to normal, such as proliferation markers like MKI67 in breast tumors, aiding molecular classification. Early microarray studies on acute leukemias identified sets of upregulated oncogenes distinguishing subtypes, demonstrating the method's utility in biomarker discovery.Statistical and Computational Tools
Gene expression profiling generates high-dimensional datasets where the number of genes often exceeds the number of samples, necessitating advanced statistical and computational tools to uncover patterns and make predictions. Unsupervised learning methods, such as clustering and dimensionality reduction, are fundamental for exploring inherent structures in these data without predefined labels. Clustering algorithms group genes or samples based on similarity in expression profiles; for instance, hierarchical clustering builds a tree-like structure to reveal nested relationships, while k-means partitioning assigns data points to a fixed number of clusters by minimizing intra-cluster variance. These techniques have been pivotal since the late 1990s, enabling the identification of co-expressed gene modules and sample subtypes in microarray experiments.[58] Dimensionality reduction complements clustering by projecting high-dimensional data into lower-dimensional spaces to mitigate noise and enhance visualization. Principal Component Analysis (PCA) achieves this linearly by identifying directions of maximum variance, commonly used as a preprocessing step in gene expression workflows to retain the top principal components that capture most variability. Nonlinear methods like t-distributed Stochastic Neighbor Embedding (t-SNE) preserve local structures for visualizing clusters in two or three dimensions, particularly effective for single-cell RNA-seq data where it reveals cell type separations. Uniform Manifold Approximation and Projection (UMAP) offers a faster alternative to t-SNE, balancing local and global data structures while scaling better to large datasets, as demonstrated in comparative evaluations of transcriptomic analyses.[59] Supervised learning methods leverage labeled data to train models for classification or regression tasks in gene expression profiling. Support Vector Machines (SVMs) construct hyperplanes to separate classes with maximum margins, proving robust for phenotype prediction from expression profiles, such as distinguishing cancer subtypes, through efficient handling of high-dimensional inputs via kernel tricks. Random Forests, an ensemble of decision trees, aggregate predictions to reduce overfitting and provide variable importance rankings, widely applied in genomic classification for tasks like tumor identification with high accuracy on microarray data. Regression variants, including ridge or lasso, predict continuous traits like drug response by penalizing coefficients to address multicollinearity in expression matrices.[60][61] Advanced techniques extend these approaches to infer complex relationships and enhance interpretability. Weighted Gene Co-expression Network Analysis (WGCNA) constructs scale-free networks from pairwise gene correlations, using soft thresholding to identify modules of co-expressed genes that correlate with traits, as formalized in its foundational framework for microarray and RNA-seq data. For machine learning models in the 2020s, SHapley Additive exPlanations (SHAP) quantifies feature contributions to predictions, aiding interpretability in genomic applications like variant effect scoring by attributing importance to specific genes or interactions.[62][63] Software ecosystems facilitate implementation of these tools. In R, Bioconductor packages like clusterProfiler support clustering and downstream exploration of gene groups, integrating statistical tests for profile comparisons. Python's Scanpy toolkit streamlines single-cell RNA-seq analysis, incorporating UMAP, Leiden clustering, and batch correction for scalable processing of millions of cells.[64][65] High-dimensionality poses the "curse of dimensionality," where sparse data leads to overfitting and unreliable distances; mitigation strategies include feature selection to retain informative genes and embedding into lower dimensions via PCA or autoencoders before modeling. Recent advancements in gene pair methods, focusing on ratios or differences between expression levels of gene pairs, have improved biomarker discovery by reducing dimensionality while preserving relational information, as demonstrated in a 2025 review of gene pair methods in clinical research advancing precision medicine and in 2025 studies applying them to cancer subtyping.[66][67] These approaches yield robust signatures with fewer features than single-gene models, enhancing predictive power in heterogeneous datasets. As of 2025, integration of deep learning models, such as graph neural networks for co-expression analysis, has further advanced pattern detection in large-scale transcriptomic data.[68]Functional Annotation and Pathway Analysis
Functional annotation involves mapping differentially expressed genes to known biological functions, processes, and components using standardized ontologies and databases. The Gene Ontology (GO) consortium provides a structured vocabulary for annotating genes across three domains: molecular function, biological process, and cellular component, enabling systematic classification of gene products.[69] Databases such as UniProt integrate GO terms with protein sequence and functional data, while Ensembl and NCBI Gene offer comprehensive gene annotations derived from experimental evidence, computational predictions, and literature curation.[70][71][72] In RNA-Seq profiling, handling transcript isoforms is crucial, as multiple isoforms per gene can contribute to expression variability; tools often aggregate isoform-level counts to gene-level summaries or use isoform-specific quantification to avoid underestimating functional diversity.[73] Tools like DAVID and g:Profiler facilitate ontology assignment by integrating multiple annotation sources for high-throughput analysis of gene lists. DAVID clusters functionally related genes and terms into biological modules, supporting GO enrichment alongside other annotations from over 40 databases.[74] g:Profiler performs functional profiling by mapping genes to GO terms, pathways, and regulatory motifs, with support for over 500 organisms and regular updates from Ensembl.[75] These tools assign annotations based on evidence codes, prioritizing experimentally validated terms to ensure reliability in interpreting expression profiles. Pathway analysis extends annotation by identifying coordinated changes in biological pathways, using enrichment tests to detect over-representation of profiled genes in predefined sets. Common databases include KEGG, which maps genes to metabolic and signaling pathways, and Reactome, focusing on detailed reaction networks. Over-representation analysis (ORA) applies to lists of differentially expressed genes, employing the hypergeometric test (equivalent to Fisher's exact test) to compute significance:p = \sum_{i = x}^{n} \frac{\binom{n}{i} \binom{N - n}{M - i}}{\binom{N}{M}}
where N is the total number of genes, n the number in the pathway, M the number of differentially expressed genes, and x the observed overlap.[76] This test assesses whether pathway genes are enriched beyond chance, with multiple-testing corrections like Benjamini-Hochberg to control false positives. Gene Set Enrichment Analysis (GSEA) complements ORA by evaluating ranked gene lists from full expression profiles, detecting subtle shifts in pathway activity without arbitrary significance cutoffs.[77] GSEA uses a Kolmogorov-Smirnov-like statistic to measure enrichment at the top or bottom of the ranking, weighted by gene metric, and permutes phenotypes to estimate empirical p-values. For example, in cancer studies, GSEA has revealed upregulation of the PI3K-AKT pathway, linking altered expression of genes like PIK3CA and AKT1 to tumor proliferation and survival.[78] Regulated genes are often categorized by function to highlight regulatory mechanisms, such as grouping into transcription factors (e.g., E2F family regulating cell cycle and apoptosis genes) or apoptosis-related sets (e.g., BCL2 family modulators).[79] This categorization integrates annotations to infer upstream regulators and downstream effects, aiding in the interpretation of co-regulated patterns in expression profiles. As of 2025, advancements in pathway analysis include AI-driven tools for dynamic pathway modeling, enhancing predictions of pathway perturbations in disease contexts.[80]