Fact-checked by Grok 2 weeks ago

Gene set enrichment analysis

Gene set enrichment analysis (GSEA) is a computational method in bioinformatics that determines whether an a priori defined set of genes—such as those sharing a common biological function, chromosomal location, or regulatory mechanism—exhibits statistically significant, concordant differences between two biological states, such as distinct phenotypes or experimental conditions.^[1] Developed to interpret genome-wide expression profiles, GSEA addresses limitations of traditional single-gene analyses by focusing on coordinated changes across gene sets, thereby revealing subtle biological signals that might otherwise be obscured by noise or individual gene variability.^[1] At a high level, GSEA operates by first ranking all genes in a dataset according to their differential expression or correlation with a phenotype of interest.^[1] It then calculates an enrichment score for each gene set, measuring the degree to which its members are overrepresented at the extremes of this ranked list (either upregulated or downregulated), with statistical significance assessed through permutation testing to generate empirical p-values and account for multiple hypothesis testing via false discovery rate (FDR) adjustment.^[1] This approach is particularly powerful for analyzing microarray or RNA-seq data, where it can detect pathway-level perturbations even when no single gene reaches conventional significance thresholds.^[2] Key advantages of GSEA include its ability to uncover biologically meaningful themes across diverse datasets, such as identifying common oncogenic signatures in multiple cancer studies, and its robustness to small sample sizes compared to overrepresentation analysis methods.^[1] However, benchmarks have shown that while GSEA performs well in prioritizing relevant gene sets, it can be computationally intensive and conservative in some scenarios, with alternatives like PADOG sometimes ranking phenotype-relevant sets higher in RNA-seq applications.^[2] Widely implemented through open-source software from the Broad Institute, GSEA is often paired with databases like the Molecular Signatures Database (MSigDB), which provides curated gene sets from sources including Gene Ontology and Reactome, enabling reproducible analyses in fields like oncology and developmental biology.^[3]

Background and Motivation

Definition and Purpose

Gene set enrichment analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states, such as phenotypes like disease versus healthy.^[3] It analyzes genome-wide expression profiles by evaluating whether members of predefined gene sets—such as those representing biochemical pathways, chromosomal locations, or co-regulated genes—are enriched at the extremes of a ranked list of genes, thereby revealing coordinated biological activity.^[4] The primary purpose of GSEA is to overcome the limitations of traditional single-gene analysis, which often requires stringent thresholds to identify differentially expressed genes and can miss subtle, collective changes across gene groups that are biologically meaningful.^[4] By focusing on the collective behavior of gene sets, GSEA detects modest expression alterations that might be undetectable at the individual gene level, improving the signal-to-noise ratio and enhancing the detection of relevant biological processes.^[4] Key benefits include reducing the multiple testing burden by testing hypotheses on gene sets rather than thousands of individual genes, thereby mitigating the problem of false positives through methods like false discovery rate (FDR) correction; incorporating prior biological knowledge from curated databases to guide interpretation; and applicability to ranked lists derived from high-throughput data, including microarrays and RNA-seq.^[5]^[3] In terms of inputs, GSEA requires a ranked list of genes (typically ordered by differential expression metrics like fold change or t-statistic) and a collection of predefined gene sets.^[4] Outputs include enrichment scores quantifying the degree to which a gene set is overrepresented, along with p-values or FDR-adjusted significance measures to assess statistical relevance.^[3]

Historical Development

The roots of gene set enrichment analysis (GSEA) trace back to earlier over-representation analyses in genomics, which emerged in the late 1990s alongside the rise of microarray technologies and pathway databases. These precursor methods typically employed statistical tests such as Fisher's exact test to identify whether predefined gene sets were disproportionately represented in lists of differentially expressed genes, often in the context of functional annotation.^[6] Concurrently, pathway analysis tools like the Kyoto Encyclopedia of Genes and Genomes (KEGG), initiated in 1995, provided foundational resources for organizing biological pathways and enabled initial explorations of coordinated gene activity. These approaches, however, were limited to focusing on extreme gene lists and struggled to detect subtle, distributed changes across the genome. A pivotal precursor to modern GSEA appeared in 2003, when Mootha et al. introduced an analytical strategy to detect coordinated expression changes in functionally related gene groups, applied to oxidative phosphorylation genes in human diabetes using microarray data from skeletal muscle. This work highlighted the need for methods beyond simple over-representation to capture modest but consistent shifts in gene sets. Building directly on this foundation, GSEA was formally introduced in 2005 by Subramanian et al. at the Broad Institute and UC San Diego as a computational method to evaluate whether predefined gene sets exhibit statistically significant enrichment at the top or bottom of a ranked list of genes, addressing limitations of prior techniques by considering the entire transcriptome.^[4] In parallel, the Molecular Signatures Database (MSigDB) was created in 2005 as a comprehensive collection of over 1,300 curated gene sets, including pathways from KEGG and other sources, to support GSEA implementations.^[4] Throughout the 2010s, GSEA evolved from its microarray-centric origins to accommodate emerging high-throughput technologies, with key extensions enhancing its applicability to next-generation sequencing (NGS) and proteomics data. Adaptations such as SeqGSEA (2013) modified the core algorithm for count-based RNA-seq data, preserving the Kolmogorov-Smirnov-like statistic while handling discrete distributions.^[6] Single-sample GSEA (ssGSEA), introduced in 2009, further expanded the method to individual samples without requiring phenotypic comparisons, facilitating analyses in heterogeneous datasets like single-cell RNA-seq.^[7] By the mid-2010s, integrations with multi-omics workflows became prominent, allowing GSEA to incorporate proteomics and epigenomics alongside transcriptomics for more holistic pathway assessments.^[6] Post-2020 developments have focused on improving software usability and support for diverse data types. The GSEA software reached version 4.0 in 2020, introducing modules for trap analysis of leading-edge subsets and improved support for NGS inputs, marking a shift toward versatile, user-friendly tools for diverse omics platforms; as of March 2025, it has been updated to version 4.4.0 with enhancements for Java 21 compatibility and bug fixes.^[3] This evolution has solidified GSEA's role in bioinformatics standards, influencing widespread adoption in fields from cancer genomics to single-cell studies.^[6]

Core Concepts

Gene Sets and Databases

Gene sets in gene set enrichment analysis (GSEA) are curated collections of genes that share common biological characteristics, such as functions, cellular locations, or regulatory mechanisms, including pathways, Gene Ontology (GO) terms, and hallmark processes.^[4] These sets serve as predefined groups that enable the detection of coordinated changes in gene expression or activity, providing interpretable biological context beyond individual gene analysis.^[8] Various types of gene sets are utilized in GSEA to capture diverse biological knowledge. Hallmark gene sets represent simplified, non-redundant summaries of well-defined biological states, derived by aggregating multiple related founder sets through automated overlap analysis and expert curation.^[9] Canonical pathway gene sets, such as those from Reactome or KEGG, describe molecular interaction networks involved in cellular processes like metabolism and signaling.^[10] Computational gene sets, often based on motifs or co-expression patterns, infer regulatory relationships, while immunologic signatures focus on immune cell-specific or response-related genes.^[10] Major databases provide these gene sets for GSEA applications. The Molecular Signatures Database (MSigDB), maintained by the Broad Institute, is the primary resource, containing over 35,000 annotated gene sets divided into collections such as C2 (curated pathways from sources like KEGG and WikiPathways, with 7,561 sets), C6 (oncogenic signatures, 189 sets), and C7 (immunologic signatures, 5,219 sets).^[10] The KEGG (Kyoto Encyclopedia of Genes and Genomes) database offers manually curated pathway maps representing molecular interactions and reactions in biological systems.^[11] The Gene Ontology (GO) resource provides structured vocabularies for gene functions across biological processes, molecular functions, and cellular components, with over 16,000 sets in MSigDB's C5 collection.^[12]^[10] WikiPathways is a community-curated, open platform for collaboratively editing and sharing pathway diagrams, integrated into MSigDB's C2 subcollection with 885 sets.^[13] Enrichr aggregates diverse libraries, including disease ontologies and transcription factor targets, supporting over 100 gene set collections for enrichment queries.^[14]^[15] Gene sets are curated from scientific literature, experimental data, and expert contributions, with ongoing updates to reflect new knowledge. For instance, MSigDB's curation involves extracting signatures from publications and pathway databases, followed by validation and annotation; version 7.0, released in 2019, introduced Ensembl-based gene mapping and expanded collections, including refinements to hallmark sets.^[16]^[17] In GSEA, these predefined sets are essential for providing biological interpretability to ranked gene lists, with a typical selection criterion of 15 to 500 genes per set to minimize statistical biases from very small or large groups.^[18]^[19]

Ranked Gene Lists

A ranked gene list serves as the primary input for gene set enrichment analysis (GSEA), consisting of all genes in a dataset ordered by a metric that quantifies their differential activity or association with a phenotype of interest. This ordering captures the continuum of changes across the transcriptome without relying on predefined significance thresholds, enabling the detection of coordinated shifts in gene sets that may not be apparent through traditional differential expression analysis. Common ranking metrics include the signal-to-noise ratio, which measures the difference in means between phenotypes scaled by the standard deviation and is the default in the original GSEA implementation. For two-group comparisons, log2 fold-change is frequently used, particularly with RNA-seq data analyzed via tools like DESeq2, as it directly reflects the magnitude of expression differences. In microarray studies, moderated t-statistics from linear models (e.g., via limma) are preferred, as they incorporate empirical Bayes moderation to improve stability for genes with low variance.^[20] Other options, such as Pearson correlation for continuous phenotypes or t-test statistics, allow adaptation to various experimental designs, though the choice of metric can influence enrichment outcomes.^[21] Unlike threshold-based approaches that select only significantly differentially expressed genes, ranked lists in GSEA consider the full spectrum of expression changes, thereby increasing sensitivity to gene sets exhibiting modest but coordinated alterations in the middle ranks. This avoids arbitrary cutoffs, such as fold-change or p-value thresholds, which can miss biologically relevant pathways with subtle effects across many genes. Ranked lists are typically derived from differential expression analyses tailored to the data type, including t-tests or limma for microarray data and DESeq2 or edgeR for RNA-seq, where genes are sorted by the chosen metric post-analysis.^[20] Key properties of ranked gene lists include collapsing multiple probes or transcripts to a single gene-level representative, often using the maximum or median expression value to handle redundancy in platforms like microarrays. For multi-condition designs, rankings can incorporate leading edge subsets—genes contributing most to enrichment signals—or be generated via preranked modes that accommodate complex statistics from tools like limma for broader experimental contrasts.^[20]

Methodology

Algorithm Overview

Gene set enrichment analysis (GSEA) follows a structured workflow to identify coordinated changes in predefined gene sets within a ranked list of genes, typically derived from genome-wide expression data. The process starts with loading the ranked gene list, ordered by metrics such as differential expression or phenotypic correlation, alongside a collection of gene sets from curated databases. For each gene set, the algorithm computes an enrichment score by traversing the ranked list using a running-sum approach. Statistical significance is then evaluated through permutation testing of the phenotype labels to generate a null distribution, followed by adjustment for multiple testing across all gene sets to control false positives.^[4]^[3] At its core, the GSEA algorithm employs a running-sum statistic to assess whether genes in a given set are enriched at the top (upregulated) or bottom (downregulated) of the ranked list, capturing subtle but coordinated biological signals that might be missed by individual gene analysis. As the statistic walks through the list from highest to lowest rank, it incrementally increases the sum for genes belonging to the set (up-weighting their contributions based on their ranking metric) and decreases it for non-members (down-weighting equally to simulate a random walk), with the enrichment score defined as the maximum positive or negative deviation from zero during this process. GSEA supports different scoring schemes: the default weighted scheme (exponent p=1) weights contributions by the absolute ranking metric raised to p, normalized across the set; the classic unweighted scheme (p=0) treats all set members equally.^[4] Several key parameters influence the algorithm's execution and sensitivity. The exponent parameter p, which modulates the weighting of gene-set members by their ranking metric, is set to a default value of 1 for balanced emphasis on stronger signals. The number of phenotype permutations, typically 1000, determines the resolution of the null distribution for significance estimation. Additionally, a minimum gene set size threshold, such as 15 genes, is applied to filter out sets too small for reliable analysis, while a maximum size may also be imposed to avoid overly broad categories.^[4]^[3] The algorithm produces key outputs per gene set, including the normalized enrichment score (NES), which scales the raw enrichment score to account for differences in set size and composition; the FDR q-value, derived from permutation-based multiple testing correction to indicate significance while controlling the false discovery rate; and the leading edge subset, consisting of the core genes whose differential expression primarily drives the observed enrichment.^[4]^[3] For clarity, the following pseudocode outlines the high-level computation of the enrichment score for a single gene set (default weighted scheme):

Input: Ranked [gene](/page/Gene) list L (from highest to lowest by [metric](/page/Metric) r_j), [gene](/page/Gene) set S

Compute weights: for each j in S, w_hit(j) = |r_j|^p / sum_{k in S} |r_k|^p   (p=1 default)
For misses: w_miss = -1 / (N - |S|)

Initialize running-sum [statistic](/page/Statistic) RS = 0

For each position i in L:
    If [gene](/page/Gene) at i is in S (hit):
        RS += w_hit(i)
    Else (miss):
        RS += w_miss

Enrichment Score (ES) = maximum absolute deviation of RS from zero across the walk
Input: Ranked [gene](/page/Gene) list L (from highest to lowest by [metric](/page/Metric) r_j), [gene](/page/Gene) set S

Compute weights: for each j in S, w_hit(j) = |r_j|^p / sum_{k in S} |r_k|^p   (p=1 default)
For misses: w_miss = -1 / (N - |S|)

Initialize running-sum [statistic](/page/Statistic) RS = 0

For each position i in L:
    If [gene](/page/Gene) at i is in S (hit):
        RS += w_hit(i)
    Else (miss):
        RS += w_miss

Enrichment Score (ES) = maximum absolute deviation of RS from zero across the walk

This procedure is repeated for each gene set, with subsequent normalization and significance assessment applied to the resulting ES values. For the unweighted scheme (p=0), w_hit(j) = 1 / |S| for all hits.^[4]

Enrichment Score Computation

The enrichment score (ES) quantifies the extent to which a predefined gene set S deviates from uniform distribution across a ranked gene list L of N genes, capturing coordinated shifts in the set's members toward the extremes of the ranking. Computation involves traversing the list from top to bottom (highest to lowest rank metric), accumulating a running sum that positively increments for genes in S (hits) via normalized weights based on their ranking metric and negatively increments for genes not in S (non-hits) via equal penalties. The ES is the maximum value of this running sum, reflecting peak enrichment or depletion.^[4] The running sum is computed as the difference between two cumulative fractions: P_hit(i), the fraction of weighted gene set members up to position i, and P_miss(i), the fraction of non-set genes up to i. Specifically, P_hit(i) = \sum_{j=1}^i w_j \cdot I(g_j \in S) / \sum_{all j \in S} w_j P_miss(i) = \sum_{j=1}^i I(g_j \notin S) / (N - |S|) where I is the indicator function, and the weights w_j for the default weighted scheme are |r_j|^p (p=1), with r_j the ranking metric (e.g., Pearson correlation with phenotype); these are normalized so the denominator for P_hit sums to 1. For non-hits, contributions are unweighted and equal. The running sum at i is P_hit(i) - P_miss(i), and ES(S) = \max_{1 \leq i \leq N} [P_hit(i) - P_miss(i)] for positive enrichment (similarly for negative). The exponent p tunes weighting: p=0 (unweighted/classic) sets w_j = 1 for all in S, equivalent to a Kolmogorov-Smirnov-like statistic; higher p (e.g., 2) amplifies extreme metric genes. Normalization of the raw ES to NES, accounting for set size, occurs empirically via permutations (see Significance Testing).^[4] Consider a toy example with N=10 genes ranked from 1 (top) to 10 (bottom) and a gene set S of size 3 at positions 1, 4, and 9 (two early hits for positive enrichment illustration, one late). Using the unweighted case (p=0), contributions are +1/3 for each hit and -1/7 for each non-hit (normalized totals: hits sum to 1, non-hits to 1). The computation proceeds as follows:

Position (rank)	In S?	Contribution	Running Sum V(k)
1	Yes	+1/3 ≈ 0.333	0.333
2	No	-1/7 ≈ -0.143	0.190
3	No	-1/7 ≈ -0.143	0.048
4	Yes	+1/3 ≈ 0.333	0.381
5	No	-1/7 ≈ -0.143	0.238
6	No	-1/7 ≈ -0.143	0.095
7	No	-1/7 ≈ -0.143	-0.048
8	No	-1/7 ≈ -0.143	-0.190
9	Yes	+1/3 ≈ 0.333	0.143
10	No	-1/7 ≈ -0.143	0.000

The maximum running sum is 0.381 at position 4, so ES = 0.381, indicating moderate enrichment. This running sum plot would peak early due to clustered hits, visualizing the deviation. For the default weighted scheme, contributions for hits would vary by their r_j values (higher for top-ranked), potentially increasing the ES if early hits have stronger metrics.^[4] For continuous or quantitative phenotypes, the method adapts by ranking genes via a metric like Pearson correlation with the trait values, and weights incorporate this signal strength (r_j) as described, enabling detection of subtle, coordinated shifts in expression correlated with trait variation.^[4]

Significance Testing

In gene set enrichment analysis (GSEA), statistical significance of the enrichment score (ES) for a gene set is assessed using a permutation test to generate an empirical null distribution, which accounts for the observed data's structure and controls for false positives.^[4] Two primary permutation strategies are employed: phenotype permutation, which randomly reassigns sample labels while preserving gene-gene correlations, and is the standard approach for datasets with at least seven samples per phenotype class; and gene set permutation, which randomly permutes gene labels within the ranked list but assumes gene independence, making it more conservative and suitable for smaller sample sizes where phenotype permutation may lack power.^[4] The empirical p-value is calculated as the proportion of permuted ES values that are greater than or equal to the absolute value of the observed ES, divided by the total number of permutations; at least 1,000 permutations are recommended to ensure stability and precision in the estimate.^[4] To enable comparability across gene sets of varying sizes and compositions, the observed ES is normalized by dividing it by the mean of the permuted ES values from the null distribution, yielding the normalized enrichment score (NES), which adjusts for directionality by considering only positive or negative permutations as appropriate.^[4] Leading edge analysis identifies the core subset of genes within a gene set that contributes most to the observed ES, consisting of those genes that appear in the ranked list at or before the point of maximum ES; this subset, often termed the leading edge genes, facilitates targeted follow-up studies on the biologically relevant components driving the enrichment.^[4] GSEA demonstrates power to detect gene sets exhibiting coherent directional changes even when only 10-20% of the genes show modest individual alterations, thereby enhancing sensitivity over single-gene analyses in noisy datasets; however, its statistical power is highly sensitive to sample size, with smaller cohorts reducing the ability to discern true enrichments and increasing reliance on the more conservative gene set permutation.^[4]^[22]

Data Preparation

Generating Input Data

Generating input data for gene set enrichment analysis (GSEA) begins with high-throughput experimental assays that quantify molecular features across biological conditions, producing datasets suitable for ranking genes or proteins by differential activity. Common data sources include RNA sequencing (RNA-seq) for transcriptomics, microarray hybridization for gene expression profiling, and mass spectrometry-based proteomics for protein abundance measurements. These assays typically involve two-sample designs, such as control versus treatment groups, or multi-class comparisons like different disease stages or cell types.^[4]^[23]^[24] To generate the required ranked gene list, differential expression or abundance analysis is performed using specialized computational tools that compute ranking metrics, such as log-fold change (logFC), moderated t-statistics, or signal-to-noise ratios. For microarray data, the limma package in R/Bioconductor employs linear models to estimate these metrics, accounting for experimental design and providing empirical Bayes moderation for improved variance estimation. In RNA-seq workflows, edgeR uses negative binomial generalized linear models to model count data and derive logFC or likelihood ratio test statistics for ranking, while limma's voom transformation enables similar linear modeling after converting counts to log-counts per million. Proteomics data, often from label-free quantification or isobaric tagging methods, can be analyzed with tools like limma or MSstats to obtain fold changes or t-statistics for ranking. These metrics rank all features from most upregulated to most downregulated, forming the input for pre-ranked GSEA.^[25] When experimental designs include replicates, statistical methods incorporate them to enhance reliability, either by averaging expression across replicates for simple comparisons or using moderated statistics that borrow information across genes to stabilize variance estimates, particularly useful with limited sample sizes. A minimum of 3-5 biological replicates per group is generally recommended to achieve sufficient statistical power for detecting differential features, though methods like those in limma and edgeR perform robustly even with fewer replicates by leveraging empirical Bayes approaches. For multi-class designs, tools like edgeR's likelihood ratio tests allow ranking based on overall condition effects.^[26] In multi-omics studies, ranked lists can be generated by integrating scores across layers, such as combining transcriptomic logFC with epigenomic differential methylation signals using weighted summation or joint modeling approaches. Packages like multiGSEA facilitate this by computing combined enrichment scores from separate omics rankings, enabling pathway analysis that captures concordant changes, for example, between RNA-seq and DNA methylation data.^[27] Best practices emphasize including all genes in the ranked list without arbitrary filtering, as GSEA's permutation-based statistics benefit from the full dataset to accurately assess enrichment across the entire distribution. Batch effects should be addressed prior to ranking using methods like ComBat or removeBatchEffect to remove systematic technical variation, ensuring ranks reflect true biological differences rather than artifacts. Gene identifiers must be consistently mapped, often collapsing multiple probes or transcripts to unique gene symbols using annotations like Ensembl or Entrez. These steps align with the conceptual requirement for comprehensive, phenotype-ordered rankings in GSEA.^[18]^[23]

Preprocessing and Normalization

Preprocessing and normalization are critical steps in preparing gene expression data for gene set enrichment analysis (GSEA), as they ensure that the resulting ranked gene lists accurately reflect biological differences rather than technical artifacts or biases.^[5] These procedures mitigate variations introduced during data acquisition, such as differences in sample handling or platform-specific noise, allowing GSEA to detect subtle pathway-level enrichments without confounding influences.^[28] Quality control begins with the removal of low-quality samples and genes to maintain dataset integrity. Samples exhibiting extreme outliers, detected via principal component analysis (PCA), should be excluded to prevent distortion of overall expression patterns.^[29] Similarly, genes with low detection rates—such as those expressed in fewer than 20% of samples—are typically filtered out, as they contribute noise rather than reliable signal.^[28] This filtering reduces dimensionality and focuses analysis on informative features. Normalization techniques vary by data type to achieve comparability across samples. For microarray data, quantile normalization is widely used to equalize the distribution of expression values, addressing systematic biases in probe intensities.^[25] In RNA-seq datasets, methods like trimmed mean of M-values (TMM) or upper quartile normalization stabilize counts, accounting for differences in library size and composition. Additionally, log2 transformation is commonly applied post-normalization to variance-stabilize the data, compressing the dynamic range and making it suitable for ranking in GSEA.^[30] RNA-seq data must undergo external normalization prior to GSEA input, as the GSEA software does not perform this step internally.^[30] For single-cell RNA-seq (scRNA-seq) data, which is increasingly used in transcriptomics studies, preprocessing must address unique challenges such as high dropout rates, zero-inflation, and cell-to-cell variability. Initial steps include quality control to filter low-quality cells (e.g., based on mitochondrial gene content or total counts) and genes expressed in few cells. Normalization often employs methods like log-normalization scaled by library size or more advanced techniques such as SCTransform, which models technical noise using regularized negative binomial regression. Batch effects in scRNA-seq can be corrected using integration tools like Harmony or Seurat's CCA. To generate ranked lists for GSEA, differential expression analysis may use pseudobulk aggregation followed by bulk tools like edgeR, or single-cell-specific methods such as MAST or Wilcoxon tests. Adaptations like AUCell or ssGSEA enable per-cell enrichment scoring, but for standard GSEA, pseudobulk approaches preserve compatibility. As of 2023, best practices recommend these steps to ensure robust pathway analysis in heterogeneous cell populations.^[31]^[32] Gene collapsing addresses redundancy in expression platforms where multiple probes or transcripts map to the same gene. Probes are typically aggregated to unique gene identifiers, such as Entrez IDs, by selecting the probe with the maximum mean expression across samples to represent the gene's overall activity.^[33] For genes with isoforms, a representative transcript is chosen based on annotation priorities, ensuring one entry per gene in the ranked list to avoid inflating set sizes or diluting signals.^[34] Batch correction is essential when technical batches introduce systematic variation that could overshadow biological signals. The ComBat algorithm, an empirical Bayes method, adjusts for known batch effects while preserving biological covariates, and is particularly effective for microarray data. For RNA-seq, extensions like ComBat-seq apply negative binomial modeling to count data, maintaining integer outputs suitable for downstream analysis.^[35] Alternatively, limma's removeBatchEffect function can be used for linear model-based correction.^[25] These methods enhance the robustness of GSEA ranks by isolating true differential expression. In the context of GSEA, preprocessing must prioritize preserving biological rank orders over aggressive artifact removal, as over-normalization can mask subtle pathway shifts critical for enrichment detection.^[5] Thus, validation steps, such as visualizing normalized data distributions and checking for residual batch effects via PCA, are recommended to confirm that the prepared ranked lists faithfully represent phenotypic differences.^[28]

Statistical and Interpretational Aspects

Multiple Hypothesis Testing

In gene set enrichment analysis (GSEA), the evaluation of thousands of gene sets—often exceeding 10,000 from databases like MSigDB—simultaneously introduces a multiple hypothesis testing problem, where the probability of false positives increases substantially without correction.^[10] For instance, testing 35,000 sets at a nominal significance level of 0.05 could yield over 1,700 false discoveries by chance alone. The naive Bonferroni correction, which divides the significance threshold by the number of tests, is typically too conservative for GSEA, as it drastically reduces statistical power and overlooks the exploratory nature of such analyses.^[4] To control false discoveries, GSEA uses a permutation-based false discovery rate (FDR) q-value computed directly on the normalized enrichment scores (NES).^[4] In this approach, for a given observed NES, the q-value is the ratio of the proportion of NES from permuted null distributions that are at least as extreme as the observed NES to the proportion of observed NES that meet the same criterion, separately for positive and negative enrichments, providing an estimate of the proportion of false positives among significant results.^[4] This q-value is normalized to account for gene set sizes, ensuring comparability across sets of varying lengths. Common thresholds for significance include FDR < 0.05 for conservative analyses or FDR < 0.25 for exploratory purposes, where the latter accepts that approximately one in four significant results may be false positives, balancing discovery and error control in hypothesis generation.^[4] For more stringent control of the family-wise error rate (FWER)—the probability of any false positive across all tests—alternatives like the Bonferroni method or adaptive variants (e.g., Sidak correction) may be applied, though they are less favored in GSEA due to power loss.^[4] Extensions to standard FDR address challenges such as varying gene set sizes and correlations among sets, particularly in hierarchical structures like Gene Ontology (GO). Weighted FDR procedures incorporate set size as a weight in the adjustment, downweighting larger sets that may inherently have higher detection power due to more genes.^[36] For correlated sets, such as GO terms sharing genes, adaptive methods like the focused BH procedure or tree-structured FDR use data-dependent weights or hierarchical pruning to improve power while controlling FDR, mitigating inflation from dependencies.^[37] As an example, to compute the permutation-based q-value in GSEA, first obtain the nominal p-value for a gene set as the proportion of permutations (e.g., 1,000) yielding an NES at least as extreme as the observed; then, the q-value is estimated by the ratio of the mean NES among permuted gene sets exceeding the observed threshold to the mean NES among observed gene sets exceeding it, identifying significant enrichments where q < 0.05.^[4]

Result Visualization and Interpretation

Results from gene set enrichment analysis (GSEA) are typically visualized through several key plots that facilitate the understanding of enrichment patterns across ranked gene lists. The primary visualization is the enrichment plot, which depicts the running sum statistic as a black line progressing along the x-axis representing the ranked gene list, with the maximum or minimum enrichment score (ES) marked at the peak or trough, respectively.^[4] This plot highlights the position where the gene set shows the strongest deviation from random expectation, often accompanied by vertical lines indicating the location of genes within the analyzed set. Complementing this, heatmaps display the expression levels of the leading edge subset— the core genes contributing most to the enrichment signal—across samples, using color gradients (e.g., red for upregulation, blue for downregulation) to reveal coordinated changes relative to the phenotype.^[30] Network views, such as those representing enriched gene sets as nodes connected by edges based on gene overlap, provide an overview of relationships between multiple sets, with node size proportional to set size and coloring by enrichment direction (e.g., red for positive enrichment, blue for negative).^[38] Interpretation of these visualizations centers on the normalized enrichment score (NES) and its direction, where a positive NES indicates upregulation of the gene set in the first phenotype compared to the second, suggesting coordinated activation of associated biological processes.^[30] The leading edge genes, identified as those appearing before or after the enrichment peak depending on the ES sign, represent the subset driving the signal and are often prioritized for further investigation as potential biomarkers due to their biological relevance to the phenotype.^[39] Overlap analysis between leading edges of enriched sets reveals shared genes, aiding in the identification of common regulatory mechanisms. To derive biological context, enriched sets are mapped to known pathways or functions (e.g., an upregulated gene set linked to inflammation pathways in a disease model), with validation recommended using orthogonal data such as qPCR or independent datasets to confirm the observed coordinations.^[4] Reporting GSEA results follows standards that emphasize key metrics for transparency and reproducibility, including the NES, false discovery rate (FDR)-adjusted p-values (typically with FDR < 0.25 denoting significance), and details of the top enriched sets, alongside visualizations of the enrichment plot and leading edge heatmap.^[30] Caution is advised against over-interpreting small ES values, as they may reflect weak signals or noise, particularly in datasets with limited sample sizes. For advanced analysis, enrichment maps cluster overlapping gene sets into modules, enabling the discernment of broader themes like immune response networks, with dynamic visualizations in software allowing interactive exploration of node attributes and edges.^[38]

Limitations and Challenges

Methodological Assumptions

Gene set enrichment analysis (GSEA) operates under several key methodological assumptions that underpin its statistical validity and interpretability. A primary assumption is that the ranking metric used to order genes—such as the signal-to-noise ratio, t-statistic, or correlation coefficient—accurately captures the true differential expression or biological relevance relative to the phenotype of interest. This ranking presupposes that the metric reflects linear or monotonic associations between gene expression and phenotype; however, it may falter in scenarios involving non-linear effects, complex interactions, or heterogeneous expression patterns, potentially leading to misordered genes and biased enrichment scores. For example, if the chosen metric inadequately models subtle regulatory dynamics, the enrichment signal for a gene set could be diluted or obscured, increasing the risk of type II errors in detecting biologically relevant pathways. Another critical assumption concerns the relative independence of genes within and across gene sets, particularly in biological contexts like pathways with extensive crosstalk or shared regulatory elements. Although GSEA mitigates intra-sample gene correlations through phenotype permutation rather than gene-level resampling, it implicitly assumes minimal overlap between gene sets to prevent correlated enrichment scores that could inflate false positives. Violations occur frequently due to overlapping annotations in databases like MSigDB, where shared genes can amplify apparent significances. Such overlaps can elevate type I error rates, as correlated scores propagate across related pathways without adjustment.^[40]^[41] GSEA also assumes sufficient sample size and balanced experimental designs to support permutation-based significance testing, ensuring the null distribution of enrichment scores is robust and representative. Typically, at least 5-7 samples per phenotype group are required for adequate statistical power, as smaller cohorts (e.g., n<5) yield unstable permutations, high variability in scores, and reduced reproducibility of results. In unbalanced designs, this assumption breaks down, potentially skewing the null toward one phenotype and increasing type I or II errors; for instance, simulations show that low sample sizes lead to inconsistent detection of enriched sets, with average predictions dropping below 15 significant sets even at n=10 per group.^[42] Finally, the method assumes gene sets are of sufficient size (generally >15 genes) to generate reliable enrichment signals, as smaller sets are susceptible to stochastic noise and sampling variability, which can exaggerate false enrichments or mask true signals. This size threshold helps maintain the Kolmogorov-Smirnov-like running sum statistic's sensitivity, but violations in small sets—common in specialized pathways—contribute to inflated variance and higher error rates, underscoring the need for cautious interpretation in such cases.^[41]

Common Pitfalls in Application

One common pitfall in GSEA application involves arbitrary gene filtering prior to ranking, which can bias results toward extreme values and undermine the method's threshold-free design intended to assess all genes without predefined cutoffs.^[30] Similarly, failing to account for batch effects during data preparation can introduce systematic biases that confound enrichment signals, as unadjusted technical variations may mimic biological differences.^[43] Parameter selection errors also frequently occur, such as using too few permutations (e.g., fewer than 1,000), which results in unstable p-values and reduced reliability of significance estimates due to insufficient sampling of the null distribution.^[44] Inappropriate choice of the weighting exponent in ranking metrics, like overemphasizing absolute differences without biological justification, can distort the enrichment score by unevenly prioritizing certain gene contributions.^[45] In interpretation, users often cherry-pick top-ranked gene sets based solely on nominal p-values or enrichment scores without applying false discovery rate (FDR) correction, leading to inflated false positive rates since GSEA tests thousands of sets simultaneously.^[46] Overlooking the leading edge subset—the core genes driving the enrichment—can result in misguided biological inferences, as this subset reveals the specific contributors to the signal rather than the entire set.^[1] Computational challenges arise with large datasets, where default memory allocations (e.g., 4 GB in standard GSEA software) may be insufficient, causing crashes or incomplete analyses for expression profiles exceeding 20,000 genes.^[18] Additionally, relying on outdated gene set collections, such as pre-2020 versions of MSigDB, can lead to mismatched identifiers or deprecated pathways, producing erroneous mappings and irreproducible results.^[8] To mitigate these issues, researchers should employ validated pipelines like the official Broad Institute GSEA desktop application, which includes built-in safeguards for common errors. Cross-validating findings with orthogonal approaches, such as over-representation analysis on the same data, enhances robustness without relying on a single method.^[47] Finally, transparent reporting of all parameters—including permutation counts, ranking metrics, and gene set versions—is essential for reproducibility and peer review.^[48]

Alternative Approaches

Over-Representation Analysis

Over-representation analysis (ORA) is a threshold-based statistical method that evaluates whether a predefined gene set, such as a biological pathway or functional category, contains a higher proportion of significantly differentially expressed (DE) genes than expected by chance, relative to a background set of all measured genes.^[49] This approach typically involves selecting a subset of genes deemed significant based on criteria like false discovery rate (FDR) < 0.05 or the top 5% by fold change, then testing for enrichment within that subset.^[50] The core statistical test in ORA is the hypergeometric test, equivalent to a one-sided Fisher's exact test for contingency tables in this context.^[51] It models the overlap between the gene set and the DE list as sampling without replacement from a finite population, computing a p-value as the probability of observing at least k genes from the set in the DE list under the null hypothesis of no enrichment. The formula is:

P(X \geq k) = \sum_{i=k}^{\min(n, K)} \frac{\binom{K}{i} \binom{N - K}{n - i}}{\binom{N}{n}}

where N is the total number of genes in the background population, K is the size of the gene set, n is the number of DE genes, and k is the observed overlap.^[52] ORA offers simplicity and computational speed, enabling rapid screening of multiple gene sets without requiring ranked expression data.^[53] However, it discards information from non-significant genes, potentially overlooking subtle or coordinated effects across the full dataset, and results are highly sensitive to the arbitrary choice of DE threshold, which can introduce bias or variability.^[54] Historically, ORA emerged as an early standard for interpreting high-throughput genomic data in the microarray era, with tools like the Database for Annotation, Visualization and Integrated Discovery (DAVID) formalizing its use in 2003 through modified Fisher's exact tests for functional enrichment.^[55] It remains popular for preliminary or confirmatory analyses, especially with smaller datasets where a concise list of significant genes is prioritized over comprehensive ranking.^[49]

Competitive Enrichment Methods

Competitive enrichment methods assess whether a predefined gene set exhibits stronger differential expression or association with a phenotype compared to other genes or sets in the dataset, using the full set of measured genes as a competitive background. These approaches typically involve statistical tests, such as modified t-tests, ANOVA on aggregated set scores, or permutation-based rotations, to evaluate relative performance across sets without relying solely on the genes within the set itself. Unlike self-contained methods, competitive tests explicitly account for the distribution of effects across all genes, helping to identify sets that outperform expectations under a null hypothesis where membership in the set provides no advantage.^[56]^[57] A notable example is Significance Analysis of Microarrays for Gene Sets (SAM-GS), introduced in 2007, which extends the original SAM framework to gene sets by computing an L2-norm of t-like statistics for the genes in each set, followed by permutation testing to derive significance and false discovery rates. SAM-GS aggregates individual gene contributions into a set-level score and tests whether this score differs significantly between phenotypes, providing a statistically rigorous alternative for microarray data analysis.^[58] Another widely adopted method is Gene Set Variation Analysis (GSVA), developed in 2013, which generates per-sample enrichment scores to capture variation in pathway activity across heterogeneous populations. In GSVA, the kernel-based score for a gene set is computed as an integral over the rescaled ranks of gene expression levels within the set relative to all genes, employing a non-parametric kernel density estimation (e.g., Gaussian for microarrays) to estimate the competitive difference in distributions. For hypothesis testing, competitive gene set tests—such as rotation-based methods that permute or rotate gene sets while preserving correlations—can be applied to these scores to assess if the set performs better than permuted alternatives.^[59]^[60] These methods offer advantages in handling sample-level heterogeneity, as per-sample scoring in approaches like GSVA enables detection of subtle variations in individual samples, making them particularly valuable for single-cell RNA-seq or tumor studies where intra-sample diversity is high. Compared to GSEA's self-contained approach, competitive methods like GSVA and SAM-GS mitigate biases related to gene set size by enforcing direct competition, enhancing reliability for modular or overlapping pathways.^[59]^[56] Recent developments as of 2025 include methods like GOAT (2024), which improves efficiency and robustness in identifying gene set enrichments, GeneAgent (2025), a language model-based agent for automated gene-set analysis, and Survival-based Gene Set Enrichment Analysis (SGSEA, 2025), tailored for survival data in disease studies.^[49]^[61]^[62]

Software Tools

Desktop and Standalone Tools

The Gene Set Enrichment Analysis (GSEA) software, developed by the Broad Institute, is a prominent Java-based desktop application for performing GSEA on microarray and RNA-seq data.^[3] It supports analysis using predefined gene sets from the Molecular Signatures Database (MSigDB), including hallmark, pathway, and immunologic signatures, as well as user-defined custom gene sets in formats like .gmt. Key features include permutation-based statistical testing to assess enrichment significance, computation of enrichment scores (ES), normalized ES (NES), and false discovery rates (FDR), along with export capabilities for visualization in tools like Cytoscape via .gmt and .gml files.^[18] The software enables offline processing of large datasets without data upload constraints, offering full user control over parameters such as the number of permutations (default 1,000) and gene ranking metrics.^[4] Its graphical user interface facilitates step-by-step workflow from data loading to result reporting, including leading-edge subset analysis for enriched gene sets.^[63] The latest version, GSEA 4.4.0 released in March 2025, incorporates Java 21 compatibility and addresses launch issues on modern macOS systems, building on prior updates like enhanced OpenJDK security in 4.3.3 (February 2024).^[64] The piano R package provides a comprehensive platform for GSEA by integrating multiple statistical methods, including over-representation analysis, gene set enrichment, and competitive approaches like GSEA itself. It accepts gene-level statistics from various sources and gene set collections such as MSigDB or KEGG, applying consensus scoring to combine results across methods for robust directionality assessment (e.g., up- vs. down-regulated patterns).^[65] Features include visualization tools for heatmaps and network plots, as well as automation via scripting for batch processing of omics datasets.^[66] As a Bioconductor package, piano supports offline execution in R environments, allowing customization without installation barriers beyond R setup, though it requires familiarity with R scripting. This integration reduces method-specific biases, enabling users to compare outcomes from tools like Fisher's exact test or the original GSEA algorithm within a unified framework.^[65] Another widely used R-based tool is fgsea, designed for fast preranked GSEA with adaptive multi-level permutation testing that achieves over 10-fold speed improvement compared to the original GSEA implementation for large gene sets.^[67] It processes ranked gene lists from differential expression analyses, computing p-values and adjusted p-values efficiently even for thousands of permutations, and supports gene sets in .gmt format including MSigDB.^[68] Visualization options include plotting enrichment scores and leading-edge genes, with export functions for downstream integration.^[69] Released as a Bioconductor package, fgsea version 1.20.0 (2022) optimized memory usage and permutation accuracy, with subsequent updates up to version 1.36.0 (as of Bioconductor release 3.22 in 2025) enhancing compatibility with modern R versions.^[68] Like other standalone tools, it offers complete local control and scalability for high-throughput analyses but demands R proficiency and initial package installation.^[67] Desktop and standalone tools like GSEA, piano, and fgsea provide advantages in privacy and customization for sensitive or voluminous datasets, contrasting with web-based platforms that prioritize ease of access.^[3] However, they generally involve steeper learning curves due to software installation and command-line elements, particularly for non-programmers.

Web-Based and Integrated Platforms

Web-based and integrated platforms for gene set enrichment analysis (GSEA) offer user-friendly interfaces that eliminate the need for local software installation, making them particularly suitable for non-experts and collaborative workflows. These tools typically allow direct upload of gene lists or ranked datasets via a browser, perform enrichment against curated databases, and generate interactive visualizations to aid interpretation. By integrating multiple analysis methods and gene set libraries, they streamline the transition from raw genomic data to biological insights, often with built-in sharing options for results. Enrichr stands out as a comprehensive web server for gene list enrichment, supporting uploads of gene symbols, fuzzy sets, or BED files against more than 200 libraries spanning ontologies, pathways, transcription factors, and disease associations. It provides interactive visualizations, including bar charts, network diagrams, and combined scores for term ranking, enabling users to explore collective gene functions intuitively. The platform includes a REST API for programmatic queries and embedding options for custom websites, facilitating integration into larger pipelines. g:Profiler's g:GOSt module enables functional enrichment analysis of gene lists or ranked profiles, integrating resources like Gene Ontology (GO), KEGG pathways, Reactome, and regulatory data from multiple organisms. It supports both over-representation analysis (ORA) and competitive enrichment akin to GSEA, with visualizations such as Manhattan plots, heatmaps, and term networks to highlight significant categories. The tool handles identifier conversion and custom backgrounds, promoting reproducibility across studies. WebGestalt functions as a hybrid toolkit combining ORA, GSEA, and network topology-based analysis (NTA), where NTA incorporates pathway gene interactions for more nuanced enrichment scores. Users can analyze gene lists against GO, pathways, and custom sets, with options for multi-omics integration and visualizations like enrichment maps and hierarchical trees. Its 2024 update improved computational speed, added metabolomics support, and enhanced reproducibility through versioned database tracking. Among integrated platforms, Galaxy serves as an open-source workflow environment where users construct GSEA pipelines using plugins like fgsea or g:Profiler wrappers, combining enrichment with upstream tasks such as differential expression without coding. It supports data import from public repositories, workflow sharing via URLs, and execution on cloud resources for scalability. DAVID emphasizes functional annotation clustering, applying modified Fisher exact tests for ORA on gene lists to generate enrichment charts and term groups, offering GSEA-like overviews of biological themes through integrated GO and pathway mappings. Metascape facilitates multi-omics enrichment by processing multiple gene lists in parallel, applying clustering algorithms to consolidate redundant terms into representative nodes for clearer interpretation of shared pathways or processes. These platforms share key advantages, including automatic database updates to reflect the latest annotations, collaborative result sharing through exportable reports or links, and API endpoints for automation, as exemplified by Enrichr's query interface. They lower barriers for rapid prototyping and community-driven analyses by incorporating user-submitted gene sets alongside standard libraries. However, web-based tools introduce challenges such as data privacy risks, where uploaded gene lists may be stored on servers subject to institutional policies or breaches, necessitating careful review of terms for sensitive genomic data. Upload limits, often capping gene counts at 2,000–10,000 or file sizes at 10–50 MB depending on the platform, can restrict analysis of large datasets from high-throughput experiments. Recent advancements include Enrichr's 2023 Enrichr-KG extension, which builds a knowledge graph linking over 20 libraries for integrative visualizations across categories like pathways and diseases. g:Profiler's 2024 update incorporated fresh Ensembl releases and enhanced GO term highlighting in results, with extensions in Galaxy enabling GSEA on spatial transcriptomics data through updated single-cell tools.

Applications

Genomics and Transcriptomics

Gene set enrichment analysis (GSEA) is extensively applied to genome-wide expression profiling from RNA sequencing (RNA-seq) and microarray experiments, particularly in differential expression analyses to uncover coordinated dysregulation of biological pathways. In these contexts, genes are ranked based on metrics such as log2 fold change or moderated t-statistics between conditions, allowing GSEA to evaluate whether predefined gene sets—curated from databases like MSigDB—are statistically enriched at the extremes of the ranked list. This approach reveals subtle, pathway-level changes that might be obscured in individual gene analyses, such as the upregulation of cell cycle gene sets (e.g., G2M checkpoint) in proliferating versus quiescent cell populations, highlighting regulatory networks driving cellular growth. Adaptations for RNA-seq data, such as GSAASeqSP, account for count-based variance and discrete distributions to maintain statistical power in high-throughput transcriptomic studies.^[4]^[70] GSEA extends beyond bulk transcriptomics to other genomic assays, including chromatin immunoprecipitation sequencing (ChIP-seq) for assessing enrichment of transcription factor (TF) binding sites within gene sets. The ChIP-Enrich method, for instance, tests ChIP-seq peaks against gene sets to identify functional categories associated with TF targets, such as developmental pathways enriched in histone modification profiles. In expression quantitative trait loci (eQTL) studies, GSEA integrates genome-wide association study (GWAS) summary statistics with eQTL annotations to detect genetic variants regulating entire gene sets, as demonstrated by GIGSEA, which imputes differential expression from genotypes to reveal heritability in pathways like immune response. These applications leverage GSEA's rank-based framework to bridge sequence data with functional interpretation, often using preranked lists derived from peak calls or effect sizes.^[71]^[72] A prominent example in cancer transcriptomics involves GSEA of breast cancer RNA-seq data, where analyses have identified epithelial-mesenchymal transition (EMT) gene sets as significantly enriched in basal-like subtypes, with leading-edge genes including the TF SNAI1 (driving repression of epithelial markers) and the intermediate filament VIM (promoting mesenchymal motility). This enrichment underscores EMT's role in tumor invasion, as validated across TCGA cohorts. Advances in single-cell RNA-seq (scRNA-seq) have prompted GSEA variants like AUCell, which calculates area-under-the-curve scores for gene set activity in individual cells, enabling detection of pathway enrichments specific to heterogeneous populations, such as inflammatory signatures in tumor-infiltrating lymphocytes. For temporal dynamics, time-series adaptations like TcGSA model longitudinal expression trajectories to identify evolving pathway activations, such as oscillatory cell cycle genes in synchronized cultures.^[73] The impact of GSEA in genomics is evident in its routine use within consortium-scale projects like The Cancer Genome Atlas (TCGA), where it annotates pathway alterations across thousands of tumor transcriptomes, and the Encyclopedia of DNA Elements (ENCODE), which employs it to interpret regulatory element enrichments in functional gene sets. These applications have standardized pathway-based annotation in large omics datasets, facilitating discoveries in gene regulation and cellular states.^[74]

Disease and Biomarker Studies

Gene set enrichment analysis (GSEA) has been instrumental in elucidating molecular pathways dysregulated in various cancers, facilitating the identification of disease subtypes and therapeutic vulnerabilities. In glioblastoma multiforme (GBM), GSEA applied to transcriptomic profiles has revealed subtype-specific enrichments.^[75] These findings underscore GSEA's role in stratifying cancer patients for targeted therapies, with leading-edge genes from enriched sets often serving as prognostic indicators. In neurological disorders, GSEA has uncovered pathway alterations linked to synaptic dysfunction and immune dysregulation. For schizophrenia, single-cell RNA sequencing analyses have identified changes in synaptic genes and pathways in prefrontal cortex neurons from affected individuals. Beyond oncology and neurology, GSEA has illuminated immune and metabolic dysregulations in other diseases. In systemic lupus erythematosus (SLE), an autoimmune condition, GSEA consistently identifies interferon (IFN) signature enrichment across peripheral blood mononuclear cells, with type I IFN-inducible genes driving hypergammaglobulinemia and autoantibody production, as evidenced in longitudinal patient cohorts.^[76] In microbiome-associated disorders like acute pancreatitis, GSEA of host transcriptomes alongside microbial profiles demonstrates shifts in metabolic pathways, including elevated short-chain fatty acid synthesis genes, linking gut dysbiosis to inflammatory exacerbation and disease progression.^[77] For biomarker discovery, GSEA's leading-edge subset analysis prioritizes core genes driving enrichment, yielding candidate panels for clinical use. In diverse cancers, including colorectal and lung, combinations of 3–5 leading-edge genes from prognostic pathways (e.g., cell cycle or immune checkpoints) have outperformed single-gene markers in predicting survival, with validation in independent TCGA cohorts showing hazard ratios up to 2.5 for high-risk groups.^[78] Clinical translation of GSEA integrates it with genome-wide association studies (GWAS) for set-level insights into disease heritability. Tools like MAGMA extend GSEA principles to GWAS summary statistics, aggregating SNP effects to detect enriched pathways in traits like schizophrenia or autoimmune diseases, revealing polygenic contributions from synaptic or IFN-related sets with improved power over individual locus analysis.^[79] Recent 2024 advancements incorporate AI to automate GSEA interpretation, using large language models to refine gene set curation and predict patient-specific pathway perturbations, enhancing precision medicine applications in oncology and neurology.^[80]

References

[1]
Gene set enrichment analysis: A knowledge-based approach ... - NIH
Here, we describe a powerful analytical method called Gene Set Enrichment Analysis (GSEA) for interpreting gene expression data. The method derives its power by ...
[2]
Toward a gold standard for benchmarking gene set enrichment ...
We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction ...
[3]
GSEA
### Summary of GSEA (Gene Set Enrichment Analysis)
[4]
Gene set enrichment analysis: A knowledge-based approach for ...
In this paper, we provide a full mathematical description of the GSEA methodology and illustrate its utility by applying it to several diverse biological ...
[5]
Gene set enrichment analysis: performance evaluation and usage ...
Sep 7, 2011 · Gene set enrichment analysis (GSEA) aims to determine if gene sets are connected by a common theme, such as genes in the same pathway.INTRODUCTION · COMPONENTS OF GSEA · COMPARISON OF GENE SET...
[6]
Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data
For instance, gene set enrichment analysis (GSEA) was originally developed for microarrays, and subsequent extensions of GSEA, i.e., SeqGSEA and GSEA-SNP were ...Missing: precursors | Show results with:precursors
[7]
Dual gene set enrichment analysis (dualGSEA); an R function that ...
Dec 4, 2024 · We developed a new dualGSEA tool, which provides users with multiple statistics and visuals to aid interpretation of results.
[8]
MSigDB - GSEA
The Molecular Signatures Database (MSigDB) is a resource of tens of thousands of annotated gene sets for use with GSEA software, divided into Human and Mouse ...Human MSigDB Collections · Browse Human Gene Sets · Login · Gene FamiliesMissing: creation date
[9]
The Molecular Signatures Database (MSigDB) hallmark gene set ...
The Molecular Signatures Database (MSigDB) is one of the most widely used and comprehensive databases of gene sets for performing gene set enrichment analysis.
[10]
Human MSigDB Collections - GSEA
The 35134 gene sets in the Human Molecular Signatures Database (MSigDB) are divided into 9 major collections, and several subcollections.Browse Human Gene Sets · Browse 16228 gene sets · Mouse MSigDB Collections
[11]
KEGG PATHWAY Database
KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge of the molecular interaction, reaction and relation networks.Kegg network · KEGG Compound · KEGG Disease · KEGG ObjectsMissing: 1990s | Show results with:1990s
[12]
http://geneontology.org/
[13]
WikiPathways: Home
WikiPathways is an open science platform for biological pathways contributed, updated, and used by the research community. Read more Video tour.About · WikiPathways Data · Search · AnalyzeMissing: paper | Show results with:paper
[14]
Enrichr - Ma'ayan Laboratory, Computational Systems Biology
As of September 2025, the Enrichr platform surpassed 100 million gene set queries submitted by over a million unique users worldwide.Enrichr-KG · API Documentation · Login · WormEnrichr
[15]
Enrichr: a comprehensive gene set enrichment analysis web server ...
May 3, 2016 · Enrichment analysis is a popular method for analyzing gene sets generated by genome-wide experiments. Here we present a significant update ...
[16]
Human MSigDB Collections: Details and Acknowledgments - GSEA
Gene annotations for this collection are derived from the Chromosome and Karyotype band tracks from Ensembl BioMart (see MSigDB release notes for the current ...Missing: history v4.
[17]
Molecular signatures database (MSigDB) 3.0 - PMC - NIH
The initial MSigDB database, released in 2005 with GSEA software, contained 1325 sets. In contrast, MSigDB 3.0, released in September 2010, includes 6769 sets ...
[18]
GSEA User Guide
GSEA is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two ...Missing: 2020 | Show results with:2020
[19]
Gene Set Enrichment Analysis (GSEA) - OmicsBox User Manual
Gene Sets Min Size. Minimum number of genes required in a gene set. By default, GSEA ignores gene sets with fewer than 15 genes because normalization is not ...
[20]
limma powers differential expression analyses for RNA-sequencing ...
... ranked by moderated t-statistics. Example plots displaying results from DE ... The romer function in limma implements a GSEA approach that is based on rotation ...
[21]
Ranking metrics in gene set enrichment analysis: do they matter?
May 12, 2017 · Ranking metrics in gene set enrichment analysis: do they matter? Joanna Zyla,; Michal Marczyk,; January Weiner & … Joanna Polanska.
[22]
Improving the power of gene set enrichment analyses
May 17, 2019 · In this paper, we explore the dependence of the statistical power of the GSEA approach on the number of samples in the cohort with available ...
[23]
Pathway enrichment analysis and visualization of omics data using ...
Abstract. Pathway enrichment analysis helps researchers gain mechanistic insight into gene lists generated from genome-scale (omics) experiments.
[24]
8.3 Gene Set Enrichment Analysis
The maximum gene set size, as well as the number of permutations, affect the normalized enrichment scores and p-values. For details on these different ...Missing: guide | Show results with:guide
[25]
RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR
In this article, we describe an edgeR - limma workflow for analysing RNA-seq data that takes gene-level counts as its input, and moves through pre-processing ...
[26]
Robust multi-group gene set analysis with few replicates
Dec 9, 2016 · We propose a new permutation-based competitive gene set analysis method for multi-group gene expression data with as few as three replicates per ...Missing: batch effects pre-
[27]
a GSEA-based pathway enrichment analysis for multi-omics data
Dec 7, 2020 · Here, we present the multiGSEA package aiding to calculate a combined GSEA-based pathway enrichment on multiple omics layers.
[28]
A Beginner's Guide to Analysis of RNA Sequencing Data - PMC - NIH
Our goals in the present review are to break down the steps of a typical RNA-seq analysis and to highlight the pitfalls and checkpoints along the way.
[29]
Batch Correction Analysis | Griffith Lab - RNA-seq
It includes the ComBat method that has been widely used for batch correction of gene expression datasets, especially those generated on microarray platforms.
[30]
Gene Set Enrichment Analysis (GSEA) User Guide
To display the Analysis History page, click the Analysis history icon in the GSEA main window. The Analysis History page displays all analysis results. The ...Missing: origins precursors<|separator|>
[31]
CollapseDataset (v2.2.x)
The new data set uses gene symbols as the gene identifier format. Collapsing the probe sets eliminates multiple probes, which can inflate gene set ...
[32]
GSEA (v14) - GenePattern
GSEA evaluates gene expression profiles to see if gene sets show significant changes correlated with a phenotype, interpreting gene expression data.
[33]
ComBat-seq: batch effect adjustment for RNA-seq count data
We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies.
[34]
Testing multiple hypotheses through IMP weighted FDR based on a ...
Jun 17, 2015 · We present a Weighted False Discovery Rate (WFDR) method that incorporate biological knowledge from genetic networks. We first identify weights ...Methods · Results · Rna-Seq Data Analysis...
[35]
Multiple Testing of Gene Sets from Gene Ontology - Oxford Academic
Multiple Testing of Gene Sets from Gene Ontology: Possibilities and Pitfalls Free · When gene sets of interest are created by means of GO annotation, many will ...
[36]
A Network-Based Method for Gene-Set Enrichment Visualization ...
We developed “Enrichment Map”, a network-based visualization method for gene-set enrichment results. Gene-sets are organized in a network.
[37]
Application of Gene Set Enrichment Analysis for Identification of ...
For each significantly enriched pathway characterized via GSEA, leading edge genes responsible for driving enrichment were identified.
[38]
https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0013984
[39]
https://pmc.ncbi.nlm.nih.gov/articles/PMC6106787/
[40]
how sample size affects the reproducibility and specificity of gene ...
Oct 22, 2019 · Our findings show that, as a general pattern, the results of gene set analysis become more reproducible as sample size increases.Missing: coherent | Show results with:coherent
[41]
Pitfalls in the application of gene set analysis to genetics studies
As Mooney et al. point out, there is no agreement on what gene sets to use, and sources differ dramatically even when they are attempting to describe the same ...
[42]
Gene set analysis: limitations in popular existing methods and ...
Most popular gene set analysis methods are based on empirical P -value that requires large number of permutations. Despite numerous gene set analysis methods ...
[43]
[PDF] To tweak or not to tweak. How exploiting flexibilities in gene set ...
Feb 2, 2024 · It is recommended over the default of weighting each gene by its absolute value of the gene-level statistic when unsure about the biological.
[44]
Avoiding the pitfalls of gene set enrichment analysis with SetRank
Mar 4, 2017 · The purpose of gene set enrichment analysis (GSEA) is to find general trends in the huge lists of genes or proteins generated by many functional ...
[45]
Gene Set Analysis: Challenges, Opportunities, and Future Research
In this paper, we classify gene set analysis methods based on their components, describe the underlying requirements and assumptions for each class,
[46]
Nine quick tips for pathway enrichment analysis - PMC
Aug 11, 2022 · We propose nine quick tips to avoid common mistakes and to out a complete, sound, thorough PEA, which can produce relevant and robust results.
[47]
GOAT: efficient and robust identification of gene set enrichment
Jun 19, 2024 · The GSEA algorithm is a commonly used permutation-based approach for gene set enrichment analysis that can be applied to a preranked gene list.
[48]
Over-Representation Analysis with ClusterProfiler
Over-representation (or enrichment) analysis is a statistical method that determines whether genes from pre-defined sets (ex: those beloging to a specific ...Missing: pros cons
[49]
Over-representation analysis - RNA-Seq - GitHub Pages
Dec 9, 2020 · This particular example analysis shows how you can use over-representation analysis (ORA) to determine if a set of genes shares more or fewer genes with gene ...
[50]
8.2 Over-Representation Analysis | Proteomics Data Analysis in R ...
Over-representation analysis (ORA) is used to determine which a priori defined gene sets are more present (over-represented) in a subset of “interesting” genes.
[51]
Pathways and gene sets: What is functional enrichment analysis?
Jul 4, 2023 · Functional enrichment analysis is an umbrella term encompassing approaches used to derive biological / functional meaning from gene lists.
[52]
Gene set enrichment and pathway analysis | Griffith Lab - RNA-seq
Another difference between GSEA and overrepresentation analysis is that in GSEA, we use all the genes as inputs without applying any filters based on log2FC or ...
[53]
DAVID: Database for Annotation, Visualization, and Integrated ...
Aug 14, 2003 · This report describes DAVID, a web-accessible program that integrates functional genomic annotations with intuitive graphical summaries.Missing: history | Show results with:history
[54]
Toward a gold standard for benchmarking gene set enrichment ...
We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and ...<|separator|>
[55]
Gene-set testing - RNA-seq analysis in R
Sep 19, 2017 · Competitive gene set tests, such as those implemented in GOseq and camera , ask the question whether the differentially expressed genes tend to ...Gene Set Testing · Fgsea Analysis · Goseq Analysis<|separator|>
[56]
Improving gene set analysis of microarray data by SAM-GS - PMC
SAM-GS is an alternative to GSEA for gene-set analysis, extending SAM to gene-sets, and is proposed as a statistically-sound method.
[57]
GSVA: gene set variation analysis for microarray and RNA-Seq data
Jan 16, 2013 · GSE analyses begin by obtaining a ranked gene list, typically derived from a microarray experiment that studies gene expression changes between ...
[58]
GSVA: gene set variation analysis - davetang.github.io
GSVA calculates sample-wise gene set enrichment scores as a function of genes inside and outside the gene set, analogously to a competitive gene set test.Quick Start · Investigating The Scores · Enrichment Scores On...
[59]
GSEA-P: a desktop application for Gene Set Enrichment Analysis
Abstract. Gene Set Enrichment Analysis (GSEA) is a computational method that assesses whether an a priori defined set of genes shows statistically signific.
[60]
https://davetang.github.io/muse/gsva.html
[61]
Enriching the gene set analysis of genome-wide data by ... - NIH
Feb 26, 2013 · The Piano R package for enriched gene set analysis. The workflow described in this article, as well as the consensus scoring approach, is ...
[62]
piano | An R/Bioconductor package for gene set analysis
Piano contains a new approach which divides the gene set results into directionality classes, giving deeper information about the underlying gene pattern. For ...
[63]
Fast gene set enrichment analysis - bioRxiv
Feb 1, 2021 · Here we present FGSEA (Fast Gene Set Enrichment Analysis) method that is able to estimate arbitrarily low GSEA P-values with a high accuracy in a matter of ...Missing: original | Show results with:original
[64]
fgsea package - Bioconductor
No information is available for this page. · Learn why
[65]
alserglab/fgsea: Fast Gene Set Enrichment Analysis - GitHub
fgsea is an R-package for fast preranked gene set enrichment analysis (GSEA). This package allows to quickly and accurately calculate arbitrarily low GSEA ...
[66]
GSAASeqSP: A Toolset for Gene Set Association Analysis of RNA ...
Sep 12, 2014 · We developed GSAASeqSP, a novel toolset for genome-wide gene set association analysis of sequence count data.Overview Of Gene Set... · Simulation Studies · Case Study 2: Breast Cancer...
[67]
ChIP-Enrich: gene set enrichment testing for ChIP-seq data - NIH
May 30, 2014 · Gene set enrichment testing can enhance the biological interpretation of ChIP-seq data. Here, we develop a method, ChIP-Enrich, for this ...
[68]
genotype imputed gene set enrichment analysis using GWAS ...
Jul 13, 2018 · A novel method that uses GWAS summary statistics and eQTL to infer differential gene expression and interrogate gene set enrichment for the trait-associated ...
[69]
Decreased PRC2 activity supports the survival of basal-like breast ...
Nov 29, 2021 · Gene set enrichment analyses (GSEA) revealed strong enrichment of EMT-gene sets, cancer aggressiveness, and stemness (Fig. 1E). Indeed, well- ...
[70]
GS-TCGA: Gene Set-Based Analysis of The Cancer Genome Atlas
Mar 26, 2024 · GS-TCGA is designed to enable user-friendly exploration of TCGA data using gene set-based analysis, leveraging gene sets from the Molecular Signatures Database.
[71]
Integrated Genomic Analysis Identifies Clinically Relevant Subtypes ...
Jan 19, 2010 · We describe a robust gene expression-based molecular classification of GBM into Proneural, Neural, Classical, and Mesenchymal subtypes and integrate ...
[72]
Molecular correlates of sensitivity to PARP inhibition beyond ... - NIH
Aug 13, 2020 · Mutations and genome-wide mutational patterns associated with homologous recombination deficiency (HRD) were investigated in 255 primary CRCs ...
[73]
Single-cell multi-cohort dissection of the schizophrenia transcriptome
May 24, 2024 · Here, we performed single-cell dissection of schizophrenia-associated transcriptomic changes in the human prefrontal cortex across 140 ...Structured Abstract · Schizophrenia Degs Are... · Discussion
[74]
Interferon-inducible gene expression signature in peripheral blood ...
IFN Signature Is a Marker for Severe SLE. An IFN “score” was calculated for each patient and control, based on the expression levels of genes in the IFN cluster ...
[75]
Gut microbiota predicts severity and reveals novel metabolic ...
Dec 21, 2023 · GSEA revealed functional pathway units suggesting elevated short-chain fatty acid (SCFA) production in severe AP.
[76]
Unbiased discovery of cancer pathways and therapeutics using ...
Aug 24, 2024 · Combinations of 3–5 leading genes from prognostic pathways were found to be the best predictor of prognosis. Notably, none of the individual ...
[77]
MAGMA: Generalized Gene-Set Analysis of GWAS Data
We have developed MAGMA, a novel tool for gene and gene-set analysis. The gene analysis is based on a multiple regression model, to provide better statistical ...
[78]
How Artificial Intelligence Could Automate Genomics Research
New research from UC San Diego suggests that large language models like GPT-4 could streamline the process of gene set enrichment, ...