Pathway analysis
Pathway analysis is a computational approach in bioinformatics and systems biology that interprets high-throughput molecular data—such as gene expression profiles from genomics, transcriptomics, or proteomics—by mapping differentially expressed genes or proteins onto known biological pathways to identify those significantly enriched or perturbed under specific conditions, like disease versus control states.[1] This method provides biological context to large lists of molecular changes, revealing coordinated functional alterations in processes such as metabolism, signaling, or immune response, and aids in hypothesis generation for mechanistic studies.[2] By leveraging curated databases like KEGG, Reactome, or WikiPathways, pathway analysis transforms raw omics data into interpretable insights about cellular networks and disease mechanisms.[1] Originating from early genetic mapping efforts in the mid-20th century, pathway analysis gained prominence with the advent of high-throughput technologies following the Human Genome Project in 2001, evolving into a cornerstone of omics research by the mid-2000s.[1] Key methods include over-representation analysis (ORA), which tests for disproportionate gene enrichment in pathways using hypergeometric or Fisher's exact tests; functional class scoring (FCS), such as Gene Set Enrichment Analysis (GSEA), which evaluates cumulative pathway activity across all genes without arbitrary thresholds; and topology-based approaches (e.g., Signaling Pathway Impact Analysis or SPIA), which account for gene interactions and pathway structures to better capture regulatory dynamics.[2] These techniques have been benchmarked extensively, with topology-based methods often outperforming others in accuracy for identifying truly impacted pathways, though challenges like database incompleteness (covering only about 45% of human genes) and statistical biases persist.[2][1] In the multi-omics era, pathway analysis has expanded to integrate diverse data types, including metabolomics and epigenomics, enabling more holistic views of biological systems and supporting applications in personalized medicine, drug discovery, and biomarker identification.[3] Recent advances emphasize network-based and machine learning-enhanced methods to address limitations in traditional enrichment analyses, improving reproducibility and predictive power across cohorts.[4] As of 2025, over 700 pathway databases exist, underscoring the field's maturity, yet ongoing efforts focus on standardizing annotations and handling complex, interconnected pathway topologies for robust interpretation.[5]Overview
Definition and purpose
Pathway analysis is a computational method in bioinformatics that integrates high-throughput omics data, such as gene expression profiles, protein abundances, or metabolite levels, with predefined biological pathways to identify those that are statistically enriched or perturbed under specific experimental conditions.[6] This approach maps individual molecular changes onto structured networks of interacting genes, proteins, and metabolites, revealing coordinated alterations that might otherwise be obscured in raw data.[7] The primary purpose of pathway analysis is to provide biological context and mechanistic insights from lists of differentially expressed or regulated molecules, facilitating the interpretation of complex datasets in areas like disease pathology, drug response, or environmental perturbations.[8] By focusing on pathways rather than isolated entities, it enables researchers to infer affected cellular processes, such as signaling cascades or metabolic routes, and prioritize hypotheses for further validation.[3] Pathway analysis emerged in the early 2000s alongside the proliferation of microarray technologies, which generated vast gene expression datasets requiring systematic interpretation beyond individual gene assessments.[7] A pivotal advancement occurred in 2005 with the introduction of Gene Set Enrichment Analysis (GSEA) by Subramanian et al., which formalized a knowledge-based framework for detecting subtle, coordinated pathway-level changes without relying on arbitrary significance thresholds for single genes.[6] This method addresses key limitations of traditional single-gene analysis, particularly the multiple testing problem, where stringent corrections for thousands of hypotheses often yield few significant findings despite evident biological effects.[6] By evaluating gene sets collectively, pathway analysis enhances sensitivity to modest changes across multiple components, reducing false negatives and providing a more holistic view of system-wide perturbations.[9]Types of biological pathways
Biological pathways are primarily categorized into three main types: metabolic, signaling, and regulatory, each representing distinct mechanisms of cellular function and interaction.[10] These classifications provide foundational models for understanding how biological systems process information and maintain homeostasis, serving as targets for pathway analysis in omics data.[11] Metabolic pathways comprise sequences of enzymatic reactions that transform substrates into products, facilitating energy production and biosynthesis. Examples include glycolysis, which converts glucose to pyruvate, and the Krebs cycle (tricarboxylic acid cycle), which generates reducing equivalents for oxidative phosphorylation. These pathways are typically modeled as directed graphs, with nodes representing metabolites or enzymes and edges denoting chemical reactions or conversions, often incorporating stoichiometry and directionality to reflect flux through the system.[10][11] Signaling pathways consist of cascades that propagate extracellular signals to elicit cellular responses, involving sequential activation of molecular components. A prominent example is the mitogen-activated protein kinase (MAPK) pathway, which transmits signals from growth factors to regulate processes like cell proliferation through protein interactions and phosphorylation events. In graph representations, nodes correspond to proteins, receptors, or second messengers, while directed edges illustrate activations, inhibitions, or bindings, emphasizing the flow of information rather than material transformation.[10][11] Regulatory pathways, often termed gene regulatory networks, govern the control of gene expression through interactions among transcription factors, microRNAs (miRNAs), and other regulators. The p53 signaling network exemplifies this, where p53 acts as a transcription factor responding to DNA damage to activate genes involved in cell cycle arrest or apoptosis. These are depicted as directed graphs with nodes as genes, proteins, or regulatory elements and edges indicating transcriptional activation, repression, or post-transcriptional modifications, capturing dynamic feedback loops.[10][11] Across these types, biological pathways are commonly formalized as graphs where nodes denote biological entities such as genes, proteins, or metabolites, and edges capture interactions, reactions, or regulatory relationships, enabling computational analysis of structure and dynamics. Standards like BioPAX (Biological Pathway Exchange) facilitate the interchange of pathway data, supporting representations of metabolic, signaling, and regulatory processes at molecular and genetic levels, while SBML (Systems Biology Markup Language) provides an XML-based format for encoding quantitative models of these pathways, including regulatory networks.[10][12] In contrast to broader biological networks, which integrate diverse interactions across an entire system and often display scale-free topologies with high connectivity, pathways are curated, context-specific models emphasizing linear or branching sequences of functionally linked events rather than exhaustive connectivity.[10][11]Applications
In genomics and transcriptomics
In genomics and transcriptomics, pathway analysis typically begins with differential expression (DE) analysis of high-throughput sequencing data, such as RNA-seq or microarray outputs, to identify genes with significant changes in expression levels between conditions. For RNA-seq data, tools like DESeq2 are commonly employed to normalize read counts, model variance, and compute fold changes and p-values for DE genes, generating ranked lists based on statistics like log2 fold change or adjusted p-values. These ranked gene lists or fold-change values are then fed into pathway enrichment methods, such as over-representation analysis or gene set enrichment analysis, to assess whether predefined biological pathways are dysregulated.[13][14] This approach offers key advantages over single-gene analysis by accommodating genome-wide data scale, where thousands of genes are tested simultaneously, reducing false positives through multiple testing corrections while highlighting coordinated pathway-level perturbations that individual gene effects might obscure. For instance, it integrates subtle changes across multiple genes within a pathway, providing biological context and revealing mechanisms like signaling cascades that drive disease phenotypes, rather than isolated markers.[6][15] In cancer transcriptomics, pathway analysis has been instrumental in identifying dysregulated immune response pathways; for example, RNA-seq studies of tumor samples have shown enrichment in inflammation and immune activation pathways, such as those involving cytokine signaling, which correlate with tumor progression and immunotherapy response. Similarly, in genome-wide association studies (GWAS), single nucleotide polymorphisms (SNPs) are mapped to pathway genes using annotation resources, enabling enrichment analysis to link genetic variants to broader biological processes like cell cycle regulation, thereby prioritizing candidate pathways for functional validation.[16][17] Quantitative metrics from these analyses often include enrichment p-values, which indicate the statistical significance of pathway dysregulation; in tumor RNA-seq datasets, the Wnt signaling pathway, implicated in oncogenesis, has shown significant enrichment with p-values as low as 2.90 × 10^{-52} in renal cancer samples treated with arsenic trioxide, highlighting its role in modulating β-catenin activity.[18] As of 2025, recent advances integrate pathway analysis with single-cell RNA-seq (scRNA-seq) to detect cell-type-specific perturbations, such as heterogeneous immune pathway activations within tumor microenvironments, using graph-based models like GSDensity for pathway-centric dissection of transcriptomic heterogeneity.[19]In proteomics and metabolomics
In proteomics, pathway analysis leverages mass spectrometry data to assess enrichment in biological pathways affected by post-translational modifications (PTMs), particularly in dynamic processes like kinase signaling that influence drug responses. For instance, quantitative proteomics workflows identify phosphorylation sites on kinases, revealing pathway activations or inhibitions in response to therapeutic agents, such as targeted inhibitors in cancer cells. This approach has been instrumental in characterizing PTM landscapes in drug metabolism, where altered signaling pathways correlate with efficacy and resistance profiles.[20] Interactive tools further enable visualization of PTM dysregulation across signaling cascades, facilitating the pinpointing of regulatory hubs in disease progression.[21] Pan-cancer studies using mass spectrometry have demonstrated widespread PTM alterations in oncogenic pathways, underscoring their role in therapeutic targeting.[22] In metabolomics, pathway analysis maps profiles from liquid chromatography-mass spectrometry (LC-MS) and gas chromatography-mass spectrometry (GC-MS) to reconstruct metabolic networks, detecting flux imbalances such as those in the tricarboxylic acid (TCA) cycle during diabetes pathogenesis. Untargeted metabolomics has identified TCA cycle intermediates like citrate and α-ketoglutarate as dysregulated in type 2 diabetes, linking mitochondrial dysfunction to insulin resistance and hyperglycemia.[23] These analyses reveal broader network perturbations, including amino acid and lipid metabolism, providing biomarkers for disease monitoring and intervention. Advanced mass spectrometry strategies enhance resolution for low-abundance metabolites, enabling precise pathway mapping in diabetic complications like nephropathy.[24] Multi-omics integration extends pathway analysis by combining proteomics and metabolomics with genomics to trace causal perturbations from genetic variants through protein modifications to metabolite outputs. For example, tools like Pathview overlay genomic, proteomic, and metabolomic data onto KEGG pathway maps, visualizing how gene mutations propagate to alter kinase activity and downstream TCA flux in metabolic disorders. This layered approach has elucidated regulatory mechanisms in complex diseases, such as integrating phosphoproteomics with metabolomics to model signaling-metabolism crosstalk.[25] Comprehensive frameworks for transcriptomics-proteomics-metabolomics integration emphasize pathway-centric methods to uncover coordinated changes, improving predictive accuracy over single-omics analyses.[26] Distinct challenges in these domains arise from the inherent variability and noise in protein and metabolite measurements, which exceed that in genomic data and demand advanced normalization and imputation techniques for reliable enrichment. Proteomics data often exhibits batch effects and low reproducibility due to PTM lability, complicating pathway inference. Metabolomics pathway modeling further requires stoichiometric constraints in flux balance analysis to account for reaction stoichiometries, yet incomplete metabolite coverage and dynamic range issues hinder accurate reconstructions. In microbiome contexts, pathway analysis of host-pathogen interactions highlights these issues, as seen in recent genome-scale models revealing metabolic exchanges like nutrient competition in gut infections, where noisy multi-omics data integration is essential for dissecting reciprocal influences.[27][28]Pathway Databases and Resources
Curated pathway databases
Curated pathway databases compile structured representations of biological pathways, including metabolic, signaling, and regulatory processes, derived from experimental evidence and literature. These databases provide graph-based models that capture interactions, reactions, and relationships among genes, proteins, metabolites, and other biomolecules, enabling detailed analysis of cellular functions. Major databases emphasize manual curation by domain experts to ensure accuracy and integration of supporting evidence, such as PubMed IDs (PMIDs) for referenced publications.[29][30][31] The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a foundational resource featuring manually drawn pathway maps that represent molecular interaction, reaction, and relation networks across organisms. As of November 2025, KEGG contains 581 pathway maps, including metabolic pathways annotated with Enzyme Commission (EC) numbers for catalytic reactions and signaling pathways detailing regulatory cascades. Curation involves expert annotation of pathways based on genomic and biochemical data, with regular updates incorporating new evidence from literature and experimental studies.[32][33][34] Reactome offers a human-centric, peer-reviewed database of detailed reaction steps organized in a hierarchical format, covering 16,002 molecular events (reactions) in processes like signal transduction and gene expression. Pathways are manually curated by biologists using a controlled vocabulary and linked to evidence from publications via PMIDs, ensuring traceability to primary sources. The database supports interoperability through export in formats like BioPAX (Biological Pathway Exchange), a community standard for representing pathways at the molecular and cellular level in OWL XML. As of September 2025 (version 94), Reactome includes 2,825 human pathways with expanded annotations for disease-related processes.[30][35][36][37] WikiPathways is an open, community-driven platform where researchers collaboratively curate and update pathway diagrams using the PathVisio editor, fostering rapid incorporation of emerging knowledge. It hosts 1,913 human pathways across 27 species as of 2023, with ongoing contributions from over 600 individuals, and emphasizes open access for visualization and analysis. Curation follows a review process with literature-backed annotations, including PMIDs, and pathways are available in BioPAX and GPML formats for integration with other tools.[31][38][39] To enhance coverage, integrative resources like PathDIP version 5 aggregate data from 12 curated databases, resulting in 6,535 pathways spanning 195,148 genes and 5,783 diseases in humans and 16 other organisms. This integration propagates pathway associations to indirectly connected proteins, addressing gaps in primary databases through propagation algorithms and manual validation. Similarly, the STRING database version 12.5 (released in 2025, building on 2024 updates) incorporates pathway information into its protein association networks, adding regulatory directionality based on curated evidence for over 12,535 organisms. These databases serve as reference sets for pathway enrichment analysis, where gene sets derived from them inform functional interpretation of omics data.[40][41][42][43]Gene set collections
Gene set collections consist of predefined lists of genes or gene products grouped based on shared biological attributes, such as functional roles, regulatory mechanisms, or experimental perturbations, serving as resources for gene set enrichment analysis in genomics. These collections differ from structured pathway databases by providing flat, unordered groupings that facilitate broad functional interpretation without requiring network topology.[44] The Molecular Signatures Database (MSigDB), maintained by the Broad Institute, is a leading resource with version 2025.1 containing 35,134 human gene sets divided into nine major collections.[45] Hallmark gene sets (H, 50 sets) represent refined, high-confidence signatures of well-defined biological states, such as inflammation or adipogenesis, derived from multiple sources.[46] The Gene Ontology (GO) collection within MSigDB (C5, 16,228 sets, including 10,480 GO terms) organizes genes into categories for biological processes, molecular functions, and cellular components, enabling analysis of diverse functional annotations beyond pathways.[46][47] Oncogenic gene sets (C6, 189 sets) capture cancer-related neighborhoods and modules curated from tumor gene expression profiles.[46] Gene set types include canonical pathways, compiled from databases like KEGG and Reactome as part of MSigDB's C2 collection (7,561 curated sets overall).[45] Computational sets, such as motif-based regulatory targets (C3), infer gene groupings from sequence motifs or microRNA binding sites.[46] Immunologic signatures (C7, 5,219 sets) encompass cell type-specific, state-specific, and perturbation-induced profiles from the ImmuneSigDB compendium.[46] Curation processes emphasize manual annotation from peer-reviewed literature and validated databases, with MSigDB distinguishing categories like C2 (literature- and database-derived) from computationally generated ones like C3.[48] These collections offer advantages for enrichment analyses by simplifying computations with unordered lists, enabling rapid assessment of functional themes, and accommodating non-pathway groupings like cellular compartments or immunologic states that lack defined interactions.[49] However, they inherently lack the topological details of pathway databases, such as gene regulatory directions or interaction strengths, potentially overlooking nuanced biological relationships.[49]Methods of Pathway Analysis
Over-representation analysis (ORA)
Over-representation analysis (ORA) is a threshold-based statistical method employed in pathway analysis to evaluate whether genes associated with a specific biological pathway are more prevalent in a list of significant features—such as differentially expressed genes from genomic or transcriptomic data—than would be expected by random chance. This approach is particularly useful for interpreting high-throughput experimental results by identifying enriched pathways that may underlie observed biological phenomena. ORA operates on a binarized gene list, where features are classified as significant or non-significant based on a predefined cutoff, typically an adjusted p-value threshold like 0.05.[50] The core of ORA involves computing the enrichment of pathway genes within the significant list using the hypergeometric distribution or, equivalently, Fisher's exact test, which models the selection process as sampling without replacement from a finite population. To calculate the one-sided p-value for over-representation, the formula is: p = \sum_{i=k}^{\min(n, N_i)} \frac{\binom{N_i}{i} \binom{N - N_i}{n - i}}{\binom{N}{n}} Here, N represents the total number of genes in the background set (e.g., all genes assayed in the experiment), n is the size of the selected significant gene list, N_i is the number of genes annotated to the pathway of interest, and k is the observed number of overlaps between the significant list and the pathway genes. A low p-value indicates significant over-representation, suggesting the pathway is biologically relevant to the experimental condition. This method was popularized in tools like GO::TermFinder for Gene Ontology enrichment. The standard workflow for ORA begins with defining the background gene set to provide context for the null distribution, ensuring it reflects the experimental scope (e.g., genes detectable by the microarray or sequencing platform). Next, the significant gene list is generated by applying a statistical threshold to differential analysis results. Enrichment p-values are then computed for each pathway or gene set from curated databases, followed by correction for multiple hypothesis testing to control the family-wise error rate or false discovery rate; common corrections include the Bonferroni method for stringent control or the Benjamini-Hochberg procedure for FDR. Pathways with corrected p-values below a chosen threshold (e.g., 0.05) are deemed significantly enriched.[50] Key assumptions underlying ORA include the independence of genes within pathways, meaning correlations or interactions among genes are not accounted for, and the binary classification of significance, which discards quantitative information on effect sizes or ranks. These assumptions simplify computation but can introduce biases if violated.[50] For example, the DAVID tool has been used in microarray analyses to identify over-representation in cell death and apoptosis-related annotation clusters from differentially expressed genes in treated versus control samples, aiding in functional interpretation.[51] ORA's strengths lie in its straightforward implementation and computational efficiency, enabling rapid analysis of large gene lists without requiring complex data preprocessing. However, it is limited by its dependence on arbitrary thresholds, which can alter results and discard valuable information from non-significant but directionally consistent genes, and by a tendency to favor larger pathways due to higher baseline overlap probabilities.[52]Functional class scoring (FCS)
Functional class scoring (FCS) methods represent a category of pathway enrichment analyses that utilize the full spectrum of gene expression data by ranking all genes according to their differential expression statistics, thereby avoiding the need for arbitrary significance thresholds inherent in over-representation analysis (ORA).[53] In FCS, individual gene scores—such as moderated t-statistics derived from linear models for microarray or RNA-seq data using tools like limma—are computed to rank the entire gene list from most up-regulated to most down-regulated. This ranked list is then interrogated for each pathway or gene set to quantify the degree of coordinated perturbation, emphasizing subtle but consistent shifts across multiple genes rather than extreme changes in a few.[54] The core computation in FCS involves deriving an enrichment score (ES) for a given pathway, which aggregates the positions of pathway genes within the ranked list. A prominent variant, as implemented in Gene Set Enrichment Analysis (GSEA), employs a running-sum statistic resembling a Kolmogorov-Smirnov test to measure deviation from random expectation. The enrichment score is the maximum value of the running sum, calculated by walking down the ranked list L (from most up-regulated to most down-regulated): increase the running sum when encountering a gene in the set S (positive increment) and decrease it for genes out of S (negative increment). In the unweighted version, the increments are +1 for genes in S and -1 for genes not in S, normalized appropriately; significance is assessed via permutation tests.[6] Key FCS methods include GSEA, which relies on this permutation-based framework to detect enriched pathways, and PADOG (Pathway Analysis with Down-weighting of Overlapping Genes), a robust aggregation approach that combines moderated t-test statistics for pathway genes while down-weighting those shared across multiple sets to mitigate overlap biases. In PADOG, the pathway score is the mean of the absolute values of weighted moderated t-statistics for all genes in the pathway, where weights are higher for genes appearing in fewer sets, enabling sensitive detection in small sample sizes. Both methods process the ranked or scored gene list through these aggregation steps, followed by permutation or parametric resampling to derive p-values. An illustrative application of GSEA is in analyzing gene expression data to identify enrichment of EMT-related gene sets in chemoresistant breast cancer samples, revealing coordinated changes in genes like VIM and CDH2 associated with aggressive behavior.[55] FCS approaches offer advantages in sensitivity to modest, distributed expression changes that might be overlooked by threshold-dependent methods, as they leverage all data points without discarding information.[54] Additionally, by normalizing contributions based on gene set size and total genes, FCS reduces biases toward large pathways, promoting equitable evaluation across biological processes of varying complexity.[53]Pathway topology analysis (PTA)
Pathway topology analysis (PTA) integrates the structural information of biological pathways, modeled as graphs where nodes represent genes or proteins and edges denote interactions or regulatory relationships, to assess the impact of experimental data such as gene expression changes. Unlike over-representation analysis (ORA) or functional class scoring (FCS), which treat genes within a pathway as equally contributing members, PTA propagates signals along the graph edges to weight the influence of each gene based on its position and connectivity, thereby capturing how perturbations in upstream components may amplify or attenuate downstream effects. This approach enhances the detection of pathway dysregulation by accounting for the hierarchical and interconnected nature of signaling or metabolic processes.[56] A seminal method in PTA is Signaling Pathway Impact Analysis (SPIA), which combines evidence from ORA with a perturbation accumulation score to rank pathways. The perturbation factor P for a pathway is computed as P = \sum_i d_i \cdot |t_i|, where t_i is the t-statistic or fold change for gene i, and d_i is a topology coefficient reflecting the gene's position, such as its number of direct targets or a centrality measure like betweenness in directed graphs. SPIA has been benchmarked on various datasets, demonstrating its ability to identify impacted pathways by accounting for topology, outperforming unweighted methods in detecting regulatory dynamics.[57][58][56] Other key PTA methods include DEGraph, which employs Gaussian graphical models and multivariate tests like the Hotelling T^2-statistic to evaluate differential expression across the entire pathway graph, and PARADIGM, which uses factor graphs to infer patient-specific pathway activities by integrating multi-omics data and propagating probabilities through nodes and edges. DEGraph assesses whether the joint distribution of gene expressions differs between conditions while incorporating graph structure to boost statistical power, particularly for small sample sizes. PARADIGM models pathways as probabilistic graphical models, enabling the detection of shifts in pathway states, such as activation or inhibition, in cancer genomics datasets.[59] Topology metrics in PTA commonly include node degree (number of connections), centrality measures (e.g., betweenness for bottlenecks or closeness for influence), and distinctions between directed (e.g., signaling cascades) and undirected (e.g., metabolic interactions) graphs to reflect flow directionality. These metrics allow PTA to prioritize genes with high centrality, such as hubs or gatekeepers, whose dysregulation disproportionately affects pathway function. For instance, in directed pathways, incoming and outgoing edge weights propagate signals asymmetrically, unlike undirected models that assume symmetric interactions.[56][60] The primary benefits of PTA lie in its ability to account for pathway bottlenecks and interaction dependencies, leading to improved sensitivity and specificity over flat enrichment methods; studies show PTA detects more biologically relevant pathways in benchmark datasets by weighting topological positions. This structured weighting mitigates biases from gene length or expression levels, providing deeper insights into mechanistic dysregulation, such as in disease contexts where upstream mutations propagate broadly.[57][61][56]Network enrichment analysis (NEA)
Network enrichment analysis (NEA) integrates gene expression or other omics data with molecular interaction networks to detect enriched subnetworks, extending beyond predefined gene sets by leveraging connectivity in global networks like protein-protein interaction (PPI) maps. Unlike overlap-based methods, NEA quantifies enrichment through network topology, such as the density of links between query genes and functional modules, enabling the identification of emergent biological associations not captured by isolated gene lists. This approach typically begins by constructing a comprehensive network from resources like the STRING database, which compiles interactions from experimental, computational, and literature sources across thousands of organisms.[62] Key NEA methods employ diffusion or clustering to propagate scores across the network and score connected components. For instance, EnrichNet uses random walk with restart (RWR) on a human PPI network, where seed nodes from the query set initiate propagation with a restart probability of 0.9; the steady-state probability for a node v updates iteratively as S_v^{(t+1)} = (1 - \gamma) \sum_u A_{uv} S_u^{(t)} / \deg(u) + \gamma f_v, with \gamma = 0.9 as the restart parameter, A the adjacency matrix, and f_v the seed indicator, allowing scores to diffuse and reveal pathway associations via subnetwork visualization. Similarly, PRINCE applies label propagation on PPI networks to prioritize disease-associated genes and complexes, computing a smooth scoring function by iteratively diffusing priors from known disease genes across the graph, achieving high accuracy in gene prioritization for diseases like Alzheimer's. Module detection often incorporates algorithms like Markov Clustering (MCL), which simulates flow in the network to partition it into dense clusters, scoring these for enrichment against query sets to uncover functional modules.[63][64][65] In contrast to pathway topology analysis (PTA), which relies on fixed, curated pathway structures with predefined topologies, NEA operates on larger, uncurated networks to discover de novo modules without assuming rigid pathway boundaries, thus capturing indirect or emergent interactions across broader biological contexts. For example, network-based analyses in Alzheimer's disease using PPI networks have identified modules enriched for amyloid processing genes like APP.[66] Recent advancements incorporate directional edges in signed networks to model activation/inhibition, enhancing module specificity in neurodegenerative analyses. Advantages include the ability to reveal novel interactions missed by pathway-centric methods, but NEA is computationally intensive due to network scale and propagation iterations, often requiring efficient randomization for significance testing.[67][66][68] Recent advances as of 2025 include machine learning-enhanced NEA methods, such as graph neural networks for enrichment analysis (e.g., GNNenrich), and topology-aware approaches for spatial transcriptomics data, improving detection in complex omics contexts.[69][70]Software and Tools
Open-source tools
Open-source tools for pathway analysis provide accessible platforms for researchers to perform enrichment analyses, visualize results, and integrate multi-omics data without licensing costs, often leveraging community-driven development within ecosystems like Bioconductor. These tools typically support methods such as over-representation analysis (ORA), functional class scoring (FCS), and network enrichment analysis (NEA), enabling users to input gene lists in formats like Ensembl IDs or Entrez IDs and generate outputs including p-value-adjusted enrichments and visualizations.[71] The R package clusterProfiler is a widely used tool for comprehensive enrichment analysis, supporting ORA, FCS, and gene set enrichment analysis (GSEA) across thousands of species with up-to-date annotations from databases like MSigDB and KEGG.[71] It features thecompareCluster function for multi-set comparisons, allowing simultaneous analysis of multiple gene lists to identify overlapping enriched pathways, and produces visualizations such as dot plots and bubble plots for intuitive interpretation. As part of the Bioconductor project, clusterProfiler integrates seamlessly with other R packages for downstream analyses, including support for non-coding genomics data.[71]
For pathway visualization, the R package pathview enables mapping and rendering of user data onto KEGG pathway graphs, facilitating the integration of omics datasets like gene expression or metabolomics onto predefined diagrams.[72] It automatically handles data parsing and graph rendering, supporting inputs in various formats and producing publication-ready images that highlight node-level changes, such as fold-expression values.[73] Pathview is also integrated into Bioconductor, allowing combination with tools like clusterProfiler for end-to-end workflows.[74]
Web-based options like Enrichr offer a user-friendly interface for gene set enrichment, aggregating libraries from MSigDB, KEGG, and over 100 other sources to perform ORA and visualization without installation.[75] Users can upload gene lists via a web form, receiving interactive results including bar graphs, tables of top terms, and combined scores for pathway overlaps.[75] Enrichr supports programmatic access through its API, making it suitable for batch processing in pipelines.[75]
The GSEA desktop application, developed by the Broad Institute, is a standalone Java tool implementing FCS via the GSEA method, which assesses coordinated changes across ranked gene lists using metrics like normalized enrichment scores (NES). It processes microarray or RNA-seq data, supports custom gene sets, and generates detailed reports with heatmaps and leading-edge analyses to identify core enriched genes.[76] The application is freely downloadable and runs on multiple operating systems, with updates maintaining compatibility with recent MSigDB releases.
For multi-omics integration, the R package mixOmics provides multivariate methods like sparse partial least squares (sPLS) to combine datasets from proteomics, metabolomics, and transcriptomics prior to pathway analysis, reducing dimensionality while preserving biological relevance.[77] It supports pathway-level projections through integration with enrichment tools, enabling identification of correlated features across omics layers for downstream NEA or PTA.[78]
Recent advancements include the STRING API update in 2025, which enhances NEA by incorporating directionality in protein-protein interaction networks and a new geneset_description function for automated enrichment on input gene sets, drawing from over 20 billion interactions across 12,535 species.[79] Similarly, WebGestalt 2024 introduces support for metabolomics via integration with RaMP-DB and multi-omics analysis through multi-list functionality, accelerating ORA with over 600,000 functional categories and interactive visualizations like pathway maps, while improving computational speed for large datasets.[80]
In practice, users should standardize inputs to common identifiers (e.g., HGNC symbols for human genes) to avoid mapping errors, and leverage Bioconductor's vignette tutorials for combining tools like clusterProfiler with pathview to generate bubble plots overlaid on pathway diagrams. These open-source resources foster reproducibility through version-controlled code and community forums.