Trajectory inference
Trajectory inference is a computational method in single-cell omics analysis that reconstructs the dynamic progression of cell states by ordering individual cells along continuous trajectories, typically parameterized by pseudotime, to model biological processes such as differentiation, maturation, and response to stimuli from snapshot data.[1] These methods address the challenge of inferring temporal dynamics from static, high-dimensional measurements like gene expression profiles, enabling the study of cellular transitions at unprecedented resolution without requiring time-series experiments.[2] The field emerged around 2014 with pioneering tools like Monocle, which used principal curves to fit trajectories to dimensionality-reduced data, followed by Wanderlust, which employed diffusion-based pseudotime estimation on k-nearest neighbor graphs.[1] Since then, over 70 trajectory inference tools have been developed, reflecting rapid evolution driven by advances in single-cell RNA sequencing (scRNA-seq) technologies.[1] Early methods focused on linear or branching topologies, but subsequent innovations incorporated more complex structures, such as cycles and bifurcations, to better capture real biological variability.[2] Trajectory inference methods can be broadly categorized into clustering-based, graph-based, and probabilistic approaches. Clustering-based methods, such as Slingshot and SCORPIUS, first identify discrete cell clusters representing states and connect them via minimum spanning trees or elastic principal curves to form trajectories.[2] Graph-based techniques, including PAGA and TSCAN, leverage nearest-neighbor graphs or random walks to propagate pseudotime and infer global structures, often excelling in scalability for large datasets.[1] Probabilistic models, like Palantir and VeloVI,[3] introduce uncertainty quantification through generative processes, such as Gaussian processes or Markov chains, and integrate additional data modalities like RNA velocity to predict future cell fates.[2] Applications of trajectory inference extend beyond basic trajectory reconstruction to downstream analyses, including differential gene expression along paths (e.g., via tradeSeq), gene regulatory network inference, and alignment of trajectories across datasets or conditions.[2] It has been instrumental in fields like developmental biology, immunology, and oncology, revealing insights into processes such as embryonic development, immune cell activation, and tumor heterogeneity.[4] Benchmarks, such as the dynverse evaluation of 45 methods on synthetic and real datasets, highlight trade-offs in accuracy, stability, and computational efficiency, with no single method outperforming others universally; selection depends on data topology, dimensionality, and noise levels.[1] Despite these advances, challenges persist, including sensitivity to data preprocessing, assumptions about continuous progression that may not hold for discrete transitions, and the need for robust quality control to distinguish true trajectories from artifacts.[2] Ongoing developments focus on multi-omics integration, such as combining transcriptomics with epigenomics or spatial data, and machine learning enhancements to handle complex, non-linear dynamics more effectively.[2]Background
Definition and principles
Trajectory inference is a computational approach that reconstructs continuous developmental or differentiation paths, known as trajectories, from static snapshot data obtained from single-cell omics technologies, such as single-cell RNA sequencing (scRNA-seq).[1] It achieves this by ordering individual cells along a pseudotime axis, which serves as a proxy for biological progression through a dynamic process, thereby enabling the inference of temporal dynamics without requiring time-series experiments.[5] This method assumes that the observed cells represent asynchronous samples from an underlying continuous biological process, where gene expression changes occur gradually, allowing the reconstruction of cell fate transitions from high-dimensional data.[1] The core principles of trajectory inference revolve around modeling cellular progression as a graph-like structure, where nodes represent cell states and edges denote transitions along pseudotime.[2] Pseudotime quantifies the extent of progress through the process, often derived from diffusion-based or minimum spanning tree approaches that capture manifold geometry in the data. Trajectories can take various topologies, including linear paths for unbranched processes, branching structures to model fate bifurcations, and multifurcating graphs for complex decision points involving multiple lineages.[1] Primarily applied to high-dimensional datasets like gene expression profiles, trajectory inference facilitates the study of cell fate decisions in contexts such as embryonic development or tissue regeneration, where traditional time-course data are scarce or infeasible.[1] For instance, in scRNA-seq analysis of differentiating cells, clusters representing distinct stages are identified and linked via edges to form a trajectory graph that visualizes progression from progenitor to mature states.[5] Dimensionality reduction techniques, such as principal component analysis, are often employed as a preprocessing step to aid visualization and noise reduction in these high-dimensional spaces.[2]Historical development
Trajectory inference emerged in the mid-2010s, driven by advancements in single-cell RNA sequencing (scRNA-seq) that provided high-resolution snapshots of cellular states, necessitating computational methods to reconstruct dynamic processes like differentiation. The foundational concepts of pseudotime—ordering cells along inferred developmental paths based on transcriptional similarity—were introduced in 2014 through two landmark publications. Wanderlust, developed by Bendall et al., employed graph-based diffusion to align cells on trajectories reflecting progression in human B-cell development, marking an early shift from static clustering to dynamic modeling.[6] Independently, Monocle, proposed by Trapnell et al., used unsupervised algorithms to order cells by pseudotemporal gene expression changes, demonstrating its utility in myoblast differentiation and revealing regulatory cascades.[5] Key milestones in the late 2010s addressed limitations in handling complex topologies, such as branching lineages. In 2016, Wishbone extended graph diffusion techniques to detect bifurcations, enabling high-resolution mapping of diverging trajectories in hematopoietic and neural progenitor data.[7] The same year, TSCAN introduced cluster-based pseudotime reconstruction via principal curves, offering a robust approach for ordering cells in branching structures without requiring prior lineage knowledge.[8] Slingshot, building on these ideas and published in 2018 (with preprint in 2017), inferred multiple lineages by constructing minimum spanning trees over clusters, providing flexible pseudotime estimation along curved paths.[9] Concurrently, Haghverdi et al. (2016) advanced the field with diffusion pseudotime (DPT), a manifold learning-based metric that robustly captures branching and metastable states using random walk approximations on diffusion maps.[10] Street et al. (2018) further refined lineage inference in Slingshot, emphasizing its stability for noisy scRNA-seq data.[9] These developments were synthesized in a 2019 benchmark of 45 methods, which underscored the shift toward scalable, topology-aware algorithms.[1] Post-2019 advancements focused on scalability, multi-omics integration, and dynamic information. Monocle 3, introduced in 2019, enhanced trajectory construction for million-cell datasets by incorporating graph-based partitioning and partition-based graph abstraction, facilitating multi-omics analyses in contexts like embryonic development. From 2019 onward, RNA velocity—estimating transcriptional kinetics from unspliced and spliced mRNA ratios—has been increasingly integrated to refine trajectories, as pioneered by La Manno et al. (2018) and extended in subsequent works to predict cell fate directions.[11][12] PAGA (2019), developed by Wolf et al., reconciled clustering with trajectory inference via topology-preserving graph abstractions, enabling coarse-grained connectivity maps for complex manifolds like planarian regeneration. By 2024, methods like IDTI addressed time-series scRNA-seq challenges, using diversity minimization to infer trajectories without predefined starting cells, improving accuracy in dynamic datasets. In 2024-2025, methods like Genes2Genes for gene trajectory alignment and CASCAT for causal inference in spatial data further advanced multi-omics and spatiotemporal applications.[13][14] These evolutions reflect trajectory inference's maturation into a cornerstone of single-cell analysis, with ongoing emphasis on robustness and interpretability as of November 2025.Theoretical foundations
Key assumptions
Trajectory inference methods rely on several core assumptions to reconstruct cellular progression from high-dimensional single-cell omics data. A fundamental premise is that the dataset captures a continuous biological process, where gene expression changes gradually across states, and cells represent independent, asynchronous samples drawn from this trajectory at different progression points.[15] Another key assumption is the existence of an underlying low-dimensional manifold structure in the data, which dimensionality reduction techniques can reveal to facilitate trajectory embedding.[1] These assumptions enable the ordering of cells along a pseudotemporal axis without requiring time-labeled data. Biologically, trajectory inference typically presumes unidirectional progression along the trajectory, excluding cycles unless explicitly modeled, as in processes like differentiation where cells advance toward terminal states.[15] Branching structures are assumed to reflect fate decisions, with divergence points representing lineage bifurcations driven by regulatory changes.[1] Additionally, the methods assume that technical noise in the data, such as dropout events in single-cell RNA sequencing, does not overwhelm the signal of the underlying trajectory, preserving continuity in expression profiles.[16] From a statistical perspective, expression profiles are expected to vary smoothly along the inferred pseudotime, implying that nearby cells in pseudotime exhibit similar transcriptomic states.[17] Methods further assume sufficient sampling density across trajectory states to avoid gaps that could distort reconstruction, ensuring representation of transient intermediates.[16] Violations of these assumptions can introduce artifacts, such as erroneous linearization of cyclic processes like the cell cycle, leading to biologically implausible trajectories.[15] To assess reliability, validation often employs metrics like w-correlation to quantify alignment between inferred and reference trajectories, highlighting discrepancies from assumption breaches.[1]Mathematical models
Pseudotime serves as a fundamental scalar metric in trajectory inference, assigning to each cell a continuous value \tau_i \in [0, 1] that quantifies its progression along an inferred developmental path, often derived from the projection of a cell's expression profile x_i onto a parameterized trajectory curve \mu(t), such that \tau_i = \arg\min_t \|x_i - \mu(t)\|^2.[18] This formulation assumes a smooth, low-dimensional manifold underlying the data, enabling the ordering of cells without direct temporal information. In diffusion-based approaches, pseudotime can alternatively be computed as the diffusion distance along the principal eigenvector of a transition matrix derived from cell-cell similarities, capturing geodesic distances on the data manifold to robustly reconstruct branching structures. Trajectory representations frequently employ graph-based models, where cells or clusters form nodes in a similarity graph, and edges are weighted by metrics such as k-nearest neighbor distances or kernel affinities; the minimum spanning tree (MST) of this graph then approximates the global topology by connecting nodes with minimal total edge weight while avoiding cycles, providing a tree-like backbone for pseudotime assignment along paths from root to terminal nodes.[9] Complementing this, diffusion maps embed the data into a lower-dimensional space via the eigenvectors \Psi_k(X) of a Markov transition matrix P, where P_{ij} represents the diffusion probability from cell i to j, yielding coordinates that preserve local manifold geometry and facilitate pseudotime computation as progress along the embedding's dominant direction. Branching models extend linear trajectories to tree structures, incorporating bifurcation points where cell fates diverge; these are often parameterized as Gaussian processes per branch, with the posterior over the branching time t_b calculated as p(t_b | Y) \propto p(Y | t_b) using marginal likelihood, enabling gene-specific identification of dynamic changes at branches.[19] Such models assume independent evolution along branches after splitting, grounded in the key assumption of a tree-like topology without cycles.[1] Integration of RNA velocity enhances trajectory inference by estimating future cell states from ratios of unspliced to spliced mRNA, modeling the dynamics as \frac{dX}{dt} = A X + B, where X denotes the expression state vector, A captures regulatory interactions estimated via steady-state ratios, and B represents transcriptional bursts; velocities are derived by solving the kinetic equations distinguishing unspliced (u) and spliced (s) transcripts, with \frac{du}{dt} = \alpha - \beta u and \frac{ds}{dt} = \beta u - \gamma s, projecting arrows onto the trajectory to refine pseudotime and branch directions.[12] Validation of inferred trajectories relies on metrics assessing consistency with ground-truth progressions, such as the w-correlation, a weighted Pearson correlation between pseudotimes that emphasizes local manifold structure by weighting pairs according to their diffusion distances, ensuring preservation of biological ordering across methods or batches.[20]Methods
Dimensionality reduction techniques
Dimensionality reduction techniques are essential preprocessing steps in trajectory inference for single-cell RNA sequencing (scRNA-seq) data, projecting high-dimensional gene expression profiles—often spanning thousands of genes—into low-dimensional spaces of 2 to 50 dimensions to uncover underlying manifold structures and cell neighborhoods.[21] This reduction facilitates the identification of continuous trajectories by mitigating the curse of dimensionality, noise, and computational burdens associated with raw data analysis.[21] Linear methods, such as principal component analysis (PCA), serve as a baseline for initial noise reduction and global structure capture.[21] In PCA, the data matrix X is centered by subtracting the mean, followed by singular value decomposition (SVD) to yield X = U \Sigma V^T, where the reduced representation is formed by projecting onto the top principal components: X_{\text{reduced}} = U_k \Sigma_k, with k denoting the number of retained components.[22] This approach assumes linearity and excels in scRNA-seq for denoising but may distort nonlinear manifolds prevalent in biological differentiation processes.[21] Nonlinear methods better preserve local and global structures in complex scRNA-seq datasets. t-distributed stochastic neighbor embedding (t-SNE) visualizes clusters by minimizing divergences between high- and low-dimensional probability distributions, controlled by the perplexity parameter that balances local and global neighborhood preservation (typically 5–50 for single-cell data).[23] Uniform manifold approximation and projection (UMAP) offers faster computation through fuzzy simplicial set representations of the data manifold, optimizing a cross-entropy loss to embed cells while maintaining topological relationships.[24] Diffusion maps approximate the manifold's geometry via eigenvectors of a diffusion kernel, providing coordinates that can serve as a basis for pseudotime estimation by capturing diffusion distances along potential trajectories.[25][10] scRNA-seq data's sparsity, arising from dropout events where true expression is undetected as zero, poses challenges for standard reduction techniques, often leading to distorted manifolds.[26] To address this, zero-inflated models like zero-inflated factor analysis (ZIFA) explicitly parameterize dropout probability alongside latent factors, enabling robust low-dimensional embeddings without full imputation.[27] Alternatively, imputation methods (e.g., scImpute or MAGIC) can preprocess data to fill dropouts based on similar cells before applying reduction, though this risks introducing bias if over-applied.[26] Evaluation of these techniques emphasizes preservation of global structure, quantified by the trustworthiness score, which measures the fraction of low-dimensional k-nearest neighbors that were also neighbors in the original space (ranging from 0 to 1, with higher values indicating better preservation).[28] In scRNA-seq benchmarks, nonlinear methods like UMAP often achieve higher trustworthiness (e.g., >0.8 on pancreas datasets) compared to PCA, supporting their preference for trajectory preprocessing.[21]Trajectory construction algorithms
Trajectory construction algorithms in trajectory inference aim to build topological structures, such as graphs or trees, from dimensionality-reduced single-cell data to represent cellular progression paths. These algorithms typically operate after an initial dimensionality reduction step, like principal component analysis or t-SNE, to capture the underlying manifold of cell states. Common approaches include graph-based, diffusion-based, and partition-based methods, each suited to different trajectory topologies from linear to branched structures. Graph-based approaches construct trajectories by first building a k-nearest neighbors (k-NN) graph from the reduced data, which encodes local cell similarities, and then deriving a minimum spanning tree (MST) to form a linear or tree-like backbone for pseudotemporal ordering. This framework is particularly effective for linear trajectories, as the MST connects all nodes without cycles while minimizing total edge weights, providing a parsimonious path. For instance, Slingshot employs this strategy by clustering cells, constructing an MST on cluster centroids, and fitting principal curves to smooth the paths through clusters, enabling robust inference of branching lineages. Slingshot's use of principal curves ensures trajectories follow the data's principal directions, improving accuracy on datasets with clear bifurcations. Diffusion-based methods leverage diffusion maps or similar embeddings to propagate pseudotime along principal components of cell diffusion, capturing global trajectory geometry. Monocle, in its second version, uses reversed graph embedding (RGE) to iteratively learn a principal graph that aligns with the diffusion structure, allowing pseudotime to be ordered along tree-like branches in an unsupervised manner. This approach resolves complex trajectories by embedding cells onto a low-dimensional graph that minimizes reconstruction error, making it suitable for datasets with multiple fate decisions. Partition-based algorithms focus on clustering cells first and then ordering the clusters to form the trajectory skeleton. TSCAN, for example, groups cells into clusters based on gene expression similarity, computes an MST connecting cluster centroids in reduced space, and pseudotemporally orders cells within each branch by projecting onto the principal curve of the cluster sequence. This method excels in handling noisy data by reducing the problem to cluster-level connections, providing interpretable trajectories for linear or mildly branched progressions. To handle branching trajectories, methods like PAGA first abstract the data into a coarse-grained graph by partitioning cells and estimating connectivity via mutual nearest neighbors, then refine into fine-grained paths while preserving topology. PAGA detects potential bifurcations by maximizing variance in gene expression modules across graph edges, highlighting decision points where cellular states diverge. This abstraction step reconciles clustering with continuity, enabling scalable analysis of high-dimensional data. Recent extensions address multifurcating and cyclic topologies, which challenge traditional tree models. PHLOWER (2025) infers complex, multi-branching differentiation trees from multimodal single-cell data, such as combined transcriptomics and proteomics, by integrating Hodge decomposition to decompose cell connectivity into harmonic, gradient, and curl components, thus capturing cycles and multifurcations in developmental processes. This method improves accuracy on datasets with non-tree structures, like looping trajectories in immune responses.Incorporation of prior knowledge
Trajectory inference methods can be enhanced by incorporating external biological knowledge, known as priors, to guide the reconstruction process and improve alignment with known cellular dynamics. These priors help constrain the vast space of possible trajectories derived from data-driven approaches, reducing ambiguity in branching structures or pseudotime ordering. Common types of priors include gene regulatory networks (GRNs), which weight potential transitions between cell states by prioritizing edges supported by known regulatory interactions; known marker genes, used to identify the root or starting point of a trajectory based on expression patterns of stem cell or progenitor markers; and time-series labels from longitudinal experiments, which anchor cells to specific developmental stages across time points.[29][30][31] One prevalent approach for integration involves constrained optimization frameworks, where the trajectory is formulated as minimizing a composite cost function that balances data fidelity with adherence to priors. For instance, the objective can be expressed as: \text{Cost} = \text{Data fit} + \lambda \cdot \text{Prior violation} Here, the data fit term measures agreement with observed gene expression or spatial coordinates, while the prior violation penalizes deviations from biological constraints, such as spatial contiguity or regulatory strengths, with \lambda tuning the trade-off. This regularization ensures trajectories respect domain-specific assumptions without fully relying on unsupervised inference.[32] Examples of prior incorporation include using lineage tracing data from clonal labeling experiments to validate inferred branch points, confirming that trajectory bifurcations align with experimentally observed cell fate divergences. Similarly, RNA velocity estimates—derived from ratios of unspliced to spliced mRNA—provide directional priors by aligning velocity vectors with the trajectory manifold, orienting pseudotime flow toward predicted future states and resolving ambiguities in cyclic or multifurcating paths.[33]00150-8) Advanced methods leverage Bayesian frameworks to incorporate hierarchical priors for more robust inference, particularly across multiple samples. For example, VITAE employs a latent hierarchical mixture model with variational autoencoders to jointly infer trajectories while integrating sample-specific priors on topology and dynamics, enabling scalable analysis of heterogeneous datasets like brain cell atlases.[34] Incorporating priors enhances the biological relevance of trajectories, often yielding more interpretable and accurate reconstructions in complex systems like hematopoiesis or neurogenesis, where pure data-driven methods may overlook subtle regulatory cues. However, reliance on priors risks introducing bias if the external knowledge is incomplete or species-specific, potentially overfitting to erroneous assumptions and reducing generalizability across datasets.[2]Applications
Developmental biology
Trajectory inference has been instrumental in reconstructing embryogenesis trajectories in model organisms, such as the progression from zygote to gastrula stages in zebrafish and mouse embryos. In zebrafish, single-cell RNA sequencing (scRNA-seq) data from over 38,000 cells across 12 developmental stages revealed continuous transcriptional changes underlying somitogenesis, neural crest migration, and other morphogenetic processes, enabling the mapping of cell fate transitions without relying on prior lineage knowledge.[35] Similarly, in mouse embryos, integration of scRNA-seq datasets spanning gastrulation to early organogenesis identified 56 distinct cell trajectories, highlighting spatiotemporal gene expression dynamics during primitive streak formation and germ layer specification. Prominent examples include the inference of hematopoiesis trajectories, where branching structures delineate myeloid and lymphoid lineages from hematopoietic stem cells, revealing sequential differentiation waves in bone marrow progenitors.[36] In pancreatic development, trajectory analysis of fetal scRNA-seq data has delineated the divergence of endocrine progenitors into alpha and beta cells, capturing key transcriptional shifts that govern insulin-producing cell maturation.[37] These applications have yielded critical insights, such as pinpointing tipping points in fate decisions—where cells commit irreversibly to specific lineages—and uncovering novel intermediate states that bridge progenitor and differentiated populations, thereby refining models of developmental progression.[15] Notable case studies underscore these advances: a 2019 analysis of mouse gastrulation on integrated scRNA-seq profiles from embryonic days 6.5 to 8.5 reconstructed trajectories for epiblast diversification and mesoderm formation, identifying regulatory modules active at germ layer boundaries.[38] More recently, branching trajectory analyses of human retinal organoids derived from induced pluripotent stem cells have mapped differentiation paths, revealing off-target cell populations in vitro. Multi-omics integration further enhances trajectory inference in developmental contexts, exemplified by combining scRNA-seq with assay for transposase-accessible chromatin sequencing (ATAC-seq) to trace epigenetic landscapes during human fetal hematopoiesis. This approach aligned transcriptional and chromatin accessibility profiles across approximately 8,500 cells and nuclei, disclosing dynamic enhancer remodeling that coordinates myeloid and erythroid branching from common progenitors.[39]Disease progression and other fields
Trajectory inference has been instrumental in modeling disease progression by reconstructing dynamic cellular states from single-cell data, particularly in oncology where it elucidates paths of tumor evolution and metastasis within heterogeneous microenvironments. For instance, in prostate cancer, trajectory analysis of single-cell transcriptomes has delineated progression from normal to metastatic states, identifying key transcriptional changes along pseudotemporal paths that highlight driver genes and potential intervention points.[40] Similarly, in breast and other solid tumors, methods like alignment of single-cell trajectories across conditions have enabled the inference of branching paths representing divergent metastatic routes, integrating multi-omics data to reveal how tumor cells adapt to selective pressures in the microenvironment. These approaches not only map linear progression but also capture bifurcations that correspond to therapeutic resistance or dormancy, providing insights into metastasis mechanisms. In neurodegenerative diseases, trajectory inference has mapped transitions from healthy to pathological states, as exemplified in Alzheimer's disease (AD) studies around 2022 that applied single-cell sequencing guidelines to reconstruct glial and neuronal trajectories influenced by amyloid and tau pathology.[41] These analyses revealed pseudotemporal ordering of cell states, showing how microglia activation and astrocyte dysfunction contribute to neurodegeneration, with trajectories diverging based on genetic risk factors. More recent extensions have integrated perturbation data from precision medicine cohorts to infer causal paths of therapeutic efficacy, such as how targeted inhibitors alter branching in AD-related cellular models, thereby identifying responsive subpopulations for personalized interventions. Beyond canonical diseases, trajectory inference extends to immune dynamics during infections, where it has characterized COVID-19 response trajectories by ordering immune cell states from activation to exhaustion in severe cases. For example, single-cell analyses have inferred epithelial-immune interaction paths, revealing how airway cells transition amid viral challenge and how immune niches bifurcate into recovery or hyperinflammatory routes. In stem cell reprogramming, optimal transport-based trajectory methods have uncovered heterogeneous paths from somatic to pluripotent states, highlighting asynchronous diversification and regulatory networks that govern efficiency. In microbial ecology, trajectory concepts applied to community succession, such as in soil ecosystems, model bacterial shifts over environmental gradients, inferring successional paths shaped by nutrient availability and interactions, though single-cell applications remain emerging. A key advantage of trajectory inference in these contexts is its ability to pinpoint branch points as therapeutic windows, where interventions can redirect pathological paths, as seen in cancer models where pseudotime analysis identifies vulnerable states for drug targeting. Furthermore, integrating perturbation data, such as from CRISPR screens or drug exposures, enables causal inference along trajectories, enhancing precision medicine by predicting response heterogeneity. Extensions to spatial transcriptomics have further advanced disease modeling by inferring tissue-level trajectories, for instance, in tumors where spatial pseudotime reveals invasion gradients and in AD brains where it maps regional progression from healthy to degenerative zones.Software tools
Monocle
Monocle is an open-source R package distributed through Bioconductor, developed for analyzing single-cell RNA sequencing data with a focus on trajectory inference to model dynamic gene expression changes along biological processes such as differentiation. Introduced in 2014 with version 1, Monocle pioneered the concept of pseudotime, an unsupervised ordering of cells along a linear trajectory to reconstruct temporal progression from static snapshots of transcriptomes. Monocle 2, released in 2017, extended this to support branching trajectories by employing reversed graph embedding to capture complex fate decisions without prior knowledge of the structure. Monocle 3, launched in 2019, further advanced scalability for large datasets through a graph abstraction framework that enables efficient partitioning and trajectory construction across millions of cells. A core feature of Monocle is its use of DDRTree for low-dimensional embedding, which approximates a tree-structured manifold to position cells in a space that reflects their progression, building on mathematical models of discriminant analysis for reduced representations. The package incorporates partition-based trajectory learning, where cells are first clustered into partitions representing potential lineages, followed by the construction of a principal graph that connects these partitions into a cohesive trajectory. It integrates tightly with cell clustering algorithms, such as those based on Leiden or Louvain methods, to define the scope of trajectories and identify branch points corresponding to fate decisions. The standard workflow in Monocle begins with input of dimensionality-reduced data, such as PCA or UMAP coordinates from preprocessed expression matrices, followed by graph learning to fit the trajectory structure and pseudotime assignment to quantify each cell's position along the paths. This process supports visualization of trajectories in 2D or 3D embeddings and can incorporate RNA velocity information to orient branches and predict future states. Monocle 3 enhances this by allowing trajectory refinement within specific partitions or across multiple ones, facilitating modular analysis of complex datasets. Monocle's strengths lie in its ability to handle large-scale single-cell datasets efficiently, processing up to millions of cells while producing interpretable 2D/3D trajectory visualizations that aid in exploring lineage relationships. However, the graph learning and embedding steps can be computationally intensive for very large datasets exceeding current hardware limits, and the method assumes predominantly tree-like topologies, potentially underperforming on datasets with cyclic or highly disconnected structures.Slingshot
Slingshot is an open-source R package developed for trajectory inference in single-cell RNA sequencing data, first introduced in 2017. It models continuous developmental trajectories by leveraging pre-computed cell clusters in a low-dimensional space, fitting principal curves to represent lineages and pseudotime. The method assumes that cells within clusters are more similar to each other than to those in other clusters, enabling robust lineage reconstruction without directly operating on high-dimensional gene expression data.[42] A core strength of Slingshot lies in its ability to simultaneously infer pseudotime and branching lineages, starting from user-specified root cells or automatically detecting them. It accommodates complex topologies, including multiple branches emanating from a common progenitor, which is particularly useful for capturing differentiation hierarchies in biological processes. Unlike methods that rely on continuous embeddings alone, Slingshot emphasizes discrete clustering as a foundation, providing interpretable structures that align with known cell types.[9] The workflow begins with dimensionality reduction techniques, such as principal component analysis (PCA) or uniform manifold approximation and projection (UMAP), to obtain cell coordinates, followed by clustering (e.g., using k-means or density-based methods). Slingshot then constructs a minimum spanning tree (MST) connecting cluster centroids to define the global lineage topology. Principal curves are fitted along each path in the MST, projecting cells onto these curves to assign pseudotime values that reflect progression along inferred trajectories. This process is implemented via theslingshot() wrapper function, which outputs a SingleCellExperiment object compatible with downstream analyses.
Slingshot excels in interpretability for branching trajectories, allowing visualization of lineages and pseudotime on reduced-dimensional plots using integrated plotting functions. It integrates directly with the tradeSeq package for differential expression analysis, enabling tests for genes changing along specific paths or between branches, which enhances its utility in identifying regulatory dynamics. For instance, in hematopoiesis datasets like those from Paul et al. (2015), Slingshot has been used to reconstruct myeloid and erythroid lineages from hematopoietic stem cells, revealing pseudotime-ordered gene expression patterns consistent with known differentiation cascades. Recent package versions maintain compatibility with diverse input formats, including those derived from spatial transcriptomics by treating spatial coordinates as additional dimensions in the embedding.[43]
PAGA
PAGA (Partition-based Graph Abstraction) is a Python package introduced in 2019 within the Scanpy ecosystem for single-cell RNA sequencing analysis, designed to infer coarse-grained trajectory topologies by creating interpretable graph-like maps of cellular manifolds.[11] It addresses the challenge of reconciling clustering with trajectory inference by abstracting high-dimensional data into a simplified graph that preserves global connectivity structures.[11] As an implementation of trajectory construction algorithms, PAGA focuses on initial topology estimation rather than full pseudotime assignment.[44] The core feature of PAGA involves constructing a k-nearest neighbor (kNN) graph from high-dimensional cell neighborhoods, followed by partitioning cells into communities using algorithms like Louvain clustering, and then estimating connectivities between these partitions to form an abstracted PAGA graph.[11] This graph uses edge weights to represent confidence in connections, enabling visualization of branching or linear trajectories while maintaining multi-resolution views of the data manifold.[11] By prioritizing global topology preservation, PAGA avoids distortions common in dimensionality reduction techniques and supports integration with RNA velocity for directional inference.[11] In a typical workflow, PAGA first computes the kNN graph and partitions, abstracts these into the PAGA graph, and then facilitates pseudotime estimation through methods like diffusion-based ordering or diffusion pseudotime (DPT) on the abstracted structure.[44] This stepwise approach allows users to refine trajectories iteratively, starting from a robust coarse map.[11] PAGA's strengths include high scalability, processing a dataset of 1.3 million neurons in approximately 90 seconds on standard hardware, making it suitable for large-scale analyses.[11] It demonstrates robustness to technical noise in single-cell data and provides intuitive visualizations of connectivity strengths via plot functions in Scanpy.[11] These attributes enable efficient exploration of complex datasets without excessive computational demands.[45] Applications of PAGA span developmental biology, including trajectory inference in mouse hematopoiesis across multiple datasets, regeneration studies in adult planaria, and embryonic patterning in zebrafish.[11] It has been employed in large-scale single-cell atlases, such as the 2023 human embryonic limb cell atlas, where it generated abstracted graphs to resolve spatial and temporal cellular progressions.[46]Other notable tools
TSCAN (Tools for Single-Cell ANalysis) is a cluster-based trajectory inference method that reconstructs pseudotime by ordering cell clusters along a minimum spanning tree derived from pairwise distances between clusters, making it particularly suitable for linear trajectories in single-cell RNA-seq data.[8] Introduced in 2016, TSCAN emphasizes simplicity and computational efficiency, enabling robust pseudotime estimation even in noisy datasets by incorporating quantitative evaluation metrics for trajectory quality.[8] Early diffusion-based approaches like Wanderlust and Wishbone laid foundational work for handling branching structures. Wanderlust, developed in 2014, infers a one-dimensional trajectory by propagating distances from a user-specified root cell across a k-nearest neighbors graph, effectively capturing progression in processes such as human B cell development.[47] Wishbone, building on this in 2016, extends diffusion pseudotime to detect bifurcations by identifying branch points through diffusion maps and random forest classification, proving effective for developmental bifurcations like hematopoietic differentiation. While influential, these methods are now considered outdated for large-scale datasets due to scalability limitations and sensitivity to the choice of root cell.[2] More recent tools address complex topologies and multimodal data. scSAE, introduced in 2025, employs stacked autoencoders to learn nonlinear latent representations of single-cell data, enabling inference of intricate, nonlinear trajectories without assuming predefined structures.[48] VITAE, from 2024, integrates variational autoencoders with a latent hierarchical mixture model for Bayesian joint inference of trajectories and cell types, supporting multi-sample integration and outperforming prior methods on diverse topologies like trees and cycles.[34] PHLOWER, released in 2025, leverages Hodge decomposition on multimodal single-cell data (e.g., RNA and protein) to infer complex, multi-branching differentiation trees, enhancing accuracy in scenarios with spatial or multi-omics inputs.[49] For time-series data, IDTI (2024) computes trajectories by quantifying the increment of diversity between adjacent time points, avoiding the need for root cell specification and improving reconstruction in dynamic processes like embryonic development.[50]| Tool | Key Strength | Limitation | Citation |
|---|---|---|---|
| TSCAN | Fast cluster-based linear ordering via MST | Dependent on initial clustering quality | [8] |
| Wishbone | Detects single bifurcations in diffusion space | Limited to one branch point; scales poorly | [51] |
| Wanderlust | Simple root-to-end progression mapping | Assumes linear paths; root-sensitive | [47] |
| scSAE | Handles nonlinear paths with autoencoders | Requires tuning for deep architectures | [48] |
| VITAE | Bayesian integration across samples | Computationally intensive for large datasets | [34] |
| PHLOWER | Multimodal support for complex trees | Relies on graph decomposition assumptions | [49] |
| IDTI | Tailored for time-series without roots | Assumes temporal ordering in input | [50] |