Fact-checked by Grok 2 weeks ago

Trajectory inference

Trajectory inference is a computational method in single-cell omics analysis that reconstructs the dynamic progression of cell states by ordering individual cells along continuous trajectories, typically parameterized by pseudotime, to model biological processes such as differentiation, maturation, and response to stimuli from snapshot data.^[1] These methods address the challenge of inferring temporal dynamics from static, high-dimensional measurements like gene expression profiles, enabling the study of cellular transitions at unprecedented resolution without requiring time-series experiments.^[2] The field emerged around 2014 with pioneering tools like Monocle, which used principal curves to fit trajectories to dimensionality-reduced data, followed by Wanderlust, which employed diffusion-based pseudotime estimation on k-nearest neighbor graphs.^[1] Since then, over 70 trajectory inference tools have been developed, reflecting rapid evolution driven by advances in single-cell RNA sequencing (scRNA-seq) technologies.^[1] Early methods focused on linear or branching topologies, but subsequent innovations incorporated more complex structures, such as cycles and bifurcations, to better capture real biological variability.^[2] Trajectory inference methods can be broadly categorized into clustering-based, graph-based, and probabilistic approaches. Clustering-based methods, such as Slingshot and SCORPIUS, first identify discrete cell clusters representing states and connect them via minimum spanning trees or elastic principal curves to form trajectories.^[2] Graph-based techniques, including PAGA and TSCAN, leverage nearest-neighbor graphs or random walks to propagate pseudotime and infer global structures, often excelling in scalability for large datasets.^[1] Probabilistic models, like Palantir and VeloVI,^[3] introduce uncertainty quantification through generative processes, such as Gaussian processes or Markov chains, and integrate additional data modalities like RNA velocity to predict future cell fates.^[2] Applications of trajectory inference extend beyond basic trajectory reconstruction to downstream analyses, including differential gene expression along paths (e.g., via tradeSeq), gene regulatory network inference, and alignment of trajectories across datasets or conditions.^[2] It has been instrumental in fields like developmental biology, immunology, and oncology, revealing insights into processes such as embryonic development, immune cell activation, and tumor heterogeneity.^[4] Benchmarks, such as the dynverse evaluation of 45 methods on synthetic and real datasets, highlight trade-offs in accuracy, stability, and computational efficiency, with no single method outperforming others universally; selection depends on data topology, dimensionality, and noise levels.^[1] Despite these advances, challenges persist, including sensitivity to data preprocessing, assumptions about continuous progression that may not hold for discrete transitions, and the need for robust quality control to distinguish true trajectories from artifacts.^[2] Ongoing developments focus on multi-omics integration, such as combining transcriptomics with epigenomics or spatial data, and machine learning enhancements to handle complex, non-linear dynamics more effectively.^[2]

Background

Definition and principles

Trajectory inference is a computational approach that reconstructs continuous developmental or differentiation paths, known as trajectories, from static snapshot data obtained from single-cell omics technologies, such as single-cell RNA sequencing (scRNA-seq).^[1] It achieves this by ordering individual cells along a pseudotime axis, which serves as a proxy for biological progression through a dynamic process, thereby enabling the inference of temporal dynamics without requiring time-series experiments.^[5] This method assumes that the observed cells represent asynchronous samples from an underlying continuous biological process, where gene expression changes occur gradually, allowing the reconstruction of cell fate transitions from high-dimensional data.^[1] The core principles of trajectory inference revolve around modeling cellular progression as a graph-like structure, where nodes represent cell states and edges denote transitions along pseudotime.^[2] Pseudotime quantifies the extent of progress through the process, often derived from diffusion-based or minimum spanning tree approaches that capture manifold geometry in the data. Trajectories can take various topologies, including linear paths for unbranched processes, branching structures to model fate bifurcations, and multifurcating graphs for complex decision points involving multiple lineages.^[1] Primarily applied to high-dimensional datasets like gene expression profiles, trajectory inference facilitates the study of cell fate decisions in contexts such as embryonic development or tissue regeneration, where traditional time-course data are scarce or infeasible.^[1] For instance, in scRNA-seq analysis of differentiating cells, clusters representing distinct stages are identified and linked via edges to form a trajectory graph that visualizes progression from progenitor to mature states.^[5] Dimensionality reduction techniques, such as principal component analysis, are often employed as a preprocessing step to aid visualization and noise reduction in these high-dimensional spaces.^[2]

Historical development

Trajectory inference emerged in the mid-2010s, driven by advancements in single-cell RNA sequencing (scRNA-seq) that provided high-resolution snapshots of cellular states, necessitating computational methods to reconstruct dynamic processes like differentiation. The foundational concepts of pseudotime—ordering cells along inferred developmental paths based on transcriptional similarity—were introduced in 2014 through two landmark publications. Wanderlust, developed by Bendall et al., employed graph-based diffusion to align cells on trajectories reflecting progression in human B-cell development, marking an early shift from static clustering to dynamic modeling.^[6] Independently, Monocle, proposed by Trapnell et al., used unsupervised algorithms to order cells by pseudotemporal gene expression changes, demonstrating its utility in myoblast differentiation and revealing regulatory cascades.^[5] Key milestones in the late 2010s addressed limitations in handling complex topologies, such as branching lineages. In 2016, Wishbone extended graph diffusion techniques to detect bifurcations, enabling high-resolution mapping of diverging trajectories in hematopoietic and neural progenitor data.^[7] The same year, TSCAN introduced cluster-based pseudotime reconstruction via principal curves, offering a robust approach for ordering cells in branching structures without requiring prior lineage knowledge.^[8] Slingshot, building on these ideas and published in 2018 (with preprint in 2017), inferred multiple lineages by constructing minimum spanning trees over clusters, providing flexible pseudotime estimation along curved paths.^[9] Concurrently, Haghverdi et al. (2016) advanced the field with diffusion pseudotime (DPT), a manifold learning-based metric that robustly captures branching and metastable states using random walk approximations on diffusion maps.^[10] Street et al. (2018) further refined lineage inference in Slingshot, emphasizing its stability for noisy scRNA-seq data.^[9] These developments were synthesized in a 2019 benchmark of 45 methods, which underscored the shift toward scalable, topology-aware algorithms.^[1] Post-2019 advancements focused on scalability, multi-omics integration, and dynamic information. Monocle 3, introduced in 2019, enhanced trajectory construction for million-cell datasets by incorporating graph-based partitioning and partition-based graph abstraction, facilitating multi-omics analyses in contexts like embryonic development. From 2019 onward, RNA velocity—estimating transcriptional kinetics from unspliced and spliced mRNA ratios—has been increasingly integrated to refine trajectories, as pioneered by La Manno et al. (2018) and extended in subsequent works to predict cell fate directions.^[11]^[12] PAGA (2019), developed by Wolf et al., reconciled clustering with trajectory inference via topology-preserving graph abstractions, enabling coarse-grained connectivity maps for complex manifolds like planarian regeneration. By 2024, methods like IDTI addressed time-series scRNA-seq challenges, using diversity minimization to infer trajectories without predefined starting cells, improving accuracy in dynamic datasets. In 2024-2025, methods like Genes2Genes for gene trajectory alignment and CASCAT for causal inference in spatial data further advanced multi-omics and spatiotemporal applications.^[13]^[14] These evolutions reflect trajectory inference's maturation into a cornerstone of single-cell analysis, with ongoing emphasis on robustness and interpretability as of November 2025.

Theoretical foundations

Key assumptions

Trajectory inference methods rely on several core assumptions to reconstruct cellular progression from high-dimensional single-cell omics data. A fundamental premise is that the dataset captures a continuous biological process, where gene expression changes gradually across states, and cells represent independent, asynchronous samples drawn from this trajectory at different progression points.^[15] Another key assumption is the existence of an underlying low-dimensional manifold structure in the data, which dimensionality reduction techniques can reveal to facilitate trajectory embedding.^[1] These assumptions enable the ordering of cells along a pseudotemporal axis without requiring time-labeled data. Biologically, trajectory inference typically presumes unidirectional progression along the trajectory, excluding cycles unless explicitly modeled, as in processes like differentiation where cells advance toward terminal states.^[15] Branching structures are assumed to reflect fate decisions, with divergence points representing lineage bifurcations driven by regulatory changes.^[1] Additionally, the methods assume that technical noise in the data, such as dropout events in single-cell RNA sequencing, does not overwhelm the signal of the underlying trajectory, preserving continuity in expression profiles.^[16] From a statistical perspective, expression profiles are expected to vary smoothly along the inferred pseudotime, implying that nearby cells in pseudotime exhibit similar transcriptomic states.^[17] Methods further assume sufficient sampling density across trajectory states to avoid gaps that could distort reconstruction, ensuring representation of transient intermediates.^[16] Violations of these assumptions can introduce artifacts, such as erroneous linearization of cyclic processes like the cell cycle, leading to biologically implausible trajectories.^[15] To assess reliability, validation often employs metrics like w-correlation to quantify alignment between inferred and reference trajectories, highlighting discrepancies from assumption breaches.^[1]

Mathematical models

Pseudotime serves as a fundamental scalar metric in trajectory inference, assigning to each cell a continuous value \tau_i \in [0, 1] that quantifies its progression along an inferred developmental path, often derived from the projection of a cell's expression profile x_i onto a parameterized trajectory curve \mu(t), such that \tau_i = \arg\min_t \|x_i - \mu(t)\|^2.^[18] This formulation assumes a smooth, low-dimensional manifold underlying the data, enabling the ordering of cells without direct temporal information. In diffusion-based approaches, pseudotime can alternatively be computed as the diffusion distance along the principal eigenvector of a transition matrix derived from cell-cell similarities, capturing geodesic distances on the data manifold to robustly reconstruct branching structures. Trajectory representations frequently employ graph-based models, where cells or clusters form nodes in a similarity graph, and edges are weighted by metrics such as k-nearest neighbor distances or kernel affinities; the minimum spanning tree (MST) of this graph then approximates the global topology by connecting nodes with minimal total edge weight while avoiding cycles, providing a tree-like backbone for pseudotime assignment along paths from root to terminal nodes.^[9] Complementing this, diffusion maps embed the data into a lower-dimensional space via the eigenvectors \Psi_k(X) of a Markov transition matrix P, where P_{ij} represents the diffusion probability from cell i to j, yielding coordinates that preserve local manifold geometry and facilitate pseudotime computation as progress along the embedding's dominant direction. Branching models extend linear trajectories to tree structures, incorporating bifurcation points where cell fates diverge; these are often parameterized as Gaussian processes per branch, with the posterior over the branching time t_b calculated as p(t_b | Y) \propto p(Y | t_b) using marginal likelihood, enabling gene-specific identification of dynamic changes at branches.^[19] Such models assume independent evolution along branches after splitting, grounded in the key assumption of a tree-like topology without cycles.^[1] Integration of RNA velocity enhances trajectory inference by estimating future cell states from ratios of unspliced to spliced mRNA, modeling the dynamics as \frac{dX}{dt} = A X + B, where X denotes the expression state vector, A captures regulatory interactions estimated via steady-state ratios, and B represents transcriptional bursts; velocities are derived by solving the kinetic equations distinguishing unspliced (u) and spliced (s) transcripts, with \frac{du}{dt} = \alpha - \beta u and \frac{ds}{dt} = \beta u - \gamma s, projecting arrows onto the trajectory to refine pseudotime and branch directions.^[12] Validation of inferred trajectories relies on metrics assessing consistency with ground-truth progressions, such as the w-correlation, a weighted Pearson correlation between pseudotimes that emphasizes local manifold structure by weighting pairs according to their diffusion distances, ensuring preservation of biological ordering across methods or batches.^[20]

Methods

Dimensionality reduction techniques

Dimensionality reduction techniques are essential preprocessing steps in trajectory inference for single-cell RNA sequencing (scRNA-seq) data, projecting high-dimensional gene expression profiles—often spanning thousands of genes—into low-dimensional spaces of 2 to 50 dimensions to uncover underlying manifold structures and cell neighborhoods.^[21] This reduction facilitates the identification of continuous trajectories by mitigating the curse of dimensionality, noise, and computational burdens associated with raw data analysis.^[21] Linear methods, such as principal component analysis (PCA), serve as a baseline for initial noise reduction and global structure capture.^[21] In PCA, the data matrix X is centered by subtracting the mean, followed by singular value decomposition (SVD) to yield X = U \Sigma V^T, where the reduced representation is formed by projecting onto the top principal components: X_{\text{reduced}} = U_k \Sigma_k, with k denoting the number of retained components.^[22] This approach assumes linearity and excels in scRNA-seq for denoising but may distort nonlinear manifolds prevalent in biological differentiation processes.^[21] Nonlinear methods better preserve local and global structures in complex scRNA-seq datasets. t-distributed stochastic neighbor embedding (t-SNE) visualizes clusters by minimizing divergences between high- and low-dimensional probability distributions, controlled by the perplexity parameter that balances local and global neighborhood preservation (typically 5–50 for single-cell data).^[23] Uniform manifold approximation and projection (UMAP) offers faster computation through fuzzy simplicial set representations of the data manifold, optimizing a cross-entropy loss to embed cells while maintaining topological relationships.^[24] Diffusion maps approximate the manifold's geometry via eigenvectors of a diffusion kernel, providing coordinates that can serve as a basis for pseudotime estimation by capturing diffusion distances along potential trajectories.^[25]^[10] scRNA-seq data's sparsity, arising from dropout events where true expression is undetected as zero, poses challenges for standard reduction techniques, often leading to distorted manifolds.^[26] To address this, zero-inflated models like zero-inflated factor analysis (ZIFA) explicitly parameterize dropout probability alongside latent factors, enabling robust low-dimensional embeddings without full imputation.^[27] Alternatively, imputation methods (e.g., scImpute or MAGIC) can preprocess data to fill dropouts based on similar cells before applying reduction, though this risks introducing bias if over-applied.^[26] Evaluation of these techniques emphasizes preservation of global structure, quantified by the trustworthiness score, which measures the fraction of low-dimensional k-nearest neighbors that were also neighbors in the original space (ranging from 0 to 1, with higher values indicating better preservation).^[28] In scRNA-seq benchmarks, nonlinear methods like UMAP often achieve higher trustworthiness (e.g., >0.8 on pancreas datasets) compared to PCA, supporting their preference for trajectory preprocessing.^[21]

Trajectory construction algorithms

Trajectory construction algorithms in trajectory inference aim to build topological structures, such as graphs or trees, from dimensionality-reduced single-cell data to represent cellular progression paths. These algorithms typically operate after an initial dimensionality reduction step, like principal component analysis or t-SNE, to capture the underlying manifold of cell states. Common approaches include graph-based, diffusion-based, and partition-based methods, each suited to different trajectory topologies from linear to branched structures. Graph-based approaches construct trajectories by first building a k-nearest neighbors (k-NN) graph from the reduced data, which encodes local cell similarities, and then deriving a minimum spanning tree (MST) to form a linear or tree-like backbone for pseudotemporal ordering. This framework is particularly effective for linear trajectories, as the MST connects all nodes without cycles while minimizing total edge weights, providing a parsimonious path. For instance, Slingshot employs this strategy by clustering cells, constructing an MST on cluster centroids, and fitting principal curves to smooth the paths through clusters, enabling robust inference of branching lineages. Slingshot's use of principal curves ensures trajectories follow the data's principal directions, improving accuracy on datasets with clear bifurcations. Diffusion-based methods leverage diffusion maps or similar embeddings to propagate pseudotime along principal components of cell diffusion, capturing global trajectory geometry. Monocle, in its second version, uses reversed graph embedding (RGE) to iteratively learn a principal graph that aligns with the diffusion structure, allowing pseudotime to be ordered along tree-like branches in an unsupervised manner. This approach resolves complex trajectories by embedding cells onto a low-dimensional graph that minimizes reconstruction error, making it suitable for datasets with multiple fate decisions. Partition-based algorithms focus on clustering cells first and then ordering the clusters to form the trajectory skeleton. TSCAN, for example, groups cells into clusters based on gene expression similarity, computes an MST connecting cluster centroids in reduced space, and pseudotemporally orders cells within each branch by projecting onto the principal curve of the cluster sequence. This method excels in handling noisy data by reducing the problem to cluster-level connections, providing interpretable trajectories for linear or mildly branched progressions. To handle branching trajectories, methods like PAGA first abstract the data into a coarse-grained graph by partitioning cells and estimating connectivity via mutual nearest neighbors, then refine into fine-grained paths while preserving topology. PAGA detects potential bifurcations by maximizing variance in gene expression modules across graph edges, highlighting decision points where cellular states diverge. This abstraction step reconciles clustering with continuity, enabling scalable analysis of high-dimensional data. Recent extensions address multifurcating and cyclic topologies, which challenge traditional tree models. PHLOWER (2025) infers complex, multi-branching differentiation trees from multimodal single-cell data, such as combined transcriptomics and proteomics, by integrating Hodge decomposition to decompose cell connectivity into harmonic, gradient, and curl components, thus capturing cycles and multifurcations in developmental processes. This method improves accuracy on datasets with non-tree structures, like looping trajectories in immune responses.

Incorporation of prior knowledge

Trajectory inference methods can be enhanced by incorporating external biological knowledge, known as priors, to guide the reconstruction process and improve alignment with known cellular dynamics. These priors help constrain the vast space of possible trajectories derived from data-driven approaches, reducing ambiguity in branching structures or pseudotime ordering. Common types of priors include gene regulatory networks (GRNs), which weight potential transitions between cell states by prioritizing edges supported by known regulatory interactions; known marker genes, used to identify the root or starting point of a trajectory based on expression patterns of stem cell or progenitor markers; and time-series labels from longitudinal experiments, which anchor cells to specific developmental stages across time points.^[29]^[30]^[31] One prevalent approach for integration involves constrained optimization frameworks, where the trajectory is formulated as minimizing a composite cost function that balances data fidelity with adherence to priors. For instance, the objective can be expressed as:

\text{Cost} = \text{Data fit} + \lambda \cdot \text{Prior violation}

Here, the data fit term measures agreement with observed gene expression or spatial coordinates, while the prior violation penalizes deviations from biological constraints, such as spatial contiguity or regulatory strengths, with \lambda tuning the trade-off. This regularization ensures trajectories respect domain-specific assumptions without fully relying on unsupervised inference.^[32] Examples of prior incorporation include using lineage tracing data from clonal labeling experiments to validate inferred branch points, confirming that trajectory bifurcations align with experimentally observed cell fate divergences. Similarly, RNA velocity estimates—derived from ratios of unspliced to spliced mRNA—provide directional priors by aligning velocity vectors with the trajectory manifold, orienting pseudotime flow toward predicted future states and resolving ambiguities in cyclic or multifurcating paths.^[33]00150-8) Advanced methods leverage Bayesian frameworks to incorporate hierarchical priors for more robust inference, particularly across multiple samples. For example, VITAE employs a latent hierarchical mixture model with variational autoencoders to jointly infer trajectories while integrating sample-specific priors on topology and dynamics, enabling scalable analysis of heterogeneous datasets like brain cell atlases.^[34] Incorporating priors enhances the biological relevance of trajectories, often yielding more interpretable and accurate reconstructions in complex systems like hematopoiesis or neurogenesis, where pure data-driven methods may overlook subtle regulatory cues. However, reliance on priors risks introducing bias if the external knowledge is incomplete or species-specific, potentially overfitting to erroneous assumptions and reducing generalizability across datasets.^[2]

Applications

Developmental biology

Trajectory inference has been instrumental in reconstructing embryogenesis trajectories in model organisms, such as the progression from zygote to gastrula stages in zebrafish and mouse embryos. In zebrafish, single-cell RNA sequencing (scRNA-seq) data from over 38,000 cells across 12 developmental stages revealed continuous transcriptional changes underlying somitogenesis, neural crest migration, and other morphogenetic processes, enabling the mapping of cell fate transitions without relying on prior lineage knowledge.^[35] Similarly, in mouse embryos, integration of scRNA-seq datasets spanning gastrulation to early organogenesis identified 56 distinct cell trajectories, highlighting spatiotemporal gene expression dynamics during primitive streak formation and germ layer specification. Prominent examples include the inference of hematopoiesis trajectories, where branching structures delineate myeloid and lymphoid lineages from hematopoietic stem cells, revealing sequential differentiation waves in bone marrow progenitors.^[36] In pancreatic development, trajectory analysis of fetal scRNA-seq data has delineated the divergence of endocrine progenitors into alpha and beta cells, capturing key transcriptional shifts that govern insulin-producing cell maturation.^[37] These applications have yielded critical insights, such as pinpointing tipping points in fate decisions—where cells commit irreversibly to specific lineages—and uncovering novel intermediate states that bridge progenitor and differentiated populations, thereby refining models of developmental progression.^[15] Notable case studies underscore these advances: a 2019 analysis of mouse gastrulation on integrated scRNA-seq profiles from embryonic days 6.5 to 8.5 reconstructed trajectories for epiblast diversification and mesoderm formation, identifying regulatory modules active at germ layer boundaries.^[38] More recently, branching trajectory analyses of human retinal organoids derived from induced pluripotent stem cells have mapped differentiation paths, revealing off-target cell populations in vitro. Multi-omics integration further enhances trajectory inference in developmental contexts, exemplified by combining scRNA-seq with assay for transposase-accessible chromatin sequencing (ATAC-seq) to trace epigenetic landscapes during human fetal hematopoiesis. This approach aligned transcriptional and chromatin accessibility profiles across approximately 8,500 cells and nuclei, disclosing dynamic enhancer remodeling that coordinates myeloid and erythroid branching from common progenitors.^[39]

Disease progression and other fields

Trajectory inference has been instrumental in modeling disease progression by reconstructing dynamic cellular states from single-cell data, particularly in oncology where it elucidates paths of tumor evolution and metastasis within heterogeneous microenvironments. For instance, in prostate cancer, trajectory analysis of single-cell transcriptomes has delineated progression from normal to metastatic states, identifying key transcriptional changes along pseudotemporal paths that highlight driver genes and potential intervention points.^[40] Similarly, in breast and other solid tumors, methods like alignment of single-cell trajectories across conditions have enabled the inference of branching paths representing divergent metastatic routes, integrating multi-omics data to reveal how tumor cells adapt to selective pressures in the microenvironment. These approaches not only map linear progression but also capture bifurcations that correspond to therapeutic resistance or dormancy, providing insights into metastasis mechanisms. In neurodegenerative diseases, trajectory inference has mapped transitions from healthy to pathological states, as exemplified in Alzheimer's disease (AD) studies around 2022 that applied single-cell sequencing guidelines to reconstruct glial and neuronal trajectories influenced by amyloid and tau pathology.^[41] These analyses revealed pseudotemporal ordering of cell states, showing how microglia activation and astrocyte dysfunction contribute to neurodegeneration, with trajectories diverging based on genetic risk factors. More recent extensions have integrated perturbation data from precision medicine cohorts to infer causal paths of therapeutic efficacy, such as how targeted inhibitors alter branching in AD-related cellular models, thereby identifying responsive subpopulations for personalized interventions. Beyond canonical diseases, trajectory inference extends to immune dynamics during infections, where it has characterized COVID-19 response trajectories by ordering immune cell states from activation to exhaustion in severe cases. For example, single-cell analyses have inferred epithelial-immune interaction paths, revealing how airway cells transition amid viral challenge and how immune niches bifurcate into recovery or hyperinflammatory routes. In stem cell reprogramming, optimal transport-based trajectory methods have uncovered heterogeneous paths from somatic to pluripotent states, highlighting asynchronous diversification and regulatory networks that govern efficiency. In microbial ecology, trajectory concepts applied to community succession, such as in soil ecosystems, model bacterial shifts over environmental gradients, inferring successional paths shaped by nutrient availability and interactions, though single-cell applications remain emerging. A key advantage of trajectory inference in these contexts is its ability to pinpoint branch points as therapeutic windows, where interventions can redirect pathological paths, as seen in cancer models where pseudotime analysis identifies vulnerable states for drug targeting. Furthermore, integrating perturbation data, such as from CRISPR screens or drug exposures, enables causal inference along trajectories, enhancing precision medicine by predicting response heterogeneity. Extensions to spatial transcriptomics have further advanced disease modeling by inferring tissue-level trajectories, for instance, in tumors where spatial pseudotime reveals invasion gradients and in AD brains where it maps regional progression from healthy to degenerative zones.

Software tools

Monocle

Monocle is an open-source R package distributed through Bioconductor, developed for analyzing single-cell RNA sequencing data with a focus on trajectory inference to model dynamic gene expression changes along biological processes such as differentiation. Introduced in 2014 with version 1, Monocle pioneered the concept of pseudotime, an unsupervised ordering of cells along a linear trajectory to reconstruct temporal progression from static snapshots of transcriptomes. Monocle 2, released in 2017, extended this to support branching trajectories by employing reversed graph embedding to capture complex fate decisions without prior knowledge of the structure. Monocle 3, launched in 2019, further advanced scalability for large datasets through a graph abstraction framework that enables efficient partitioning and trajectory construction across millions of cells. A core feature of Monocle is its use of DDRTree for low-dimensional embedding, which approximates a tree-structured manifold to position cells in a space that reflects their progression, building on mathematical models of discriminant analysis for reduced representations. The package incorporates partition-based trajectory learning, where cells are first clustered into partitions representing potential lineages, followed by the construction of a principal graph that connects these partitions into a cohesive trajectory. It integrates tightly with cell clustering algorithms, such as those based on Leiden or Louvain methods, to define the scope of trajectories and identify branch points corresponding to fate decisions. The standard workflow in Monocle begins with input of dimensionality-reduced data, such as PCA or UMAP coordinates from preprocessed expression matrices, followed by graph learning to fit the trajectory structure and pseudotime assignment to quantify each cell's position along the paths. This process supports visualization of trajectories in 2D or 3D embeddings and can incorporate RNA velocity information to orient branches and predict future states. Monocle 3 enhances this by allowing trajectory refinement within specific partitions or across multiple ones, facilitating modular analysis of complex datasets. Monocle's strengths lie in its ability to handle large-scale single-cell datasets efficiently, processing up to millions of cells while producing interpretable 2D/3D trajectory visualizations that aid in exploring lineage relationships. However, the graph learning and embedding steps can be computationally intensive for very large datasets exceeding current hardware limits, and the method assumes predominantly tree-like topologies, potentially underperforming on datasets with cyclic or highly disconnected structures.

Slingshot

Slingshot is an open-source R package developed for trajectory inference in single-cell RNA sequencing data, first introduced in 2017. It models continuous developmental trajectories by leveraging pre-computed cell clusters in a low-dimensional space, fitting principal curves to represent lineages and pseudotime. The method assumes that cells within clusters are more similar to each other than to those in other clusters, enabling robust lineage reconstruction without directly operating on high-dimensional gene expression data.^[42] A core strength of Slingshot lies in its ability to simultaneously infer pseudotime and branching lineages, starting from user-specified root cells or automatically detecting them. It accommodates complex topologies, including multiple branches emanating from a common progenitor, which is particularly useful for capturing differentiation hierarchies in biological processes. Unlike methods that rely on continuous embeddings alone, Slingshot emphasizes discrete clustering as a foundation, providing interpretable structures that align with known cell types.^[9] The workflow begins with dimensionality reduction techniques, such as principal component analysis (PCA) or uniform manifold approximation and projection (UMAP), to obtain cell coordinates, followed by clustering (e.g., using k-means or density-based methods). Slingshot then constructs a minimum spanning tree (MST) connecting cluster centroids to define the global lineage topology. Principal curves are fitted along each path in the MST, projecting cells onto these curves to assign pseudotime values that reflect progression along inferred trajectories. This process is implemented via the slingshot() wrapper function, which outputs a SingleCellExperiment object compatible with downstream analyses. Slingshot excels in interpretability for branching trajectories, allowing visualization of lineages and pseudotime on reduced-dimensional plots using integrated plotting functions. It integrates directly with the tradeSeq package for differential expression analysis, enabling tests for genes changing along specific paths or between branches, which enhances its utility in identifying regulatory dynamics. For instance, in hematopoiesis datasets like those from Paul et al. (2015), Slingshot has been used to reconstruct myeloid and erythroid lineages from hematopoietic stem cells, revealing pseudotime-ordered gene expression patterns consistent with known differentiation cascades. Recent package versions maintain compatibility with diverse input formats, including those derived from spatial transcriptomics by treating spatial coordinates as additional dimensions in the embedding.^[43]

PAGA

PAGA (Partition-based Graph Abstraction) is a Python package introduced in 2019 within the Scanpy ecosystem for single-cell RNA sequencing analysis, designed to infer coarse-grained trajectory topologies by creating interpretable graph-like maps of cellular manifolds.^[11] It addresses the challenge of reconciling clustering with trajectory inference by abstracting high-dimensional data into a simplified graph that preserves global connectivity structures.^[11] As an implementation of trajectory construction algorithms, PAGA focuses on initial topology estimation rather than full pseudotime assignment.^[44] The core feature of PAGA involves constructing a k-nearest neighbor (kNN) graph from high-dimensional cell neighborhoods, followed by partitioning cells into communities using algorithms like Louvain clustering, and then estimating connectivities between these partitions to form an abstracted PAGA graph.^[11] This graph uses edge weights to represent confidence in connections, enabling visualization of branching or linear trajectories while maintaining multi-resolution views of the data manifold.^[11] By prioritizing global topology preservation, PAGA avoids distortions common in dimensionality reduction techniques and supports integration with RNA velocity for directional inference.^[11] In a typical workflow, PAGA first computes the kNN graph and partitions, abstracts these into the PAGA graph, and then facilitates pseudotime estimation through methods like diffusion-based ordering or diffusion pseudotime (DPT) on the abstracted structure.^[44] This stepwise approach allows users to refine trajectories iteratively, starting from a robust coarse map.^[11] PAGA's strengths include high scalability, processing a dataset of 1.3 million neurons in approximately 90 seconds on standard hardware, making it suitable for large-scale analyses.^[11] It demonstrates robustness to technical noise in single-cell data and provides intuitive visualizations of connectivity strengths via plot functions in Scanpy.^[11] These attributes enable efficient exploration of complex datasets without excessive computational demands.^[45] Applications of PAGA span developmental biology, including trajectory inference in mouse hematopoiesis across multiple datasets, regeneration studies in adult planaria, and embryonic patterning in zebrafish.^[11] It has been employed in large-scale single-cell atlases, such as the 2023 human embryonic limb cell atlas, where it generated abstracted graphs to resolve spatial and temporal cellular progressions.^[46]

Other notable tools

TSCAN (Tools for Single-Cell ANalysis) is a cluster-based trajectory inference method that reconstructs pseudotime by ordering cell clusters along a minimum spanning tree derived from pairwise distances between clusters, making it particularly suitable for linear trajectories in single-cell RNA-seq data.^[8] Introduced in 2016, TSCAN emphasizes simplicity and computational efficiency, enabling robust pseudotime estimation even in noisy datasets by incorporating quantitative evaluation metrics for trajectory quality.^[8] Early diffusion-based approaches like Wanderlust and Wishbone laid foundational work for handling branching structures. Wanderlust, developed in 2014, infers a one-dimensional trajectory by propagating distances from a user-specified root cell across a k-nearest neighbors graph, effectively capturing progression in processes such as human B cell development.^[47] Wishbone, building on this in 2016, extends diffusion pseudotime to detect bifurcations by identifying branch points through diffusion maps and random forest classification, proving effective for developmental bifurcations like hematopoietic differentiation. While influential, these methods are now considered outdated for large-scale datasets due to scalability limitations and sensitivity to the choice of root cell.^[2] More recent tools address complex topologies and multimodal data. scSAE, introduced in 2025, employs stacked autoencoders to learn nonlinear latent representations of single-cell data, enabling inference of intricate, nonlinear trajectories without assuming predefined structures.^[48] VITAE, from 2024, integrates variational autoencoders with a latent hierarchical mixture model for Bayesian joint inference of trajectories and cell types, supporting multi-sample integration and outperforming prior methods on diverse topologies like trees and cycles.^[34] PHLOWER, released in 2025, leverages Hodge decomposition on multimodal single-cell data (e.g., RNA and protein) to infer complex, multi-branching differentiation trees, enhancing accuracy in scenarios with spatial or multi-omics inputs.^[49] For time-series data, IDTI (2024) computes trajectories by quantifying the increment of diversity between adjacent time points, avoiding the need for root cell specification and improving reconstruction in dynamic processes like embryonic development.^[50]

Tool	Key Strength	Limitation	Citation
TSCAN	Fast cluster-based linear ordering via MST	Dependent on initial clustering quality	^[8]
Wishbone	Detects single bifurcations in diffusion space	Limited to one branch point; scales poorly	^[51]
Wanderlust	Simple root-to-end progression mapping	Assumes linear paths; root-sensitive	^[47]
scSAE	Handles nonlinear paths with autoencoders	Requires tuning for deep architectures	^[48]
VITAE	Bayesian integration across samples	Computationally intensive for large datasets	^[34]
PHLOWER	Multimodal support for complex trees	Relies on graph decomposition assumptions	^[49]
IDTI	Tailored for time-series without roots	Assumes temporal ordering in input	^[50]

The dynverse framework, established around 2020, provides standardized benchmarking for these tools against gold-standard simulations, revealing that methods like TSCAN excel in speed for linear cases while emerging approaches such as VITAE and PHLOWER achieve higher accuracy on branched and multimodal benchmarks, often surpassing established tools like Monocle in topological fidelity.^[52]^[49]^[34]

Challenges and future directions

Current limitations

Trajectory inference methods face significant technical challenges that limit their scalability and robustness in handling large-scale single-cell RNA sequencing (scRNA-seq) datasets. For instance, many algorithms struggle with ultra-large datasets comprising millions of cells due to high computational demands, particularly when integrating multimodal data such as transcriptomics with proteomics or spatial information, which exacerbates memory and processing requirements.^[53] Additionally, these methods are highly sensitive to technical artifacts like dropout events—where gene expression is falsely recorded as zero due to inefficient mRNA capture—and batch effects arising from variations in experimental protocols across samples, which can distort pseudotime ordering and lead to spurious branching structures.^[53]^[16] Biologically, trajectory inference often grapples with distinguishing correlations in gene expression from true causal relationships, as stochastic fluctuations in cellular processes can mimic directional progressions without implying mechanism.^[53] Handling inherent biological stochasticity, such as random state transitions or feedback loops in gene regulatory networks, further complicates accurate reconstruction, as methods typically assume smooth, deterministic paths that may not capture abrupt or reversible dynamics.^[16] In noisy datasets, this can result in overfitting, where algorithms infer excessive branches that reflect artifacts rather than genuine bifurcations, as observed in applications of tools like Slingshot to heterogeneous tissues.^[53] Similarly, without explicit priors, these methods frequently fail to model cyclic processes, such as cell cycle oscillations, leading to linearized trajectories that ignore periodicity.^[53] Validation of inferred trajectories remains problematic due to the absence of ground truth in most biological systems, making it difficult to assess accuracy beyond qualitative alignment with known markers.^[16] Common metrics, such as trajectory alignment scores, are often dataset-specific and sensitive to preprocessing choices, limiting generalizability across studies.^[16] From a 2025 perspective, biases introduced by uneven sampling—where rare cell types are underrepresented—persist as a key issue, potentially skewing trajectories toward dominant populations and amplifying errors in multimodal integrations that demand balanced data representation.^[53] These limitations underscore the need for cautious interpretation in complex, real-world applications.

Emerging trends

Recent advancements in trajectory inference are increasingly focusing on integrating multimodal data sources to enhance the accuracy and biological relevance of inferred cellular paths. For instance, methods like PHLOWER incorporate single-cell RNA sequencing (scRNA-seq) alongside proteomics and spatial transcriptomics to reconstruct complex, multi-branching differentiation trajectories, leveraging Hodge decomposition for spatial-aware inference.^[49] This approach addresses limitations in unimodal analyses by capturing interactions across molecular layers, as demonstrated in kidney organoid models where multimodal integration improved trajectory resolution.^[54] In parallel, artificial intelligence and machine learning innovations are enabling more interpretive and end-to-end trajectory modeling. Large language models (LLMs) are being augmented into analysis pipelines, such as in PyEvoCell, which applies LLM capabilities to outputs from tools like Monocle3 for suggesting biologically relevant lineages and aiding interpretation.^[55] Deep learning frameworks, exemplified by scSAE (single-cell stacked autoencoders), facilitate direct trajectory inference from raw scRNA-seq data through unsupervised feature learning, reducing reliance on predefined pseudotime assumptions.^[48] Joint analysis techniques are advancing cross-sample comparability, with Genes2Genes providing a Bayesian dynamic programming framework for aligning gene expression trajectories between datasets, enabling the identification of conserved or divergent patterns in pseudotime.^[13] These alignments support time-series augmentation strategies that refine pseudotime estimation by incorporating temporal replicates, improving robustness in heterogeneous samples.^[56] Looking ahead, causal inference methods are emerging to incorporate experimental interventions, such as in CASCAT, which uses tree-shaped structural causal models on spatial transcriptomics to infer directed cell differentiation paths and predict perturbation outcomes.^[14] Real-time trajectory analysis is also gaining traction for clinical applications, with machine learning integrations poised to enable on-the-fly pseudotime estimation in patient-derived samples for personalized medicine.^[57] Community-driven standardization efforts are bolstering these developments through updated benchmarks like dyno, which evaluates over 50 trajectory inference methods on gold-standard datasets to guide method selection and foster interoperability.^[52] Open-source ecosystems, including repositories under dynverse, promote collaborative tool development and reproducible workflows.^[58]

References

[1]
A comparison of single-cell trajectory inference methods - Nature
Apr 1, 2019 · Trajectory inference approaches analyze genome-wide omics data from thousands of single cells and computationally infer the order of these ...
[2]
Recent advances in trajectory inference from single-cell omics data
Trajectory inference methods have emerged as a novel class of single-cell bioinformatics tools to study cellular dynamics at unprecedented resolution.
[3]
Current progress and potential opportunities to infer single-cell ... - NIH
Here, we summarize and categorize the most recent and popular computational approaches for trajectory inference based on the information they leverage.Cell Fate Modeling With... · Future Perspectives · Figure 2<|control11|><|separator|>
[4]
The dynamics and regulators of cell fate decisions are revealed by ...
Mar 23, 2014 · Monocle orders cells by progress through a biological process, resulting in an induced 'pseudotime' scale describing that process in ...
[5]
Pseudo-time reconstruction and evaluation in single-cell RNA-seq ...
May 13, 2016 · In this article, we exploit this idea to develop Tools for Single Cell Analysis (TSCAN), a new tool for pseudo-time reconstruction. One ...
[6]
Slingshot: cell lineage and pseudotime inference for single-cell ...
Jun 19, 2018 · We introduce Slingshot, a novel method for inferring cell lineages and pseudotimes from single-cell gene expression data.
[7]
Diffusion pseudotime robustly reconstructs lineage branching - Nature
Aug 29, 2016 · Diffusion pseudotime (DPT) enables robust and scalable inference of cellular trajectories, branching events, metastable states and ...
[8]
The single-cell transcriptional landscape of mammalian ... - Nature
Feb 20, 2019 · We use Monocle 3 to identify hundreds of cell types and 56 trajectories, many of which are detected only because of the depth of cellular ...
[9]
PAGA: graph abstraction reconciles clustering with trajectory ...
Finally, we show how PAGA abstracts transition graphs, for instance, from RNA velocity and compare to previous trajectory-inference algorithms.
[10]
Concepts and limitations for learning developmental trajectories ...
Jun 27, 2019 · Summary: This Review describes the concepts and use of computational approaches to infer cellular trajectories from single cell expression ...
[11]
Trajectory Inference for Single Cell Omics - PMC - NIH
Trajectory inference is used to order single-cell omics data along a path that reflects a continuous transition between cells.
[12]
A robust and accurate single-cell data trajectory inference method ...
Feb 20, 2023 · We proposed a novel framework for trajectory inference called the single-cell data Trajectory inference method using Ensemble Pseudotime inference (scTEP).Methods · Ensemble Pseudotime... · Trajectory InferenceMissing: machine | Show results with:machine<|control11|><|separator|>
[13]
BGP: identifying gene-specific branching dynamics from single-cell ...
May 29, 2018 · We develop the branching Gaussian process (BGP), a non-parametric model that is able to identify branching dynamics for individual genes.
[14]
Benchmarking atlas-level data integration in single-cell genomics
Dec 23, 2021 · Moreover, we use 14 metrics to evaluate the integration methods on their ability to remove batch effects while conserving biological variation.
[15]
Accuracy, robustness, scalability of dimensionality reduction methods
Dec 10, 2019 · We compare 18 different dimensionality reduction methods on 30 publicly available scRNA-seq datasets that cover a range of sequencing techniques and sample ...Missing: seminal | Show results with:seminal
[16]
Principal component analysis: a review and recent developments
Apr 13, 2016 · Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing ...
[17]
[PDF] Visualizing Data using t-SNE - Journal of Machine Learning Research
We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map.
[18]
UMAP: Uniform Manifold Approximation and Projection for ... - arXiv
Feb 9, 2018 · UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Authors:Leland McInnes, John Healy, James Melville.
[19]
Geometric diffusions as a tool for harmonic analysis and structure ...
Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. R. R. Coifman, S. Lafon, A. B. Lee, +3 , M. Maggioni ...
[20]
Embracing the dropouts in single-cell RNA-seq analysis - Nature
Mar 3, 2020 · As a result of the dropouts, the scRNA-seq data is often highly sparse. The excessive zero counts cause the data to be zero-inflated, only ...
[21]
ZIFA: Dimensionality reduction for zero-inflated single-cell gene ...
Nov 2, 2015 · The inclusion of a zero-inflation model gives ZIFA greater expressive power than standard PPCA/FA but increases the computational complexity. We ...
[22]
Trustworthiness and metrics in visualizing similarity of gene ...
Oct 13, 2003 · In this paper, we study one of the main tasks of exploratory data analysis: visualization of similarity relationships among high-dimensional ...Missing: score | Show results with:score
[23]
ENTRAIN: integrating trajectory inference and gene regulatory ... - NIH
A computational method that integrates trajectory inference methods with ligand-receptor pair gene regulatory networks to identify extracellular signals.
[24]
descriptive marker gene approach to single-cell pseudotime inference
Jun 23, 2018 · Here we introduce an orthogonal Bayesian approach termed 'Ouija' that learns pseudotimes from a small set of marker genes.
[25]
Tempora: Cell trajectory inference using time-series single-cell RNA ...
Sep 9, 2020 · We present Tempora, a novel cell trajectory inference method that orders cells using time information from time-series scRNA-seq data.
[26]
Accurate trajectory inference in time-series spatial transcriptomics ...
There are five arguments in the objective function, and five tunable parameters: α , f b , ρ 1 , ρ 2 and ϵ . The first argument is the earth-mover's distance: ...
[27]
Creating Lineage Trajectory Maps Via Integration of Single-Cell ...
The inferred lineage trajectories and branch points can be tested and validated by using clonal level lineage tracing.
[28]
Joint trajectory inference for single-cell genomics using ... - PNAS
Trajectory inference methods are essential for analyzing the developmental paths of cells in single-cell sequencing datasets. It provides insights into cellular ...Joint Trajectory Inference... · Results · Model Training, Estimation...
[29]
Single-cell reconstruction of developmental trajectories during ...
Apr 26, 2018 · Three research groups have used single-cell RNA sequencing to analyze the transcriptional changes accompanying development of vertebrate embryos.
[30]
New insights into hematopoietic differentiation landscapes from ...
Mar 28, 2019 · Single-cell transcriptomics has recently emerged as a powerful tool to analyze cellular heterogeneity, discover new cell types, and infer putative ...
[31]
A transcriptomic roadmap to α- and β-cell differentiation in the ...
Jun 24, 2019 · Pancreatic epithelial cells are located in the top of the trajectory, α-cells on the left trajectory and β-cells are on the right trajectory, as ...
[32]
Single‐Cell Transcriptome Landscape and Cell Fate Decoding in ...
May 6, 2024 · To further assess the cell lineage differentiation, single-cell branching trajectory analysis was performed for two main cell types, neurons and ...
[33]
https://pmc.ncbi.nlm.nih.gov/articles/PMC6161781/
[34]
Trajectory-based differential expression analysis for single-cell ...
Mar 5, 2020 · Trajectory inference has radically enhanced single-cell RNA-seq research by enabling the study of dynamic changes in gene expression.Missing: machine | Show results with:machine
[35]
scanpy.tl.paga - Read the Docs
A much simpler abstracted graph (PAGA graph) of partitions, in which edge weights represent confidence in the presence of connections.
[36]
theislab/paga: Mapping out the coarse-grained connectivity ... - GitHub
PAGA - partition-based graph abstraction. Mapping out the coarse-grained connectivity structures of complex manifolds (Genome Biology, 2019).
[37]
A human embryonic limb cell atlas resolved in space and time - Nature
Dec 6, 2023 · In the second step of this process, PAGA (scanpy tl.paga) was performed to generate an abstracted graph of partitions. Finally, force ...
[38]
Cell Trajectory Inference Based On Single Cell Stacked Auto Encoders
Sep 15, 2025 · scSAE outperformed existing cell trajectory inference tools and scSAE can be a powerful tool for analyzing single-cell RNA trajectory inference.
[39]
PHLOWER leverages single-cell multimodal data to infer complex ...
Oct 23, 2025 · Computational trajectory analysis is a key computational task for inferring differentiation trees from this single-cell data.
[40]
An increment of diversity method for cell state trajectory inference of ...
Here, we compared IDTI with six other trajectory inference methods (i.e., Monocle 2, TSCAN [38], Slingshot [39], PAGA [40], Tempora and CStreet) on the ...Missing: paper | Show results with:paper
[41]
:: dynverse
dynverse is a collection of R packages aimed at supporting the trajectory inference (TI) community on multiple levels.Missing: 2020 | Show results with:2020
[42]
Unique challenges and best practices for single cell transcriptomic ...
This review examines these challenges while presenting best practices for critical single cell analysis tasks.
[43]
Single cell trajectory analysis using Hodge Decomposition - bioRxiv
We evaluate PHLOWER through benchmarking with multi-branching differentiation trees and using novel kidney organoid multi-modal and spatial ...
[44]
PyEvoCell: an LLM-augmented single-cell trajectory analysis ...
Apr 10, 2025 · We developed PyEvoCell, a dashboard for trajectory interpretation and analysis that is augmented by large language model (LLM) capabilities.
[45]
Gene-level alignment of single-cell trajectories | Nature Methods
Sep 19, 2024 · Here we describe Genes2Genes, a Bayesian information-theoretic dynamic programming framework for aligning single-cell trajectories.
[46]
Trajectory inference across multiple conditions with condiments
Jan 27, 2024 · In this manuscript, we present condiments, a method for the inference and downstream interpretation of cell trajectories across multiple conditions.<|control11|><|separator|>
[47]
Inferring causal trajectories from spatial transcriptomics using CASCAT
Aug 19, 2025 · Abstract. Spatial trajectory inference models cell differentiation and state dynamics within tissues by integrating spatial information.Missing: assumptions | Show results with:assumptions
[48]
Global trends in machine learning applications for single-cell ...
Aug 16, 2025 · ML-scRNA-seq integration has advanced cellular heterogeneity analysis and precision medicine development. Future directions should optimize ...
[49]
dynverse: benchmarking, constructing and interpreting single-cell ...
The dynverse is an open set of packages to benchmark, construct and interpret single-cell trajectories. See https://dynverse.org for an overview.