Fact-checked by Grok 2 weeks ago

Intrinsic dimension

The intrinsic dimension (ID) of a is defined as the minimum number of independent parameters or coordinates required to describe the data without significant information loss, typically representing the dimensionality of the underlying manifold on which the data points lie within a higher-dimensional ambient . This concept arises in scenarios where high-dimensional data, such as images or readings embedded in \mathbb{R}^N with N \gg 1, actually possess a lower effective dimensionality M < N due to geometric constraints or generative processes. The ID can vary locally across the , influenced by factors like noise at small scales and manifold curvature or topology at larger scales, making it a scale-dependent measure. Estimating the intrinsic dimension is crucial for addressing the curse of dimensionality in data analysis, where high ambient dimensions lead to sparse sampling and computational inefficiencies in algorithms. In machine learning, ID estimation informs dimensionality reduction techniques like principal component analysis (PCA) or manifold learning (e.g., ISOMAP), enabling more efficient clustering, classification, and pattern recognition by projecting data onto its effective subspace. Applications extend beyond computer science to fields such as biology for genomic data analysis, quantum physics for state space modeling, and image processing for feature extraction, where low ID reveals hidden structures in complex datasets. For instance, in deep neural networks, the intrinsic low dimensionality of covariates has been shown to drive generalization performance and approximation capabilities. Historically, the concept traces back to early work by Bennett in 1969 on parameter counting, with formalization by Fukunaga in the 1970s as the number of variables capturing essential data properties. Modern estimation methods are categorized as global (e.g., fractal-based like Grassberger-Procaccia algorithm or projection methods like PCA), local (e.g., nearest-neighbor approaches like Fukunaga-Olsen or Levina-Bickel maximum likelihood), and pointwise (e.g., estimating ID per data point for heterogeneous datasets). Recent advances include robust estimators like the Adaptive Binomial ID Estimator (ABIDE), which optimizes neighborhood sizes to mitigate noise, and graph-based methods using Wasserstein distances for accurate approximation even in small samples. As of 2025, further developments encompass comprehensive surveys of estimation techniques, specialized toolkits for single-cell transcriptomics like IDEAS, and applications to molecular properties and gene expression analysis. These techniques, implemented in tools like the Python package scikit-dimension, facilitate reliable ID assessment in practical machine learning workflows.

Basic Concepts

Definition

The intrinsic dimension of a dataset is defined as the smallest integer d such that the points lie entirely, without information loss, on a d-dimensional manifold embedded within a higher-dimensional ambient \mathbb{R}^D, where D > d. This concept captures the minimal number of variables or required to represent the essential structure of the , distinguishing it from redundant or noisy dimensions in the . In contrast, the extrinsic or ambient dimension refers to the full dimensionality D of the embedding in which the data or manifold resides, which may include irrelevant directions that do not contribute to the data's variability. For instance, a X \subset \mathbb{R}^D has intrinsic d if it can be parameterized by a d-dimensional , allowing faithful reconstruction with far fewer parameters than D. The dimension of such a manifold M is formally denoted \dim(M), representing the local Euclidean dimensionality at every point on M. This notation highlights how intrinsic dimension reveals hidden low-dimensional geometry; for example, the Swiss roll dataset consists of points sampled from a 2-dimensional surface embedded in \mathbb{R}^3, where \dim(M) = 2, enabling the data's underlying planar structure to be uncovered despite the 3-dimensional appearance.

Motivations and Intuition

The curse of dimensionality arises in high-dimensional spaces, where data points become increasingly sparse as the number of dimensions grows, leading to exponentially larger volumes that require vastly more samples to adequately cover the space for reliable analysis or modeling. This sparsity complicates tasks like and estimation, as distances between points lose meaning and computational costs skyrocket, often rendering algorithms inefficient without sufficient data. However, many real-world datasets exhibit a low intrinsic dimension, meaning they effectively lie on a lower-dimensional manifold embedded within the high-dimensional ambient space, allowing for more efficient representations and mitigating these challenges. An intuitive is a two-dimensional surface, such as the skin of a , embedded in three-dimensional : while the ambient dimension is three, the intrinsic dimension is two, as local neighborhoods on the surface can be parameterized by just two coordinates, like , without loss of essential structure. This highlights how the true complexity of the data may be much lower than its apparent dimensionality, enabling simpler models that capture the underlying geometry. Studying intrinsic dimension is crucial because recognizing this low-dimensional structure facilitates data compression by projecting onto fewer coordinates while preserving key , reduces to by focusing on the manifold rather than extraneous dimensions, and improves in statistical models by avoiding the pitfalls of high-dimensional . Conversely, ignoring intrinsic dimension often results in , where models memorize spurious patterns in the sparse high-dimensional space rather than the true low-dimensional trends, or in computational inefficiency, as algorithms expend resources exploring irrelevant directions.

Examples and Illustrations

Geometric Examples

A straight line embedded in \mathbb{R}^3 provides a fundamental example of an object with intrinsic dimension 1, meaning it requires only one to describe its points despite lying in a higher-dimensional ambient . This line can be parametrized as \mathbf{x}(t) = (t, 0, 0) for t \in \mathbb{R}, where the single coordinate t suffices to capture the manifold's geometry without loss of information. The surface of a sphere in \mathbb{R}^3 illustrates an intrinsic dimension of 2, as it forms a 2-dimensional manifold where local neighborhoods resemble patches of the , and distances are preserved intrinsically along the surface rather than through the surrounding volume. For instance, the sphere S^2 can be covered by charts such as stereographic projections from the north and poles, each mapping open sets to \mathbb{R}^2 while maintaining the manifold's metric properties. Similarly, a circle as S^1 embedded in \mathbb{R}^2 has intrinsic dimension 1, parametrized by an angle \theta \mapsto (\cos \theta, \sin \theta) for \theta \in [0, 2\pi). Fractal curves like the demonstrate how intrinsic dimension can be non-integer and exceed the topological dimension of 1, capturing the curve's self-similar complexity and space-filling behavior. The boundary of the , constructed by iteratively replacing line segments with equilateral triangles, has a of \frac{\log 4}{\log 3} \approx 1.26, which serves as its intrinsic dimension and is independent of any metric. These geometric examples highlight the intrinsic structure through the use of charts and atlases on manifolds, where an atlas consists of compatible charts—each a from an open subset of the manifold to an in \mathbb{R}^k with k equal to the intrinsic dimension—allowing visualization of the local Euclidean-like geometry while ensuring smooth transitions globally.

Signal Processing Examples

In signal processing, a classic example of intrinsic dimension is provided by a pure sinusoidal signal s(t) = \sin(\omega t), which evolves as a one-dimensional time series. When this signal is embedded into a higher-dimensional using time-delay coordinates, such as the vector \mathbf{x}(t) = (s(t), s(t + \tau), \dots, s(t + (d-1)\tau)) for embedding dimension d \geq 2 and suitable delay \tau (e.g., \tau = \pi / (2\omega)), the resulting traces a one-dimensional closed , such as an or , embedded in the higher-dimensional . This demonstrates that the intrinsic of the signal remains 1, capturing the periodic dynamics without requiring additional dimensions, even as the ambient dimension increases. Takens' embedding theorem underpins this reconstruction, stating that for a dynamical system generating a time series from an of box-counting m, a delay with d > 2m and generic delay \tau yields a diffeomorphic copy of the original , preserving its topological and dimensional properties. This theorem enables the analysis of signal intrinsic directly from univariate time-delay , without knowledge of the underlying differential equations, and is foundational for estimating low-dimensional structures in seemingly high-dimensional observations like noisy recordings. Consider a more complex deterministic signal formed as the sum of k sinusoids with incommensurate frequencies, s(t) = \sum_{i=1}^k a_i \sin(\omega_i t + \phi_i). In phase space via delay embedding, this quasiperiodic signal traces a trajectory confined to a k-dimensional torus, yielding an intrinsic dimension of k, as each frequency component contributes one independent phase coordinate. In contrast, superimposing broadband noise on such a signal, as in s(t) + \eta(t) where \eta(t) is Gaussian white noise, elevates the effective intrinsic dimension toward the full embedding dimension d, as the noise fills higher-dimensional volumes and disrupts the low-dimensional manifold structure. Low intrinsic dimension in signals is also illustrated in the frequency domain, where such structures, like sums of few sinusoids, concentrate their energy in sparse spectral lines, thereby spanning thin, low-dimensional subspaces within the full frequency basis rather than occupying the entire high-dimensional uniformly.

Formal Definitions

For Manifolds and Datasets

In the context of manifolds, the intrinsic dimension d of a manifold M is defined as the dimension of the T_p M at any point p \in M, which equals the rank of this and remains constant across all points due to the manifold's local structure. Equivalently, it is the maximal such that M admits coordinate charts mapping open subsets diffeomorphically onto open subsets of \mathbb{R}^d. This dimension captures the minimal number of coordinates needed to describe the manifold locally, independent of any into a higher-dimensional . For empirical datasets sampled from an underlying low-dimensional manifold embedded in a high-dimensional ambient space, the intrinsic dimension is formalized asymptotically using properties of local neighborhoods. In particular, if the data lie on a d-dimensional manifold, the covering number N(r)—the minimal number of balls of radius r required to cover the manifold—scales as N(r) \propto r^{-d} for small r > 0. Thus, the intrinsic dimension can be characterized by the relation d \approx \frac{\log N(r)}{\log (1/r)}, which arises from the logarithmic scaling in the limit as r \to 0. This notion bridges to , allowing assessment of effective dimensionality in sampled without assuming a specific . A related formalization for datasets involves the introduced by Grassberger and Procaccia, which estimates the intrinsic dimension through the scaling of the correlation integral C(r). Specifically, C(r) counts the average number of pairs of data points within distance r, and for data on a d-dimensional manifold, C(r) \sim r^d as r \to 0, yielding d as the slope of \log C(r) versus \log r in the asymptotic regime. This provides a fractal-inspired measure particularly useful for irregular or noisy samples from manifolds. The underscores the theoretical foundation for such embeddings, stating that any d-dimensional manifold admits a embedding into \mathbb{R}^{2d+1} without self-intersections, ensuring that the intrinsic dimension governs the minimal ambient required for faithful representation.

For Signals

In the context of signals, the intrinsic dimension of a signal s: \mathbb{R} \to \mathbb{R}^m refers to the dimension of the minimal capable of generating it, capturing the underlying that govern its temporal evolution. This perspective arises from embedding the signal into a state where the dynamics can be reconstructed without loss of essential , distinguishing it from the signal's ambient dimension, which grows with the length of the observation window when the signal is represented as a high-dimensional . For instance, a deterministic signal like a sinusoid, which can be generated by a two-dimensional oscillator, maintains an intrinsic dimension of 2 regardless of how many samples are taken, highlighting the separation between the generative mechanism and the raw data dimensionality. Formally, in state space reconstruction techniques, the intrinsic dimension corresponds to the minimal dimension required to unfold the signal's , often determined as the number of independent modes driving the evolution or, for signals, the Lyapunov dimension derived from the of Lyapunov exponents. The Lyapunov dimension, proposed via the Kaplan-Yorke , provides a measure for strange by cumulatively summing positive Lyapunov exponents until the sum becomes negative, yielding the dimension as the largest k such that the sum of the first k exponents is non-negative. This approach quantifies the effective dimensionality of the in the reconstructed state space, ensuring faithful representation of the signal's nonlinear dynamics. For band-limited signals, a specific formulation emerges from sampling theory variants, where the intrinsic dimension approximates $2 B W T + 1, with B W denoting the signal's and T its duration, reflecting the finite number of needed to specify the signal without or redundancy. This estimate, rooted in the concentration of within prolate spheroidal wave functions, underscores how constraints limit the signal's effective complexity far below the ambient of densely sampled representations.

Analysis Techniques

Fourier Transform Properties

Signals with low intrinsic dimension exhibit Fourier transforms that are concentrated on low-frequency components or a limited number of modes, a property that highlights the sparsity induced by the underlying low-dimensional structure. In the simple case of a one-dimensional signal that is k times continuously differentiable (e.g., a C^k function with compact support), the Fourier spectrum decays rapidly away from low frequencies. The coefficients \hat{c}(\omega) satisfy |\hat{c}(\omega)| = O(1/|\omega|^{k+1}) as |\omega| \to \infty, obtained via repeated integration by parts in the Fourier integral, ensuring that the majority of the signal's energy resides in the low-frequency portion of the spectrum. For signals defined on a d-dimensional manifold embedded in higher-dimensional space, the Fourier transform often exhibits energy concentration in low-frequency components, reflecting the underlying low-dimensional structure.

Dimensionality Reduction Connections

The intrinsic dimension of a dataset, denoted as d, represents the minimal number of coordinates required to faithfully represent the data without significant information loss, justifying dimensionality reduction from the high ambient dimension D to d when d \ll D. This principle underpins many reduction techniques, as projecting data onto a space of dimension at least d preserves essential structure, while lower dimensions introduce distortion. In (), the intrinsic dimension is estimated by examining the eigenvalues of the data's , where d corresponds to the number of significant eigenvalues that capture a substantial portion of the variance. projects the data onto the subspace spanned by the eigenvectors associated with these eigenvalues, effectively reducing dimensionality while retaining the principal directions of variation. Manifold learning methods, such as , exploit the intrinsic dimension by assuming the data lies on a low-dimensional manifold embedded in high-dimensional space and aim to recover this structure. constructs a neighborhood graph to approximate geodesic distances on the manifold and applies () to embed the data into a low-dimensional that preserves these distances. The embedding minimizes the stress function, which measures the discrepancy between the geodesic distances \hat{d}_{ij} and the distances d_{ij} in the reduced space: \text{Stress} = \sqrt{ \frac{ \sum_{i < j} ( \hat{d}_{ij} - d_{ij} )^2 }{ \sum_{i < j} \hat{d}_{ij}^2 } }, allowing recovery of the intrinsic dimension as the embedding dimension that achieves low stress. Autoencoders provide a neural network-based approach to dimensionality reduction, where the size of the bottleneck layer approximates the intrinsic dimension by forcing the model to learn a compact latent representation. During training, the encoder compresses the input to this low-dimensional code of size roughly d, and the decoder reconstructs it, with optimal performance indicating that the bottleneck captures the manifold's degrees of freedom without excessive loss. This setup ensures that reductions beyond d degrade reconstruction quality, validating the intrinsic dimension estimate.

Estimation Methods

Global Estimation Approaches

Global estimation approaches aim to compute a single intrinsic dimension value that characterizes the overall complexity of an entire dataset or signal, assuming the data lies on a manifold of uniform dimensionality. These methods are particularly useful for datasets where the embedding dimension is high but the intrinsic structure is low-dimensional, such as in chaotic time series or point clouds from sensor data. Common techniques rely on scaling properties of measures like pair correlations or box coverings, providing a global summary under the assumption of a stationary, low-dimensional attractor. The correlation dimension, a fractal-based estimator of intrinsic dimension, is computed via the Grassberger-Procaccia algorithm. This method calculates the correlation integral C(r), which counts the average number of pairs of points within a distance r in the dataset: C(r) = \frac{2}{N(N-1)} \sum_{i=1}^{N} \sum_{j=i+1}^{N} \Theta(r - \| \mathbf{x}_i - \mathbf{x}_j \|), where N is the number of points, \mathbf{x}_i are the data points, and \Theta is the Heaviside step function. The intrinsic dimension d is then estimated as the slope of the log-log plot: d = \lim_{r \to 0} \frac{\log C(r)}{\log r}. In practice, the slope is fitted over a range of small r values where scaling holds, typically using least-squares regression on the linear portion of the plot. This approach has been widely applied to estimate dimensions in dynamical systems, such as the Lorenz attractor, where it yields values around 2.06. The box-counting dimension offers another global measure by assessing how the number of boxes needed to cover the data scales with box size. For a point cloud in embedding space, the number of boxes N(\epsilon) of side length \epsilon containing at least one point satisfies: N(\epsilon) \sim \epsilon^{-d}, so the dimension d is the negative slope: d = \lim_{\epsilon \to 0} \frac{\log N(\epsilon)}{-\log \epsilon}. The algorithm involves partitioning the space into a grid of varying \epsilon, counting occupied boxes at each scale, and fitting the slope on a log-log scale. This method is effective for self-similar structures and has been used in image analysis to estimate dimensions up to 2.5 for natural textures. Efficient implementations, such as Kégl's grid-based partitioning, reduce complexity for large datasets. Maximum likelihood estimators provide a probabilistic framework for global intrinsic dimension, particularly for data assumed to be uniformly distributed on a manifold. A prominent example is the Levina-Bickel estimator, which models the distribution of nearest-neighbor distances. The local estimator for each point i is \hat{d}_i = \left[ \frac{1}{k-1} \sum_{j=1}^{k-1} \log \frac{r_{i,k}}{r_{i,j}} \right]^{-1}, where r_{i,j} is the distance to the j-th nearest neighbor of point i, and the global estimate is the average \hat{d} = \frac{1}{m} \sum_{i=1}^m \hat{d}_i, with m reference points and typically k ranging from 10 to 20. This estimator maximizes the likelihood under the assumption of local uniformity and has demonstrated accuracy on manifolds like Swiss roll, estimating d=2 with low bias for N > 1000. Recent developments include the Adaptive ID (ABIDE), which adaptively selects neighborhood sizes to handle noise and outliers, and graph-based methods using Wasserstein distances for robust approximation even in small samples (N < 1000). These advances, as of 2024, improve reliability in practical settings and are implemented in tools like the Python package scikit-dimension. These global methods generally assume data stationarity, meaning the intrinsic dimension is constant across the dataset, and require sufficient samples (N \gg 10^d) to reliably capture scaling regimes. They perform well in low-noise settings but can overestimate or underestimate in high-dimensional noise, where embedding artifacts dominate small scales, limiting applicability to clean, manifold-like data. Local variants extend these ideas to varying dimensions but are not covered here.

Local Intrinsic Dimensionality

Local intrinsic dimensionality captures the variation in the effective dimensionality of a dataset across different points, which is particularly relevant for data lying on non-uniform manifolds or exhibiting heterogeneous structure. Unlike global estimates that assume a constant dimension throughout, local methods compute the intrinsic dimension d(\mathbf{x}) at each point \mathbf{x} by examining small neighborhoods, where the dimension is defined as d(\mathbf{x}) = \lim_{r \to 0} \frac{\log N_{\mathbf{x}}(r)}{\log r}, with N_{\mathbf{x}}(r) denoting the expected number of data points within a geodesic ball of radius r centered at \mathbf{x} on the underlying manifold. In practice, for a k-nearest neighbors approach, this is approximated as d(\mathbf{x}) \approx \frac{\log k}{\log r_k(\mathbf{x})}, where r_k(\mathbf{x}) is the Euclidean distance to the k-th nearest neighbor, assuming small neighborhoods where local flatness holds. The TWO-NN method computes a global intrinsic dimension estimate using ratios of the first and second nearest-neighbor distances from each point, making it robust to local density inhomogeneities. For each point \mathbf{x}_i, it computes the ratio \mu_i = \frac{r_2(\mathbf{x}_i)}{r_1(\mathbf{x}_i)}, where r_1(\mathbf{x}_i) and r_2(\mathbf{x}_i) are the distances to the first and second nearest neighbors, respectively. Under a local Poisson point process assumption, the cumulative distribution function of \mu is F(\mu) = 1 - \mu^{-d} for \mu \geq 1, leading to the estimator \hat{d} = -\frac{\log(1 - F(\mu))}{\log \mu}, fitted empirically by linear regression on sorted \mu_i values after discarding a fraction (e.g., 10%) of extreme ratios to reduce outlier bias. For datasets with varying dimensions, the method can be applied to local subsets around each point to obtain approximate pointwise estimates. Another key approach is the maximum likelihood estimator (MLE) based on a local Poisson process model for points on manifolds. For a neighborhood around \mathbf{x}, the distances T_1(\mathbf{x}) < T_2(\mathbf{x}) < \cdots < T_k(\mathbf{x}) to the k nearest neighbors are modeled such that the likelihood maximizes the intrinsic dimension m via \hat{m}_k(\mathbf{x}) = \left[ \frac{1}{k-1} \sum_{j=1}^{k-1} \log \frac{T_k(\mathbf{x})}{T_j(\mathbf{x})} \right]^{-1}. This estimator assumes uniform density locally and derives from the volume scaling in the intrinsic space, yielding consistent pointwise estimates as k increases. In applications to manifolds with varying curvature, local intrinsic dimensionality estimation reveals how dimensional structure adapts to local geometry, with geodesic distances preferred over Euclidean ones to avoid distortions from embedding curvature. For small radii, Euclidean distances approximate geodesics, but in high-curvature regions, the relation d(\mathbf{g}) \approx d(\mathbf{e}) holds only locally, where \mathbf{g} and \mathbf{e} denote geodesic and Euclidean metrics; graph-based approximations of geodesics ensure that N_{\mathbf{x}}(r) scales correctly with the intrinsic volume, enabling accurate detection of dimension variations (e.g., on where curvature folds alter apparent dimensionality).

Generalizations

To Noisy and High-Dimensional Data

In the presence of additive noise, such as Gaussian perturbations modeled as \tilde{\mathbf{x}}_i = \mathbf{x}_i + \sigma \boldsymbol{\eta}_i where \boldsymbol{\eta}_i \sim \mathcal{N}(\mathbf{0}, \mathbf{I}_D), the estimated d of a low-dimensional manifold embedded in high-dimensional space tends to be overestimated, with the inflation increasing as the noise variance \sigma^2 grows. This occurs because noise introduces extraneous variability orthogonal to the manifold, creating "shadow dimensions" that local estimators like maximum likelihood or interpret as additional structure. For instance, even at signal-to-noise ratios as low as 20 dB (corresponding to 1% noise variance), methods such as or exhibit systematic overestimation on synthetic manifolds. Denoising strategies often involve projecting the noisy data onto an estimated subspace of dimension d, typically via (PCA), which removes noise-dominated components by retaining only the top eigenvalues above the noise floor. In extremely high ambient dimensions D \gg d, the concentration of measure phenomenon further complicates intrinsic dimension estimation by causing pairwise distances and volumes to concentrate around their medians, effectively masking the underlying low-dimensional geometry and leading to inflated or unstable estimates. This effect, prominent in spaces where D > 1000, arises because most mass in high-dimensional distributions localizes near a thin shell, reducing the discriminability of local neighborhoods. To address this, random projections—based on the Johnson-Lindenstrauss lemma—can embed the data into a lower-dimensional (e.g., O(d \log n)) while approximately preserving distances with high probability, more reliable subsequent estimation. Such projections have been integrated into pipelines for high-dimensional datasets like hyperspectral images. Robust estimation techniques are essential for handling s and heavy-tailed noise in these settings, with variants of providing a foundation. Robust covariance estimators, such as those using minimum determinant or Winsorized statistics, downweight outlier influence when computing the sample , yielding more stable eigenspectra for subspace identification. For example, applying robust to contaminated data sets recovers the intrinsic subspace under sub-Gaussian noise assumptions. These methods extend to intrinsic dimension by thresholding the in the robustly estimated , where the number of significant eigenvalues approximates d. From random matrix theory, the effective observed dimension in noisy regimes can be expressed as d' \approx d + \delta(\sigma^2, D), where \delta captures noise-induced inflation, with bounds derived from the spiked covariance model: the top d eigenvalues exceed the Marchenko-Pastur bulk edge \lambda_+ = (1 + \sqrt{\gamma})^2 \sigma^2 (for aspect ratio \gamma = D/n), while noise eigenvalues concentrate below it. Theoretical guarantees ensure that under conditions like \sigma \sqrt{D} / r \ll 1 (where r is the local scale), the spectral gap \Delta_d = \lambda_d - \lambda_{d+1} identifies d with probability $1 - O(1/n), provided the sample size n \gtrsim d \log D. This framework, applied in multiscale analysis, separates signal (linear growth in singular values) from noise (constant) components.

Multiscale and Hierarchical Dimensions

In multiscale analysis of data manifolds, the intrinsic dimension often varies with the of , increasing at finer scales due to the of more detailed structures. For example, in fractal-like datasets, coarser scales may suggest a lower effective dimension, while finer scales uncover self-similar complexities that elevate the estimated dimension, as seen in samples from self-affine sets. addresses this by providing a scale-invariant estimate of intrinsic dimension through the persistent homology dimension (PHD), which quantifies the growth rate of topological features—such as connected components or holes—across thickening scales, yielding a value equivalent to the upper box-counting dimension for bounded sets. Multiscale estimation methods, such as multiscale (), further enable dimension assessment by analyzing structures at varying radii around data points, distinguishing signal from noise and effects to identify the manifold's intrinsic dimensionality without assuming uniformity. In hierarchical contexts, intrinsic dimension can differ across structural levels, as in tree-like data organizations where higher levels exhibit lower dimensions due to aggregation, while levels reveal higher complexities. This is evident in genomic datasets, where of profiles estimates varying intrinsic dimensions that correspond to biological hierarchies, such as cell types or pathways, using distance-based agglomerative methods. Diffusion maps provide a key method for multiscale , constructing a Markov on the data to embed points into a low-dimensional space that preserves intrinsic geometry across scales parameterized by diffusion time t. The coordinates are derived from the of the : \Psi_t(x_i) = \left( \lambda_1^t \phi_1(x_i), \lambda_2^t \phi_2(x_i), \dots, \lambda_m^t \phi_m(x_i) \right), where \{\lambda_j\}_{j=1}^m are the eigenvalues of the (ordered $1 = \lambda_1 > |\lambda_2| \geq \cdots \geq |\lambda_m|) and \{\phi_j\} are the corresponding right eigenvectors. The intrinsic dimension corresponds to the number of leading eigenvalues that decay slowly (close to 1 even for moderate t), while subsequent eigenvalues decay rapidly, reflecting off-manifold directions or noise. These approaches generalize from uniform manifolds—where dimension is constant—to piecewise manifolds, which comprise unions of patches with potentially varying local dimensions, enabling robust estimation in datasets with abrupt structural changes by partitioning and analyzing subregions separately.

History

Origins and Key Developments

The concept of intrinsic dimension originated in the field of , where it refers to the minimal number of coordinates needed to describe a manifold locally without . In 1936, Hassler Whitney proved the embedding theorem, demonstrating that any smooth n-dimensional manifold can be embedded into \mathbb{R}^{2n}, providing a foundational result for understanding the intrinsic dimensionality of geometric objects independent of their ambient space. This theorem established that the intrinsic dimension of a manifold is well-defined and finite, laying the groundwork for later extensions to more complex structures like attractors. In the late and , statistical methods for estimating intrinsic dimension in data emerged, bridging geometric s to practical . Bennett proposed the first for dimensionality in 1969. Fukunaga formalized the in the as the minimum number of parameters needed to describe data and introduced the Fukunaga-Olsen in 1971, using local matrices and Voronoi for . During the 1970s, as emerged, researchers began exploring intrinsic dimensions in , particularly for reconstructing from time series data. In 1980, Norman Packard and colleagues proposed a method to infer the geometry of an using time-delay embeddings, allowing estimation of its dimension from a single observed variable in dissipative systems. This idea was formalized mathematically in 1981 by Floris Takens, who proved that under generic conditions, a delay-coordinate map embeds a smooth into a higher-dimensional space while preserving its topological properties, thus enabling reliable dimension estimation for chaotic in signals. The 1980s saw further advancements in quantifying intrinsic dimensions for fractal and strange attractors, particularly in nonlinear dynamics and analysis. In 1983, Peter Grassberger and Itamar Procaccia introduced the , a fractal measure computed from the scaling of pairwise distances in a , which provides a practical for the dimension of attractors in experimental data such as turbulent flows. This method became a cornerstone for analyzing non-integer dimensions in chaotic systems, bridging theoretical geometry with empirical applications in physics. Key milestones in the late and early extended intrinsic concepts to , emphasizing local variations in data manifolds. Around this period, local intrinsic dimensionality gained prominence as a tool for understanding high-dimensional datasets, with Elizaveta Levina and Peter Bickel introducing a maximum likelihood in 2005 that assesses the in local neighborhoods using nearest-neighbor distances, facilitating applications in clustering and .

Modern Advances

In the , refinements to the maximum likelihood estimator (MLE) proposed by Levina and Bickel in 2005 enhanced its robustness for local intrinsic dimension estimation, particularly in handling noise and non-uniform sampling. For instance, a regularized MLE variant introduced in incorporated penalties to mitigate in high-density regions, improving accuracy on synthetic manifolds such as the S-curve and . Further, the geometry-aware MLE (GeoMLE) from 2019 extended the approach by integrating local estimates, outperforming the original on benchmarks like the , achieving a of 9.7% compared to 26.9% for Levina-Bickel. These improvements facilitated broader adoption in preprocessing pipelines for tasks. The era introduced neural network-based estimators that leverage architectures to detect intrinsic dimensions through bottleneck optimization. In such methods, the latent space dimensionality is tuned to minimize reconstruction error while maximizing information retention, effectively estimating the intrinsic dimension as the point where further yields . A seminal 2019 work demonstrated this on the MNIST dataset, estimating intrinsic dimensions of 19-22 via bottlenecks, lower than PCA's 30-35 estimates, surpassing traditional PCA-based approaches on nonlinear manifolds. Subsequent variants, like additive autoencoders in , decomposed the estimation into linear and nonlinear components for better scalability on nonlinear manifolds. In the 2020s, scalable methods addressed challenges by approximating high-dimensional computations efficiently. Randomized () techniques, refined for intrinsic dimension estimation, enable processing of datasets with millions of points by projecting onto low-rank subspaces, achieving near-linear with approximation errors below 10% on large-scale simulations. These have been integrated with (), where computes multi-scale features to refine dimension estimates; a 2021 study showed this reduces variance in noisy embeddings by capturing holes and cycles, improving accuracy on manifolds by up to 25%. Theoretical advances have provided non-asymptotic bounds on errors, moving beyond large-sample assumptions to finite regimes. Using Wasserstein distances, a 2022 derived , providing multiplicative bounds on the dimension estimate with high probability for finite samples. Such bounds have informed hybrid methods, ensuring robustness in non-stationary settings like . Recent developments include optimal transport-based for robust beyond noise (2024) and local intrinsic dimension calculations for molecular properties (2025).

Applications

In Machine Learning and Data Analysis

In , estimates of the intrinsic dimension (ID) play a crucial role in assessing and controlling model complexity, particularly for algorithms sensitive to the curse of dimensionality. For instance, in k-nearest neighbors (k-NN) classifiers, the local intrinsic dimension () helps guide the selection of the hyperparameter k by quantifying the effective dimensionality around each query point, enabling more adaptive neighborhood sizes that improve graph accuracy and performance without overparameterization. This approach refines k-NN graph construction by sparsifying features based on their contribution to LID, reducing computational overhead while preserving discriminative power. Intrinsic dimension estimation also informs clustering and visualization techniques that assume data lies on low-dimensional manifolds. Methods like (t-SNE) and uniform manifold approximation and projection (UMAP) rely on low ID to produce meaningful low-dimensional projections, as high ID can distort local structures and violate linearity assumptions, leading to less interpretable visualizations. For example, t-SNE's effectiveness diminishes with intrinsic dimensions exceeding 10–100, as seen in datasets like handwritten digits or face images, where preprocessing to estimate and account for ID enhances projection quality. Similarly, UMAP preserves manifold topology under the assumption of low ID, outperforming t-SNE in runtime while maintaining global structure for . In , variations in local intrinsic dimension provide a robust signal for identifying s, especially in high-dimensional data where traditional distance-based methods fail due to sparsity. High LID values at a point indicate deviation from the low-dimensional manifold assumed for normal data, signaling potential anomalies; conversely, low LID aligns with inlier regions. The dimensionality-aware (DAO) method, for instance, uses LID to compute asymptotic density ratios between query points and neighbors, achieving superior performance over (LOF) and k-NN detectors across hundreds of synthetic and real datasets by adapting to local ID variations. For efficiency in scenarios, intrinsic dimension estimates guide strategies to reduce computational costs without significant loss of representational power. Low implies that fewer samples suffice to capture the underlying manifold, lowering for learning tasks; empirical studies confirm that datasets with lower exhibit reduced sample requirements for accurate manifold learning and . This enables targeted in high-volume pipelines, such as selecting subsets proportional to estimated to maintain fidelity in downstream models like neural networks.

In Physics and Signal Processing

In physics, the intrinsic dimension plays a crucial role in reconstructing from time series data in dynamical systems, enabling the analysis of behavior without full knowledge of the underlying equations. According to Takens' embedding theorem, a time-delay of dimension at least twice the intrinsic dimension plus one suffices to reconstruct the topologically, allowing estimation of its properties from scalar observations. For instance, in the —a canonical model of atmospheric —the , which approximates the intrinsic dimension of the , is estimated at approximately 2.05, indicating a low-dimensional structure despite the three-dimensional . This estimation facilitates quantifying through metrics like Lyapunov exponents and supports applications in weather prediction and . In , intrinsic dimension estimation informs bandwidth requirements and by identifying the minimal in high-dimensional signals, often leading to sparse representations. Low intrinsic dimension implies that signals lie on a lower-dimensional manifold, enabling efficient sampling strategies that reduce acquisition costs while preserving information. For example, in compressive sensing frameworks, operating at the intrinsic dimension throughout the processing pipeline exploits this structure for robust reconstruction from incomplete measurements. A practical application is ID-guided sparse sampling, where signals with low intrinsic dimension exhibit sparsity in the domain, allowing recovery with fewer samples than traditional Nyquist rates—typically O(k log n) for k-sparse signals of length n. This approach has been applied to audio and , achieving ratios that scale with the estimated dimension rather than ambient dimensionality. In , the effective intrinsic dimension characterizes the accessible within the exponentially large of many-body problems, aiding in the simulation and understanding of entangled states. For systems like chains or Rydberg atoms, the full dimension grows as 2^N for N particles, but quantum many-body scars—non-ergodic states—manifest as low-dimensional manifolds with intrinsic dimensions far below this scale, enabling unsupervised detection via techniques. Estimating this effective dimension helps approximate ground states and dynamics, reducing in methods or variational quantum eigensolvers. An illustrative example from is the analysis of (EEG) signals, where low intrinsic dimension correlates with focused brain states, such as or reduced . In resting-state EEG, the intrinsic dimension of neural activity trajectories drops during tasks requiring concentration, reflecting a collapse to a low-dimensional manifold that suppresses noise and enhances signal coherence—often estimated at 4-6 dimensions in focused conditions versus higher in diffuse states. This metric distinguishes cognitive states, with applications in brain-computer interfaces and diagnosis, as lower dimensions indicate streamlined neural dynamics.

References

  1. [1]
    [PDF] Intrinsic dimension estimation: Advances and open problems
    Definition 1. A data set ⊆ RN is said to have intrinsic dimension (ID) equal to M if its elements lie entirely, without information loss, within ...
  2. [2]
    None
    ### Summary of Intrinsic Dimension (ID) from https://arxiv.org/pdf/2405.15132
  3. [3]
    Local Intrinsic Dimension Estimation by Generalized Linear Modeling
    Jul 1, 2017 · 2 Definitions of Intrinsic Dimension. In this letter, we consider the intrinsic dimension based on the distance between objects, which is ...
  4. [4]
    [PDF] Intrinsic dimension estimation using Wasserstein distances - People
    It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold ...
  5. [5]
    [PDF] Adaptive Approximation and Generalization of Deep Neural Network ...
    In this study, we prove that an intrinsic low dimensionality of covariates is the main factor that determines the performance of deep neural networks (DNNs). ...
  6. [6]
    Scikit-dimension: a Python package for intrinsic dimension estimation
    Sep 6, 2021 · Dealing with uncertainty in applications of machine learning to real-life data critically depends on the knowledge of intrinsic dimensionality ( ...
  7. [7]
    Estimating the intrinsic dimension of datasets by a minimal ... - Nature
    Sep 22, 2017 · Several approaches work on the assumption that the important content of a dataset belongs to a manifold whose Intrinsic Dimension (ID) is much ...
  8. [8]
    [PDF] Maximum Likelihood Estimation of Intrinsic Dimension
    In this paper, we present a new estimator of intrinsic dimension, study its statistical properties, and compare it to other estimators on both simulated and ...
  9. [9]
    [PDF] Introduction to Smooth Manifolds - Julian Chaidez
    ... John M. Lee. Introduction to. Smooth Manifolds. Second Edition. Page 6. John M. Lee. Department of Mathematics. University of Washington. Seattle, WA, USA. ISSN ...
  10. [10]
    Introduction to Smooth Manifolds - SpringerLink
    This book is an introductory graduate-level textbook on the theory of smooth manifolds. Its goal is to familiarize students with the tools they will need
  11. [11]
    [PDF] Fractal Dimension and Measure - Cornell Mathematics
    For the Koch curve, this formula yields dimsim (K) = log 4 log 3. = log3 4 ≈ 1.2619 , a number which is strikingly similar to the compass dimension of K, dimcom ...
  12. [12]
    [PDF] NOTES ON MANIFOLDS (239) - UC Davis Mathematics
    Charts of atlases from D are called simply charts of D. A set M with n-dimensional differential structure is called a (smooth) n-dimensional manifold.
  13. [13]
    [PDF] 2 Manifolds
    An m-dimensional (coordinate) chart (U,j) on M is a subset U ✓ M together with ... The charts (U,j) 2 s are called (coordinate) charts on the manifold M.
  14. [14]
    [PDF] Intrinsic Dimension Estimation Using Packing Numbers
    We propose a new algorithm to estimate the intrinsic dimension of data sets. The method is based on geometric properties of the data and re-.
  15. [15]
    Grassberger-Procaccia algorithm - Scholarpedia
    Oct 21, 2011 · The Grassberger-Procaccia algorithm is used for estimating the correlation dimension of some fractal measure \mu from a given set of points randomly ...Basic Definitions · Relations to Other Dynamical... · ``Optimal" Choices for Delay...
  16. [16]
    Whitney embedding theorem in nLab
    Dec 26, 2024 · The (strong) Whitney embedding theorem states that every smooth manifold (Hausdorff and sigma-compact) of dimension n n has an embedding of ...
  17. [17]
    [PDF] EE 261 - The Fourier Transform and its Applications
    Every signal has a spectrum and is determined by its spectrum. You can ... polynomial is a finite sum of complex exponentials with the same fundamental ...
  18. [18]
    [PDF] Intrinsic dimension estimation of data by principal component analysis
    Feb 10, 2010 · PCA is a classical projection method which finds ID by counting the number of significant eigenvalues.
  19. [19]
    [PDF] A Global Geometric Framework for Nonlinear Dimensionality ...
    The complete isometric feature mapping, or Isomap, algorithm has three steps, which are detailed in Table 1. The first step deter- mines which points are ...
  20. [20]
    [PDF] Deep Autoencoders: From Understanding to Generalization ...
    intrinsic dimensionality of the data is at most h. Therefore, the size of the bottleneck layer, which corresponds to the maximum dimension of the manifold ...
  21. [21]
    [PDF] Physica 9D (1983) 189-208 North-Holland Publishing Company ...
    We study the correlation exponent v introduced recently as a characteristic measure of strange attractors which allows one.
  22. [22]
    Accurate Estimation of the Intrinsic Dimension Using Graph Distances
    Aug 11, 2016 · We show that the estimator does not depend on the shape of the intrinsic manifold and is highly accurate, even for exceedingly small sample sizes.
  23. [23]
    Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification
    ### Summary of Key Points on Noise and Intrinsic Dimension Estimation
  24. [24]
    Estimating the dimensionality of the manifold underlying multi ...
    The objective of this study was to evaluate the efficacy of several representative algorithms for estimating the dimensionality of linearly and nonlinearly ...
  25. [25]
    Robust estimation of the intrinsic dimension of data sets with ...
    Feb 26, 2025 · In this paper, we propose a new data representation and manifold learning technique based on Quantum Cognition Machine Learning (QCML) and ...
  26. [26]
    [2001.11739] Local intrinsic dimensionality estimators based ... - arXiv
    Jan 31, 2020 · Local intrinsic dimensionality estimators based on concentration of measure. Authors:Jonathan Bac, Andrei Zinovyev. (Submitted on 31 Jan 2020).
  27. [27]
  28. [28]
  29. [29]
    Robust high dimensional factor models with applications to ... - PMC
    Applying PCA to the robust covariance estimators as described above leads to more reliable estimation of principal eigenspaces in the presence of heavy-tailed ...2. Factor Models And Pca · 3.1. Covariance Estimation · 4.1. Gaussian Mixture Model
  30. [30]
    Robust principal component analysis for accurate outlier sample ...
    Jun 29, 2020 · The purpose of a robust PCA is twofold: (1) to find those linear combinations of the original variables that contain most of the information, ...Rna-Seq Data Simulation · Outlier Detection · Outlier Removal Enabled The...
  31. [31]
    [PDF] Estimating the Intrinsic Dimension of High-Dimensional Data Sets
    Apr 17, 2015 · D → ambient dimension k → intrinsic dimension. Dimensionality estimation is important in many applications in machine learning, including:.
  32. [32]
    [PDF] Estimating the Intrinsic Dimension of Hyperspectral Images ... - HAL
    Similarly to the NWRMT and Hysime approaches, our method starts by estimating the noise covariance matrix in order to remove its effect from the sample/ ...
  33. [33]
  34. [34]
    Intrinsic Dimension, Persistent Homology and Generalization ... - arXiv
    Nov 25, 2021 · We develop an efficient algorithm to estimate PHD in the scale of modern deep neural networks and further provide visualization tools to help understand ...
  35. [35]
    [PDF] Estimating Intrinsic Dimension via Clustering. - Computer Science
    Our observations consist of a set of values D = {dij} giving the distance between items i and j, from which we seek to determine the intrinsic dimension of X.
  36. [36]
    [PDF] Diffusion maps
    Abstract. In this paper, we provide a framework based upon diffusion processes for finding meaningful geometric descriptions of data sets.
  37. [37]
    [PDF] Multiscale Anomaly Detection Using Diffusion Maps - Israel Cohen
    Since often the intrinsic dimension of the data is not known in advance, it is difficult to predict how well an ANN algorithm will do in a specific application.
  38. [38]
    (PDF) Piecewise-Linear Manifold Learning: A Heuristic Approach to ...
    Aug 7, 2025 · In this paper a heuristic approach is presented to tackle this problem by approximating the manifold as a set of piecewise linear models. By ...
  39. [39]
    [PDF] Differentiable Manifolds - Hassler Whitney
    Dec 13, 2000 · Differentiable Manifolds. Hassler Whitney. STOR. The Annals of Mathematics, Second Series, Volume 37, Issue 3 (Jul., 1936), 645-680. Your use ...
  40. [40]
    [PDF] physical review letters
    Sep 1, 1980 · 56, 679 (1976). Geometry from a Time Series. N. H. Packard, J. P. Crutchfield, J. D. Farmer, and R. S. Shaw. Dynamical Systems Collective ...
  41. [41]
    Detecting strange attractors in turbulence - SpringerLink
    Oct 7, 2006 · Takens, F. (1981). Detecting strange attractors in turbulence. In: Rand, D., Young, LS. (eds) Dynamical Systems and Turbulence, Warwick 1980.
  42. [42]
    [PDF] Regularized Maximum Likelihood for Intrinsic Dimension Estimation
    dimension estimators. 1 Introduction. The curse of dimensionality can be greatly alleviated if the intrinsic dimension of the data were a prior knowledge.
  43. [43]
    [PDF] Geometry-Aware Maximum Likelihood Estimation of Intrinsic ...
    In this work, we show that the explicit accounting to geometric properties of unknown support leads to the polynomial correction to the standard maximum ...Missing: Vidyasagar | Show results with:Vidyasagar
  44. [44]
    [PDF] Dimension Estimation Using Autoencoders - arXiv
    Sep 24, 2019 · In. DE, one attempts to estimate the intrinsic dimensionality or number of latent variables in a set of measurements of a random vector. However ...
  45. [45]
    Additive autoencoder for dimension estimation - ScienceDirect.com
    Sep 28, 2023 · The basic procedure to determine the intrinsic dimension is to gradually increase the size of the squeezing layer and to seek a small value of ...
  46. [46]
    [PDF] Randomized Dimension Reduction on Massive Data - arXiv
    Nov 5, 2013 · Random- ized SVD will serve as the core computational engine for the other two dimension reduction estimators in which estimation reduces to a ...
  47. [47]
    Intrinsic Dimension, Persistent Homology and Generalization in ...
    In this study, we consider this problem from the lens of topological data analysis (TDA) and develop a generic computational tool that is built on rigorous ...
  48. [48]
    [PDF] Intrinsic Dimension Estimation Using Wasserstein Distance
    In this work, our primary focus is on estimating the intrinsic dimension. To see why this is an important question, note that the local estimators of Bickel et ...
  49. [49]
    The role of local dimensionality measures in benchmarking nearest ...
    The local continuous intrinsic dimension of at distance is given by ID F ( r ) = lim ε → 0 ln ( F ( ( 1 + ε ) r ) ∕ F ( r ) ) ln ( ( 1 + ε ) r ∕ r ) , whenever ...Missing: hyperparameter | Show results with:hyperparameter
  50. [50]
    Improving k-NN Graph Accuracy Using Local Intrinsic Dimensionality
    This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle. The local model can be regarded ...
  51. [51]
    [PDF] Visualizing Data using t-SNE - Journal of Machine Learning Research
    We present a new technique called “t-SNE” that visualizes high-dimensional data by giving each datapoint a location in a two or three-dimensional map.Missing: UMAP | Show results with:UMAP
  52. [52]
    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
    ### Summary: UMAP and Intrinsic Dimension/Manifold Assumption for Visualization
  53. [53]
    [2401.05453] Dimensionality-Aware Outlier Detection: Theoretical ...
    Jan 10, 2024 · We present a nonparametric method for outlier detection that takes full account of local variations in intrinsic dimensionality within the dataset.
  54. [54]
    The Effect of Manifold Entanglement and Intrinsic Dimensionality on ...
    Jun 28, 2022 · Low levels of entanglement lead to low increases of the sample complexity when the intrinsic dimensionality is increased, while for high levels ...
  55. [55]
    [PDF] the intrinsic dimension of images - OpenReview
    In this work, we apply dimension estimation tools to popular datasets and investigate the role of low-dimensional structure in deep learning. We find that ...
  56. [56]
    Lyapunov Exponent and Dimension of the Lorenz Attractor
    The program also calculates the capacity dimension D0 = 2.001 ± 0.017 and the correlation dimension D2 = 2.055 ± 0.004, but these values are considerably ...
  57. [57]
    [PDF] The Sparse Fourier Transform: Theory & Practice - Haitham Hassanieh
    It plays a central role in signal processing, communications, audio and video compression, medical imaging, genomics, astronomy, as well as many other areas.
  58. [58]
    [PDF] Compressive Measurements for Signal Acquisition and Processing
    We would like to operate at the intrinsic dimension at all stages of the DSP pipeline. How can we exploit low- dimensional models in the design of signal ...
  59. [59]
    Unsupervised learning of quantum many-body scars using intrinsic ...
    May 29, 2024 · In this work, we show how two dimensionality reduction techniques, multidimensional scaling and intrinsic dimension estimation, can be used to learn structural ...
  60. [60]
    Cascades and Cognitive State: Focused Attention Incurs Subcritical ...
    Mar 18, 2015 · ... EEG cascades change with cognitive state suggests that they reflect underlying neural dynamics. In conclusion, we have shown that EEG-domain ...Eeg Recording · Eeg-Informed Imaging... · Mapping Eeg Cascades To...<|control11|><|separator|>
  61. [61]
    A mechanism for the emergence of low-dimensional structures in ...
    Apr 10, 2025 · Our study introduces a novel mechanism for the collapse of high dimension brain dynamics onto lower dimensional manifolds.