Kolmogorov–Smirnov test
The Kolmogorov–Smirnov test (KS test) is a nonparametric statistical test used to determine whether a sample of data follows a specified continuous probability distribution (one-sample KS test) or whether two independent samples are drawn from the same continuous distribution (two-sample KS test).[1] The test evaluates the maximum vertical distance between the empirical cumulative distribution function (ECDF) of the sample(s) and the theoretical cumulative distribution function (CDF), providing a measure of discrepancy that is sensitive to differences anywhere in the distribution.[1]
Developed independently by Russian mathematicians Andrey Nikolaevich Kolmogorov and Nikolai Vasilyevich Smirnov, the one-sample version was introduced by Kolmogorov in 1933 as a method to assess the goodness-of-fit of empirical distributions to theoretical ones.[2] Smirnov extended the framework in 1939 to the two-sample case, allowing comparison of distributions from two populations without assuming a specific parametric form. The test statistic, denoted as D, is defined for the one-sample case as
D = \sup_x |F_n(x) - F(x)|,
where F_n is the ECDF and F is the reference CDF, and for the two-sample case as
D = \sup_x |F_{1,n}(x) - F_{2,m}(x)|,
with F_{1,n} and F_{2,m} as the ECDFs of the two samples.[1] Under the null hypothesis of no difference, the distribution of D is independent of the underlying continuous distribution, enabling exact p-values via tabulated critical values or asymptotic approximations for large samples.[1]
Key advantages of the KS test include its distribution-free nature under the null hypothesis, applicability to any continuous distribution without prior parameter estimation from the data, and its ability to detect deviations in location, scale, or shape across the entire support of the distribution.[1] However, it assumes continuous distributions and can be less powerful against specific alternatives compared to parametric tests; it is also sensitive to the presence of ties in discrete data and requires the reference distribution to be fully specified.[1] Critical values and significance levels have been extensively tabulated, with asymptotic results facilitating computation for sample sizes n > 40.[1]
In practice, the KS test is widely applied in goodness-of-fit testing, such as verifying normality or uniformity in datasets.[1] For instance, it is commonly used to assess whether observed data conform to an expected model before applying parametric methods. It has applications in fields like quality control,[3] hydrology,[4] and biological sciences.[5] Extensions and modifications, including weighted versions and multivariate analogs, have been developed to address limitations in higher dimensions or specific scenarios.
Introduction
Purpose and variants
The Kolmogorov–Smirnov test serves as a nonparametric statistical hypothesis test primarily used to evaluate goodness-of-fit in the one-sample case, where it determines whether a sample's empirical cumulative distribution function aligns with a specified theoretical distribution, and in the two-sample case, where it assesses whether two independent samples arise from the same underlying distribution.[1] This test quantifies discrepancies between cumulative distribution functions without requiring parametric assumptions about the form of the distributions, making it suitable for broad applications in distributional inference.[6]
As a nonparametric procedure, the test imposes minimal assumptions on the data-generating process, relying only on the continuity of the distributions for the validity of its asymptotic distribution under the null hypothesis; it does not presuppose membership in any specific parametric family, such as normality or exponentiality.[7][8] This distribution-free property under the null enables its use across diverse datasets where parametric forms are unknown or suspect.[9]
Key variants include the standard one-sample test, which compares a single empirical distribution to a fully specified reference, and the two-sample test, which compares two empirical distributions directly; a notable adaptation is the Lilliefors test, which modifies the one-sample version to account for estimated parameters, such as mean and variance in normality testing.[10] Under the null hypothesis, the distributions are identical, while the alternative posits a difference in their cumulative distribution functions.[6] Developed in the early 20th century, the test emerged as a foundational tool for distribution-free statistical inference, with the one-sample form introduced by Kolmogorov in 1933 and the two-sample extension by Smirnov in 1939.[6][2]
Historical development
The Kolmogorov–Smirnov test emerged in the early 20th century as a nonparametric method for assessing the fit between empirical and theoretical distributions. In 1933, Andrey Kolmogorov introduced the one-sample variant in his paper "Sulla determinazione empirica di una legge di distribuzione," where he defined the test for continuous distributions and established its asymptotic distribution, laying the groundwork for modern goodness-of-fit testing. This contribution was contemporaneous with and influenced by the Glivenko-Cantelli theorem, proved by Valery Glivenko in the same year, which demonstrates the almost sure uniform convergence of the empirical distribution function to the true cumulative distribution function as sample size increases.
Nikolai V. Smirnov built upon Kolmogorov's foundation with key extensions in the late 1930s and 1940s. In 1939, Smirnov developed the two-sample test in his work "On the Estimation of the Discrepancy Between Empirical Curves of Distribution for Two Independent Samples," enabling comparisons between distributions from two independent samples without assuming a specific form.[11] He further advanced the framework in 1948 by publishing tables for estimating the goodness of fit of empirical distributions, which provided practical computational aids for applying the test in finite samples. These developments formalized the test's variants, shifting it from abstract probability theory toward applied statistics.
Post-World War II, the test gained broader acceptance as statistical computing advanced, with its incorporation into software packages by the 1960s enhancing accessibility for researchers.[1] Refinements in the 1950s, notably by F. J. Massey, included detailed tables of critical values for the goodness-of-fit statistic, improving the test's usability for small to moderate sample sizes.[12] By the 1970s, the Kolmogorov–Smirnov test had transitioned into a standard practical tool, widely applied in quality control for verifying process distributions and in astronomy for comparing observational data sets to theoretical models.[1][13]
One-sample test
Statistic definition
The one-sample Kolmogorov–Smirnov test assesses whether a random sample is drawn from a specified continuous distribution by comparing its empirical cumulative distribution function (ECDF) to the theoretical cumulative distribution function (CDF).[1] Let X_1, \dots, X_n be a random sample from a distribution with hypothesized CDF F. The ECDF for the sample is defined as
F_n(x) = \frac{1}{n} \sum_{i=1}^n \mathbf{1}\{X_i \leq x\},
where \mathbf{1}\{\cdot\} is the indicator function.[1]
The test statistic D_n is the supremum of the absolute difference between the ECDF and the hypothesized CDF,
D_n = \sup_x |F_n(x) - F(x)|,
which quantifies the maximum vertical distance between the ECDF step function and the theoretical CDF.[1] This statistic, introduced by Kolmogorov in 1933, measures overall discrepancies in the distribution without assuming a parametric form beyond the specified F. The null hypothesis is that the sample arises from the distribution with CDF F, with the alternative that it does not.[2]
To compute D_n, sort the observations to obtain the order statistics X_{(1)} \leq \dots \leq X_{(n)}, then evaluate the differences at the points just before and after each X_{(i)}, as the supremum occurs at these jumps in the ECDF. Specifically, compute
D_n^+ = \max_{1 \leq i \leq n} \left( \frac{i}{n} - F(X_{(i)}) \right), \quad D_n^- = \max_{1 \leq i \leq n} \left( F(X_{(i)}) - \frac{i-1}{n} \right),
and D_n = \max(D_n^+, D_n^-).[1]
For illustration, consider a sample of 1000 observations generated from a standard normal distribution and tested against the normal CDF (α = 0.05, critical value ≈ 0.043). The computed D_n \approx 0.024 does not exceed the critical value, failing to reject the null hypothesis of normality. In contrast, the same sample size from a lognormal distribution yields D_n \approx 0.535, leading to rejection.[1]
Asymptotic distribution
Under the null hypothesis that a sample of size n is drawn from the continuous distribution function F, the one-sample Kolmogorov–Smirnov statistic D_n satisfies
\sqrt{n} \, D_n \xrightarrow{d} \sup_{t \in [0,1]} |B(t)|,
where B is a standard Brownian bridge on [0,1].[1] This convergence holds as n \to \infty.[14]
The cumulative distribution function of this limiting random variable is given by the Kolmogorov distribution:
K(\lambda) = 1 - 2 \sum_{k=1}^{\infty} (-1)^{k-1} e^{-2 k^2 \lambda^2}, \quad \lambda \geq 0.
This distribution is independent of the underlying CDF F.[1]
For inference at significance level \alpha = 0.05, the asymptotic critical value of \sqrt{n} \, D_n is approximately 1.36; thus, the critical value for D_n itself is about $1.36 / \sqrt{n}.[15] Critical values for finite samples can be obtained from tables.[16]
In finite samples, particularly when n < 40, the asymptotic approximation may be inaccurate, and exact p-values are computed using the null distribution via enumeration or Monte Carlo simulations with at least 10,000 replicates.[17]
The proof relies on Donsker's invariance principle, where the empirical process \sqrt{n} (F_n - F) converges in distribution to a Brownian bridge B, such that the supremum norm converges to that of |B|.[14]
Parameter estimation effects
When the parameters of the hypothesized distribution, such as the mean μ and standard deviation σ for a normal distribution, are estimated from the sample data, the null distribution of the Kolmogorov–Smirnov statistic D_n shifts. Specifically, the test statistic D_n^* = \sup_x |F_n(x) - F(x; \hat{\theta})|, where \hat{\theta} denotes the estimated parameters, tends to be smaller under the null hypothesis because the fitted distribution is adjusted to better match the empirical distribution function F_n. If the standard critical values assuming known parameters are used, the test becomes conservative, resulting in Type I error rates lower than the nominal level.[1]
The Lilliefors test addresses this issue for testing normality with estimated mean and variance. It modifies the one-sample KS test by employing critical values obtained from Monte Carlo simulations of the null distribution under parameter estimation, rather than relying on the standard Kolmogorov tables. These simulated critical values are smaller than the standard ones to account for the bias introduced by estimation, ensuring the test maintains the desired significance level; for instance, for a sample size of 20 and α = 0.05, the Lilliefors critical value is approximately 0.190, compared to 0.294 for known parameters. Lilliefors (1967) provided tables for critical values up to sample sizes of 300 at common significance levels.
For more general cases, particularly location-scale families, the test statistic is computed using the estimated parameters, but appropriate critical values must be derived via Monte Carlo simulations tailored to the specific distribution and number of estimated parameters. This approach simulates samples from the fitted distribution and computes the empirical distribution of D_n^* to obtain p-values or thresholds that correct for the estimation effect.[1]
The Pelz-Good algorithm offers an efficient computational method for approximating the cumulative distribution function of the standard KS statistic (with known parameters), which can be integrated into parametric bootstrap procedures for p-value estimation when parameters are unknown. By generating simulations from the estimated distribution and applying the algorithm to compute tail probabilities, it accounts for the reduced degrees of freedom due to parameter estimation, enabling accurate inference without exhaustive full simulations. Pelz and Good (1976) developed this series expansion-based approximation for the lower tail areas, improving efficiency for moderate to large sample sizes.
As a practical guideline, the asymptotic Kolmogorov distribution provides a reasonable approximation for the test with estimated parameters when the sample size n > 50, but for smaller n, simulated or tabulated critical values are essential to avoid excessive conservatism. For example, in testing an exponential distribution with estimated rate parameter \hat{\lambda} = 1 / \bar{X}, the critical value for D_n at α = 0.05 and n = 20 is approximately 0.22–0.25, a decrease of about 10–20% compared to the value of 0.294 for known λ, reflecting the need for adjustment to achieve nominal Type I error control.[1]
Discrete case adjustments
The asymptotic theory underlying the Kolmogorov–Smirnov (KS) test assumes a continuous null cumulative distribution function (CDF) F, under which the empirical CDF F_n converges to a uniform distribution on [0,1], enabling the use of Kolmogorov's limiting distribution for p-values. For discrete null distributions, however, ties in the data cause F_n(x) - F(x) to exhibit jumps and plateaus, preventing the supremum from achieving the full range of the continuous case and resulting in a discrete distribution for the test statistic D_n = \sup_x |F_n(x) - F(x)| that stochastically dominates the continuous one.[7] This leads to conservative p-values when using continuous tables, meaning the test has lower power to detect deviations, with actual type I error rates below the nominal level (e.g., observed rejection rates of about 2-3% at the 5% level for binomial data).
To address these issues, modifications to the statistic or its evaluation have been proposed. One approach uses a modified statistic D_n^* = \sup_x |F_n(x^-) - F(x)|, where x^- denotes the left limit, focusing on deviations just before jumps in the discrete CDF to better approximate continuous behavior. Alternatively, a continuity correction can be applied by adjusting the empirical ranks, such as comparing F_n\left(\frac{i - 0.5}{n}\right) to F(X_{(i)}) for ordered observations X_{(i)}, which shifts the steps to midpoints of intervals and reduces conservatism, though it may slightly alter power properties. These adjustments maintain the nonparametric nature of the test while improving its applicability to discrete data.
For exact inference in the discrete case, the null hypothesis posits that the sample follows a multinomial distribution with cell probabilities given by the jumps in F. An exact p-value is then P(D_m \geq D_{n, \obs} \mid H_0), computed by enumerating or recursively summing over all possible multinomial outcomes with statistic at least as extreme as observed. Conover's algorithm efficiently calculates this for one-sided tests with n \leq 30 by leveraging the structure of partial sums, providing exact critical values; for two-sided tests, conservative bounds are used due to symmetry challenges.[7] This method has been implemented in statistical software for small samples, ensuring proper control of the type I error.
In cases of mixed continuous-discrete null distributions, where F has both jumps and continuous parts, discretization techniques such as binning the continuous components into fine intervals or using weighted deviations can approximate the test. More precisely, recursive methods compute the exact distribution of D_n by integrating over continuous segments and handling discrete masses separately, often incorporating a continuity correction like adding 0.5 to jumps for numerical stability.[18] These approaches yield exact p-values without relying on asymptotic approximations, though they require specifying the full hybrid CDF.
For example, consider testing a sample against a known Poisson null distribution with parameter \lambda = 2. The supremum D_n is evaluated at integer points where jumps occur, but to adjust for the flat intervals between integers, one averages deviations over each unit interval or applies the continuity correction to the ranks before computing F_n. Using Conover's exact method on a small sample (e.g., n=20) might yield a p-value of 0.12 for observed D_n = 0.15, whereas the unadjusted continuous approximation would conservatively report around 0.20. This highlights the adjustment's role in restoring power for count data common in Poisson testing.
Despite these advances, exact methods remain computationally intensive for large n (e.g., beyond 100) or distributions with many categories, as the multinomial enumeration grows factorially.[7] Approximations like Monte Carlo simulation under the null can mitigate this but introduce variability; for highly discrete cases with sparse categories, the chi-squared goodness-of-fit test is often recommended as a more efficient alternative due to its asymptotic chi-squared distribution and better power in binned settings.
Two-sample test
Statistic definition
The two-sample Kolmogorov–Smirnov test assesses whether two independent samples are drawn from the same continuous distribution by comparing their empirical cumulative distribution functions (ECDFs).[19] Let X_1, \dots, X_{n_1} be a random sample from an unknown distribution with CDF F, and Y_1, \dots, Y_{n_2} be an independent random sample from another unknown distribution with CDF G. The ECDF for the first sample is defined as
F_{n_1}(x) = \frac{1}{n_1} \sum_{i=1}^{n_1} \mathbf{1}\{X_i \leq x\},
where \mathbf{1}\{\cdot\} is the indicator function, and similarly for the second sample,
F_{n_2}(x) = \frac{1}{n_2} \sum_{j=1}^{n_2} \mathbf{1}\{Y_j \leq x\}.
[20]
The test statistic D_{n_1,n_2} is the supremum of the absolute difference between these ECDFs,
D_{n_1,n_2} = \sup_x |F_{n_1}(x) - F_{n_2}(x)|,
which quantifies the maximum vertical distance between the two step functions.[19] This statistic, introduced by Smirnov in 1939, measures overall discrepancies in distribution shape, location, and scale without assuming a specific parametric form.[20]
For asymptotic approximations, an effective sample size m = \frac{n_1 n_2}{n_1 + n_2} is used to scale the statistic as \sqrt{m} D_{n_1,n_2}, which converges in distribution to the Kolmogorov distribution under the null hypothesis when n_1, n_2 \to \infty.[21]
To compute D_{n_1,n_2}, combine and sort the observations from both samples to form the order statistics, then evaluate the ECDF differences at these points, as the supremum occurs at jumps in either ECDF.[19]
For illustration, consider two samples of size 8 from Italy ({2, 4, 6, 8, 10, 12, 14, 16}) and size 7 from France ({1, 3, 5, 7, 9, 11, 13}), which can be viewed as scaled uniforms for demonstration; the ECDFs yield D_{8,7} \approx 0.357 as the maximum absolute difference.[22]
The null hypothesis is that the two samples arise from the same continuous distribution (F = G), with the alternative that the distributions differ (F \neq G).[19]
Asymptotic distribution
Under the null hypothesis that two independent samples of sizes n_1 and n_2 are drawn from the same continuous distribution function F, the two-sample Kolmogorov–Smirnov statistic D_{n_1,n_2} satisfies
\sqrt{m} \, D_{n_1,n_2} \xrightarrow{d} \sup_{t \in [0,1]} |B_1(t) - B_2(t)|,
where m = n_1 n_2 / (n_1 + n_2) is the effective sample size and B_1, B_2 are independent standard Brownian bridges on [0,1].[14] This convergence holds as \min(n_1, n_2) \to \infty.[14]
The cumulative distribution function of this limiting random variable is given by the Kolmogorov distribution:
K(\lambda) = 1 - 2 \sum_{k=1}^{\infty} (-1)^{k-1} e^{-2 k^2 \lambda^2}, \quad \lambda \geq 0.
Surprisingly, this distribution is identical to that of the limiting supremum in the one-sample Kolmogorov–Smirnov test, despite involving two independent samples.[14]
For inference at significance level \alpha = 0.05, the asymptotic critical value of \sqrt{m} \, D_{n_1,n_2} is approximately 1.36; thus, the critical value for D_{n_1,n_2} itself is about $1.36 / \sqrt{m}.[15] For unequal sample sizes, critical values can be obtained from tables based on the effective sample size m.[23]
In finite samples, particularly when n_1 + n_2 < 40, the asymptotic approximation may be inaccurate, and exact p-values are computed using the permutation distribution under the null, which enumerates all possible label assignments to the combined sample.[17] For larger but still moderate samples, Monte Carlo simulations with at least 10,000 replicates provide reliable p-values by approximating the null distribution.[17]
The proof relies on the convergence of the empirical processes \sqrt{n_1} (F_{n_1} - F) and \sqrt{n_2} (F_{n_2} - F) to independent Brownian bridges, such that their appropriately scaled difference converges to a Gaussian process whose supremum has the Kolmogorov distribution; specifically, the law of \sup |B_1 - B_2| (after variance adjustment) coincides with that of \sup |B| for a single Brownian bridge B.[14]
Theoretical foundations
Kolmogorov distribution properties
The Kolmogorov distribution arises as the limiting distribution of the Kolmogorov–Smirnov statistic in large samples, specifically as the distribution of the random variable K = \sup_{0 \le t \le 1} |B(t)|, where B(t) = W(t) - t W(1) is a standard Brownian bridge and W is a standard Wiener process.
The cumulative distribution function of K is given explicitly by the infinite series
K(\lambda) = \Pr(K \le \lambda) = \sum_{k=-\infty}^{\infty} (-1)^k e^{-2 k^2 \lambda^2}, \quad \lambda > 0.
This alternating series converges rapidly for \lambda > 0 due to the quadratic exponential decay in the terms, allowing practical truncation after a few terms for numerical evaluation.
The first two moments of K are \mathbb{E}[K] \approx 0.8693 and \mathrm{Var}(K) \approx 0.3070, with higher moments obtainable through numerical integration of the survival function derived from the series expansion.
For large \lambda, the tail probability satisfies \Pr(K > \lambda) \sim 2 e^{-2 \lambda^2}, reflecting the dominant contribution from the first term of the series and providing key insights into large deviation behavior.
The Kolmogorov distribution coincides with that of the supremum norm of a centered Gaussian process on [0,1] with the Brownian bridge covariance function \mathbb{E}[B(s)B(t)] = \min(s,t) - st, and it relates to the scaling limits of tied-down random walks or excursion processes in discrete approximations.
Numerical evaluation of the CDF employs recursive relations or direct summation of the series for efficiency, with precomputed quantile tables available for critical values corresponding to significance levels as low as \alpha = 0.001.
Convergence and limiting behavior
The Glivenko–Cantelli theorem provides the foundational result for the uniform consistency of the empirical distribution function. Specifically, for independent and identically distributed (i.i.d.) random variables X_1, \dots, X_n drawn from a distribution function F, the supremum \sup_x |F_n(x) - F(x)| \to 0 almost surely as n \to \infty, where F_n is the empirical distribution function. This strong uniform convergence holds under the i.i.d. assumption without further restrictions on F.
Building on this, Donsker's invariance principle describes the weak convergence of the properly scaled empirical process to a Brownian bridge. Under the i.i.d. condition, the process \sqrt{n} (F_n(x) - F(x)) converges in distribution to a Brownian bridge B^0(x) = B(x) - x B(1) in the Skorohod space D[0,1] equipped with the Skorohod topology, where B is a standard Brownian motion. This functional central limit theorem underpins the asymptotic distribution of the Kolmogorov–Smirnov statistic D_n = \sup_x |F_n(x) - F(x)|, which satisfies \sqrt{n} D_n \Rightarrow \sup_x |B^0(x)|.
The validity of these asymptotic results requires i.i.d. samples from F for the one-sample case; for the two-sample test, the samples must be independent and drawn from the same underlying distribution F. Additionally, continuity of F ensures the exact limiting Kolmogorov distribution without adjustments for ties or discontinuities.
Stronger quantitative bounds on the approximation are given by the Komlós–Major–Tusnády strong approximation theorem, which constructs a probability space where |\sqrt{n} D_n - \sup_x |B^0(x)|| = O\left( \frac{\log n}{\sqrt{n}} \right) with probability approaching 1 as n \to \infty, under the i.i.d. assumption with finite variance. This rate quantifies the error in approximating the KS statistic by its Brownian bridge limit and improves upon weaker convergence results.
Extensions to non-i.i.d. settings, such as dependent data in time series, necessitate modifications like blocking techniques or bootstrapping to restore approximate independence and enable asymptotic validity, though these require case-specific conditions on the dependence structure.[24]
Regarding the rate of convergence, for a fixed significance level \alpha, the quantity n D_n^2 converges in distribution to a form related to the squared supremum of the Brownian bridge, which exhibits slower convergence rates compared to parametric tests that achieve O(1/n) efficiency.
Extensions
Confidence bands for distributions
The Kolmogorov–Smirnov test exhibits a duality with confidence bands for the cumulative distribution function (CDF). Specifically, failing to reject the null hypothesis H_0: F = F_0 (where F is the true CDF and F_0 is a specified CDF) at significance level \alpha implies, with confidence $1 - \alpha, that the true CDF F satisfies \sup_x |F(x) - F_0(x)| \leq d_{\alpha,n}, where d_{\alpha,n} is the critical value of the KS statistic D_n.[25] This inversion of the test provides a nonparametric confidence region consisting of all CDFs G such that \sup_x |G(x) - F_0(x)| \leq d_{\alpha,n}.[26]
For an unknown distribution without a specified F_0, the duality yields confidence bands centered on the empirical CDF F_n(x). The (1 - \alpha) confidence band is given by
\left[ \max\left(0, F_n(x) - \frac{c_\alpha}{\sqrt{n}}\right), \min\left(1, F_n(x) + \frac{c_\alpha}{\sqrt{n}}\right) \right],
where c_\alpha is the (1 - \alpha) quantile of the Kolmogorov distribution (the limiting distribution of \sqrt{n} D_n). For example, c_{0.05} \approx 1.36 provides an asymptotic 95% confidence band.[27][26]
Smirnov's method constructs these bands with uniform width for the one-sample case assuming known parameters, relying on the exact distribution of the KS statistic under continuity.[25] When parameters are estimated from the data (e.g., via maximum likelihood), the null distribution of the KS statistic changes, requiring the band to be widened by a simulation-derived factor to maintain coverage; this adjustment, often implemented via Monte Carlo methods, accounts for the estimation uncertainty and is detailed in approaches like the Lilliefors modification.[26][28]
In survival analysis, these bands set nonparametric limits on reliability functions (survival curves, $1 - F(x)), aiding inference on failure probabilities without parametric assumptions.
The bands provide simultaneous coverage over all x \in \mathbb{R}, ensuring the true CDF lies within the band with probability $1 - \alpha. For known F_0 and continuous distributions, coverage is exact; with unknown distributions, it is asymptotic as n \to \infty.[25][26]
For discrete distributions, the bands tend to be conservative due to ties in the empirical CDF, leading to inflated critical values; adjustments like randomization or continuity corrections mitigate this.[26] Alternatives, such as Hall–Wellner bands, weight deviations by \sqrt{F_n(x)(1 - F_n(x))} to improve uniformity and power in the tails, though they require more computation.[29]
Multivariate versions
The multivariate Kolmogorov–Smirnov test extends the univariate test to compare probability distributions in d \geq 2 dimensions, addressing scenarios where data points are vectors in \mathbb{R}^d. The multivariate empirical cumulative distribution function (ECDF) for a sample of n i.i.d. observations X_1, \dots, X_n \in \mathbb{R}^d is defined as
F_n(\mathbf{x}) = \frac{1}{n} \sum_{i=1}^n \prod_{j=1}^d I(X_{i,j} \leq x_j),
where \mathbf{x} = (x_1, \dots, x_d) and I(\cdot) is the indicator function; this counts the proportion of points whose components are all less than or equal to the corresponding components of \mathbf{x}, analogous to the univariate case but over orthants.[30]
For the one-sample goodness-of-fit test against a hypothesized distribution with CDF F, the test statistic is D_n = \sup_{\mathbf{x} \in \mathbb{R}^d} |F_n(\mathbf{x}) - F(\mathbf{x})|, measuring the maximum vertical deviation between the ECDF and the target CDF. However, computing this supremum is computationally intractable due to the infinite domain of \mathbb{R}^d and the need to evaluate the difference everywhere, which becomes prohibitive even for moderate d.[30]
Practical implementations approximate the supremum using methods like evaluating the statistic over a finite grid of points or at marginal maxima derived from the data, as developed in early multidimensional extensions. For instance, Fasano and Franceschini (1987) proposed an exact computational approach for the two-sample case in two or three dimensions by sorting the data and checking deviations across all relevant orthants defined by the combined sample points, avoiding full grid searches. An alternative is the Cramér-von Mises analog, which integrates the squared deviations over the space rather than taking the supremum, though it requires similar approximations in higher dimensions.[31]
In the two-sample setting, with samples of sizes n_1 and n_2 having ECDFs F_{n_1} and F_{n_2}, the statistic is D_{n_1,n_2} = \sup_{\mathbf{x} \in \mathbb{R}^d} |F_{n_1}(\mathbf{x}) - F_{n_2}(\mathbf{x})|, again evaluated over orthants to detect differences in multidimensional distributions. Under the null hypothesis of identical distributions, the asymptotic distribution involves a multivariate Brownian bridge, but no closed-form expression exists for d > 1, complicating exact p-value computation.[31]
Key challenges in multivariate applications include the curse of dimensionality, where the effective sample size diminishes rapidly as d increases, leading to low power and unreliable approximations even for d=3. Consequently, tests often rely on bootstrap methods to estimate p-values, resampling from the combined or individual samples to simulate the null distribution, though this adds significant computational cost.[32]
For example, to test bivariate normality on 2D scatterplot data from a sample of size n=100, one might compute D_n by discretizing a grid over the observed range of each dimension (e.g., 50x50 points) and evaluating deviations from the bivariate normal CDF, then using bootstrapping for the p-value; significant deviations would indicate non-normality, such as clustering or outliers in specific orthants.[30]
Higher-dimensional generalizations
Higher-dimensional generalizations of the Kolmogorov–Smirnov (KS) test extend beyond finite-dimensional vectors to infinite-dimensional or non-Euclidean spaces, such as function spaces or graph structures, by adapting the supremum distance to measure discrepancies in empirical and target distributions. In functional data analysis, where observations are curves or functions in a Hilbert space H, the test compares projected empirical distribution functions to a null hypothesis distribution. The statistic is typically defined as T_{n,\pi} = \sup_t |F_{n,\pi}(t) - F_{0,\pi}(t)|, where \pi: H \to \mathbb{R} is a projection (e.g., onto principal components derived from a Karhunen–Loève expansion), F_{n,\pi} is the empirical cumulative distribution function of the projected data, and F_{0,\pi} is the projected null distribution; this approach leverages random or all-projections to capture multidimensional variability while maintaining computational tractability.[33] Such tests assess goodness-of-fit for functional models by aggregating projection-based KS statistics, with asymptotic properties derived from empirical process theory.[34]
For graph or network distributions, the KS test is adapted by treating graphs as realizations of random variables over edge probabilities or subgraph counts, with the statistic computed as the supremum of differences in empirical cumulative distributions of these features. This measures similarity between network ensembles, such as testing whether two sets of graphs arise from the same generative model, and has applications in social network analysis for detecting structural shifts.[35] The approach embeds graph properties into a distributional framework, enabling non-parametric comparisons without assuming specific parametric forms.[36]
As an alternative in high dimensions, the energy distance metric provides a related supremum-based measure that avoids the curse of dimensionality plaguing grid-based KS extensions; defined as d^2(P, Q) = 2\mathbb{E}[\|X - Y\|] - \mathbb{E}[\|X - X'\|] - \mathbb{E}[\|Y - Y'\|], it quantifies distribution differences via expected Euclidean distances and offers faster computation through direct sample estimates without projections or kernels.[37] This makes it suitable for high-dimensional settings where traditional KS variants degrade in power.[38]
Theoretically, these generalizations embed distributions into a reproducing kernel Hilbert space (RKHS) \mathcal{H}_k, mapping measures P to kernel mean embeddings \mu_P = \int k(\cdot, y) \, dP(y), with the test statistic as d_{k,\Lambda}(P_n, Q_m) = \sup_{\lambda \in \Lambda} \|\mu_{P_n} - \mu_{Q_m}\|_{\mathcal{H}_{k_\lambda}} over a family of kernels; convergence under the null follows functional Donsker theorems, ensuring \sqrt{nm/(n+m)} \, d_{k,\Lambda}(P_n, Q_m) converges to the supremum of a Gaussian process in the RKHS unit ball.[39] This framework unifies infinite-dimensional two-sample testing with universal Donsker classes for empirical processes.
Post-2010 developments include bootstrap methods for p-value computation in streaming data scenarios, where KS statistics monitor evolving distributions online by resampling from sliding windows to approximate null distributions under non-stationarity.[40] In machine learning, these tests detect distribution shifts, such as in deep reinforcement learning for traffic control, where KS distances between vehicle flow empirical CDFs quantify deviations causing performance drops (e.g., a 0.02 increase linked to 3.7% throughput loss).[41]
An illustrative example is testing uniformity on the hypersphere S^{d-1} for d \geq 3, where projections onto random directions yield one-dimensional KS statistics K_n = \sup_x |F_n(x) - F_0(x)| on the projected uniforms, aggregated over multiple directions to assess global deviations; angular measures, like cosine-based pairwise angles \cos^{-1}(U_i^\top U_j), further adapt the supremum to spherical geometry for rotation-invariant testing.[42]
Practical considerations
Implementation algorithms
The Kolmogorov–Smirnov (KS) statistic for the one-sample test is computed by first sorting the sample data in ascending order to obtain the order statistics X_{(1)} \leq X_{(2)} \leq \cdots \leq X_{(n)}. The empirical cumulative distribution function (ECDF) is then evaluated at these points, and the test statistic D_n is the supremum of the absolute differences between the ECDF and the hypothesized cumulative distribution function F, specifically D_n = \sup_x |F_n(x) - F(x)|, where F_n(x) = i/n for X_{(i)} \leq x < X_{(i+1)}. In practice, this supremum is achieved at the order statistics, so D_n is calculated as the maximum of |i/n - F(X_{(i)})| and |(i-1)/n - F(X_{(i)})| over i = 1, \dots, n.[1]
For the two-sample test, the samples from the two groups are sorted separately to form order statistics, and the ECDFs F_{n_1} and F_{n_2} are computed for samples of sizes n_1 and n_2. The statistic D_{n_1,n_2} is the supremum of |F_{n_1}(x) - F_{n_2}(x)| over x, which can be efficiently found by merging the two sorted lists and evaluating the differences at the combined order statistics, accounting for jumps in each ECDF. This merged approach avoids redundant computations by traversing the combined sequence once.[1]
The time complexity of computing the KS statistic involves an initial sorting step, which requires O(n \log n) operations for a sample of size n in the one-sample case, or O((n_1 + n_2) \log (n_1 + n_2)) for the two-sample case. Following sorting, the search for the supremum difference proceeds in linear time, O(n), by iterating over the order statistics.[43]
P-values for the KS test are calculated differently based on sample size. For large n, asymptotic approximations to the Kolmogorov distribution K(\lambda) are used, where \lambda = D \sqrt{n} (adjusted for two-sample as \lambda = D \sqrt{n_1 n_2 / (n_1 + n_2)}), providing the tail probability P(K > \lambda). For smaller samples or when parameters are estimated, exact p-values can be obtained via dynamic programming methods, such as the Durbin matrix algorithm, which computes the distribution recursively in O(n^2) time. An efficient approximation is the Pelz-Good asymptotic series, which expands the distribution function using a series of terms for improved accuracy over basic asymptotics, particularly for moderate n.[44]
When exact or asymptotic methods are computationally intensive, Monte Carlo simulation offers a flexible alternative for p-value estimation. This involves generating B (typically $10^4 to $10^5) independent samples under the null hypothesis, computing the KS statistic for each, and estimating the p-value as the proportion of simulated statistics exceeding the observed D. The process is parallelizable across simulations to reduce computation time.[45]
Special handling is required for edge cases to ensure numerical stability and validity. In discrete distributions with ties, the standard KS test assumes continuity and may produce biased results; ties can be addressed by averaging the ECDF ranks at tied points or by adding small uniform random noise to break ties while preserving the distribution. Numerical stability issues arise when F(x) is near 0 or 1, potentially causing floating-point errors in differences; these are mitigated by using adjusted ECDF evaluations, such as (i - 0.5)/n instead of i/n, to avoid boundary artifacts.[46]
The following pseudocode illustrates the computation of the one-sample KS statistic:
function ks_one_sample_statistic(sample, F):
n = length(sample)
sort sample to get X_sorted # O(n log n)
D = 0
for i in 1 to n:
# Evaluate at jumps
d1 = abs(i / n - F(X_sorted[i]))
d2 = abs((i - 1) / n - F(X_sorted[i]))
# Optional adjustment for continuity correction
d3 = abs((i - 0.5) / n - F(X_sorted[i]))
D = max(D, d1, d2, d3)
return D # O(n) after sorting
function ks_one_sample_statistic(sample, F):
n = length(sample)
sort sample to get X_sorted # O(n log n)
D = 0
for i in 1 to n:
# Evaluate at jumps
d1 = abs(i / n - F(X_sorted[i]))
d2 = abs((i - 1) / n - F(X_sorted[i]))
# Optional adjustment for continuity correction
d3 = abs((i - 0.5) / n - F(X_sorted[i]))
D = max(D, d1, d2, d3)
return D # O(n) after sorting
[1]
Software and libraries
The Kolmogorov–Smirnov (KS) test is implemented in various statistical software packages and programming libraries, facilitating its application in data analysis workflows across different environments.
In the R programming language, the base stats package provides the ks.test() function, which supports both one-sample and two-sample KS tests. For the one-sample case, it compares an empirical distribution to a specified cumulative distribution function (CDF), such as the normal distribution, with options to adjust for estimated parameters using the Lilliefors correction. The syntax is exemplified as ks.test(x, y = "pnorm", mean = 0, sd = 1), where x is the data vector and y specifies the theoretical distribution. For advanced handling of discrete distributions, the dgof package extends KS testing with adjustments for ties and lattice distributions, while the boot package enables simulation-based p-value estimation through bootstrapping.
Python's SciPy library offers robust implementations via the scipy.stats module, including kstest() for one-sample tests against a specified or empirical CDF, and ks_2samp() for two-sample comparisons. These functions integrate seamlessly with NumPy arrays, supporting efficient computation on large datasets, and allow custom CDFs for flexible hypothesis testing. For instance, kstest(rvs, 'norm', args=(0, 1)) tests normality assuming mean 0 and standard deviation 1. Starting with SciPy version 1.10.0, enhancements to bootstrap methods improve accuracy for p-value calculations in complex scenarios.
MATLAB includes built-in functions kstest() and kstest2() in its Statistics and Machine Learning Toolbox, performing one- and two-sample KS tests, respectively, with access to precomputed critical value tables for asymptotic distributions. These tools output test statistics, p-values, and confidence intervals, suitable for both interactive and scripted analysis.
Other environments provide specialized support: Julia's HypothesisTests.jl package implements KS tests through functions like ExactKS for one-sample and ApproximateTwoSampleKS for two-sample cases, emphasizing high-performance computing. In SAS, the PROC NPAR1WAY procedure conducts KS tests as part of nonparametric analyses. For users of Microsoft Excel, third-party add-ins like the Real Statistics Resource Pack enable basic one-sample KS testing via menu-driven interfaces or VBA functions.
Considerations for implementation include the balance between open-source options like R, Python, and Julia—which offer extensive customization and community support—and proprietary tools like MATLAB and SAS, which provide integrated environments with validated compliance for enterprise use. Version-specific updates, such as those in SciPy, underscore the importance of maintaining current installations to leverage algorithmic refinements.
Power and limitations
The Kolmogorov–Smirnov (KS) test demonstrates moderate statistical power for detecting location shifts in continuous distributions, where it outperforms the chi-squared test due to the latter's reliance on binning, which reduces effectiveness for small sample sizes and continuous data.[1][47] Simulation studies show that power varies with sample size, alternative distribution, and significance level.[48] However, the KS test has low power against changes in variance or deviations involving heavy tails, as it is primarily sensitive to discrepancies near the center of the distribution rather than the extremes.[1]
In comparisons with other goodness-of-fit tests, the KS statistic exhibits lower power than the Anderson-Darling test, particularly for tail deviations, where the latter weights observations more heavily in the tails and requires smaller sample sizes (e.g., n=14 vs. n=17 for 80% power at a standardized shift of δ=0.9 in normal distributions).[48][10] Relative to the chi-squared test, the KS approach is preferable for small n (e.g., n<50) because it avoids arbitrary binning and provides exact distribution-free critical values under continuity assumptions.[1] Empirical evidence from 1980s studies, such as those by Stephens, quantified these power curves through extensive tables and simulations for exponentiality and normality tests, confirming the KS test's relative strengths and weaknesses across alternatives.
Key limitations include high sensitivity to ties in discrete data, which renders the test underpowered and requires conservative p-value adjustments or specialized implementations to maintain validity.[7] The test assumes independent and identically distributed (i.i.d.) observations, leading to invalid results and inflated Type I errors for clustered or dependent data, such as time series.[49] Additionally, when distribution parameters are estimated from the data, the test becomes conservative, necessitating simulation-based critical values rather than asymptotic approximations.[1] Common pitfalls involve misinterpreting p-values without verifying continuity or over-relying on asymptotic distributions for small samples (n<20), where exact methods or bootstrapping are recommended to avoid underpowered inferences.[1]
To address these issues, weighted variants of the KS test, such as Kuiper's test for circular data, improve power against specific alternatives like rotational shifts by emphasizing uniform weighting across the domain.[50] Bootstrap resampling enhances overall power by incorporating parameter uncertainty and providing more accurate p-values, particularly for finite samples or non-i.i.d. settings.[51] Recent studies in the 2020s highlight significant power loss in high-dimensional extensions of the KS test, where the curse of dimensionality dilutes sensitivity, prompting developments in sliced or projected variants for multivariate applications.[52]