Quantile

In statistics, a quantile is a value that divides a dataset or probability distribution into two parts such that a specified fraction of the observations or probabilities lie below it.^[1] The q-th quantile, where q is a number between 0 and 1, represents the point at which q proportion of the data falls below and (1 - q) proportion falls above.^[2] This concept generalizes measures like the median, which is the 0.5 quantile, and provides a way to describe the location and spread of data without assuming a particular distributional form.^[3] Quantiles are computed from ordered (ranked) data and include specific cases such as quartiles (dividing data into four equal parts at q = 0.25, 0.5, and 0.75) and percentiles (where q = k/100 for integer k from 1 to 99).^[4] For a sample of n observations, the sample quantile is interpolated from the ordered values, though various conventions exist for handling ties or non-integer positions.^[5] Unlike the mean, quantiles depend solely on the relative ordering of data points, which is preserved under strictly monotonic transformations, making them suitable for skewed distributions.^[6] Quantiles play a central role in descriptive statistics for summarizing empirical distributions, such as in box plots that visualize medians and quartiles to detect outliers.^[7] They are particularly robust to extreme values, as measures like the median remain stable even when outliers are present, unlike the arithmetic mean.^[8] In inferential statistics, quantiles serve as critical values for hypothesis testing, confidence intervals, and assessing distributional assumptions via tools like quantile-quantile (Q-Q) plots.^[6] Beyond core statistics, quantiles find applications in risk assessment across fields like finance (e.g., value-at-risk calculations), engineering, and environmental science, where they help quantify tail behaviors and uncertainties.^[9]

Fundamentals

Definition

For a real-valued random variable X with cumulative distribution function F, the p-quantile, often denoted Q(p) or F^{-1}(p), is defined as the infimum of the set \{x \in \mathbb{R} : F(x) \geq p\} for p \in (0,1).^[10] This definition employs the generalized inverse of F to handle cases where F may not be strictly increasing or continuous.^[11] Quantiles generalize order statistics by specifying points that partition the distribution such that the probability that X is at most the p-quantile is at least p, and the probability that X is at least the p-quantile is at least $1-p; for instance, the median corresponds to the 0.5-quantile, dividing the distribution into two parts, each with probability at least 0.5.^[10] If F has flat regions—such as at points of discontinuity or intervals of zero density—the p-quantile may not be unique, forming an interval, and the infimum selects the left endpoint of this interval.^[11] The term "quantile" first appeared in the statistical literature in the 1938 edition of Statistical Tables for Biological, Agricultural and Medical Research by Ronald A. Fisher and Frank Yates.^[12] Percentiles represent a specific instance of quantiles, where p = k/100 for integer k.^[10]

Types and Notation

Quantiles are often categorized by the number of equal-probability divisions they create in a distribution, with specific names for common subdivisions. Quartiles divide the distribution into four equal parts, corresponding to the quantiles at probabilities p = 0.25, p = 0.5, and p = 0.75.^[13] Quintiles divide it into five equal parts at p = 0.2, 0.4, 0.6, 0.8.^[14] Deciles divide into ten equal parts at p = 0.1, 0.2, \dots, 0.9, while percentiles divide into 100 equal parts at p = k/100 for integer k = 1, 2, \dots, 99.^[15]^[13] Standard notation for the p-quantile of a random variable with cumulative distribution function F is Q(p) or x_p, defined as the generalized inverse Q(p) = F^{-1}(p) = \inf \{ x : F(x) \geq p \}.^[16]^[2] The median is a special case given by Q(0.5).^[16] Interquantile ranges provide measures of variability between specific quantiles; for instance, the interquartile range (IQR) is defined as \text{IQR} = Q(0.75) - Q(0.25), offering a robust indicator of spread that is less sensitive to outliers than the full range.^[17]^[13] In distributions where the cumulative distribution function has flat regions, such as discrete distributions, the p-quantile may not be unique, leading to a distinction between the lower quantile (the infimum of the interval where F(x) \geq p) and the upper quantile (the supremum of that interval).^[18]

Population Quantiles

Calculation Methods

For a general population with cumulative distribution function (CDF) F, the p-th population quantile Q(p) is defined as \inf\{x : F(x) \geq p\}, or equivalently Q(p) = F^{-1}(p). For continuous distributions, this value is unique.^[19] For a finite population of size n, quantiles are computed using the sorted values x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)}, known as the order statistics. The position of the p-quantile is typically determined by the formula k = p(n + 1). If k is an integer, the quantile is exactly x_{(k)}; otherwise, linear interpolation is applied between x_{(\lfloor k \rfloor)} and x_{(\lceil k \rceil)}.^[20] One specific non-interpolated approach defines the quantile as Q(p) = x_{(k)} where k = \lceil p(n + 1) \rceil.^[20] Hyndman and Fan (1996) outline nine types of quantile definitions, differing primarily in the exact positioning and interpolation schemes, with types 1–3 relying on direct selection from order statistics and types 4–9 incorporating linear interpolation. Type 7, widely adopted as a default in statistical software such as R, uses h = p(n + 1), sets j = \lfloor h \rfloor, and computes Q(p) = (1 - \gamma) x_{(j)} + \gamma x_{(j+1)} where \gamma = h - j. This method approximates the continuous uniform distribution over the sample indices.^[20] For the median (p = 0.5), the formula naturally handles even and odd n: if n is odd, it selects the middle point x_{((n+1)/2)}; if n is even, it averages the two middle points x_{(n/2)} and x_{(n/2 + 1)}. In edge cases, p = 0 yields the minimum x_{(1)} and p = 1 yields the maximum x_{(n)}.^[20]

Examples and Properties

To illustrate the calculation of population quantiles, consider a finite population consisting of the values \{[1](/page/1), 2, [3](/page/3), 4, 5\}. When sorted in non-decreasing order, the cumulative distribution function reaches or exceeds 0.5 at the third value, yielding the median Q(0.5) = [3](/page/3). For an even-sized population, such as \{1, 2, 3, 4\}, linear interpolation between the two central values is applied to estimate the median, resulting in Q(0.5) = (2 + 3)/2 = 2.5. This interpolation method aligns the quantile with the inverse of the empirical cumulative distribution function for discrete data.^[21] Population quantiles exhibit several key properties. The quantile function Q(p) is non-decreasing in p: for $0 \leq p_1 \leq p_2 \leq 1, it holds that Q(p_1) \leq Q(p_2), reflecting the ordering of the distribution. Additionally, quantiles demonstrate equivariance under monotone transformations: if h is a strictly increasing function and q_p is the p-th quantile of a random variable Y, then h(q_p) is the p-th quantile of h(Y). This property preserves the relative positioning within the transformed distribution.^[22] Quantiles are robust to outliers relative to the mean, as extreme values influence the mean proportionally to their magnitude but affect quantiles only if they alter the ordering near the specified p. For instance, the median remains unchanged unless more than half the population values shift.^[22] Regarding the relation to the mean, in symmetric distributions the median equals the mean, providing a central value that aligns both measures of location. In skewed distributions, however, quantiles offer superior insight into the tails by delineating the full spread, whereas the mean can be disproportionately pulled by asymmetry.^[23]

Sample Quantiles

Estimation Techniques

Estimation of population quantiles from sample data relies on the empirical distribution function, which serves as the foundation for deriving quantile estimators from observed values. The standard sample quantile estimator \hat{Q}(p) for a probability p \in (0,1) is obtained from a random sample X_1, \dots, X_n by ordering the observations to form the order statistics X_{(1)} \leq \cdots \leq X_{(n)} and selecting \hat{Q}(p) = X_{(k)}, where k is typically chosen as the integer closest to p(n+1).^[24] This approach assumes the sample is independent and identically distributed (i.i.d.) from the underlying population distribution with a continuous cumulative distribution function F. Various interpolation schemes address cases where p(n+1) is not an integer, providing smoother estimates. For instance, linear interpolation between adjacent order statistics is common in many statistical packages, while the Harrell-Davis estimator improves upon this by computing a weighted linear combination of all order statistics, using beta distribution probabilities to assign weights that emphasize observations near the target quantile; this method yields lower mean squared error, particularly in small samples.^[24]^[25] Kernel density-based methods estimate quantiles by first constructing a smoothed empirical density via kernel smoothing and then inverting the resulting cumulative distribution function numerically, offering flexibility for non-parametric settings but requiring bandwidth selection to balance bias and variance. Sample quantiles are consistent estimators of population quantiles under mild conditions on the population distribution, converging in probability to the true Q(p) = F^{-1}(p) as n \to \infty. For interior quantiles ( $0 < p < 1 away from 0 and 1) and smooth densities, the bias of these estimators is of order O(1/n). In the special case of median estimation (p = 0.5), the sample median—defined as X_{((n+1)/2)} for odd n or the average of the two central order statistics for even n—provides a robust point estimate that inherits the consistency and bias properties of general sample quantiles.^[24]

Asymptotic Behavior

Under suitable regularity conditions, the sample quantile \hat{Q}_n(p) for $0 < p < 1 exhibits asymptotic normality when derived from an i.i.d. sample X_1, \dots, X_n from a distribution with cumulative distribution function F that is continuously differentiable at the population quantile Q(p) = F^{-1}(p) with positive density f(Q(p)) > 0. Specifically,

\sqrt{n} \left( \hat{Q}_n(p) - Q(p) \right) \xrightarrow{d} N\left(0, \frac{p(1-p)}{f(Q(p))^2}\right)

as n \to \infty. This result follows from the Bahadur representation, which provides a linear expansion of the sample quantile:

\hat{Q}_n(p) = Q(p) + \frac{p - F_n(Q(p))}{f(Q(p))} + o_p(n^{-1/2}),

where F_n denotes the empirical cumulative distribution function. The term \sqrt{n} (F_n(Q(p)) - p) converges in distribution to N(0, p(1-p)) by the central limit theorem for the empirical process, and the delta method applies to the differentiable inverse transformation to yield the asymptotic normality of the quantile.^[26] For the sample median (p = 0.5), the asymptotic variance simplifies to $1/(4 f(m)^2), where m = Q(0.5) is the population median, so

\sqrt{n} \left( \hat{Q}_n(0.5) - m \right) \xrightarrow{d} N\left(0, \frac{1}{4 f(m)^2}\right).

The key condition remains the existence of a continuous density f with f(m) > 0 at the median, ensuring the representation holds and the normality applies without additional complications from discontinuities.^[26] These asymptotic properties facilitate inference, such as constructing approximate confidence intervals for Q(p) via the normal approximation: \hat{Q}_n(p) \pm z_{1-\alpha/2} \sqrt{p(1-p)/(n \hat{f}(\hat{Q}_n(p))^2)}, where \hat{f} is a consistent estimator of the density at \hat{Q}_n(p). Bootstrap resampling offers a robust alternative, generating B bootstrap samples to compute empirical quantiles and derive percentile or bias-corrected intervals that adapt to the underlying distribution without explicit density estimation.

Advanced Topics

Streaming and Approximate Methods

In streaming data environments, computing quantiles presents significant challenges due to the requirement for one-pass processing with limited memory, as data arrives continuously and the entire dataset cannot be stored or sorted multiple times.^[27] To address this, several approximate methods have been developed. The Greenwald-Khanna (GK) algorithm maintains ε-approximate quantiles for a data stream of length n using O((1/ε) log(ε n)) space in the worst case, ensuring that the true rank of any reported quantile lies within ε n of the estimated rank.^[27] This deterministic approach updates the summary incrementally by maintaining a compact set of tuples representing lower and upper bounds on item ranks, merging or trimming them as needed to control space.^[27] The t-digest algorithm constructs mergeable probabilistic sketches for estimating quantiles with a tunable relative error ε, using constant space that does not grow with stream length and offering high accuracy even in distribution tails. It builds a collection of weighted centroids clustered by scale, allowing efficient online updates and centroid merging, which facilitates aggregation across distributed streams such as in map-reduce frameworks. Adaptations of the Count-Min sketch enable approximate quantile computation by hashing items into bins to estimate frequencies and reconstruct the empirical cumulative distribution function, achieving ε-approximation with high probability using O((1/ε) log(1/δ) log n) space, where δ is the failure probability.^[28] These methods find applications in real-time monitoring within databases and network systems, such as tracking latency percentiles for service level agreements or detecting anomalies in traffic distributions. Key trade-offs include accuracy versus space and update time: smaller ε improves precision but increases resource demands, while randomized sketches like t-digest often achieve better efficiency than deterministic ones like GK for large-scale deployments.^[27] Expectiles represent asymmetric analogs to quantiles, defined as the values that minimize an asymmetric mean squared error criterion. For a random variable X and level \tau \in (0,1), the \tau-expectile e_\tau is defined by the equation

\tau E[(X - e_\tau)^+] = (1-\tau) E[(e_\tau - X)^+].

This formulation weights positive and negative deviations asymmetrically, making expectiles a weighted average akin to the mean but sensitive to tails in a quantile-like manner. Introduced by Newey and Powell (1987), expectiles are advantageous in regression due to their smooth objective function, facilitating gradient-based optimization and applications in risk management and finance.^[29] Quantile regression generalizes quantiles to conditional distributions, modeling the \tau-th conditional quantile of the response y given covariates x as y = x^T \beta(\tau) + \varepsilon, where \beta(\tau) varies with \tau. The parameters \beta(\tau) are estimated by minimizing the check function loss \sum_i \rho_\tau(y_i - x_i^T \beta), with \rho_\tau(u) = u(\tau - I(u < 0)), solvable via linear programming. Pioneered by Koenker and Bassett (1978), this method enables inference across the full conditional distribution, revealing variations in effects across outcome levels and proving robust to heteroscedasticity and outliers.^[30] Other variants include dyadic quantiles, which leverage dyadic tree structures for quantile estimation in regression settings, particularly useful for handling hierarchical or networked data. Multivariate quantiles extend the univariate concept to higher dimensions, often constructed via copulas to model joint dependence while preserving marginal distributions; for example, copula-based estimators transform marginal quantiles into joint ones, aiding in multidimensional risk assessment. In machine learning, quantile forests adapt random forest algorithms to predict conditional quantiles directly, supporting uncertainty quantification through distribution-free intervals, as developed by Athey et al. (2019).^[31]^[32]^[33] In robust statistics, variants like expectiles and quantile regression enhance outlier resistance beyond standard means, with applications in econometrics for tail-risk analysis. Post-2020 developments in AI integrate quantile losses—such as the pinball loss—into neural networks for probabilistic predictions, enabling non-crossing multi-quantile outputs and improved calibration in forecasting tasks like time series and image segmentation.^[34]