Fact-checked by Grok 2 weeks ago

Quantile

In statistics, a quantile is a value that divides a or into two parts such that a specified of the observations or probabilities lie below it. The q-th quantile, where q is a number between 0 and 1, represents the point at which q proportion of the data falls below and (1 - q) proportion falls above. This concept generalizes measures like the , which is the 0.5 quantile, and provides a way to describe the location and spread of data without assuming a particular distributional form. Quantiles are computed from ordered (ranked) data and include specific cases such as quartiles (dividing data into four equal parts at q = 0.25, 0.5, and 0.75) and percentiles (where q = k/100 for k from 1 to 99). For a sample of n observations, the sample quantile is interpolated from the ordered values, though various conventions exist for handling ties or non-integer positions. Unlike the , quantiles depend solely on the relative ordering of data points, which is preserved under strictly monotonic transformations, making them suitable for skewed distributions. Quantiles play a central role in descriptive statistics for summarizing empirical distributions, such as in box plots that visualize and quartiles to detect outliers. They are particularly robust to extreme values, as measures like the remain stable even when outliers are present, unlike the . In inferential statistics, quantiles serve as critical values for hypothesis testing, confidence intervals, and assessing distributional assumptions via tools like quantile-quantile (Q-Q) plots. Beyond core statistics, quantiles find applications in across fields like (e.g., value-at-risk calculations), , and , where they help quantify tail behaviors and uncertainties.

Fundamentals

Definition

For a real-valued X with F, the p-quantile, often denoted Q(p) or F^{-1}(p), is defined as the infimum of the set \{x \in \mathbb{R} : F(x) \geq p\} for p \in (0,1). This definition employs the of F to handle cases where F may not be strictly increasing or continuous. Quantiles generalize order statistics by specifying points that partition the such that the probability that X is at most the p-quantile is at least p, and the probability that X is at least the p-quantile is at least $1-p; for instance, the corresponds to the 0.5-quantile, dividing the into two parts, each with probability at least 0.5. If F has flat regions—such as at points of discontinuity or of zero —the p-quantile may not be unique, forming an , and the infimum selects the left endpoint of this interval. The term "quantile" first appeared in the statistical literature in the 1938 edition of Statistical Tables for Biological, Agricultural and Medical Research by Ronald A. Fisher and Frank Yates. Percentiles represent a specific instance of quantiles, where p = k/100 for k.

Types and Notation

Quantiles are often categorized by the number of equal-probability divisions they create in a , with specific names for common subdivisions. Quartiles divide the into four equal parts, corresponding to the quantiles at probabilities p = 0.25, p = 0.5, and p = 0.75. Quintiles divide it into five equal parts at p = 0.2, 0.4, 0.6, 0.8. Deciles divide into ten equal parts at p = 0.1, 0.2, \dots, 0.9, while percentiles divide into 100 equal parts at p = k/100 for integer k = 1, 2, \dots, 99. Standard notation for the p-quantile of a with F is Q(p) or x_p, defined as the Q(p) = F^{-1}(p) = \inf \{ x : F(x) \geq p \}. The is a special case given by Q(0.5). Interquantile ranges provide measures of variability between specific quantiles; for instance, the (IQR) is defined as \text{IQR} = Q(0.75) - Q(0.25), offering a robust indicator of spread that is less sensitive to outliers than the full range. In distributions where the cumulative distribution function has flat regions, such as discrete distributions, the p-quantile may not be unique, leading to a distinction between the lower quantile (the infimum of the interval where F(x) \geq p) and the upper quantile (the supremum of that interval).

Population Quantiles

Calculation Methods

For a general population with cumulative distribution function (CDF) F, the p-th population quantile Q(p) is defined as \inf\{x : F(x) \geq p\}, or equivalently Q(p) = F^{-1}(p). For continuous distributions, this value is unique. For a finite of size n, quantiles are computed using the sorted values x_{(1)} \leq x_{(2)} \leq \dots \leq x_{(n)}, known as the order statistics. The position of the p-quantile is typically determined by the formula k = p(n + 1). If k is an , the quantile is exactly x_{(k)}; otherwise, is applied between x_{(\lfloor k \rfloor)} and x_{(\lceil k \rceil)}. One specific non-interpolated approach defines the quantile as Q(p) = x_{(k)} where k = \lceil p(n + 1) \rceil. Hyndman and Fan (1996) outline nine types of quantile definitions, differing primarily in the exact positioning and interpolation schemes, with types 1–3 relying on direct selection from order statistics and types 4–9 incorporating linear interpolation. Type 7, widely adopted as a default in statistical software such as R, uses h = p(n + 1), sets j = \lfloor h \rfloor, and computes Q(p) = (1 - \gamma) x_{(j)} + \gamma x_{(j+1)} where \gamma = h - j. This method approximates the continuous uniform distribution over the sample indices. For the median (p = 0.5), the formula naturally handles even and odd n: if n is odd, it selects the middle point x_{((n+1)/2)}; if n is even, it averages the two middle points x_{(n/2)} and x_{(n/2 + 1)}. In edge cases, p = 0 yields the minimum x_{(1)} and p = 1 yields the maximum x_{(n)}.

Examples and Properties

To illustrate the calculation of population quantiles, consider a finite consisting of the values \{[1](/page/1), 2, [3](/page/3), 4, 5\}. When sorted in non-decreasing order, the reaches or exceeds 0.5 at the third value, yielding the Q(0.5) = [3](/page/3). For an even-sized population, such as \{1, 2, 3, 4\}, between the two central values is applied to estimate the , resulting in Q(0.5) = (2 + 3)/2 = 2.5. This method aligns the quantile with the of the empirical for discrete data. Population quantiles exhibit several key properties. The Q(p) is non-decreasing in p: for $0 \leq p_1 \leq p_2 \leq 1, it holds that Q(p_1) \leq Q(p_2), reflecting the ordering of the distribution. Additionally, quantiles demonstrate equivariance under monotone transformations: if h is a strictly increasing and q_p is the p-th quantile of a Y, then h(q_p) is the p-th quantile of h(Y). This property preserves the relative positioning within the transformed . Quantiles are robust to outliers relative to the , as extreme values influence the mean proportionally to their magnitude but affect quantiles only if they alter the ordering near the specified p. For instance, the remains unchanged unless more than half the population values shift. Regarding the relation to the , in symmetric distributions the equals the , providing a central value that aligns both measures of location. In skewed distributions, however, quantiles offer superior insight into the tails by delineating the full spread, whereas the can be disproportionately pulled by asymmetry.

Sample Quantiles

Estimation Techniques

Estimation of quantiles from sample data relies on the , which serves as the foundation for deriving quantile estimators from observed values. The standard sample quantile estimator \hat{Q}(p) for a probability p \in (0,1) is obtained from a random sample X_1, \dots, X_n by ordering the observations to form the order statistics X_{(1)} \leq \cdots \leq X_{(n)} and selecting \hat{Q}(p) = X_{(k)}, where k is typically chosen as the integer closest to p(n+1). This approach assumes the sample is independent and identically distributed (i.i.d.) from the underlying distribution with a continuous F. Various interpolation schemes address cases where p(n+1) is not an , providing smoother estimates. For instance, between adjacent order statistics is common in many statistical packages, while the Harrell-Davis improves upon this by computing a weighted of all order statistics, using probabilities to assign weights that emphasize observations near the target quantile; this method yields lower , particularly in small samples. Kernel density-based methods estimate quantiles by first constructing a smoothed empirical via and then inverting the resulting numerically, offering flexibility for non-parametric settings but requiring selection to balance and variance. Sample quantiles are consistent estimators of population quantiles under mild conditions on the population distribution, converging in probability to the true Q(p) = F^{-1}(p) as n \to \infty. For interior quantiles ( $0 < p < 1 away from 0 and 1) and smooth densities, the bias of these estimators is of order O(1/n). In the special case of median estimation (p = 0.5), the sample median—defined as X_{((n+1)/2)} for odd n or the average of the two central order statistics for even n—provides a robust point estimate that inherits the consistency and bias properties of general sample quantiles.

Asymptotic Behavior

Under suitable regularity conditions, the sample quantile \hat{Q}_n(p) for $0 < p < 1 exhibits asymptotic normality when derived from an i.i.d. sample X_1, \dots, X_n from a distribution with cumulative distribution function F that is continuously differentiable at the population quantile Q(p) = F^{-1}(p) with positive density f(Q(p)) > 0. Specifically, \sqrt{n} \left( \hat{Q}_n(p) - Q(p) \right) \xrightarrow{d} N\left(0, \frac{p(1-p)}{f(Q(p))^2}\right) as n \to \infty. This result follows from the Bahadur representation, which provides a linear expansion of the sample quantile: \hat{Q}_n(p) = Q(p) + \frac{p - F_n(Q(p))}{f(Q(p))} + o_p(n^{-1/2}), where F_n denotes the empirical . The term \sqrt{n} (F_n(Q(p)) - p) converges in distribution to N(0, p(1-p)) by the for the , and the applies to the differentiable inverse transformation to yield the asymptotic normality of the quantile. For the sample median (p = 0.5), the asymptotic variance simplifies to $1/(4 f(m)^2), where m = Q(0.5) is the median, so \sqrt{n} \left( \hat{Q}_n(0.5) - m \right) \xrightarrow{d} N\left(0, \frac{1}{4 f(m)^2}\right). The key condition remains the existence of a continuous density f with f(m) > 0 at the median, ensuring the representation holds and the normality applies without additional complications from discontinuities. These asymptotic properties facilitate , such as constructing approximate intervals for Q(p) via : \hat{Q}_n(p) \pm z_{1-\alpha/2} \sqrt{p(1-p)/(n \hat{f}(\hat{Q}_n(p))^2)}, where \hat{f} is a of the at \hat{Q}_n(p). Bootstrap resampling offers a robust alternative, generating B bootstrap samples to compute empirical quantiles and derive or bias-corrected intervals that adapt to the underlying distribution without explicit .

Advanced Topics

Streaming and Approximate Methods

In streaming data environments, computing quantiles presents significant challenges due to the requirement for one-pass processing with limited memory, as data arrives continuously and the entire dataset cannot be stored or sorted multiple times. To address this, several approximate methods have been developed. The Greenwald-Khanna (GK) algorithm maintains ε-approximate quantiles for a of length n using O((1/ε) log(ε n)) space in the worst case, ensuring that the true rank of any reported quantile lies within ε n of the estimated rank. This deterministic approach updates the summary incrementally by maintaining a compact set of tuples representing lower and upper bounds on item ranks, merging or trimming them as needed to control space. The t-digest algorithm constructs mergeable probabilistic for estimating quantiles with a tunable relative error ε, using constant space that does not grow with length and offering high accuracy even in distribution tails. It builds a collection of weighted clustered by scale, allowing efficient online updates and centroid merging, which facilitates aggregation across distributed streams such as in map-reduce frameworks. Adaptations of the Count-Min sketch enable approximate quantile computation by hashing items into bins to estimate frequencies and reconstruct the empirical , achieving ε-approximation with high probability using O((1/ε) log(1/δ) log n) space, where δ is the failure probability. These methods find applications in within and systems, such as tracking latency percentiles for service level agreements or detecting anomalies in traffic distributions. Key trade-offs include accuracy versus space and update time: smaller ε improves precision but increases resource demands, while randomized sketches like t-digest often achieve better efficiency than deterministic ones like for large-scale deployments. Expectiles represent asymmetric analogs to quantiles, defined as the values that minimize an asymmetric criterion. For a X and level \tau \in (0,1), the \tau-expectile e_\tau is defined by the equation \tau E[(X - e_\tau)^+] = (1-\tau) E[(e_\tau - X)^+]. This formulation weights positive and negative deviations asymmetrically, making expectiles a weighted akin to the but sensitive to tails in a quantile-like manner. Introduced by Newey and Powell (1987), expectiles are advantageous in due to their smooth objective function, facilitating gradient-based optimization and applications in and . Quantile regression generalizes quantiles to conditional distributions, modeling the \tau-th conditional quantile of the response y given covariates x as y = x^T \beta(\tau) + \varepsilon, where \beta(\tau) varies with \tau. The parameters \beta(\tau) are estimated by minimizing the check function loss \sum_i \rho_\tau(y_i - x_i^T \beta), with \rho_\tau(u) = u(\tau - I(u < 0)), solvable via . Pioneered by Koenker and Bassett (1978), this method enables inference across the full conditional distribution, revealing variations in effects across outcome levels and proving robust to heteroscedasticity and outliers. Other variants include quantiles, which leverage tree structures for quantile estimation in settings, particularly useful for handling hierarchical or networked . Multivariate quantiles extend the univariate to higher dimensions, often constructed via copulas to model joint dependence while preserving marginal distributions; for example, copula-based estimators transform marginal quantiles into joint ones, aiding in multidimensional . In , quantile forests adapt algorithms to predict conditional quantiles directly, supporting through distribution-free intervals, as developed by Athey et al. (2019). In , variants like expectiles and enhance outlier resistance beyond standard means, with applications in for tail-risk analysis. Post-2020 developments in integrate quantile losses—such as the pinball loss—into neural networks for probabilistic predictions, enabling non-crossing multi-quantile outputs and improved in tasks like and .