Quartile
A quartile is a statistical measure that divides a dataset into four equal parts, each representing 25% of the ordered data values, providing a way to summarize the distribution and identify central tendencies and variability.[1] The first quartile (Q1) is the value below which 25% of the data lies, the second quartile (Q2) is the median where 50% of the data falls below, and the third quartile (Q3) marks the point below which 75% of the data is found.[2] To compute quartiles, the data must first be arranged in ascending order, though slight variations in calculation methods exist depending on whether the dataset size is even or odd.[3] Quartiles are fundamental in descriptive statistics for assessing data spread and skewness; the interquartile range (IQR), calculated as Q3 minus Q1, quantifies the variability of the central 50% of the data and is robust to outliers.[4] They form the basis of box plots, which visualize the minimum, Q1, median, Q3, and maximum values to detect outliers—defined as points beyond 1.5 times the IQR from Q1 or Q3—and to compare distributions across groups.[1] In fields like economics and social sciences, quartiles help analyze income distributions, with the Bureau of Labor Statistics using them to report earnings segments for policy insights.[5]Definitions and Concepts
Formal Definition
In statistics, quartiles are the three values that divide an ordered dataset or a probability distribution into four equal-frequency intervals, each containing 25% of the observations or probability mass. The first quartile, denoted Q_1, is the value below which 25% of the data lies; the second quartile, Q_2, is the median, below which 50% of the data lies; and the third quartile, Q_3, is the value below which 75% of the data lies.[6] These quartiles correspond to specific positions in an ordered sample of n observations, where Q_1 marks the boundary after the first quarter, Q_2 the middle, and Q_3 after the third quarter. The interquartile range (IQR) is defined as the difference between the third and first quartiles, IQR = Q_3 - Q_1, which measures the spread of the central 50% of the data and provides a robust indicator of variability less sensitive to outliers.[7] Quartiles form a specific case within the broader family of quantiles, which generalize divisions at arbitrary proportions. The term "quartile" was first introduced by Donald McAlister in 1879, in a paper whose topic was suggested by Francis Galton, building on earlier concepts of dividing distributions into equal parts.[8]Relation to Quantiles and Percentiles
Quantiles are points in a probability distribution or cumulative distribution function that divide the range of the data into continuous intervals with equal probability, such that the p-quantile corresponds to the value below which a proportion p of the observations fall. Quartiles represent specific instances of quantiles, namely the first quartile (Q1) at p = 0.25, the second quartile (Q2, also the median) at p = 0.5, and the third quartile (Q3) at p = 0.75, which partition the data into four equal parts each containing 25% of the observations. Percentiles extend this concept by scaling quantiles by a factor of 100, where the k-th percentile is the value below which k% of the data lies, making the 25th percentile equivalent to Q1, the 50th to Q2, and the 75th to Q3. While quantiles and percentiles describe the same underlying division of data, percentiles are more commonly used in descriptive contexts to convey relative standing in a percentage scale, whereas quantiles (including quartiles) are preferred in theoretical and computational statistics for their direct proportionality to p. This terminological distinction arises from historical usage in fields like psychometrics for percentiles and general probability theory for quantiles, though the two are mathematically interchangeable. For a dataset of size n, the position of the p-th quantile can be calculated as (n + 1)p, with interpolation applied if the result is not an integer index, and quartiles follow this formula as special cases where p takes values 0.25, 0.5, or 0.75. This general approach ensures consistent placement across the distribution, allowing quartiles to serve as a simplified subset of the broader quantile framework without requiring computation of all possible p values. Quartiles offer advantages over full percentile distributions or arbitrary quantiles due to their simplicity in summarizing central tendency and spread; by focusing on just three points, they provide an efficient way to capture the data's location and variability, particularly in exploratory data analysis, without the complexity of examining the entire 0-to-100 percentile range. This targeted utility makes quartiles a foundational tool in statistical summaries, balancing detail with interpretability compared to more granular quantile sets.Calculation Methods
Methods for Discrete Data
Computing quartiles for discrete data, which consists of a finite set of ordered observations, presents challenges because there may not be an exact observation at the 25%, 50%, or 75% positions, particularly in small samples. This leads to multiple established methods for determining quartile positions and values, each with different approaches to handling fractional positions—either by selecting a single data point or interpolating between adjacent points. These variations can yield slightly different quartile values, affecting subsequent analyses like interquartile ranges, though the differences diminish with larger sample sizes. The methods described here are drawn from standard statistical definitions for sample quantiles applicable to discrete datasets.[9] Method 1 (Inclusive Method): This approach calculates the position as (n+1) \times p, where n is the sample size and p is the quantile probability (0.25 for Q1, 0.50 for Q2, 0.75 for Q3). If the position is an integer k, the quartile is the k-th ordered value x_{(k)}; otherwise, for a fractional position j + \gamma where j is the integer part and $0 < \gamma < 1, linear interpolation is used: Q_p = x_{(j)} + \gamma (x_{(j+1)} - x_{(j)}). This method treats the sample as if it includes positions from 1 to n+1, providing a symmetric and often unbiased estimate for even n.[9] Method 2 (Exclusive Method): The position is computed as n \times p + 0.5. For non-integer positions j + \gamma, the quartile is interpolated as Q_p = (1 - \gamma) x_{(j)} + \gamma x_{(j+1)}, or the value at the integer position if exact. This method aligns with Tukey's hinges and is common in exploratory data analysis, centering the position slightly differently to mimic continuous distributions.[9] Method 3 (Nearest Rank Method): Here, the position is \round((n+1) \times p), where \round denotes rounding to the nearest integer (with ties rounded up). The quartile is simply the value at this rounded position, without interpolation. This discrete selection method is straightforward and avoids introducing non-observed values, making it suitable for strictly categorical or ranked data.[9] Method 4 (Weighted Average Method): The position follows (n-1) \times p + 1, with linear interpolation for fractional parts as in Methods 1 and 2: Q_p = (1 - \gamma) x_{(j)} + \gamma x_{(j+1)}. This is equivalent to the inverse empirical cumulative distribution function and is widely used in software for its asymptotic unbiasedness and monotonicity properties in discrete settings.[9]| Method | Position Formula | Interpolation Used | Pros | Cons |
|---|---|---|---|---|
| 1 (Inclusive) | (n+1) p | Yes, linear | Avoids endpoint bias in even n; smooth estimates | May produce values not in dataset; more complex for small n |
| 2 (Exclusive) | n p + 0.5 | Yes, linear | Intuitive for median; consistent with boxplot hinges | Can bias toward center in odd n; interpolation artifacts |
| 3 (Nearest Rank) | \round((n+1) p) | No | Simple; always selects actual data points | Jumpy for small changes in data; potential bias in ties |
| 4 (Weighted Average) | (n-1) p + 1 | Yes, linear | Asymptotically unbiased; monotonic | Sensitive to order in small samples; non-integer results |
- Method 1: Q1 position = 1.5, so $1 + 0.5(3-1) = 2; Q2 = 5; Q3 position = 4.5, so $7 + 0.5(9-7) = 8.
- Method 2: Q1 position = 1.75, so $0.25 \times 1 + 0.75 \times 3 = 2.5; Q2 = 5; Q3 position = 4.25, so $0.75 \times 7 + 0.25 \times 9 = 7.5.
- Method 3: Q1 position = \round(1.5) = 2, so 3; Q2 = 5; Q3 position = \round(4.5) = 5, so 9 (using round half up).
- Method 4: Q1 position = 2, so 3; Q2 = 5; Q3 position = 4, so 7.
Methods for Continuous Distributions
For continuous random variables, quartiles are defined probabilistically using the cumulative distribution function (CDF) F(x) = P(X \leq x), where the p-th quartile Q_p is the value satisfying F(Q_p) = p.[10] Specifically, the first quartile Q_1 solves F(Q_1) = 0.25, the second quartile Q_2 (median) solves F(Q_2) = 0.50, and the third quartile Q_3 solves F(Q_3) = 0.75.[10] This approach leverages the quantile function, defined as the generalized inverse of the CDF: Q_p = F^{-1}(p) = \inf \{ x : F(x) \geq p \}.[11] For strictly increasing and continuous CDFs, this inverse is unique and one-to-one, allowing direct computation without ambiguity.[10] Explicit formulas for quartiles are available for many standard continuous distributions, facilitating analytical solutions. For the uniform distribution on the interval [a, b] with a < b, the CDF is F(x) = \frac{x - a}{b - a} for x \in [a, b], so the quantile function is Q_p = a + p(b - a).[12] Thus, the first quartile is Q_1 = a + 0.25(b - a). For the normal distribution N(\mu, \sigma^2), the first quartile is Q_1 = \mu + \Phi^{-1}(0.25) \sigma, where \Phi is the standard normal CDF; numerically, \Phi^{-1}(0.25) \approx -0.6745, yielding Q_1 \approx \mu - 0.6745 \sigma.[13] For the exponential distribution with rate parameter \lambda > 0, the CDF is F(x) = 1 - e^{-\lambda x} for x \geq 0, so the quantile function is Q_p = -\frac{\ln(1 - p)}{\lambda}; the first quartile is therefore Q_1 = -\frac{\ln(0.75)}{\lambda} \approx \frac{0.2877}{\lambda}.[14] In contrast to methods for discrete distributions, which rely on positional indexing within ordered finite samples and often require interpolation to estimate intermediate values, computations for continuous distributions yield exact theoretical quartiles directly from the inverse CDF. These can be derived analytically when closed-form inverses exist, as in the uniform and exponential cases, or approximated via numerical methods like root-finding for the CDF equation or lookup tables for distributions without explicit inverses, such as the normal.[15] When data arise from samples of a continuous underlying distribution, empirical quartiles provide approximations, and kernel density estimation can refine the CDF estimate for more precise quantile inversion, though theoretical definitions remain the foundation for understanding distributional properties.Applications in Statistics
Descriptive Statistics and Visualization
Quartiles are integral to the five-number summary, a foundational tool in descriptive statistics introduced by John W. Tukey for summarizing data distributions without assuming normality. This summary comprises the dataset's minimum value, first quartile (Q1, the 25th percentile), median (Q2, the 50th percentile), third quartile (Q3, the 75th percentile), and maximum value, offering a robust alternative to the mean and standard deviation, which can be distorted in skewed distributions. For example, in income data—often positively skewed due to a small number of high earners—the median and quartiles provide a more representative view of central tendency and variability than the mean, which outliers can inflate dramatically.[16] A primary visualization employing quartiles is the box-and-whisker plot, pioneered by Tukey to graphically depict the five-number summary and facilitate exploratory data analysis. In this plot, a rectangular box spans from Q1 to Q3, representing the interquartile range (IQR = Q3 - Q1) that encompasses the central 50% of the data; a horizontal line within the box marks the median. Whiskers extend from the box edges to the minimum and maximum values (or up to 1.5 × IQR in some variants), providing a schematic overview of the data's spread and potential asymmetry without emphasizing extremes. This construction emphasizes the core distribution, making it ideal for detecting skewness or multimodality in datasets like exam scores or environmental measurements. Quartiles also enhance histograms by overlaying vertical lines at Q1, the median, and Q3, which delineate the quartile spans against the data's frequency distribution to reveal symmetry, central clustering, or tail heaviness. For instance, in a histogram of household incomes, these lines might show a longer lower tail, underscoring right skewness where the bulk of values cluster below the median. Such overlays aid in qualitative assessment of the data's shape beyond numerical summaries alone.[1] To compare distributions across groups, side-by-side box-and-whisker plots leverage quartiles to juxtapose medians, IQR widths, and whisker lengths, enabling quick identification of differences in location, scale, or variability—such as varying income spreads between urban and rural populations. This approach is particularly advantageous for its robustness to outliers, as the quartile-based measures resist distortion from extreme values unlike variance or standard deviation, preserving the integrity of the summary in real-world, non-ideal data.[17][18]Outlier Detection and Robust Measures
Quartiles play a central role in outlier detection through Tukey's fences, a method introduced by John Tukey in his 1977 work on exploratory data analysis. These fences define boundaries beyond which data points are flagged as potential outliers: the lower fence is calculated as Q1 - 1.5 \times IQR, and the upper fence as Q3 + 1.5 \times IQR, where IQR = Q3 - Q1.[19] Values falling outside these fences are considered mild outliers, while those beyond Q1 - 3 \times IQR or Q3 + 3 \times IQR are deemed extreme outliers, providing a tiered approach to anomaly identification.[19] Another quartile-linked technique is the modified Z-score, which enhances outlier detection by using robust location and scale estimates.[20] It is computed as $0.6745 \times \frac{x_i - \tilde{x}}{\text{[MAD](/page/Mad)}}, where \tilde{x} is the median, MAD is the median absolute deviation (\text{[median](/page/Median)}(|x_i - \tilde{x}|)), and the constant 0.6745 scales it to match the standard Z-score under normality since MAD ≈ 0.6745σ for a normal distribution.[21] This ties to quartiles because, under normality, MAD ≈ 0.5 × IQR, allowing IQR to approximate the scale when MAD is unavailable, thus leveraging quartile-based robustness for non-parametric settings.[21] Values with an absolute modified Z-score exceeding 3.5 are typically flagged as outliers, as recommended by Iglewicz and Hoaglin.[21] In robust statistics, quartiles contribute to measures resistant to outliers, such as trimmed means and winsorizing, which prioritize central data over extremes. A 50% trimmed mean, for instance, computes the arithmetic mean of values between Q1 and Q3, discarding the lower and upper quartiles to mitigate outlier influence. Winsorizing replaces values below Q1 with Q1 and above Q3 with Q3, preserving sample size while capping extremes. These methods offer higher efficiency than the median (which has about 64% efficiency relative to the mean under normality) but lower than parametric means in uncontaminated normal data; for example, a 20% trimmed mean achieves roughly 95% efficiency while remaining robust to up to 20% contamination. Consider a dataset {1, 2, 3, 4, 5, 6, 7, 8, 9, 100}: the median is 5.5, Q1 = 2.75, Q3 = 8.25, and IQR = 5.5, yielding fences at -5.5 (lower) and 16.5 (upper).[19] Applying Tukey's fences flags 100 as a mild outlier (and extreme if using 3×IQR), while the modified Z-score (MAD = 2.5, modified Z ≈ 25.5 for 100) confirms it exceeds the 3.5 threshold.[21] However, these techniques assume roughly unimodal data and may misflag legitimate points in multimodal or heavily skewed distributions, limiting their applicability without further validation.[20] Box plots visualize these fences with whiskers extending to the inner bounds and plotting points beyond as outliers.[19]Implementation in Software
Spreadsheet Applications
In Microsoft Excel, the legacy QUARTILE function computes quartiles for a dataset using the syntax QUARTILE(array, quart), where array is the range of numeric data and quart specifies the quartile: 0 for the minimum value, 1 for the first quartile (Q1), 2 for the median (Q2), 3 for the third quartile (Q3), and 4 for the maximum value.[22] This function employs an inclusive interpolation method equivalent to the modern QUARTILE.INC, positioning values via the formula k = p × (n - 1) + 1, where p is the percentile (0.25 for Q1, 0.5 for Q2, 0.75 for Q3) and n is the number of data points, followed by linear interpolation if k is not an integer.[23] Introduced before Excel 2010, QUARTILE remains available but is deprecated in favor of more precise alternatives.[22] Excel's updated functions, QUARTILE.INC and QUARTILE.EXC, offer refined calculations since Excel 2010. QUARTILE.INC mirrors the legacy function's inclusive approach, using percentiles from 0 to 1 inclusive for compatibility with minimum and maximum values.[24] In contrast, QUARTILE.EXC adopts an exclusive method, applying percentiles from 0 to 1 exclusive and positioning via k = p × (n + 1), which may yield different results for small datasets and returns a #NUM! error for quart values requiring extrapolation beyond the data range.[25] [26] To illustrate the differences, consider the dataset {1, 2, 3, 4, 5} (n=5). For Q1:| Function | Calculation Position (k) | Result |
|---|---|---|
| QUARTILE.INC | 0.25 × 4 + 1 = 2 | 2 |
| QUARTILE.EXC | 0.25 × 6 = 1.5 | 1.5 |
Programming Environments
In statistical programming environments, quartiles are commonly computed using built-in functions that support various interpolation methods for discrete data, ensuring flexibility across different analysis needs.[35][36] In R, thequantile() function from the base stats package computes sample quantiles, including quartiles, corresponding to specified probabilities. It offers nine types (1 through 9) of quantile estimation, which align with different methods for discrete data, such as inverse of empirical CDF (type 1) or various interpolation schemes; the default is type 7, which uses linear interpolation between order statistics.[35] For example, to calculate the first, second, and third quartiles of a vector x, the command is quantile(x, probs = c(0.25, 0.5, 0.75), type = 1), where type 1 provides a basic nearest-rank approximation without interpolation.[35] The interquartile range (IQR) can then be derived as IQR(x, type = 1) or manually as quantile(x, 0.75, type = 1) - quantile(x, 0.25, type = 1).[35]
Python's NumPy library provides numpy.percentile() and numpy.quantile() for quartile computation, with the latter accepting probabilities between 0 and 1 (e.g., 0.25 for the first quartile) and the former using percentages (e.g., 25). Both functions support interpolation options including 'linear' (default), 'lower', 'higher', 'midpoint', and 'nearest' to handle positions between data points. They also accommodate multidimensional arrays via the axis parameter and handle NaN values through dedicated functions like numpy.nanpercentile() or numpy.nanquantile(), which ignore NaNs during computation. For IQR on an array arr, one can use np.quantile(arr, [0.25, 0.75], method='linear') and subtract the results.
MATLAB's prctile() function computes percentiles equivalent to quartiles by specifying percentages such as [25 50 75], as in prctile(data, [25 50 75]). It employs linear interpolation by default for values between sorted data points, differing from some nearest-rank approaches in other languages.[36] The IQR is obtainable via prctile(data, 75) - prctile(data, 25).[36]
Compared to Python's NumPy implementations, which prioritize efficiency for large-scale data processing through vectorized operations, R's quantile() provides greater flexibility with its nine explicit types, allowing precise matching to statistical conventions.[37] Best practices for reproducibility include explicitly specifying the interpolation type or method in function calls, as defaults may vary across environments and versions.[37] Additionally, integrating quartile computations with visualization libraries enhances analysis; for instance, in R, ggplot2 can generate box plots using geom_boxplot(), which internally relies on quantile() for whisker and quartile rendering.