Interquartile range
The interquartile range (IQR), also known as the midspread or middle 50%, is a robust measure of statistical dispersion that quantifies the spread of the middle 50% of a dataset by calculating the difference between the third quartile (Q3, the 75th percentile) and the first quartile (Q1, the 25th percentile).[1][2] Unlike the full range, which can be heavily influenced by extreme outliers, the IQR focuses solely on the central portion of the data, providing a more stable indicator of variability that is less sensitive to anomalies.[3][2] To compute the IQR, a dataset is first ordered from lowest to highest value, after which the quartiles are determined: Q1 divides the lower half at the 25th percentile, and Q3 divides the upper half at the 75th percentile, with the median (Q2) marking the 50th percentile in between.[1] The formula is simply IQR = Q3 - Q1, often derived using methods like the true index location for precise percentile positioning in continuous data.[3] For example, in a dataset with Q1 = 80 and Q3 = 90, the IQR equals 10, indicating moderate spread in the central values.[1] This measure is particularly valuable in descriptive statistics for comparing distributions across groups, as a larger IQR signifies greater variability in the core data.[2] The IQR plays a central role in exploratory data analysis, notably within box plots (or box-and-whisker plots), where it forms the length of the central "box" to visualize the five-number summary: minimum, Q1, median, Q3, and maximum.[3][2] It is also widely used for outlier detection via the 1.5-IQR rule, classifying values below Q1 - 1.5 × IQR or above Q3 + 1.5 × IQR as potential outliers, which helps in identifying data anomalies without assuming normality.[1] Due to its non-parametric nature, the IQR is applicable to skewed or non-normal distributions and is preferred in fields like finance, environmental science, and quality control for summarizing variability robustly.[2][3]Fundamentals
Definition
The interquartile range (IQR), also known as the midspread or middle 50%, is a measure of statistical dispersion calculated as the difference between the third quartile (Q3) and the first quartile (Q1) of a dataset.[4] Q1 corresponds to the 25th percentile, marking the value below which 25% of the observations fall when the data are ordered from lowest to highest, while Q3 is the 75th percentile, below which 75% of the observations lie.[1] Thus, the IQR quantifies the spread of the central 50% of the data, providing insight into the variability within the core distribution without considering extreme values.[2] The formula for the interquartile range is: \text{IQR} = Q_3 - Q_1 This simple subtraction yields a non-negative value that directly reflects the width of the interquartile interval, offering a straightforward indicator of data spread in ordered distributions.[4] Unlike the full range or standard deviation, the IQR is particularly robust to outliers and extreme values, as it focuses solely on the middle half of the data and ignores the lowest 25% and highest 25% of observations.[5] This resistance to skewness and anomalies makes it a preferred measure of dispersion in datasets with potential irregularities, such as those in empirical sciences.[6] The concept of the interquartile range was introduced by Francis Galton in his 1882 "Report of the Anthropometric Committee," where he employed it as part of quartile-based statistical analysis for measuring variation in human measurements.[7] Galton's work in the late 19th century laid foundational groundwork for modern descriptive statistics, emphasizing quartile methods over more sensitive alternatives.[8]Quartiles
Quartiles are specific quantiles that divide an ordered dataset into four equal parts, each containing 25% of the observations. The first quartile, denoted as Q1, marks the value below which 25% of the data lies; the second quartile, Q2, is the median, with 50% of the data below it; and the third quartile, Q3, indicates the value below which 75% of the data falls.[9][10] These divisions provide a framework for understanding the distribution's central and spread characteristics without assuming normality. In notation, Q1 separates the lowest 25% of the data from the remaining 75%, while Q3 delineates the upper 25% from the lower 75%, with the span between Q1 and Q3 encompassing the middle 50% of the observations. The interquartile range, defined as the distance from Q1 to Q3, captures this central portion.[11][9] The positioning of quartiles in an ordered dataset depends on the sample size. For odd sample sizes, quartiles often align with specific data points or averages of adjacent points in the sorted list; for even sample sizes, interpolation—such as averaging the two middle values for the median or linearly interpolating between points for Q1 and Q3—ensures precise placement, particularly in continuous data distributions.[12][9] Visually, quartiles are represented on a number line by marking Q1, Q2, and Q3 to illustrate the data's segmentation into quarters, highlighting the relative positions and gaps between them. In a cumulative distribution function, these points correspond to the 0.25, 0.50, and 0.75 probability levels, providing a graphical view of the data's progression.[10][11]Calculation
Methods for Quartiles
Determining quartiles from a dataset requires sorting the data in ascending order and identifying specific positions within the ordered list. The general step-by-step process involves calculating the position for the first quartile (Q1) at (n+1)/4 and the third quartile (Q3) at $3(n+1)/4, where n is the sample size; if these positions are not integers, interpolation is typically applied between adjacent data points.[13] For discrete data or ties, methods may select the nearest observation or average values at the position to handle multiplicity.[14] One common approach is Tukey's hinges method, an inclusive technique that divides the dataset into halves while including the median in both the lower and upper portions for odd-sized samples. To compute, first find the median at position (n+1)/2; then determine the hinge depths as (n+1)/4 from each end, taking the median of the respective halves (including the overall median if n is odd). For example, with an odd n=9, the lower hinge (Q1) is the median of the first five values, and the upper hinge (Q3) is the median of the last five. This method, introduced by John Tukey in exploratory data analysis, avoids interpolation and yields robust values for box plots.[15][16] In contrast, the Moore and McCabe method employs an exclusive approach, splitting the data into halves while excluding the median from both for odd n to ensure equal-sized groups. After sorting, compute the median (Q2); then take Q1 as the median of the lower half (first \lfloor n/2 \rfloor values) and Q3 as the median of the upper half (last \lfloor n/2 \rfloor values). For even n, the halves are naturally equal without exclusion. This technique, detailed in introductory statistics texts, often results in Q1 and Q3 positioned farther from the median than in inclusive methods, particularly for small odd samples.[17][16] A more systematic framework is provided by Hyndman and Fan, who outlined nine algorithms for sample quantiles, including quartiles, emphasizing properties like continuity and bias reduction. These methods compute the position as g(p, n) = (n-1)p + 1 or variants, with interpolation r between floor and ceiling indices, where p=0.25 for Q1 and p=0.75 for Q3. Widely adopted are type 7 (default in R, using g(p,n) = p(n-1) + 1) and type 8 (recommended by the authors for median-unbiased estimates, adjusting with n + 1/3). Type 6, another common variant, uses p(n+1). These continuous methods handle ties by linear weighting and are preferred in computational statistics for large datasets.[13][18][14] Software implementations vary, leading to differences in quartile values across tools. In R, thequantile() function defaults to type 7 but allows selection up to type 9, aligning with Hyndman and Fan's classifications.[18] Excel's QUARTILE.INC uses linear interpolation over the inclusive range [0,1], computing positions as $1 + (n-1)p, while QUARTILE.EXC excludes endpoints for [1/(n+1), n/(n+1)].[19] Python's NumPy percentile function, used for quartiles via 25 and 75, defaults to linear interpolation (method='linear'), weighting between adjacent points at non-integer positions, with options for other schemes like 'nearest' for discrete handling.[20] These variations underscore the need to specify the method when comparing results.[21]
Computing the IQR
The interquartile range (IQR) is calculated by subtracting the first quartile (Q1) from the third quartile (Q3) of a dataset, providing a measure of the spread of the central 50% of the data.[4] This simple subtraction formula, IQR = Q3 - Q1, assumes that Q1 and Q3 have already been determined from the sorted data using a consistent quartile estimation method.[1] To apply it step-by-step, first sort the dataset in ascending order, identify the positions for Q1 (25th percentile) and Q3 (75th percentile), compute those values, and then perform the subtraction; the result is robust for most continuous distributions but requires care with discrete or tied data.[22] For a worked example with a small hypothetical dataset of n=8 sorted values—3, 5, 7, 8, 12, 13, 14, 21—Q1 is the median of the lower half (values 3, 5, 7, 8), which is the average of the 2nd and 3rd values: (5 + 7)/2 = 6.[1] Similarly, Q3 is the median of the upper half (12, 13, 14, 21), the average of the 2nd and 3rd values: (13 + 14)/2 = 13.5.[1] Thus, IQR = 13.5 - 6 = 7.5, capturing the variability excluding the lowest and highest values.[1] The choice of quartile computation method can influence the IQR value, especially in small or discrete datasets, as there are at least nine common definitions for sample quantiles that differ in interpolation and rounding approaches.[23] For instance, methods emphasizing median-unbiasedness (e.g., types 7–9 in Hyndman and Fan's classification) tend to yield more consistent IQR estimates for limited data compared to inverse empirical methods.[23] In cases of small sample sizes (n < 4), the IQR depends on the quantile method and is often not reliable, as the data lack sufficient points to separate distinct quartiles meaningfully; such computations are generally avoided due to lack of reliability. For n=1, Q1 and Q3 coincide, resulting in IQR = 0. For n=2, the IQR depends on the method: discrete methods (types 1–3) give the full range, while interpolation methods (types 6–9) give half the range. For n=3, the IQR also varies by method: discrete approaches (types 1–3) yield the full range, whereas interpolation methods (types 6–9) typically produce half the range. For instance, in R's default method (type 7), n=2 and n=3 yield IQR = 0.5 × range.[24][13] The IQR demonstrates numerical stability and resistance to extreme values, as it ignores the lowest 25% and highest 25% of the data, making it less affected by outliers than the full range.[25] In the earlier example, the full range is 21 - 3 = 18, heavily influenced by the extremes, whereas the IQR of 7.5 remains unchanged even if the dataset includes additional outliers like replacing 21 with 100.[25]Applications
Measure of Variability
The interquartile range (IQR) serves as a non-parametric measure of statistical dispersion, specifically quantifying the spread of the central 50% of a dataset by calculating the difference between the third quartile (Q3) and the first quartile (Q1).[2] This approach focuses on the middle half of the data, making it particularly suitable for datasets that exhibit skewness or deviations from normality, where traditional measures may be distorted by extreme values.[26] As a robust estimator, the IQR provides a reliable indicator of variability without assuming an underlying distribution, which enhances its utility in exploratory data analysis across various fields such as economics, biology, and social sciences.[27] In comparison to other measures of variability, the IQR offers distinct advantages over the range, which simply subtracts the minimum from the maximum value and is highly sensitive to outliers, potentially exaggerating the perceived spread in contaminated datasets.[28] Unlike the standard deviation, which relies on the mean and assumes approximate normality for meaningful interpretation, the IQR remains stable even in non-normal distributions and avoids the influence of tail extremes.[26] Another robust alternative, the median absolute deviation (MAD), measures spread around the median using absolute differences but typically scales differently from the IQR, with the latter often preferred for its quartile-based focus on interpercentile intervals; both outperform the standard deviation in the presence of outliers.[29] A larger IQR signifies greater variability within the middle 50% of the data, offering an intuitive interpretation of how dispersed the core observations are relative to the median, and it is commonly included in descriptive statistics summaries alongside measures of central tendency like the median.[30] This makes it valuable for summarizing datasets in reports or visualizations, where it helps convey the typical range of values without being swayed by anomalies. However, the IQR has limitations, as it ignores the full extent of the data by excluding the lowest 25% and highest 25%, thereby failing to capture overall range or the behavior in the tails of the distribution.[31] Additionally, its non-parametric nature renders it less amenable to further mathematical manipulations compared to variance-based measures.[27]Outlier Detection
The interquartile range (IQR) serves as a robust tool for identifying outliers in univariate data sets by establishing bounds that highlight values deviating significantly from the central 50% of the data. The standard method, known as Tukey's fences, defines an outlier as any data point falling below the first quartile (Q1) minus 1.5 times the IQR or above the third quartile (Q3) plus 1.5 times the IQR. This approach, introduced by John W. Tukey in his seminal work on exploratory data analysis, leverages the non-parametric nature of quartiles to resist the influence of extreme values themselves.[32] To apply this method, one first computes the IQR as Q3 minus Q1, then calculates the lower fence as Q1 - 1.5 × IQR and the upper fence as Q3 + 1.5 × IQR. Data points outside these fences are flagged as potential outliers for further investigation.[1] This process is particularly useful in exploratory data analysis, where it helps prioritize anomalous observations without assuming a specific distribution.[33] Variations of the rule adjust the multiplier to distinguish between mild and extreme outliers or to suit domain-specific tolerances. For instance, a multiplier of 1.5 identifies mild outliers, while 3.0 flags extreme outliers beyond further extended fences (Q1 - 3 × IQR and Q3 + 3 × IQR), allowing analysts to differentiate levels of deviation.[34] These multipliers can be tuned based on the data's context, such as increasing them for skewed distributions or financial data where extremes may represent valid events like market crashes.[34] While effective for data cleaning and anomaly flagging, the IQR-based method is not definitive, as it may incorrectly label valid extremes—such as natural variations in biological or environmental data—as outliers, especially in heavy-tailed or multimodal distributions.[35] Thus, flagged points warrant contextual review to avoid discarding meaningful information.[36]Examples
Tabular Data Set
To demonstrate the calculation of the interquartile range, consider a sample dataset consisting of 11 test scores: 22, 24, 26, 28, 29, 31, 35, 37, 41, 53, 64. These values represent a small, representative set for illustrating the process. The data must first be arranged in ascending order, as follows:| Position | Sorted Value |
|---|---|
| 1 | 22 |
| 2 | 24 |
| 3 | 26 |
| 4 | 28 |
| 5 | 29 |
| 6 | 31 |
| 7 | 35 |
| 8 | 37 |
| 9 | 41 |
| 10 | 53 |
| 11 | 64 |
Box Plot Illustration
The box plot, a graphical method for summarizing data distribution introduced by John Tukey, visually represents the interquartile range (IQR) as the central box spanning from the first quartile (Q1) to the third quartile (Q3), enclosing the middle 50% of the observations.[11] A horizontal or vertical line within this box marks the median, providing a quick view of central tendency relative to the data's spread.[11] The box's length directly corresponds to the IQR, highlighting variability in the core data without influence from extreme values.[37] Whiskers extend from the box edges to the adjacent values, defined as the farthest points within 1.5 times the IQR below Q1 or above Q3; these limits form "fences" that help identify potential outliers as individual points beyond the whiskers.[11] For instance, consider a simple data set of exam scores: 55, 60, 65, 70, 72, 75, 78, 80, 85, 90, 95, 100. Here, Q1 is 67.5, the median is 76.5, and Q3 is 87.5, yielding an IQR of 20. The box would span 67.5 to 87.5, with the median line at 76.5; whiskers reach to 55 (minimum) and 100 (maximum), showing no outliers since all values fall within the fences (37.5 to 117.5). A textual representation might appear as:This format emphasizes the IQR's role in depicting the data's robust spread.[11] Box plots offer benefits such as immediate visualization of the data's spread via the IQR's box width and straightforward detection of skewness or outliers, making them ideal for comparing distributions across groups at a glance.[37] A variation, the notched box plot, adds indented notches around the median to approximate a 95% confidence interval, calculated as roughly ±1.58 × IQR / √n, facilitating visual tests for median differences between plots (non-overlapping notches suggest significant disparity).[38]| | (outlier if any) | ---|----- (whisker to max: 100) | | | | |----- (median: 76.5) | | ---|----- (box: IQR from 67.5 to 87.5) || | (outlier if any) | ---|----- (whisker to max: 100) | | | | |----- (median: 76.5) | | ---|----- (box: IQR from 67.5 to 87.5) |