Five-number summary
The five-number summary is a descriptive statistical tool used to summarize the distribution of a numerical dataset, consisting of five key values: the minimum (smallest observation), the first quartile (Q1, the 25th percentile), the median (50th percentile), the third quartile (Q3, the 75th percentile), and the maximum (largest observation).[1] These values divide the data into four equal parts, each containing approximately 25% of the observations, providing insights into the data's central tendency, spread, and skewness without assuming a normal distribution.[2] Introduced by statistician John W. Tukey in his 1977 book Exploratory Data Analysis, the five-number summary serves as the foundation for creating boxplots (or box-and-whisker plots), which visually represent these metrics to identify outliers and compare distributions across multiple datasets.[3] Unlike measures such as the mean and standard deviation, it is robust to extreme values and particularly useful for ordinal or non-parametric data analysis in fields like economics, biology, and social sciences.[4]
Definition and Components
Definition
The five-number summary is a collection of five key descriptive statistics: the minimum value, the first quartile (Q1), the median (also known as the second quartile, Q2), the third quartile (Q3), and the maximum value of a dataset.[1] These values divide the data into four equal parts, offering a snapshot of its distribution.[1]
Its primary purpose is to summarize the central tendency, spread, and overall shape of the data without requiring assumptions about normality or other distributional forms, making it particularly useful for exploratory analysis of potentially skewed or outlier-prone datasets.[5] Introduced by statistician John W. Tukey in 1977 as a core element of exploratory data analysis (EDA), it promotes techniques that are resistant to anomalies and focused on revealing patterns through simple, visualizable metrics.
In contrast to parametric summaries like the mean and standard deviation, which can be distorted by extreme values and often rely on normality assumptions, the five-number summary is non-parametric and robust, providing a more reliable overview for diverse data types.[6] It is commonly visualized via box-and-whisker plots to highlight these features intuitively.[7]
Components
The five-number summary consists of five key descriptive statistics that capture essential aspects of a dataset's distribution: the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. Each component plays a distinct role in summarizing the location, spread, and central tendency of the data without assuming a specific distributional form.[1][5]
The minimum is the smallest observed value in the dataset, establishing the lower boundary of the data range and indicating the lowest point in the distribution.[1][5] It provides a baseline for understanding the extent of the data's lower tail, though it can be sensitive to outliers.[1]
The first quartile (Q1), also known as the lower quartile, is the value below which 25% of the data lies, marking the 25th percentile and serving as the lower boundary of the middle 50% of the dataset.[1][5] This component highlights the point where the lowest quarter of the data ends, offering insight into the lower half's dispersion.[5]
The median (Q2), or second quartile, represents the central value of the ordered dataset, dividing it into two equal halves such that 50% of the data falls below it and 50% above.[1][5] As a robust measure of central tendency, it indicates the data's midpoint and is less affected by extreme values than the mean.[1]
The third quartile (Q3), or upper quartile, is the value below which 75% of the data lies, corresponding to the 75th percentile and defining the upper boundary of the middle 50% of the dataset.[1][5] It delineates the end of the upper quarter of the data, helping to assess the spread in the higher portion of the distribution.[5]
The maximum is the largest observed value in the dataset, setting the upper boundary of the data range and revealing the highest point in the distribution.[1][5] Like the minimum, it captures the extent of the upper tail but may be influenced by outliers.[1]
From these components, the interquartile range (IQR) is derived as the difference between Q3 and Q1, quantifying the spread of the central 50% of the data and providing a robust measure of variability that is resistant to outliers.[8][1] This derived statistic emphasizes the consistency within the middle portion of the dataset, complementing the overall range bounded by the minimum and maximum.[8]
Calculation Methods
To determine the median in the context of the five-number summary, the dataset must first be sorted in ascending order, as this ordered arrangement is essential for identifying positional values accurately.[9][10]
For a dataset with an odd number of observations n, the median is the middle value at position (n+1)/2.[11][12] For example, in the sorted dataset {1, 3, 5, 7, 9} where n=5, the median is 5 at position 3.[9]
When n is even, the median is the average of the two middle values at positions n/2 and (n/2)+1.[13][10] This is expressed mathematically as:
\text{median} = \frac{x_{(n/2)} + x_{(n/2 + 1)}}{2}
where x_{(i)} denotes the i-th value in the ordered dataset.[9][12] For instance, in the sorted set {1, 3, 5, 7} where n=4, the median is (3 + 5)/2 = 4.[11]
In the five-number summary, the median serves as the second quartile (Q2), providing a measure of central tendency resistant to outliers.[13] Ties in the data do not affect the median calculation, as it relies solely on positional indexing rather than unique values.[10] For empty datasets (n=0), the median is undefined, as no values exist for ordering.[9]
Determining the Quartiles
Quartiles are the values that divide a sorted dataset of n observations into four equal parts, with the first quartile (Q1) marking the 25th percentile and the third quartile (Q3) marking the 75th percentile.[14] In the five-number summary introduced by John Tukey, the quartiles are calculated using the hinge method, also known as the inclusive approach.[15][16] This method divides the ordered data into lower and upper halves after identifying the median, including the median in both halves if n is odd. Q1 is then the median of the lower half, and Q3 is the median of the upper half. For even-sized halves (which occurs when n is even), the half-median is the average of the two central values in that half; for example, if n is even, Q1 averages the values at positions n/4 and n/4 + 1 in the ordered data. This discrete method avoids interpolation and aligns with Tukey's original hinges for robust summary statistics.[15][17]
An alternative common method calculates the position of Q1 as (n+1)/4 and Q3 as $3(n+1)/4. If the position is an integer k, the quartile is the value at the k-th ordered observation x_{(k)}; otherwise, linear interpolation is used between the nearest observations. The interpolated value for Q1, for example, is given by
Q1 = x_{\lfloor (n+1)/4 \rfloor} + f \cdot (x_{\lfloor (n+1)/4 \rfloor + 1} - x_{\lfloor (n+1)/4 \rfloor}),
where f is the fractional part of (n+1)/4. The same principle applies to Q3 by replacing the position with $3(n+1)/4. This approach ensures the quartiles fall at exact percentile positions and is recommended by the NIST Engineering Statistics Handbook for percentile calculations.[14][15]
These methods can produce slightly different quartile values, particularly for small n, leading to variations in software implementations. For instance, Microsoft Excel's QUARTILE.INC function employs the (n+1)p positional method with interpolation, while R's default quantile() function (type 7) uses a formula based on (n-1)p + 1, though R's boxplot() function applies the inclusive hinge method. To maintain consistency and comparability in analyses, practitioners should select and document a single method throughout.[17][18]
Identifying Minimum and Maximum
The minimum value in the five-number summary is defined as the smallest observation in the dataset, denoted as x_{(1)} when the data are arranged in non-decreasing order.[1] Similarly, the maximum value is the largest observation, denoted as x_{(n)}, where n is the sample size.[1] These endpoints require no interpolation or positional averaging, unlike the quartiles, and are directly selected from the extremes of the ordered list.[19]
When datasets contain duplicate values, the minimum and maximum remain the extreme observations, with duplicates treated as distinct entries without altering the selection process.[1] For missing values, standard practice in descriptive statistics excludes them from consideration, computing the minimum and maximum solely on the available observations to avoid biasing the summary.[20]
One potential limitation is that the minimum and maximum are highly sensitive to outliers, as extreme values directly determine these bounds and can skew the overall range of the distribution.[21] In visualizations such as box-and-whisker plots, these values serve as the endpoints of the whiskers, framing the spread of the data.[22]
Applications and Visualization
Role in Descriptive Statistics
The five-number summary plays a central role in descriptive statistics by providing a non-parametric framework for summarizing the distribution of a dataset through key order statistics: the minimum, first quartile, median, third quartile, and maximum. This approach is particularly valuable in exploratory data analysis, where it enables researchers to quickly assess central tendency, spread, and potential skewness without assuming a normal distribution. Unlike parametric measures such as the mean and standard deviation, which can be heavily influenced by extreme values, the five-number summary relies on medians and quartiles that are inherently more stable, making it a robust tool for initial data exploration and preliminary investigations of large datasets.[23][22]
One key advantage of the five-number summary is its resistance to outliers, as the median and quartiles are less affected by extreme observations compared to the mean, which can be pulled toward anomalies, and the standard deviation, which amplifies their impact on measures of variability. This robustness makes it especially useful for skewed distributions, where traditional measures may distort the true central tendency and spread; for instance, in positively skewed data, the median provides a better representation of the typical value than the mean. Additionally, the interquartile range (IQR), calculated as the difference between the third quartile (Q3) and the first quartile (Q1) or \text{IQR} = Q3 - [Q1](/page/Q1), serves as a reliable, outlier-resistant measure of spread that focuses on the middle 50% of the data, offering insight into variability without being swayed by tails.[24][25][23][26]
In practice, the five-number summary facilitates comparing distributions across groups, such as income levels between demographics, by highlighting differences in medians and IQRs to reveal patterns of inequality. For example, in economics, it aids in assessing income inequality by quantifying the spread between lower and upper quartiles, providing a snapshot of how resources are dispersed without requiring full distributional assumptions. Similarly, in medicine, it is employed to summarize patient data like ages at diagnosis, enabling clinicians to describe cohort variability and central ages in clinical studies while detecting potential outliers in age-related outcomes. These applications underscore its utility in fields where data may be non-normal or contain influential extremes.[22][27][28]
Despite its strengths, the five-number summary has limitations, as it does not capture multimodality or the exact shape of the distribution, potentially overlooking clusters or fine-grained patterns that histograms or density plots would reveal; thus, it is best used in complement with graphical tools for a fuller picture. While effective for broad summarization, it may underrepresent variability in highly irregular datasets, emphasizing the need for it to pair with other descriptive techniques in comprehensive analyses.[23]
Box-and-Whisker Plots
The box-and-whisker plot, also known as a box plot, is a graphical representation that utilizes the five-number summary to illustrate the distribution, central tendency, spread, and potential skewness of a dataset. Invented by John W. Tukey as part of exploratory data analysis, it condenses the minimum, first quartile (Q1), median, third quartile (Q3), and maximum into a compact visual format, allowing for rapid assessment of data characteristics without displaying every data point.[29][30]
The core elements of the plot consist of a rectangular box spanning from Q1 to Q3, which encloses the interquartile range (IQR) representing the middle 50% of the data, with a horizontal line inside the box marking the median. Extending from the box are "whiskers," which are lines reaching to the smallest and largest data points that fall within 1.5 times the IQR from Q1 and Q3, respectively; these whiskers thus highlight the range of the data excluding potential outliers. Data points beyond these whiskers are plotted individually as outliers, aiding in the identification of unusual values that may warrant further investigation.[30]
Outlier detection in the standard Tukey box plot employs fences calculated as follows:
\text{Lower fence} = Q_1 - 1.5(Q_3 - Q_1)
\text{Upper fence} = Q_3 + 1.5(Q_3 - Q_1)
Values below the lower fence or above the upper fence are considered mild outliers and are depicted as points outside the whiskers, while extreme outliers (beyond 3 times the IQR) may be distinguished with different symbols. This 1.5 IQR multiplier provides a robust threshold for flagging deviations in non-normal distributions.[30]
Variations of the basic Tukey box plot include notched versions, where indentations (notches) are added to the sides of the box to represent a confidence interval around the median, facilitating visual comparisons of medians across groups; non-overlapping notches suggest significant differences at approximately the 95% confidence level. These notches, proposed by McGill, Tukey, and Larsen, enhance the plot's inferential capabilities without altering the core five-number summary structure. Other adaptations may adjust whisker lengths or incorporate variable box widths to reflect sample sizes, but the standard form prioritizes simplicity.[31][30]
A key benefit of box-and-whisker plots is their ability to enable quick visual comparisons of distributions across multiple datasets or groups, revealing differences in location, variability, symmetry, and outlier presence in a single glance, which is particularly useful in exploratory data analysis for factors like treatment effects or machine performance.[30]
Examples
Numerical Example
To illustrate the computation of the five-number summary, consider a small dataset of nine exam scores, already sorted in ascending order: 55, 60, 70, 75, 80, 85, 90, 95, 100.[1]
The minimum value is the smallest score, 55. The maximum value is the largest score, 100.
The median, or second quartile (Q2), is the middle value in the ordered dataset. With nine observations, it is the fifth value: 80.
The first quartile (Q1) is the median of the lower half of the data (the first four values: 55, 60, 70, 75). This is the average of the second and third values in that subset: (60 + 70) / 2 = 65.
The third quartile (Q3) is the median of the upper half of the data (the last four values: 85, 90, 95, 100). This is the average of the second and third values in that subset: (90 + 95) / 2 = 92.5.
Thus, the five-number summary for this dataset is: minimum = 55, Q1 = 65, median = 80, Q3 = 92.5, maximum = 100. The interquartile range (IQR), calculated as Q3 - Q1, is 92.5 - 65 = 27.5, which measures the spread of the middle 50% of the scores.
This summary indicates that the central half of the exam scores spans from 65 to 92.5, while the overall range extends from 55 to 100, showing moderate variability without extreme values. To check for potential outliers, compute the bounds as Q1 - 1.5 × IQR = 65 - 1.5 × 27.5 = 23.75 and Q3 + 1.5 × IQR = 92.5 + 1.5 × 27.5 = 133.75; since all scores fall within these bounds (55 > 23.75 and 100 < 133.75), there are no outliers in this dataset. This five-number summary can be visualized using a box-and-whisker plot, with the box representing the IQR and whiskers extending to the minimum and maximum.[1]
Software Implementations
In statistical software, the five-number summary can be computed efficiently using built-in functions that automate the process of finding the minimum, first quartile, median, third quartile, and maximum of a dataset.[32] These tools vary in their default methods for calculating quartiles, which can lead to slight differences in results due to interpolation techniques; for instance, R employs the type 7 method by default, aligning with the Hyndman and Fan definition that balances bias and efficiency.[33]
In R, the summary() function applied to a numeric vector or data frame column generates the five-number summary directly, outputting the minimum, first quartile (Q1), median, third quartile (Q3), and maximum, along with the mean (which is extraneous for this summary).[34] For example, on a vector x <- c(1, 2, 3, 4, 5), executing summary(x) yields:
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 3.00 3.00 4.00 5.00
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.00 2.00 3.00 3.00 4.00 5.00
Users can ignore the mean and extract the relevant values using functions like quantile(x, probs = c(0.25, 0.5, 0.75)) for Q1, median, and Q3, combined with min(x) and max(x).
In Python, the NumPy library's percentile() function computes specific percentiles to derive the quartiles, with the minimum and maximum obtained separately using min() and max().[35] For a NumPy array arr = np.array([1, 2, 3, 4, 5]), the code q1 = np.percentile(arr, 25); median = np.percentile(arr, 50); q3 = np.percentile(arr, 75); min_val = np.min(arr); max_val = np.max(arr) produces Q1=2.0, median=3.0, Q3=4.0, min=1, and max=5, using linear interpolation by default. Alternatively, the Pandas library's describe() method on a DataFrame or Series provides a comprehensive summary including the five-number summary elements (min, 25%, 50%, 75%, max) for numeric columns.[36] For a Pandas Series s = pd.Series([1, 2, 3, 4, 5]), s.describe() outputs:
| count | mean | std | min | 25% | 50% | 75% | max |
|---|
| 5.0 | 3.0 | 1.581139 | 1.0 | 2.0 | 3.0 | 4.0 | 5.0 |
This method excludes the mean and standard deviation for the core five-number focus.
In SAS, the PROC UNIVARIATE procedure computes the five-number summary as part of its descriptive statistics output, including moments, quantiles, and extremes, with options to store results via the OUTPUT statement.[32] For a dataset with variable x, the code:
PROC UNIVARIATE DATA=mydata;
VAR x;
OUTPUT OUT=sumstats pctlpts=25 50 75 pctlpre=Q_ pctlname=1 median 3;
RUN;
PROC UNIVARIATE DATA=mydata;
VAR x;
OUTPUT OUT=sumstats pctlpts=25 50 75 pctlpre=Q_ pctlname=1 median 3;
RUN;
generates a dataset sumstats containing Q1, median, and Q3, which can be combined with the procedure's default min and max from the Moments table.[37] By default, SAS uses a method equivalent to linear interpolation for quantiles, potentially differing from other tools.[33]
In Stata, the summarize command with the detail option provides detailed output including the minimum, p25 (first quartile), p50 (median), p75 (third quartile), and maximum, along with additional percentiles and tests.[38] For a variable x, executing summarize x, detail displays these statistics in the console, such as for sample data: min=1, p25=2, p50=3, p75=4, max=5. To extract programmatically, use return summarize x after the command to access stored results like r(min), r(p25), r(p50), r(p75), and r(max).[38] Stata's default quantile method follows a continuous approximation similar to others but may vary in edge cases for small samples.[33]