Fact-checked by Grok 2 weeks ago

Five-number summary

The five-number summary is a descriptive statistical tool used to summarize the distribution of a numerical dataset, consisting of five key values: the minimum (smallest observation), the first quartile (Q1, the 25th percentile), the median (50th percentile), the third quartile (Q3, the 75th percentile), and the maximum (largest observation).^[1] These values divide the data into four equal parts, each containing approximately 25% of the observations, providing insights into the data's central tendency, spread, and skewness without assuming a normal distribution.^[2] Introduced by statistician John W. Tukey in his 1977 book Exploratory Data Analysis, the five-number summary serves as the foundation for creating boxplots (or box-and-whisker plots), which visually represent these metrics to identify outliers and compare distributions across multiple datasets.^[3] Unlike measures such as the mean and standard deviation, it is robust to extreme values and particularly useful for ordinal or non-parametric data analysis in fields like economics, biology, and social sciences.^[4]

Definition and Components

Definition

The five-number summary is a collection of five key descriptive statistics: the minimum value, the first quartile (Q1), the median (also known as the second quartile, Q2), the third quartile (Q3), and the maximum value of a dataset.^[1] These values divide the data into four equal parts, offering a snapshot of its distribution.^[1] Its primary purpose is to summarize the central tendency, spread, and overall shape of the data without requiring assumptions about normality or other distributional forms, making it particularly useful for exploratory analysis of potentially skewed or outlier-prone datasets.^[5] Introduced by statistician John W. Tukey in 1977 as a core element of exploratory data analysis (EDA), it promotes techniques that are resistant to anomalies and focused on revealing patterns through simple, visualizable metrics. In contrast to parametric summaries like the mean and standard deviation, which can be distorted by extreme values and often rely on normality assumptions, the five-number summary is non-parametric and robust, providing a more reliable overview for diverse data types.^[6] It is commonly visualized via box-and-whisker plots to highlight these features intuitively.^[7]

Components

The five-number summary consists of five key descriptive statistics that capture essential aspects of a dataset's distribution: the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. Each component plays a distinct role in summarizing the location, spread, and central tendency of the data without assuming a specific distributional form.^[1]^[5] The minimum is the smallest observed value in the dataset, establishing the lower boundary of the data range and indicating the lowest point in the distribution.^[1]^[5] It provides a baseline for understanding the extent of the data's lower tail, though it can be sensitive to outliers.^[1] The first quartile (Q1), also known as the lower quartile, is the value below which 25% of the data lies, marking the 25th percentile and serving as the lower boundary of the middle 50% of the dataset.^[1]^[5] This component highlights the point where the lowest quarter of the data ends, offering insight into the lower half's dispersion.^[5] The median (Q2), or second quartile, represents the central value of the ordered dataset, dividing it into two equal halves such that 50% of the data falls below it and 50% above.^[1]^[5] As a robust measure of central tendency, it indicates the data's midpoint and is less affected by extreme values than the mean.^[1] The third quartile (Q3), or upper quartile, is the value below which 75% of the data lies, corresponding to the 75th percentile and defining the upper boundary of the middle 50% of the dataset.^[1]^[5] It delineates the end of the upper quarter of the data, helping to assess the spread in the higher portion of the distribution.^[5] The maximum is the largest observed value in the dataset, setting the upper boundary of the data range and revealing the highest point in the distribution.^[1]^[5] Like the minimum, it captures the extent of the upper tail but may be influenced by outliers.^[1] From these components, the interquartile range (IQR) is derived as the difference between Q3 and Q1, quantifying the spread of the central 50% of the data and providing a robust measure of variability that is resistant to outliers.^[8]^[1] This derived statistic emphasizes the consistency within the middle portion of the dataset, complementing the overall range bounded by the minimum and maximum.^[8]

Calculation Methods

Determining the Median

To determine the median in the context of the five-number summary, the dataset must first be sorted in ascending order, as this ordered arrangement is essential for identifying positional values accurately.^[9]^[10] For a dataset with an odd number of observations n, the median is the middle value at position (n+1)/2.^[11]^[12] For example, in the sorted dataset {1, 3, 5, 7, 9} where n=5, the median is 5 at position 3.^[9] When n is even, the median is the average of the two middle values at positions n/2 and (n/2)+1.^[13]^[10] This is expressed mathematically as:

\text{median} = \frac{x_{(n/2)} + x_{(n/2 + 1)}}{2}

where x_{(i)} denotes the i-th value in the ordered dataset.^[9]^[12] For instance, in the sorted set {1, 3, 5, 7} where n=4, the median is (3 + 5)/2 = 4.^[11] In the five-number summary, the median serves as the second quartile (Q2), providing a measure of central tendency resistant to outliers.^[13] Ties in the data do not affect the median calculation, as it relies solely on positional indexing rather than unique values.^[10] For empty datasets (n=0), the median is undefined, as no values exist for ordering.^[9]

Determining the Quartiles

Quartiles are the values that divide a sorted dataset of n observations into four equal parts, with the first quartile (Q1) marking the 25th percentile and the third quartile (Q3) marking the 75th percentile.^[14] In the five-number summary introduced by John Tukey, the quartiles are calculated using the hinge method, also known as the inclusive approach.^[15]^[16] This method divides the ordered data into lower and upper halves after identifying the median, including the median in both halves if n is odd. Q1 is then the median of the lower half, and Q3 is the median of the upper half. For even-sized halves (which occurs when n is even), the half-median is the average of the two central values in that half; for example, if n is even, Q1 averages the values at positions n/4 and n/4 + 1 in the ordered data. This discrete method avoids interpolation and aligns with Tukey's original hinges for robust summary statistics.^[15]^[17] An alternative common method calculates the position of Q1 as (n+1)/4 and Q3 as $3(n+1)/4. If the position is an integer k, the quartile is the value at the k-th ordered observation x_{(k)}; otherwise, linear interpolation is used between the nearest observations. The interpolated value for Q1, for example, is given by

Q1 = x_{\lfloor (n+1)/4 \rfloor} + f \cdot (x_{\lfloor (n+1)/4 \rfloor + 1} - x_{\lfloor (n+1)/4 \rfloor}),

where f is the fractional part of (n+1)/4. The same principle applies to Q3 by replacing the position with $3(n+1)/4. This approach ensures the quartiles fall at exact percentile positions and is recommended by the NIST Engineering Statistics Handbook for percentile calculations.^[14]^[15] These methods can produce slightly different quartile values, particularly for small n, leading to variations in software implementations. For instance, Microsoft Excel's QUARTILE.INC function employs the (n+1)p positional method with interpolation, while R's default quantile() function (type 7) uses a formula based on (n-1)p + 1, though R's boxplot() function applies the inclusive hinge method. To maintain consistency and comparability in analyses, practitioners should select and document a single method throughout.^[17]^[18]

Identifying Minimum and Maximum

The minimum value in the five-number summary is defined as the smallest observation in the dataset, denoted as x_{(1)} when the data are arranged in non-decreasing order.^[1] Similarly, the maximum value is the largest observation, denoted as x_{(n)}, where n is the sample size.^[1] These endpoints require no interpolation or positional averaging, unlike the quartiles, and are directly selected from the extremes of the ordered list.^[19] When datasets contain duplicate values, the minimum and maximum remain the extreme observations, with duplicates treated as distinct entries without altering the selection process.^[1] For missing values, standard practice in descriptive statistics excludes them from consideration, computing the minimum and maximum solely on the available observations to avoid biasing the summary.^[20] One potential limitation is that the minimum and maximum are highly sensitive to outliers, as extreme values directly determine these bounds and can skew the overall range of the distribution.^[21] In visualizations such as box-and-whisker plots, these values serve as the endpoints of the whiskers, framing the spread of the data.^[22]

Applications and Visualization

Role in Descriptive Statistics

The five-number summary plays a central role in descriptive statistics by providing a non-parametric framework for summarizing the distribution of a dataset through key order statistics: the minimum, first quartile, median, third quartile, and maximum. This approach is particularly valuable in exploratory data analysis, where it enables researchers to quickly assess central tendency, spread, and potential skewness without assuming a normal distribution. Unlike parametric measures such as the mean and standard deviation, which can be heavily influenced by extreme values, the five-number summary relies on medians and quartiles that are inherently more stable, making it a robust tool for initial data exploration and preliminary investigations of large datasets.^[23]^[22] One key advantage of the five-number summary is its resistance to outliers, as the median and quartiles are less affected by extreme observations compared to the mean, which can be pulled toward anomalies, and the standard deviation, which amplifies their impact on measures of variability. This robustness makes it especially useful for skewed distributions, where traditional measures may distort the true central tendency and spread; for instance, in positively skewed data, the median provides a better representation of the typical value than the mean. Additionally, the interquartile range (IQR), calculated as the difference between the third quartile (Q3) and the first quartile (Q1) or \text{IQR} = Q3 - [Q1](/page/Q1), serves as a reliable, outlier-resistant measure of spread that focuses on the middle 50% of the data, offering insight into variability without being swayed by tails.^[24]^[25]^[23]^[26] In practice, the five-number summary facilitates comparing distributions across groups, such as income levels between demographics, by highlighting differences in medians and IQRs to reveal patterns of inequality. For example, in economics, it aids in assessing income inequality by quantifying the spread between lower and upper quartiles, providing a snapshot of how resources are dispersed without requiring full distributional assumptions. Similarly, in medicine, it is employed to summarize patient data like ages at diagnosis, enabling clinicians to describe cohort variability and central ages in clinical studies while detecting potential outliers in age-related outcomes. These applications underscore its utility in fields where data may be non-normal or contain influential extremes.^[22]^[27]^[28] Despite its strengths, the five-number summary has limitations, as it does not capture multimodality or the exact shape of the distribution, potentially overlooking clusters or fine-grained patterns that histograms or density plots would reveal; thus, it is best used in complement with graphical tools for a fuller picture. While effective for broad summarization, it may underrepresent variability in highly irregular datasets, emphasizing the need for it to pair with other descriptive techniques in comprehensive analyses.^[23]

Box-and-Whisker Plots

The box-and-whisker plot, also known as a box plot, is a graphical representation that utilizes the five-number summary to illustrate the distribution, central tendency, spread, and potential skewness of a dataset. Invented by John W. Tukey as part of exploratory data analysis, it condenses the minimum, first quartile (Q1), median, third quartile (Q3), and maximum into a compact visual format, allowing for rapid assessment of data characteristics without displaying every data point.^[29]^[30] The core elements of the plot consist of a rectangular box spanning from Q1 to Q3, which encloses the interquartile range (IQR) representing the middle 50% of the data, with a horizontal line inside the box marking the median. Extending from the box are "whiskers," which are lines reaching to the smallest and largest data points that fall within 1.5 times the IQR from Q1 and Q3, respectively; these whiskers thus highlight the range of the data excluding potential outliers. Data points beyond these whiskers are plotted individually as outliers, aiding in the identification of unusual values that may warrant further investigation.^[30] Outlier detection in the standard Tukey box plot employs fences calculated as follows:

\text{Lower fence} = Q_1 - 1.5(Q_3 - Q_1)

\text{Upper fence} = Q_3 + 1.5(Q_3 - Q_1)

Values below the lower fence or above the upper fence are considered mild outliers and are depicted as points outside the whiskers, while extreme outliers (beyond 3 times the IQR) may be distinguished with different symbols. This 1.5 IQR multiplier provides a robust threshold for flagging deviations in non-normal distributions.^[30] Variations of the basic Tukey box plot include notched versions, where indentations (notches) are added to the sides of the box to represent a confidence interval around the median, facilitating visual comparisons of medians across groups; non-overlapping notches suggest significant differences at approximately the 95% confidence level. These notches, proposed by McGill, Tukey, and Larsen, enhance the plot's inferential capabilities without altering the core five-number summary structure. Other adaptations may adjust whisker lengths or incorporate variable box widths to reflect sample sizes, but the standard form prioritizes simplicity.^[31]^[30] A key benefit of box-and-whisker plots is their ability to enable quick visual comparisons of distributions across multiple datasets or groups, revealing differences in location, variability, symmetry, and outlier presence in a single glance, which is particularly useful in exploratory data analysis for factors like treatment effects or machine performance.^[30]

Examples

Numerical Example

To illustrate the computation of the five-number summary, consider a small dataset of nine exam scores, already sorted in ascending order: 55, 60, 70, 75, 80, 85, 90, 95, 100.^[1] The minimum value is the smallest score, 55. The maximum value is the largest score, 100. The median, or second quartile (Q2), is the middle value in the ordered dataset. With nine observations, it is the fifth value: 80. The first quartile (Q1) is the median of the lower half of the data (the first four values: 55, 60, 70, 75). This is the average of the second and third values in that subset: (60 + 70) / 2 = 65. The third quartile (Q3) is the median of the upper half of the data (the last four values: 85, 90, 95, 100). This is the average of the second and third values in that subset: (90 + 95) / 2 = 92.5. Thus, the five-number summary for this dataset is: minimum = 55, Q1 = 65, median = 80, Q3 = 92.5, maximum = 100. The interquartile range (IQR), calculated as Q3 - Q1, is 92.5 - 65 = 27.5, which measures the spread of the middle 50% of the scores. This summary indicates that the central half of the exam scores spans from 65 to 92.5, while the overall range extends from 55 to 100, showing moderate variability without extreme values. To check for potential outliers, compute the bounds as Q1 - 1.5 × IQR = 65 - 1.5 × 27.5 = 23.75 and Q3 + 1.5 × IQR = 92.5 + 1.5 × 27.5 = 133.75; since all scores fall within these bounds (55 > 23.75 and 100 < 133.75), there are no outliers in this dataset. This five-number summary can be visualized using a box-and-whisker plot, with the box representing the IQR and whiskers extending to the minimum and maximum.^[1]

Software Implementations

In statistical software, the five-number summary can be computed efficiently using built-in functions that automate the process of finding the minimum, first quartile, median, third quartile, and maximum of a dataset.^[32] These tools vary in their default methods for calculating quartiles, which can lead to slight differences in results due to interpolation techniques; for instance, R employs the type 7 method by default, aligning with the Hyndman and Fan definition that balances bias and efficiency.^[33] In R, the summary() function applied to a numeric vector or data frame column generates the five-number summary directly, outputting the minimum, first quartile (Q1), median, third quartile (Q3), and maximum, along with the mean (which is extraneous for this summary).^[34] For example, on a vector x <- c(1, 2, 3, 4, 5), executing summary(x) yields:

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.00    2.00    3.00    3.00    4.00    5.00 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.00    2.00    3.00    3.00    4.00    5.00

Users can ignore the mean and extract the relevant values using functions like quantile(x, probs = c(0.25, 0.5, 0.75)) for Q1, median, and Q3, combined with min(x) and max(x). In Python, the NumPy library's percentile() function computes specific percentiles to derive the quartiles, with the minimum and maximum obtained separately using min() and max().^[35] For a NumPy array arr = np.array([1, 2, 3, 4, 5]), the code

q1 = np.percentile(arr, 25); median = np.percentile(arr, 50); q3 = np.percentile(arr, 75); min_val = np.min(arr); max_val = np.max(arr)

produces Q1=2.0, median=3.0, Q3=4.0, min=1, and max=5, using linear interpolation by default. Alternatively, the Pandas library's describe() method on a DataFrame or Series provides a comprehensive summary including the five-number summary elements (min, 25%, 50%, 75%, max) for numeric columns.^[36] For a Pandas Series s = pd.Series([1, 2, 3, 4, 5]), s.describe() outputs:

count	mean	std	min	25%	50%	75%	max
5.0	3.0	1.581139	1.0	2.0	3.0	4.0	5.0

This method excludes the mean and standard deviation for the core five-number focus. In SAS, the PROC UNIVARIATE procedure computes the five-number summary as part of its descriptive statistics output, including moments, quantiles, and extremes, with options to store results via the OUTPUT statement.^[32] For a dataset with variable x, the code:

PROC UNIVARIATE DATA=mydata;
   VAR x;
   OUTPUT OUT=sumstats pctlpts=25 50 75 pctlpre=Q_ pctlname=1 median 3;
RUN;
PROC UNIVARIATE DATA=mydata;
   VAR x;
   OUTPUT OUT=sumstats pctlpts=25 50 75 pctlpre=Q_ pctlname=1 median 3;
RUN;

generates a dataset sumstats containing Q1, median, and Q3, which can be combined with the procedure's default min and max from the Moments table.^[37] By default, SAS uses a method equivalent to linear interpolation for quantiles, potentially differing from other tools.^[33] In Stata, the summarize command with the detail option provides detailed output including the minimum, p25 (first quartile), p50 (median), p75 (third quartile), and maximum, along with additional percentiles and tests.^[38] For a variable x, executing summarize x, detail displays these statistics in the console, such as for sample data: min=1, p25=2, p50=3, p75=4, max=5. To extract programmatically, use return summarize x after the command to access stored results like r(min), r(p25), r(p50), r(p75), and r(max).^[38] Stata's default quantile method follows a continuous approximation similar to others but may vary in edge cases for small samples.^[33]

References

[1]
2.2.10 - Five Number Summary | STAT 200
A five number summary can be used to communicate some key descriptive statistics. It is comprised of five values, presented in the following order.
[2]
Section 3.5: The Five-Number Summary and Boxplots
The five-number summary of a set of data consists of the smallest data value, Q 1 , the median, Q 3 , and the largest value of the data.Missing: definition | Show results with:definition
[3]
Prices, location and spread: 3.3 The five-figure summary and boxplots
John W. Tukey (1915–2000), inventor of the five-figure summary and boxplot. John Tukey was a prominent and prolific US statistician, based at Princeton ...
[4]
[PDF] Exploratory Data Analysis Summary Statistics
Tukey's Five-Number Summary. John Tukey, the father of modern EDA, advocated summarizing data sets with 5 values min value lower quartile median upper ...
[5]
5 Number Summary: Definition, Finding & Using - Statistics By Jim
Keep in mind that the purpose of the five number summary is to provide a preliminary sense of your data during the exploratory phase of analysis. At this point, ...
[6]
What are Robust Statistics?
Boxplot that graphically displays the five number summary. 5 Number Summary: Definition, Finding & Using. What is the 5 Number Summary? The 5 number summary ...
[7]
Box Plot | Introduction to Statistics - JMP
A box plot is based on what is known as the five-number summary, which is the minimum, 25th percentile, median, 75th percentile, and maximum values from a data ...Box Plots Show The... · Box Plot Example · Box Plots Highlight Outliers<|control11|><|separator|>
[8]
Interquartile Range (IQR): How to Find and Use It - Statistics By Jim
The interquartile range (IQR) measures the spread of the middle half of your data. It is the range for the middle 50% of your sample.
[9]
4.4.2 Calculating the median - Statistique Canada
Sep 2, 2021 · If the number of data points is even, the median will be the average of the data point of rank n ÷ 2 and the data point of rank (n ÷ 2) + 1.
[10]
3.3: Median - Statistics LibreTexts
Oct 21, 2024 · Organize the raw scores from smallest value to highest value. · Find the two numbers that fall on each side of the middle of those ordered scores ...
[11]
Mean, median, and mode review (article) | Khan Academy
To find the median: Arrange the data points from smallest to largest. If the number of data points is odd, the median is the middle data point in the list.
[12]
How to Find the Median | Definition, Examples & Calculator - Scribbr
Oct 2, 2020 · To find the median, calculate the mean by adding together the middle values and dividing them by two.
[13]
Median - StatPearls - NCBI Bookshelf
Sep 19, 2022 · The median is the middle value in a set of numbers and is equivalent to the 50th percentile. In other words, the median is the midpoint of a set of numbers.
[14]
7.2.6.2. Percentiles
Another method of calculating percentiles (given in some elementary textbooks) starts by calculating $p N$. If that is not an integer, round up to the next ...
[15]
Full article: Quartiles in Elementary Statistics
Dec 1, 2017 · METHOD 5 (“M&S”): For the lower and upper quartile values take #((n + 1)p) with p = 0.25 for the lower quartile and p = 0.75 for the upper ...
[16]
[PDF] Calculating the Quartile (Or Why Are My Quartile Answers Different?)
Developed by Tukey, its aim is to find the quartiles of a set of data with little or no calculation. The procedure for finding the quartile is: 1. find the ...
[17]
Comparison of Values from All Hinge and Quartile Methods
Jan 10, 2013 · This article compares quartile values from the hinge and interpolation techniques described in the companion articles.
[18]
[PDF] Statistics Five Number Summary and More - CSUSM
Lowest Value(or Minimum): The smallest value of the data set ... The figures above are box and whisper plots for Examples 1 and 2 using the five-number summary.
[19]
Missing data | SPSS Learning Modules - OARC Stats - UCLA
This module will explore missing data in SPSS, focusing on numeric missing data. We will describe how to indicate missing data in your raw data files.
[20]
2.4 Five-Number Summary and Boxplot - MacEwan Open Books
Moreover, the five-number summary helps us identify outliers, those observations that are far away from the bulk of the data. 2.4.1 Identify Outliers. Outliers ...
[21]
Statistics: Power from Data! Five-number summary
Mar 31, 2021 · A five-number summary is especially useful in descriptive analyses or during the preliminary investigation of a large data set.
[22]
Descriptive Statistics for Summarising Data - PMC - PubMed Central
May 15, 2020 · This chapter discusses and illustrates descriptive statistics. The purpose of the procedures and fundamental concepts reviewed in this chapter is quite ...
[23]
[PDF] STAT 110: Chapter 12 - University of South Carolina
• An outlier won't affect the median as much – the median is more robust to outliers. • Similarly, the IQR is a more robust measure of spread than the.
[24]
[PDF] Chapter 3 Statistical Inference: Basic Concepts
(d) The interquartile range is robust to outliers whereas the variance and stan- dard deviation are both sensitive to outliers. This means, for example, if.
[25]
Interquartile Range - an overview | ScienceDirect Topics
The interquartile range (IQR) is defined as the difference between the third quartile and the first quartile, serving as a robust measure of dispersion.<|separator|>
[26]
Parametric estimations of the world distribution of income - CEPR
Jan 22, 2010 · Many different distributions of income could produce these five numbers, some of which might imply rising poverty, and others, falling poverty.
[27]
Optimally estimating the sample standard deviation from ... - PubMed
When reporting the results of clinical studies, some researchers may choose the five-number summary (including the sample median, the first and third quartiles, ...
[28]
Box-and-Whisker Plot -- from Wolfram MathWorld
A box-and-whisker plot (sometimes called simply a box plot) is a histogram-like method of displaying data, invented by J. Tukey. To create a box-and-whisker ...Missing: John | Show results with:John
[29]
1.3.3.7. Box Plot - Information Technology Laboratory
Calculate the median and the quartiles (the lower quartile is the 25th percentile and the upper quartile is the 75th percentile). Plot a symbol at the median ( ...
[30]
Variations of Box Plots: The American Statistician
Mar 12, 2012 · Box plots display batches of data. Five values from a set of data are conventionally used; the extremes, the upper and lower hinges (quartiles), and the median.Missing: notches | Show results with:notches
[31]
Overview: UNIVARIATE Procedure - SAS Help Center
Apr 16, 2025 · The UNIVARIATE procedure provides the following: You can use the PROC UNIVARIATE statement, together with the VAR statement, to compute summary statistics.
[32]
Compare the default definitions for sample quantiles in SAS, R, and ...
Jul 26, 2021 · This article compares the default sample quantiles in SAS in R. It is a misnomer to refer to one definition as the SAS method and to another as the R method.
[33]
Object Summaries - R
summary is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which ...
[34]
numpy.percentile — NumPy v2.0 Manual
numpy.percentile computes the q-th percentile of data along a specified axis, returning the q-th percentile(s) of the array elements.
[35]
pandas.DataFrame.describe — pandas 2.3.3 documentation
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.Dev2.0
[36]
Syntax: UNIVARIATE Procedure - SAS Help Center
Apr 16, 2025 · The PROC UNIVARIATE statement invokes the procedure. The VAR statement specifies the numeric variables to be analyzed, and it is required if the OUTPUT ...
[37]
[PDF] summarize — Summary statistics - Description Quick start Menu
Creating summary tables using the sumtable command. Stata Journal 15: 775–783. Stigler, S. M. 1977. Fractional order statistics, with applications. Journal of ...