Fact-checked by Grok 2 weeks ago

Five-number summary

The five-number summary is a descriptive statistical tool used to summarize the distribution of a numerical , consisting of five key values: the minimum (smallest observation), the first (Q1, the 25th ), the (50th ), the third (Q3, the 75th ), and the maximum (largest observation). These values divide the data into four equal parts, each containing approximately 25% of the observations, providing insights into the data's , spread, and without assuming a . Introduced by statistician John W. Tukey in his 1977 book , the five-number summary serves as the foundation for creating boxplots (or box-and-whisker plots), which visually represent these metrics to identify outliers and compare distributions across multiple . Unlike measures such as the and standard deviation, it is robust to values and particularly useful for ordinal or non-parametric data analysis in fields like , , and social sciences.

Definition and Components

Definition

The five-number summary is a collection of five key descriptive statistics: the minimum value, the first quartile (Q1), the median (also known as the second quartile, Q2), the third quartile (Q3), and the maximum value of a dataset. These values divide the data into four equal parts, offering a snapshot of its distribution. Its primary purpose is to summarize the central tendency, spread, and overall shape of the data without requiring assumptions about normality or other distributional forms, making it particularly useful for exploratory analysis of potentially skewed or outlier-prone datasets. Introduced by statistician John W. Tukey in 1977 as a core element of exploratory data analysis (EDA), it promotes techniques that are resistant to anomalies and focused on revealing patterns through simple, visualizable metrics. In contrast to parametric summaries like the and standard deviation, which can be distorted by extreme values and often rely on assumptions, the five-number summary is non-parametric and robust, providing a more reliable overview for diverse data types. It is commonly visualized via box-and-whisker plots to highlight these features intuitively.

Components

The five-number summary consists of five key descriptive statistics that capture essential aspects of a dataset's distribution: the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. Each component plays a distinct role in summarizing the location, spread, and central tendency of the data without assuming a specific distributional form. The minimum is the smallest observed value in the dataset, establishing the lower boundary of the data range and indicating the lowest point in the distribution. It provides a baseline for understanding the extent of the data's lower tail, though it can be sensitive to outliers. The first quartile (Q1), also known as the lower quartile, is the value below which 25% of the data lies, marking the 25th and serving as the lower boundary of the middle 50% of the dataset. This component highlights the point where the lowest quarter of the data ends, offering insight into the lower half's dispersion. The median (Q2), or second quartile, represents the central value of the ordered dataset, dividing it into two equal halves such that 50% of the data falls below it and 50% above. As a robust measure of , it indicates the data's midpoint and is less affected by extreme values than the . The third quartile (Q3), or upper quartile, is the value below which 75% of the data lies, corresponding to the 75th and defining the upper boundary of the middle 50% of the dataset. It delineates the end of the upper quarter of the data, helping to assess the spread in the higher portion of the distribution. The maximum is the largest observed value in the dataset, setting the upper boundary of the data range and revealing the highest point in the distribution. Like the minimum, it captures the extent of the upper tail but may be influenced by outliers. From these components, the interquartile range (IQR) is derived as the difference between Q3 and Q1, quantifying the spread of the central 50% of the data and providing a robust measure of variability that is resistant to outliers. This derived statistic emphasizes the consistency within the middle portion of the , complementing the overall range bounded by the minimum and maximum.

Calculation Methods

Determining the Median

To determine the median in the context of the five-number summary, the dataset must first be sorted in ascending order, as this ordered arrangement is essential for identifying positional values accurately. For a dataset with an odd number of observations n, the median is the middle value at position (n+1)/2. For example, in the sorted {1, 3, 5, 7, 9} where n=5, the is 5 at position 3. When n is even, the is the of the two middle values at positions n/2 and (n/2)+1. This is expressed mathematically as: \text{median} = \frac{x_{(n/2)} + x_{(n/2 + 1)}}{2} where x_{(i)} denotes the i-th value in the ordered dataset. For instance, in the sorted set {1, 3, 5, 7} where n=4, the median is (3 + 5)/2 = 4. In the five-number summary, the median serves as the second quartile (Q2), providing a measure of central tendency resistant to outliers. Ties in the data do not affect the median calculation, as it relies solely on positional indexing rather than unique values. For empty datasets (n=0), the median is undefined, as no values exist for ordering.

Determining the Quartiles

Quartiles are the values that divide a sorted of n observations into four equal parts, with the first quartile () marking the 25th and the third quartile (Q3) marking the 75th . In the five-number summary introduced by , the quartiles are calculated using the hinge method, also known as the inclusive approach. This method divides the ordered data into lower and upper halves after identifying the , including the median in both halves if n is odd. is then the of the lower half, and Q3 is the of the upper half. For even-sized halves (which occurs when n is even), the half-median is the average of the two central values in that half; for example, if n is even, averages the values at positions n/4 and n/4 + 1 in the ordered data. This discrete method avoids and aligns with Tukey's original hinges for robust . An alternative common method calculates the position of Q1 as (n+1)/4 and Q3 as $3(n+1)/4. If the position is an integer k, the quartile is the value at the k-th ordered observation x_{(k)}; otherwise, linear interpolation is used between the nearest observations. The interpolated value for Q1, for example, is given by Q1 = x_{\lfloor (n+1)/4 \rfloor} + f \cdot (x_{\lfloor (n+1)/4 \rfloor + 1} - x_{\lfloor (n+1)/4 \rfloor}), where f is the fractional part of (n+1)/4. The same principle applies to Q3 by replacing the position with $3(n+1)/4. This approach ensures the quartiles fall at exact percentile positions and is recommended by the NIST Engineering Statistics Handbook for percentile calculations. These methods can produce slightly different quartile values, particularly for small n, leading to variations in software implementations. For instance, Excel's .INC function employs the (n+1)p positional method with , while R's default (type 7) uses a formula based on (n-1)p + 1, though R's boxplot() function applies the inclusive hinge method. To maintain consistency and comparability in analyses, practitioners should select and document a single throughout.

Identifying Minimum and Maximum

The minimum value in the five-number summary is defined as the smallest in the , denoted as x_{(1)} when the are arranged in non-decreasing . Similarly, the maximum value is the largest , denoted as x_{(n)}, where n is the sample size. These endpoints require no or positional averaging, unlike the quartiles, and are directly selected from the extremes of the ordered list. When datasets contain duplicate values, the minimum and maximum remain the extreme observations, with duplicates treated as distinct entries without altering the selection process. For missing values, standard practice in excludes them from consideration, computing the minimum and maximum solely on the available observations to avoid biasing the summary. One potential limitation is that the minimum and maximum are highly sensitive to outliers, as extreme values directly determine these bounds and can skew the overall of the . In visualizations such as box-and-whisker plots, these values serve as the endpoints of the , framing the spread of the data.

Applications and Visualization

Role in Descriptive Statistics

The five-number summary plays a central role in by providing a non-parametric framework for summarizing the distribution of a through key order statistics: the minimum, first , , third , and maximum. This approach is particularly valuable in , where it enables researchers to quickly assess , spread, and potential without assuming a . Unlike parametric measures such as the and standard deviation, which can be heavily influenced by extreme values, the five-number summary relies on medians and that are inherently more stable, making it a robust tool for initial exploration and preliminary investigations of large . One key advantage of the five-number summary is its resistance to outliers, as the and are less affected by extreme observations compared to the , which can be pulled toward anomalies, and the standard deviation, which amplifies their impact on measures of variability. This robustness makes it especially useful for skewed distributions, where traditional measures may distort the true and ; for instance, in positively skewed data, the provides a better representation of the typical value than the . Additionally, the (IQR), calculated as the difference between the third quartile (Q3) and the first quartile () or \text{IQR} = Q3 - [Q1](/page/Q1), serves as a reliable, outlier-resistant measure of that focuses on the middle 50% of the data, offering insight into variability without being swayed by tails. In practice, the five-number summary facilitates comparing distributions across groups, such as income levels between demographics, by highlighting differences in medians and IQRs to reveal patterns of inequality. For example, in , it aids in assessing by quantifying the spread between lower and upper quartiles, providing a snapshot of how resources are dispersed without requiring full distributional assumptions. Similarly, in , it is employed to summarize data like ages at , enabling clinicians to describe variability and central ages in clinical studies while detecting potential outliers in age-related outcomes. These applications underscore its utility in fields where may be non-normal or contain influential extremes. Despite its strengths, the five-number summary has limitations, as it does not capture or the exact shape of the , potentially overlooking clusters or fine-grained patterns that histograms or density plots would reveal; thus, it is best used in complement with graphical tools for a fuller picture. While effective for broad summarization, it may underrepresent variability in highly irregular datasets, emphasizing the need for it to pair with other descriptive techniques in comprehensive analyses.

Box-and-Whisker Plots

The box-and-whisker plot, also known as a , is a graphical representation that utilizes the five-number summary to illustrate the distribution, , spread, and potential of a . Invented by John W. Tukey as part of , it condenses the minimum, first quartile (Q1), median, third quartile (Q3), and maximum into a compact visual format, allowing for rapid assessment of data characteristics without displaying every data point. The core elements of the plot consist of a rectangular box spanning from to Q3, which encloses the (IQR) representing the middle 50% of the , with a horizontal line inside the box marking the . Extending from the box are "," which are lines reaching to the smallest and largest points that fall within 1.5 times the IQR from Q1 and Q3, respectively; these thus highlight the range of the excluding potential outliers. Data points beyond these are plotted individually as outliers, aiding in the identification of unusual values that may warrant further investigation. Outlier detection in the standard Tukey box plot employs fences calculated as follows: \text{Lower fence} = Q_1 - 1.5(Q_3 - Q_1) \text{Upper fence} = Q_3 + 1.5(Q_3 - Q_1) Values below the lower fence or above the upper fence are considered mild and are depicted as points outside the whiskers, while extreme outliers (beyond 3 times the IQR) may be distinguished with different symbols. This 1.5 IQR multiplier provides a robust for flagging deviations in non-normal distributions. Variations of the basic Tukey box plot include notched versions, where indentations (notches) are added to the sides of the to represent a around the , facilitating visual comparisons of medians across groups; non-overlapping notches suggest significant differences at approximately the 95% confidence level. These notches, proposed by McGill, Tukey, and Larsen, enhance the plot's inferential capabilities without altering the core five-number summary structure. Other adaptations may adjust whisker lengths or incorporate variable widths to reflect sample sizes, but the standard form prioritizes simplicity. A key benefit of box-and-whisker plots is their ability to enable quick visual comparisons of distributions across multiple or groups, revealing differences in location, variability, symmetry, and presence in a single glance, which is particularly useful in for factors like treatment effects or machine performance.

Examples

Numerical Example

To illustrate the of the five-number summary, consider a small of nine scores, already sorted in ascending order: 55, 60, 70, 75, 80, 85, 90, 95, 100. The minimum value is the smallest score, 55. The maximum value is the largest score, 100. The , or second (Q2), is the middle value in the ordered . With nine observations, it is the fifth value: 80. The first () is the median of the lower half of the data (the first four values: 55, 60, 70, 75). This is the of the second and third values in that subset: (60 + 70) / 2 = 65. The third (Q3) is the median of the upper half of the data (the last four values: 85, 90, 95, 100). This is the of the second and third values in that subset: (90 + 95) / 2 = 92.5. Thus, the five-number summary for this dataset is: minimum = 55, Q1 = 65, = 80, Q3 = 92.5, maximum = 100. The (IQR), calculated as Q3 - Q1, is 92.5 - 65 = 27.5, which measures the spread of the middle 50% of the scores. This summary indicates that the central half of the exam scores spans from 65 to 92.5, while the overall range extends from 55 to 100, showing moderate variability without extreme values. To check for potential outliers, compute the bounds as Q1 - 1.5 × IQR = 65 - 1.5 × 27.5 = 23.75 and Q3 + 1.5 × IQR = 92.5 + 1.5 × 27.5 = 133.75; since all scores fall within these bounds (55 > 23.75 and 100 < 133.75), there are no outliers in this dataset. This five-number summary can be visualized using a box-and-whisker plot, with the box representing the IQR and whiskers extending to the minimum and maximum.

Software Implementations

In statistical software, the five-number summary can be computed efficiently using built-in functions that automate the process of finding the minimum, first quartile, , third quartile, and maximum of a . These tools vary in their default methods for calculating quartiles, which can lead to slight differences in results due to techniques; for instance, employs the type 7 method by default, aligning with the Hyndman and Fan definition that balances bias and efficiency. In , the summary() function applied to a numeric or data frame column generates the five-number summary directly, outputting the minimum, first quartile (), , third quartile (Q3), and maximum, along with the (which is extraneous for this summary). For example, on a x <- c(1, 2, 3, 4, 5), executing summary(x) yields:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.00    2.00    3.00    3.00    4.00    5.00 
Users can ignore the mean and extract the relevant values using functions like quantile(x, probs = c(0.25, 0.5, 0.75)) for , median, and Q3, combined with min(x) and max(x). In Python, the NumPy library's percentile() function computes specific s to derive the quartiles, with the minimum and maximum obtained separately using min() and max(). For a NumPy array arr = np.array([1, 2, 3, 4, 5]), the code q1 = np.percentile(arr, 25); median = np.percentile(arr, 50); q3 = np.percentile(arr, 75); min_val = np.min(arr); max_val = np.max(arr) produces =2.0, median=3.0, Q3=4.0, min=1, and max=5, using by default. Alternatively, the Pandas library's describe() method on a DataFrame or Series provides a comprehensive summary including the five-number summary elements (min, 25%, 50%, 75%, max) for numeric columns. For a Pandas Series s = pd.Series([1, 2, 3, 4, 5]), s.describe() outputs:
countstd25%50%75%max
5.03.01.5811391.02.03.04.05.0
This method excludes the and standard deviation for the core five-number focus. In , the PROC UNIVARIATE procedure computes the five-number summary as part of its output, including moments, quantiles, and extremes, with options to store results via the OUTPUT statement. For a with x, the code:
PROC UNIVARIATE DATA=mydata;
   VAR x;
   OUTPUT OUT=sumstats pctlpts=25 50 75 pctlpre=Q_ pctlname=1 median 3;
RUN;
generates a sumstats containing , , and Q3, which can be combined with the procedure's default min and max from the Moments table. By default, uses a method equivalent to for quantiles, potentially differing from other tools. In , the summarize command with the detail option provides detailed output including the minimum, p25 (first quartile), p50 (), p75 (third quartile), and maximum, along with additional percentiles and tests. For a x, executing summarize x, detail displays these statistics in the console, such as for sample data: min=1, p25=2, p50=3, p75=4, max=5. To extract programmatically, use return summarize x after the command to access stored results like r(min), r(p25), r(p50), r(p75), and r(max). 's default quantile method follows a continuous approximation similar to others but may vary in edge cases for small samples.

References

  1. [1]
    2.2.10 - Five Number Summary | STAT 200
    A five number summary can be used to communicate some key descriptive statistics. It is comprised of five values, presented in the following order.
  2. [2]
    Section 3.5: The Five-Number Summary and Boxplots
    The five-number summary of a set of data consists of the smallest data value, Q 1 , the median, Q 3 , and the largest value of the data.Missing: definition | Show results with:definition
  3. [3]
    Prices, location and spread: 3.3 The five-figure summary and boxplots
    John W. Tukey (1915–2000), inventor of the five-figure summary and boxplot. John Tukey was a prominent and prolific US statistician, based at Princeton ...
  4. [4]
    [PDF] Exploratory Data Analysis Summary Statistics
    Tukey's Five-Number Summary. John Tukey, the father of modern EDA, advocated summarizing data sets with 5 values min value lower quartile median upper ...
  5. [5]
    5 Number Summary: Definition, Finding & Using - Statistics By Jim
    Keep in mind that the purpose of the five number summary is to provide a preliminary sense of your data during the exploratory phase of analysis. At this point, ...
  6. [6]
    What are Robust Statistics?
    Boxplot that graphically displays the five number summary. 5 Number Summary: Definition, Finding & Using. What is the 5 Number Summary? The 5 number summary ...
  7. [7]
    Box Plot | Introduction to Statistics - JMP
    A box plot is based on what is known as the five-number summary, which is the minimum, 25th percentile, median, 75th percentile, and maximum values from a data ...Box Plots Show The... · Box Plot Example · Box Plots Highlight Outliers<|control11|><|separator|>
  8. [8]
    Interquartile Range (IQR): How to Find and Use It - Statistics By Jim
    The interquartile range (IQR) measures the spread of the middle half of your data. It is the range for the middle 50% of your sample.
  9. [9]
    4.4.2 Calculating the median - Statistique Canada
    Sep 2, 2021 · If the number of data points is even, the median will be the average of the data point of rank n ÷ 2 and the data point of rank (n ÷ 2) + 1.
  10. [10]
    3.3: Median - Statistics LibreTexts
    Oct 21, 2024 · Organize the raw scores from smallest value to highest value. · Find the two numbers that fall on each side of the middle of those ordered scores ...
  11. [11]
    Mean, median, and mode review (article) | Khan Academy
    To find the median: Arrange the data points from smallest to largest. If the number of data points is odd, the median is the middle data point in the list.
  12. [12]
    How to Find the Median | Definition, Examples & Calculator - Scribbr
    Oct 2, 2020 · To find the median, calculate the mean by adding together the middle values and dividing them by two.
  13. [13]
    Median - StatPearls - NCBI Bookshelf
    Sep 19, 2022 · The median is the middle value in a set of numbers and is equivalent to the 50th percentile. In other words, the median is the midpoint of a set of numbers.
  14. [14]
    7.2.6.2. Percentiles
    Another method of calculating percentiles (given in some elementary textbooks) starts by calculating \(p N\). If that is not an integer, round up to the next ...
  15. [15]
    Full article: Quartiles in Elementary Statistics
    Dec 1, 2017 · METHOD 5 (“M&S”): For the lower and upper quartile values take #((n + 1)p) with p = 0.25 for the lower quartile and p = 0.75 for the upper ...
  16. [16]
    [PDF] Calculating the Quartile (Or Why Are My Quartile Answers Different?)
    Developed by Tukey, its aim is to find the quartiles of a set of data with little or no calculation. The procedure for finding the quartile is: 1. find the ...
  17. [17]
    Comparison of Values from All Hinge and Quartile Methods
    Jan 10, 2013 · This article compares quartile values from the hinge and interpolation techniques described in the companion articles.
  18. [18]
    [PDF] Statistics Five Number Summary and More - CSUSM
    Lowest Value(or Minimum): The smallest value of the data set ... The figures above are box and whisper plots for Examples 1 and 2 using the five-number summary.
  19. [19]
    Missing data | SPSS Learning Modules - OARC Stats - UCLA
    This module will explore missing data in SPSS, focusing on numeric missing data. We will describe how to indicate missing data in your raw data files.
  20. [20]
    2.4 Five-Number Summary and Boxplot - MacEwan Open Books
    Moreover, the five-number summary helps us identify outliers, those observations that are far away from the bulk of the data. 2.4.1 Identify Outliers. Outliers ...
  21. [21]
    Statistics: Power from Data! Five-number summary
    Mar 31, 2021 · A five-number summary is especially useful in descriptive analyses or during the preliminary investigation of a large data set.
  22. [22]
    Descriptive Statistics for Summarising Data - PMC - PubMed Central
    May 15, 2020 · This chapter discusses and illustrates descriptive statistics. The purpose of the procedures and fundamental concepts reviewed in this chapter is quite ...
  23. [23]
    [PDF] STAT 110: Chapter 12 - University of South Carolina
    • An outlier won't affect the median as much – the median is more robust to outliers. • Similarly, the IQR is a more robust measure of spread than the.
  24. [24]
    [PDF] Chapter 3 Statistical Inference: Basic Concepts
    (d) The interquartile range is robust to outliers whereas the variance and stan- dard deviation are both sensitive to outliers. This means, for example, if.
  25. [25]
    Interquartile Range - an overview | ScienceDirect Topics
    The interquartile range (IQR) is defined as the difference between the third quartile and the first quartile, serving as a robust measure of dispersion.<|separator|>
  26. [26]
    Parametric estimations of the world distribution of income - CEPR
    Jan 22, 2010 · Many different distributions of income could produce these five numbers, some of which might imply rising poverty, and others, falling poverty.
  27. [27]
    Optimally estimating the sample standard deviation from ... - PubMed
    When reporting the results of clinical studies, some researchers may choose the five-number summary (including the sample median, the first and third quartiles, ...
  28. [28]
    Box-and-Whisker Plot -- from Wolfram MathWorld
    A box-and-whisker plot (sometimes called simply a box plot) is a histogram-like method of displaying data, invented by J. Tukey. To create a box-and-whisker ...Missing: John | Show results with:John
  29. [29]
    1.3.3.7. Box Plot - Information Technology Laboratory
    Calculate the median and the quartiles (the lower quartile is the 25th percentile and the upper quartile is the 75th percentile). Plot a symbol at the median ( ...
  30. [30]
    Variations of Box Plots: The American Statistician
    Mar 12, 2012 · Box plots display batches of data. Five values from a set of data are conventionally used; the extremes, the upper and lower hinges (quartiles), and the median.Missing: notches | Show results with:notches
  31. [31]
    Overview: UNIVARIATE Procedure - SAS Help Center
    Apr 16, 2025 · The UNIVARIATE procedure provides the following: You can use the PROC UNIVARIATE statement, together with the VAR statement, to compute summary statistics.
  32. [32]
    Compare the default definitions for sample quantiles in SAS, R, and ...
    Jul 26, 2021 · This article compares the default sample quantiles in SAS in R. It is a misnomer to refer to one definition as the SAS method and to another as the R method.
  33. [33]
    Object Summaries - R
    summary is a generic function used to produce result summaries of the results of various model fitting functions. The function invokes particular methods which ...
  34. [34]
    numpy.percentile — NumPy v2.0 Manual
    numpy.percentile computes the q-th percentile of data along a specified axis, returning the q-th percentile(s) of the array elements.
  35. [35]
    pandas.DataFrame.describe — pandas 2.3.3 documentation
    Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding NaN values.Dev2.0
  36. [36]
    Syntax: UNIVARIATE Procedure - SAS Help Center
    Apr 16, 2025 · The PROC UNIVARIATE statement invokes the procedure. The VAR statement specifies the numeric variables to be analyzed, and it is required if the OUTPUT ...
  37. [37]
    [PDF] summarize — Summary statistics - Description Quick start Menu
    Creating summary tables using the sumtable command. Stata Journal 15: 775–783. Stigler, S. M. 1977. Fractional order statistics, with applications. Journal of ...