Anscombe's quartet
Anscombe's quartet is a collection of four synthetic datasets, each comprising eleven pairs of observations (x, y), devised by British statistician Francis J. Anscombe in 1973 to highlight the critical role of graphical methods in statistical analysis.[1] Despite sharing virtually identical summary statistics—including a mean of x = 9, a mean of y = 7.50, a variance of x = 11.00, a variance of y = 4.12, a correlation coefficient of r = 0.816, and the same least-squares regression line y = 3 + 0.5x—the datasets yield profoundly different scatter plots that reveal distinct data structures and relationships.[1] This deliberate construction underscores how numerical summaries alone can mask underlying patterns, anomalies, or non-linearities, advocating for visualization as an essential preliminary step in data exploration.[1] The quartet's four datasets, labeled I through IV, exemplify varied graphical behaviors:- Dataset I displays a scattered but approximately linear relationship between x and y, aligning well with the common regression line and supporting straightforward linear modeling.
- Dataset II reveals a nonlinear, upward-curving pattern resembling a parabola, where the linear regression provides a poor fit despite matching summary statistics.
- Dataset III follows a linear trend similar to Dataset I but is dominated by a single high-leverage outlier at (13, 12.74), which disproportionately influences the regression without altering the overall statistics.
- Dataset IV consists almost entirely of points clustered vertically at x = 8 (with y values varying around 6–9), except for one distant outlier at (19, 12.50), making the apparent linear relationship illusory and the regression line irrelevant.
History and Origin
Creation by Francis Anscombe
Francis John Anscombe (1918–2001), a prominent British statistician, created Anscombe's quartet in 1973 as a pedagogical tool to underscore the pitfalls of relying solely on numerical statistical analyses.[3] Born in England and educated at Trinity College, Cambridge, where he earned his B.A. in mathematics in 1939, Anscombe went on to lecture in statistics at the university, shaping the field through his emphasis on rigorous interpretive methods.[4] He taught at Cambridge before moving to positions at Princeton and Yale, where he founded the statistics department in 1963.[5] Anscombe's motivation stemmed from the rapid adoption of high-speed computers in the early 1970s, which facilitated complex numerical computations but often encouraged analysts to overlook visual inspection of data.[3] In his view, this technological shift exacerbated a pre-existing tendency to prioritize "exact" numerical outputs over the interpretive insights provided by graphs, leading to potentially erroneous conclusions.[3] He constructed the quartet to demonstrate that identical summary statistics could mask fundamentally different underlying relationships.[3] The quartet appeared in Anscombe's seminal paper "Graphs in Statistical Analysis," published in The American Statistician.[3] Through this work, he advocated for computers to generate both calculations and graphs, stating: "make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding."[3] This creation reflected Anscombe's broader contributions to statistical computing and quality control, where he pioneered methods to ensure reliable data interpretation amid growing computational power.[5]Publication and Initial Reception
Anscombe's quartet was first introduced in the article "Graphs in Statistical Analysis" by Francis J. Anscombe, published in The American Statistician, Volume 27, Issue 1, pages 17–21, in February 1973.[6] In the paper, Anscombe advocated for the routine use of graphs as a "simple but powerful" diagnostic tool in statistical analysis, emphasizing their ability to reveal structures hidden by numerical summaries alone. He presented the quartet as an illustrative exhibit: four datasets engineered to yield nearly identical simple linear regression outputs—including the same means, variances, correlation coefficients, and regression equations—yet displaying markedly different underlying data configurations when plotted. This demonstration underscored the risks of relying solely on summary statistics, particularly in identifying outliers, nonlinearity, and other anomalies that could invalidate model assumptions.[3] The paper received positive uptake within the statistical community, with no notable controversies arising from its claims. It was praised for its straightforward presentation and practical focus, making complex ideas accessible to both practitioners and educators. The quartet has since become a standard pedagogical tool, frequently cited in textbooks and incorporated into courses on exploratory data analysis to highlight the indispensability of visualization.[7]Description of the Datasets
Overview of the Quartet
Anscombe's quartet refers to a collection of four bivariate datasets, each comprising pairs of variables x and y with 11 observations, constructed to possess nearly identical summary statistical measures while exhibiting fundamentally different relationships between the variables.[1] These datasets were introduced by statistician Francis J. Anscombe in his 1973 paper to underscore the limitations of relying solely on numerical summaries in statistical analysis.[1] The core purpose of the quartet is to illustrate how datasets that appear statistically equivalent based on aggregate metrics—such as means, variances, and correlation coefficients—can nonetheless reveal markedly distinct underlying structures when subjected to graphical examination.[1] This demonstration highlights the potential pitfalls of descriptive statistics alone, emphasizing the necessity of visualization to uncover patterns that might otherwise remain obscured.[1] The datasets are conventionally labeled as sets I, II, III, and IV, each representing a unique relational archetype: set I approximates a linear association, set II displays a nonlinear curvature, set III incorporates a vertical outlier disrupting the trend, and set IV features a leverage point outlier in a largely scattered distribution.[1]Detailed Data for Each Dataset
Anscombe's quartet comprises four distinct datasets, each containing eleven paired observations of variables x and y, designed to share the same basic statistical properties despite markedly different underlying relationships. The exact numerical values for these datasets, as originally presented, are detailed below.Dataset I
This dataset exhibits a roughly linear positive relationship between x and y.Dataset II
This dataset features a nonlinear, parabolic relationship between x and y.| x | y |
|---|---|
| 10 | 9.14 |
| 8 | 8.14 |
| 13 | 8.74 |
| 9 | 8.77 |
| 11 | 9.26 |
| 14 | 8.10 |
| 6 | 6.13 |
| 4 | 3.10 |
| 12 | 9.13 |
| 7 | 7.26 |
| 5 | 4.74 |
Dataset III
This dataset shows a strong linear relationship but is influenced by an outlier.| x | y |
|---|---|
| 10 | 7.46 |
| 8 | 6.77 |
| 13 | 12.74 |
| 9 | 7.11 |
| 11 | 7.81 |
| 14 | 8.84 |
| 6 | 6.08 |
| 4 | 5.39 |
| 12 | 8.15 |
| 7 | 6.42 |
| 5 | 5.73 |
Dataset IV
This dataset includes a vertical line of points at x = 8 with one outlier at x = 19.| x | y |
|---|---|
| 8 | 6.58 |
| 8 | 5.76 |
| 8 | 7.71 |
| 8 | 8.84 |
| 8 | 8.47 |
| 8 | 7.04 |
| 8 | 5.25 |
| 19 | 12.50 |
| 8 | 5.56 |
| 8 | 7.91 |
| 8 | 6.89 |