Goodness of fit
Goodness of fit, in statistics, refers to a class of hypothesis tests that assess how well a set of observed data aligns with an expected theoretical distribution or model under the null hypothesis.[1] These tests quantify the discrepancy between observed frequencies or values and those predicted by the model, helping researchers determine whether deviations are due to chance or indicate a poor fit.[2] The concept originated with Karl Pearson's development of the chi-square goodness-of-fit test in 1900, which provided a foundational method for evaluating distributional assumptions in data analysis.[3] Pearson's approach built on earlier work in probability and was designed to measure the "success" of fitting data to a theoretical curve, such as the normal distribution, without initial ties to specific forms.[4] Over time, this evolved into a broader framework encompassing various tests for categorical, discrete, and continuous data across fields like biology, engineering, and social sciences. The most widely used goodness-of-fit test is the Pearson chi-square test, which computes the statistic \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}, where O_i are observed counts and E_i are expected counts in each category or bin.[2] Under the null hypothesis, this statistic approximately follows a chi-square distribution with degrees of freedom equal to the number of categories minus one (or minus estimated parameters).[1] Other notable tests include the likelihood-ratio test (deviance statistic G^2 = 2 \sum O_i \log(O_i / E_i)) and non-parametric alternatives like the Kolmogorov-Smirnov test for continuous distributions.[1] These methods are particularly valuable for validating assumptions in parametric models, such as testing if data conform to binomial, Poisson, or normal distributions.[2] Key assumptions for these tests include sufficiently large expected frequencies (typically at least 5 per category for chi-square) and independent observations, with results sensitive to data binning in continuous cases.[2] Applications span diverse areas, including genetics to verify Mendelian ratios, quality control to check manufacturing uniformity, and survey analysis to assess response distributions against theoretical expectations.[5] Despite their utility, limitations such as power sensitivity to sample size and the need for careful interpretation of p-values underscore the importance of complementary diagnostics like residual plots.[1]Introduction
Definition and Purpose
Goodness of fit refers to a statistical measure that quantifies the discrepancy between observed data and the values expected under a hypothesized model or distribution.[1] It assesses how well a proposed model aligns with empirical observations by comparing actual outcomes to predictions derived from the model's assumptions.[2] Central to this concept are the terms "observed values," which represent the actual counts or measurements from the data, and "expected values," which are the theoretical frequencies or quantities anticipated if the null hypothesis holds true.[1] The primary purpose of goodness of fit tests is to validate underlying statistical assumptions, such as normality of errors, facilitate model selection among competing hypotheses, and evaluate whether data conform to an expected process.[6] These tests operate within a hypothesis-testing framework, where the null hypothesis posits a "good fit"—meaning the observed data are consistent with the specified model—against an alternative hypothesis of significant deviation indicating poor alignment.[2] By providing a formal mechanism to detect mismatches, goodness of fit aids in ensuring the reliability of inferences drawn from the data.[1] Interpretation of goodness of fit results focuses on the test statistic and associated p-value: smaller statistic values suggest closer agreement between observed and expected data, while a p-value greater than a chosen significance level, such as 0.05, indicates that the data provide insufficient evidence to reject the null hypothesis of adequate fit.[7] This threshold helps determine whether deviations are likely due to chance or reflect a substantive lack of model adequacy.[8] Goodness of fit tests find broad applications across disciplines, including quality control where they verify if manufacturing processes adhere to specified distributions, biology for analyzing genetic inheritance patterns like Mendelian ratios, economics for assessing error distributions in econometric models, and machine learning for validating predictive models by checking if residuals conform to assumed normality.[9][5][10][11] For instance, in machine learning, these tests ensure that model assumptions hold, enhancing the interpretability and predictive power of algorithms.Historical Development
The concept of goodness of fit emerged from 19th-century advancements in probability theory, where statisticians sought methods to assess whether observed data conformed to theoretical distributions, building on foundational work by figures like Pierre-Simon Laplace on least squares and error analysis. The formalization of goodness-of-fit testing began with Karl Pearson's introduction of the chi-square test in 1900, marking the first rigorous statistical criterion for evaluating deviations between observed and expected frequencies under a hypothesized distribution. This innovation shifted statistical practice from ad hoc comparisons toward systematic hypothesis testing, influencing fields like biology and social sciences. In the mid-20th century, developments focused on nonparametric approaches using empirical distribution functions. In the 1930s, Andrey Kolmogorov's work formalized the one-sample Kolmogorov-Smirnov test in 1933, based on the maximum discrepancy between the empirical distribution function and the theoretical distribution. Nikolai Smirnov extended this framework in the late 1930s, developing the two-sample version and further refinements for goodness-of-fit testing.[12] Building on this, Theodore W. Anderson and Donald A. Darling introduced the Anderson-Darling test in 1952, which weighted discrepancies to emphasize tails of the distribution, improving power over uniform measures.[13] Concurrently, Samuel S. Wilks advanced likelihood-based methods in 1938, establishing the asymptotic chi-square distribution for likelihood ratio statistics under composite hypotheses, which underpins many categorical goodness-of-fit tests.[14] Likelihood ratio approaches gained traction in the 1960s as alternatives to Pearson's chi-square for categorical data, offering better approximation to the chi-square distribution especially in small samples; the G-test, a specific likelihood ratio formulation, was formalized during this period and recommended for its superior performance. Its prominence surged in the 1980s through endorsements by Robert R. Sokal and F. James Rohlf, who highlighted its efficiency in biostatistical applications over traditional chi-square methods. Post-2000, goodness-of-fit methods integrated with computational statistics to address high-dimensional data challenges in machine learning, such as adapting tests via bootstrapping for accurate p-value estimation beyond asymptotic approximations that dominated early developments.[15] These extensions, including frameworks for high-dimensional linear and generalized linear models, mitigate limitations of classical tests reliant on large-sample normality assumptions.General Goodness-of-Fit Tests
Chi-Square Test
The chi-square goodness-of-fit test is a non-parametric statistical procedure designed to evaluate whether the observed frequencies in a sample of categorical data, or binned continuous data divided into k categories, align with the expected frequencies derived from a hypothesized probability distribution.[2] This test is particularly useful for discrete data or when continuous observations are grouped into discrete bins to facilitate frequency comparisons.[8] It was developed by Karl Pearson in 1900 as a method to assess the adequacy of a proposed distribution for explaining sample data.[16] The test statistic, denoted as \chi^2, measures the discrepancy between observed counts O_i and expected counts E_i across the k categories and is calculated as: \chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} Under the null hypothesis that the data follow the specified distribution, this statistic asymptotically follows a chi-square distribution with degrees of freedom df = k - 1 - m, where m is the number of parameters estimated from the data to specify the expected frequencies.[2] For fully specified distributions with no estimated parameters (m = 0), the degrees of freedom simplify to k - 1; in cases approximating a multinomial distribution where expected proportions p_i satisfy n p_i (with n as sample size) being large, the same df = k - 1 applies.[8] Key assumptions underlying the test include random sampling from the population, independence among observations, and sufficiently large expected frequencies, typically E_i ≥ 5 in at least 80% of the cells (with no E_i < 1) to ensure the asymptotic chi-square approximation holds reliably.[17] Violations, such as small expected counts, can lead to inaccurate p-values.[2] To perform the test, one first states the null hypothesis (that the observed data fit the expected distribution) and computes the \chi^2 statistic using the observed and expected frequencies; the p-value is then obtained by comparing this statistic to the chi-square distribution with the appropriate degrees of freedom, often via statistical software or tables, and the null is rejected if p < α (commonly 0.05).[8] For instance, to test if a six-sided die is fair using n = 60 rolls, the expected frequency per face is E_i = 10; observed counts might yield \chi^2 = 8.4 with df = 5, resulting in p ≈ 0.14, failing to reject fairness at α = 0.05.[18] The chi-square test offers advantages in its simplicity, broad applicability to various distributions (discrete or binned continuous), and ease of computation without requiring normality assumptions.[2] However, it has limitations, including sensitivity to the choice of binning intervals when applied to continuous data, which can arbitrarily influence results, and reduced performance with small sample sizes or low expected frequencies, where alternatives like the G-test provide better approximations.[19]G-Test
The G-test, also known as the likelihood-ratio chi-square test, is a statistical method used to assess whether observed frequencies in categorical data conform to expected frequencies under a specified multinomial distribution, serving as a likelihood ratio test that compares the fit of observed data to a hypothesized model.[20] It is particularly preferred over other tests for its closer adherence to the chi-square distribution in non-asymptotic conditions, providing more reliable inference when sample sizes are moderate or when expected frequencies are low.[21] The test statistic is calculated as G = 2 \sum_i O_i \ln \left( \frac{O_i}{E_i} \right), where O_i represents the observed frequency in category i, E_i the expected frequency, and \ln the natural logarithm; under the null hypothesis, G asymptotically follows a chi-square distribution with degrees of freedom equal to the number of categories k minus 1 minus the number of parameters m estimated from the data (df = k - 1 - m).[22] This formulation is equivalent to -2 times the log of the likelihood ratio between the observed and expected models.[20] Like the chi-square test, the G-test assumes independent observations and that expected frequencies are derived from a valid theoretical model, but it performs better when some E_i < 5 due to its logarithmic scaling, which reduces bias in the distribution approximation.[21] It requires all O_i > 0 to avoid undefined logarithms of zero; in cases where zero observations occur, continuity corrections or exact tests may be applied to adjust the statistic.[22] To perform the test, compute the G statistic from the observed and expected frequencies, determine the appropriate degrees of freedom, and compare G to the critical value from the chi-square distribution or calculate the p-value; rejection of the null hypothesis indicates a poor fit between observed and expected frequencies.[20] A common application is in genetics to test multinomial proportions, such as Mendelian inheritance ratios; for example, in a monohybrid cross expecting a 3:1 phenotypic ratio (df = 1 for the binomial case), observed counts of 80 dominant and 20 recessive traits in 100 offspring yield G \approx 1.40 (p ≈ 0.24), supporting the hypothesized fit.[21] Similarly, for a dihybrid cross expecting a 9:3:3:1 ratio (df = 3), deviations in observed progeny classes can be evaluated to assess segregation compliance. The G-test offers advantages in providing more accurate p-values for sparse data with low expected counts, making it suitable for biological datasets where chi-square approximations may overstate significance; it was recommended for such scenarios in the influential biometry textbook by Sokal and Rohlf (1981), which contributed to its adoption in ecology and biology following the 1980s.[23] However, it is slightly more computationally intensive due to the logarithmic terms, though modern software mitigates this.[20] As an approximation to the chi-square test, it shares similar large-sample properties but excels in finite-sample accuracy.[21]Tests for Continuous Distributions
Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov test is a non-parametric goodness-of-fit procedure used to assess whether a sample of continuous data follows a specified theoretical distribution by quantifying the maximum vertical distance between the empirical cumulative distribution function (ECDF), denoted F_n(x), and the hypothesized cumulative distribution function (CDF), F(x).[24] This test is particularly suited for unbinned continuous data and evaluates the null hypothesis that the sample is drawn from the specified distribution F(x).[12] The test statistic is given by D = \sup_x |F_n(x) - F(x)|, where \sup denotes the supremum over all x, representing the largest absolute deviation between the two functions.[24] For the one-sample case, critical values are derived from the Kolmogorov distribution, while the two-sample variant uses the Smirnov distribution to compare ECDFs from two independent samples.[12] The test was originally developed by Andrey Kolmogorov in 1933 for the one-sample scenario and extended by Nikolai Smirnov in 1939 to include the two-sample case, with the asymptotic distribution of \sqrt{n} D under the null hypothesis established for large sample sizes n.[12] Key assumptions include that the data are continuous, consist of independent and identically distributed (i.i.d.) observations, and that the theoretical CDF F(x) is fully specified without estimation of parameters from the sample data in the basic version.[24] Violation of the fully specified distribution assumption, such as when location or scale parameters are estimated from the data, invalidates standard critical values and requires adjustments like the Lilliefors modification for normality testing.[24][25] To perform the test, the ECDF F_n(x) is computed from the ordered sample values, the deviations |F_n(x_i) - F(x_i)| are evaluated at each data point x_i and just before/after jumps, and D is taken as the maximum of these.[24] The scaled statistic \sqrt{n} D is then used to obtain an asymptotic p-value from the Kolmogorov distribution, though exact p-values for small n rely on tables or computational software; variants include two-sided (testing general fit) and one-sided (testing for larger or smaller values) alternatives.[24] If \sqrt{n} D exceeds the critical value for a chosen significance level (e.g., 1.36 for \alpha = 0.05 asymptotically), the null hypothesis is rejected.[24] A common application is testing the uniformity of random number generators, where the sample is compared to the uniform CDF on [0,1]; another is assessing normality when mean and variance are estimated from the data via the Lilliefors modification, which provides adjusted critical values through Monte Carlo simulation to account for parameter uncertainty.[24][25] Advantages of the test include its lack of need for data binning, which preserves information unlike frequency-based methods, and its sensitivity to discrepancies in location and scale parameters of the distribution.[12][24] However, it is less powerful for detecting differences in the tails of the distribution compared to weighted alternatives like the Anderson-Darling test, and it assumes a fully specified theoretical distribution, limiting its use when parameters must be estimated.[24]Anderson-Darling Test
The Anderson-Darling test is an omnibus goodness-of-fit procedure for assessing whether a sample of continuous data follows a specified distribution, particularly emphasizing deviations in the tails of the distribution.[26] It extends the Kolmogorov-Smirnov test by integrating squared differences between the empirical and hypothesized cumulative distribution functions (CDFs), weighted inversely by the variance of the CDF to give greater emphasis to the tails.[26] The test was introduced by Theodore W. Anderson and Donald A. Darling in their seminal work on asymptotic theory for goodness-of-fit criteria based on stochastic processes.[13] The test statistic, denoted A^2, is computed for a sample of n independent and identically distributed (i.i.d.) observations X_1, \dots, X_n ordered as X_{(1)} \leq \dots \leq X_{(n)}, assuming a fully specified CDF F: A^2 = -n - \sum_{i=1}^n \frac{2i-1}{n} \left[ \ln F(X_{(i)}) + \ln \left(1 - F(X_{(n+1-i)})\right) \right] This formula provides a discrete approximation to the integral form of the statistic, which weights discrepancies by $1 / [F(x)(1 - F(x))].[26] Under the null hypothesis that the data arise from F, A^2 follows an asymptotic distribution independent of F, with critical values available in tables.[27] The test assumes the data are i.i.d. from a continuous distribution; for cases where parameters of F must be estimated from the sample, the null distribution of A^2 is affected, requiring adjustment via Monte Carlo simulation or modified critical values from tabulated results.[27] To perform the test, the statistic A^2 is calculated and compared to critical values from the Anderson-Darling distribution (e.g., for significance level \alpha = 0.05) or converted to a p-value; rejection of the null occurs if A^2 exceeds the critical threshold or the p-value is below \alpha.[26] This procedure yields higher power than the Kolmogorov-Smirnov test against many alternatives, particularly those involving tail discrepancies.[28] For example, the test can assess normality of residuals from a regression model fitted to environmental data, such as pollutant concentrations in air quality monitoring, by computing A^2 under the standard normal CDF after standardization.[29] A two-sample version exists for comparing whether two independent samples come from the same continuous distribution, using a similar weighted integral of differences between their empirical CDFs.[30] The Anderson-Darling test offers advantages in detecting subtle departures like skewness or excess kurtosis due to its tail weighting, making it particularly effective for distributions where extreme values are critical.[26] It is widely applied in reliability engineering to fit extreme value or Weibull distributions to failure time data and in finance to evaluate normality or heavy-tailed fits for stock returns and risk measures.[31] However, the test can be computationally intensive for very large samples, though the O(n) summation is efficient in practice, and it is sensitive to tied observations, which violate the continuous assumption and may inflate the statistic.[26]Applications in Regression Analysis
Lack-of-Fit Test
The lack-of-fit test is an F-test used within the analysis of variance (ANOVA) framework for regression models to detect systematic deviations between observed and predicted values that exceed what would be expected from random error alone. It is particularly applicable to polynomial or nonlinear models where replicates (multiple observations at the same predictor values) are available, allowing the separation of the error sum of squares (SSE) into a lack-of-fit component (SS_LOF), which captures model misspecification, and a pure error component (SS_PE), which reflects inherent random variation. Under the null hypothesis of adequate fit, the model correctly specifies the functional form, and any deviations are due solely to random error.[32] The test statistic is given by F = \frac{MS_{LOF}}{MS_{PE}}, where MS_{LOF} = SS_{LOF} / df_{LOF} and MS_{PE} = SS_{PE} / df_{PE}. Here, df_{LOF} = c - p (with c as the number of distinct predictor levels and p as the number of model parameters) and df_{PE} = n - c (with n as the total number of observations). This F-statistic follows an F-distribution with df_{LOF} and df_{PE} degrees of freedom under the null hypothesis. The total sum of squares is partitioned as SS_{total} = SS_{model} + SS_{LOF} + SS_{PE}, where SS_{model} is the sum of squares due to the regression. The null hypothesis is rejected if the observed F exceeds the critical value from the F-distribution at a chosen significance level (e.g., \alpha = 0.05), indicating inadequate model fit.[32][33] Key assumptions include the availability of replicates to estimate pure error, independent and normally distributed errors with constant variance (homoscedasticity), and that the specified functional form (e.g., linear or polynomial) holds if the null is true. Without replicates, the test cannot be performed, as pure error cannot be isolated from lack of fit. The procedure involves fitting the proposed model, computing the ANOVA table to obtain SS_LOF and SS_PE, deriving the mean squares and F-statistic, and comparing it to the critical value or using the p-value to decide on model adequacy. This test extends the classical ANOVA by explicitly partitioning errors to test model form in regression contexts.[32] For example, consider data on rat growth rates under different dietary supplement doses, with six distinct dose levels and two replicates each (n=12). Testing a linear model yields the following ANOVA table:| Source | df | SS | MS | F |
|---|---|---|---|---|
| Regression | 1 | 204.27 | 204.27 | 2.29 |
| Lack of Fit | 4 | 858.23 | 214.56 | 38.43 |
| Pure Error | 6 | 33.50 | 5.58 | |
| Total | 11 | 1096.00 |