Pearson's chi-squared test
Pearson's chi-squared test, introduced by British statistician Karl Pearson in 1900, is a nonparametric statistical procedure used to assess whether observed frequencies in categorical data significantly differ from expected frequencies under a specified null hypothesis, such as random distribution or independence between variables.[1] The test computes a test statistic, denoted as \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}, where O_i represents the observed frequency in each category and E_i the corresponding expected frequency; under the null hypothesis and with sufficiently large sample sizes, this statistic approximately follows a chi-squared distribution with degrees of freedom depending on the application—typically k-1 for k categories in goodness-of-fit tests or (r-1)(c-1) for an r \times c contingency table in tests of independence or homogeneity.[2][3] Pearson's chi-squared test encompasses three primary variants, each addressing distinct hypotheses about categorical data. The goodness-of-fit test evaluates whether sample data conform to a theoretical distribution, such as a uniform, normal, or Poisson distribution, by comparing observed counts to those predicted by the model; it is particularly useful in quality control and genetics to validate distributional assumptions.[2][4] The test of independence examines whether two categorical variables are associated in a single population, using a contingency table to test the null hypothesis that the variables are independent; for instance, it can assess if gender influences voting preference in survey data.[5][3] The test of homogeneity applies to multiple populations, testing whether the distribution of a categorical variable is the same across groups, such as comparing disease prevalence across different regions; it shares the same computational framework as the independence test but frames the hypothesis in terms of population equality rather than variable association.[4][3] Widely applied in fields including social sciences, medicine, economics, and biology, the test requires key assumptions for validity: all expected frequencies should be at least 1, with no more than 20% below 5 (ideally all at least 5 for accuracy), random sampling, and categorical data without excessive sparsity; violations may necessitate alternatives like Fisher's exact test or simulations.[3][4][6]Applications
Goodness-of-fit testing
The goodness-of-fit test based on Pearson's chi-squared statistic evaluates whether the observed frequencies in categorical data align with the frequencies anticipated under a specified probability distribution, providing a measure of discrepancy between empirical observations and theoretical expectations.[2] Introduced by Karl Pearson in 1900, this test is particularly suited for discrete data where categories have predefined probabilities.[7] To conduct the test, one first formulates the null hypothesis that the data are drawn from the hypothesized distribution, which may require estimating unspecified parameters (such as the mean for a Poisson distribution) from the sample itself.[8] Next, the expected frequencies E_i for each category i are computed using the formula E_i = n \cdot p_i, where n is the total sample size and p_i is the probability assigned to category i under the null hypothesis.[2] The test statistic is then applied to quantify deviations, with larger values indicating poorer fit to the hypothesized model.[9] This approach finds applications in testing discrete probability distributions, including the Poisson distribution for modeling count data like event occurrences, the binomial distribution for binary outcomes in fixed trials, and the multinomial distribution for multiple categories with fixed probabilities.[10] For example, researchers might use it to assess whether defect rates in manufacturing follow a Poisson process or if genetic trait inheritance adheres to binomial expectations.[11] A representative example involves testing uniformity in die rolls, hypothesizing that a fair six-sided die produces each face with equal probability p_i = 1/6. Suppose 120 rolls yield the following observed frequencies:| Face | Observed (O_i) | Expected (E_i) |
|---|---|---|
| 1 | 15 | 20 |
| 2 | 25 | 20 |
| 3 | 20 | 20 |
| 4 | 18 | 20 |
| 5 | 22 | 20 |
| 6 | 20 | 20 |
Independence testing
Pearson's chi-squared test for independence, introduced by Karl Pearson in 1900, provides a method to assess whether two categorical variables exhibit a statistically significant association in a dataset organized as a contingency table. This formulation extends beyond simple 2x2 tables, applying specifically to larger r × c contingency tables where r and c represent the number of categories for each variable, respectively. In this setup, the observed data consist of frequencies O_{ij} in the cell at row i and column j, derived from cross-classifying n independent observations into the r rows and c columns. The null hypothesis states that the two variables are independent, implying no association between them; under this hypothesis, the expected frequency for each cell is calculated as E_{ij} = \frac{R_i C_j}{N}, where R_i is the total for row i, C_j is the total for column j, and N is the grand total of all observations. The test statistic is then \chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, which measures the overall deviation between observed and expected frequencies and follows an approximate chi-squared distribution with (r-1)(c-1) degrees of freedom under the null for large samples. If the computed \chi^2 value yields a p-value below a chosen significance level (e.g., 0.05), the null hypothesis is rejected, indicating that the variables are likely dependent and that the observed deviations are not due to random chance alone. For instance, in survey data analyzing responses across demographic groups, a significant result might suggest an association between variables such as age category and opinion on a policy.[6] This interpretation holds provided basic assumptions like sufficient sample size are met, ensuring expected frequencies are generally at least 5 in most cells.Homogeneity testing
The chi-squared test of homogeneity assesses whether the distribution of a categorical variable is the same across multiple populations or groups, using a contingency table where rows represent groups and columns represent categories of the variable. Unlike the independence test, which examines association within a single population, homogeneity focuses on comparing proportions across populations under the null hypothesis of equal distributions. The computation mirrors that of the independence test, with expected frequencies E_{ij} = \frac{R_i C_j}{N} and degrees of freedom (r-1)(c-1), where r is the number of groups and c the number of categories. A significant result rejects the null, indicating differing distributions across groups. For example, to test if the proportion of smokers is the same in two regions, observed counts of smokers and non-smokers in each region form a 2x2 table; the test evaluates if regional differences are statistically significant.[12]Computation
Test statistic formula
The test statistic for Pearson's chi-squared test, introduced by Karl Pearson in 1900, measures the discrepancy between observed and expected frequencies under the null hypothesis.[13] It is computed using the formula \chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}, where the summation is taken over all categories or cells i, O_i denotes the observed count in category i, and E_i represents the expected count under the null hypothesis.[2] This statistic quantifies deviations by standardizing the differences (O_i - E_i) relative to the expected values, emphasizing larger relative discrepancies in cells with smaller expectations. The same core formula applies to both goodness-of-fit testing and independence testing, though the computation of expected frequencies E_i differs by context.[14] In goodness-of-fit tests, E_i = n p_i, where n is the total sample size and p_i is the hypothesized probability for category i. For tests of independence in contingency tables, E_{ij} = \frac{(r_i \cdot c_j)}{n}, with r_i as the row total for row i, c_j as the column total for column j, and n as the grand total.[14] Computationally, the statistic involves summing the squared standardized residuals across all relevant categories or table cells, ensuring no expected values are zero to avoid division issues. Modern statistical software packages, such as R or SAS, automate this calculation by inputting observed frequencies and specifying the null model to derive expectations, facilitating efficient implementation for large datasets.[2] Under the null hypothesis and for sufficiently large sample sizes, the test statistic \chi^2 approximately follows a chi-squared distribution with k-1 degrees of freedom, where k is the number of categories in a simple goodness-of-fit scenario without estimated parameters.[2] This asymptotic approximation underpins the test's inferential properties.Degrees of freedom and p-values
The degrees of freedom (df) for Pearson's chi-squared test depend on the specific application. In the goodness-of-fit test, the degrees of freedom are calculated as df = k - 1 - m, where k is the number of categories and m is the number of parameters estimated from the data under the null hypothesis.[15] For the test of independence in an r \times c contingency table, the degrees of freedom are df = (r - 1)(c - 1).[16] These values determine the shape of the reference chi-squared distribution used for inference. Once the test statistic \chi^2 is computed, its significance is assessed by comparing it to the chi-squared distribution with the appropriate degrees of freedom. The p-value is the probability of observing a test statistic at least as extreme as the one calculated, assuming the null hypothesis is true; it is obtained by evaluating the survival function (right-tail probability) of the chi-squared distribution at \chi^2 with the given df.[17] For example, at a significance level \alpha = 0.05, the null hypothesis is rejected if the p-value is less than 0.05. Alternatively, the critical value \chi^2_{\alpha, df} can be looked up from chi-squared distribution tables, and the null is rejected if the observed \chi^2 exceeds this threshold, defining the rejection region in the right tail.[16] Estimating parameters from the sample data under the null hypothesis reduces the degrees of freedom by the number of such parameters, as this accounts for the variability introduced by the estimation process and adjusts for the loss of independence in the fitted model.[15] This adjustment ensures the test maintains its nominal significance level. Pearson's chi-squared test relies on an asymptotic approximation, where the distribution of the test statistic converges to a chi-squared distribution as the sample size increases, provided the expected frequencies are sufficiently large.[18] This validity holds only for large samples, typically when all expected cell counts exceed 5, to ensure the approximation is reliable.[19]Theoretical Basis
Derivation for goodness-of-fit
The derivation of Pearson's chi-squared statistic for goodness-of-fit testing assumes that the observed frequencies O = (O_1, \dots, O_k) arise from a multinomial distribution with total sample size n = \sum O_i and specified null probabilities p = (p_1, \dots, p_k) where \sum p_i = 1. Under the null hypothesis H_0: P(O = o) = \frac{n!}{\prod o_i!} \prod p_i^{o_i}, the expected frequencies are E_i = n p_i. Pearson's original formulation in 1900 constructs the test statistic as X^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i}, which sums the squared differences between observed and expected frequencies, standardized by the expected frequencies to reflect the scale of variability in each category. This approach treats the deviations as analogous to standardized residuals in a normal model, motivated by the need for a general criterion to assess fit beyond Gaussian assumptions.[20] The asymptotic \chi^2_{k-1} distribution of X^2 under H_0 follows from its equivalence to the likelihood ratio statistic for large n. The likelihood ratio statistic is G^2 = 2 \sum_{i=1}^k O_i \log(O_i / E_i). Applying a second-order Taylor expansion to \log(O_i / E_i) around 0 yields \log(O_i / E_i) \approx \frac{O_i - E_i}{E_i} - \frac{1}{2} \left( \frac{O_i - E_i}{E_i} \right)^2, so substituting and simplifying gives G^2 \approx \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} = X^2. Since G^2 \xrightarrow{d} \chi^2_{k-1} by standard likelihood ratio theory for the multinomial, X^2 shares this limiting distribution for large n.[21][20] An equivalent derivation uses the central limit theorem: \sqrt{n} (\hat{p} - p) \xrightarrow{d} N(0, \Sigma), where \hat{p} = O/n and \Sigma = \diag(p) - p p^T. Rewriting X^2 = n (\hat{p} - p)^T D (\hat{p} - p) with D = \diag(1/p_i), this is a quadratic form in the asymptotically normal vector. The matrix D^{1/2} \Sigma D^{1/2} is idempotent with rank k-1 (trace k-1), confirming X^2 \xrightarrow{d} \chi^2_{k-1}. A moment-generating function approach similarly shows the limiting form, as the MGF of the quadratic form matches that of a \chi^2_{k-1} random variable under the covariance structure.[22][23] In the special case of two categories (k=2), the multinomial reduces to binomial, and X^2 = \frac{n (\hat{p}_1 - p_1)^2}{p_1 (1 - p_1)} = z^2, where z = \frac{\sqrt{n} (\hat{p}_1 - p_1)}{\sqrt{p_1 (1 - p_1)}}; since z \xrightarrow{d} N(0,1), it follows that X^2 \xrightarrow{d} [\chi^2_1](/page/Chi_squared_distribution).[20]Derivation for independence
Consider an r \times c contingency table where the observed cell counts O_{ij} for i = 1, \dots, r and j = 1, \dots, c arise from a multinomial distribution with total sample size N = \sum_{i,j} O_{ij} and unknown cell probabilities \pi_{ij}, satisfying \sum_{i=1}^r \sum_{j=1}^c \pi_{ij} = 1. The null hypothesis of independence posits that the row and column factors are independent, implying \pi_{ij} = \pi_{i \cdot} \pi_{\cdot j} for all i, j, where the marginal probabilities are \pi_{i \cdot} = \sum_{j=1}^c \pi_{ij} and \pi_{\cdot j} = \sum_{i=1}^r \pi_{ij}. Under the saturated model, which fits the full set of rc probabilities without restrictions, the maximum likelihood estimates are simply \hat{\pi}_{ij} = O_{ij}/N. In contrast, the independence model imposes the product structure, reducing the number of free parameters to r + c - 1, and the maximum likelihood estimates of the marginals yield expected cell frequencies E_{ij} = N \hat{\pi}_{i \cdot} \hat{\pi}_{\cdot j} = R_i C_j / N, where R_i = \sum_{j=1}^c O_{ij} is the i-th row total and C_j = \sum_{i=1}^r O_{ij} is the j-th column total.[23] Pearson's chi-squared statistic X^2 quantifies the lack of fit between the observed counts and those expected under independence, expressed as the sum of cell-wise squared standardized residuals: X^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}}. This form arises as a measure of discrepancy analogous to the goodness-of-fit test, comparing the saturated model to the restricted independence model.[22] Under the null hypothesis and as N \to \infty, X^2 follows asymptotically a chi-squared distribution with (r-1)(c-1) degrees of freedom, reflecting the difference in parameters between the saturated and independence models.[23] This asymptotic chi-squared property holds due to the equivalence between Pearson's statistic and the likelihood ratio test in large samples; the likelihood ratio statistic G^2 = 2 \sum_{i=1}^r \sum_{j=1}^c O_{ij} \ln (O_{ij} / E_{ij}) converges to the same limiting distribution as X^2.[23] The derivation extends naturally from the $2 \times 2 case, where the single degree of freedom captures overall association, to the general r \times c table via additive cell contributions, each term (O_{ij} - E_{ij})^2 / E_{ij} behaving asymptotically as a chi-squared variate with one degree of freedom under the null.[22] Karl Pearson introduced this refinement for contingency tables in his 1904 paper, adapting his earlier 1900 goodness-of-fit criterion to assess independence in cross-classified categorical data.[24]Assumptions and Validity
Sample size requirements
The validity of Pearson's chi-squared test relies on sufficient sample size to ensure the test statistic approximates the chi-squared distribution asymptotically. A widely adopted rule of thumb requires that all expected cell frequencies E_i are at least 5, or that no more than 20% of the cells have E_i < 5 with none below 1, to maintain reliable inference.[25][4] These guidelines, originating from simulation-based investigations, help prevent distortions in the test's performance when categories are sparse.[26] The rationale for these thresholds is that small expected frequencies undermine the asymptotic chi-squared approximation, leading to inaccurate p-values and reduced test reliability.[27] Cochran's 1952 analysis, based on Monte Carlo simulations for goodness-of-fit applications, established that expected frequencies below 5 often result in substantial deviations from the nominal distribution, particularly affecting the accuracy of significance levels.[26] Similar simulation evidence extends to tests of independence in contingency tables, where low expectations compromise the central limit theorem underpinnings of the approximation.[28] Violations of these sample size conditions can inflate Type I error rates, causing excessive false positives, or render the test overly conservative with diminished power to detect true associations.[29][28] In such cases, a practical remedy involves combining adjacent categories to increase expected frequencies and restore the approximation's validity, thereby preserving the test's utility without altering the underlying hypothesis.[28]Independence and randomness
The Pearson's chi-squared test relies on the fundamental assumption that the observations are independent and identically distributed (i.i.d.) according to a multinomial distribution under the null hypothesis. This means that each trial or observation contributes to one of the categorical outcomes independently of the others, with the probability of each outcome remaining constant across trials. The multinomial framework ensures that the joint distribution of the observed frequencies aligns with the expected frequencies derived from the hypothesized probabilities, allowing the test statistic to approximate a chi-squared distribution asymptotically.[30][31] For the test to be valid, the data must arise from a simple random sample without inherent dependencies such as clustering or stratification, unless specific adjustments are made to account for the sampling design. In practice, this requires that the sample be drawn randomly from the population, ensuring no systematic correlations between observations that could inflate or deflate the test statistic. Violations of this independence assumption, such as in clustered data from repeated measures or hierarchical sampling, can lead to overdispersion—where the variance of the observed frequencies exceeds that expected under the multinomial model—thereby invalidating the chi-squared approximation and increasing the risk of erroneous conclusions.[32][33][34] In contingency table analyses, the sampling design can further complicate independence if row or column margins are fixed by the experimental setup, such as when one variable is deliberately balanced. The test still uses (r-1)(c-1) degrees of freedom, but the appropriate model (e.g., test of independence or homogeneity) depends on the sampling scheme. To assess adherence to these assumptions and evaluate overall model fit, residual analysis serves as a key diagnostic tool, examining the differences between observed and expected frequencies on a cell-by-cell basis to identify patterns of deviation that may signal underlying dependencies or other issues. Standardized or adjusted residuals, for instance, help pinpoint which cells contribute disproportionately to any lack of fit, providing insights beyond the omnibus test statistic.[35][36] These diagnostics complement considerations of sample size adequacy, ensuring the test's robustness across varied data structures.[37]Examples
Testing die fairness
To illustrate the application of Pearson's chi-squared goodness-of-fit test, consider testing whether a six-sided die is fair by rolling it 60 times and recording the outcomes for each face. The null hypothesis states that the die is fair, meaning each face has an equal probability of 1/6, while the alternative hypothesis states that the probabilities are not equal. The observed frequencies from the rolls are 5, 8, 12, 10, 11, and 14 for faces 1 through 6, respectively. The expected frequency for each face under the null hypothesis is E_i = 60 / 6 = 10. The test statistic is computed as \chi^2 = \sum_{i=1}^6 \frac{(O_i - E_i)^2}{E_i}, where O_i are the observed frequencies. The individual contributions are calculated as follows:| Face | Observed (O_i) | Expected (E_i) | O_i - E_i | (O_i - E_i)^2 / E_i |
|---|---|---|---|---|
| 1 | 5 | 10 | -5 | 2.5 |
| 2 | 8 | 10 | -2 | 0.4 |
| 3 | 12 | 10 | 2 | 0.4 |
| 4 | 10 | 10 | 0 | 0 |
| 5 | 11 | 10 | 1 | 0.1 |
| 6 | 14 | 10 | 4 | 1.6 |
| Total | 60 | 60 | - | χ² = 5.0 |
pchisq(5, df=5, lower.tail=FALSE)).
Since the p-value (0.42) exceeds the common significance level of 0.05, there is insufficient evidence to reject the null hypothesis. The observed frequencies are consistent with the expectation of a fair die, indicating no significant deviation from uniformity.
Analyzing contingency tables
To illustrate the application of Pearson's chi-squared test for independence, consider a hypothetical survey of 100 individuals examining whether there is an association between gender (male or female) and preference for a product (yes or no). The observed frequencies form a 2×2 contingency table as follows:| Yes | No | Total | |
|---|---|---|---|
| Male | 30 | 20 | 50 |
| Female | 25 | 25 | 50 |
| Total | 55 | 45 | 100 |
| Yes | No | Total | |
|---|---|---|---|
| Male | 27.5 | 22.5 | 50 |
| Female | 27.5 | 22.5 | 50 |
| Total | 55 | 45 | 100 |
| Yes | No | Total | |
|---|---|---|---|
| Male | O=30, E=27.5 contrib. ≈0.227 | O=20, E=22.5 contrib. ≈0.278 | 50 |
| Female | O=25, E=27.5 contrib. ≈0.227 | O=25, E=22.5 contrib. ≈0.278 | 50 |
| Total | 55 | 45 | 100 |