Chi-squared test
The Chi-squared test, also known as Pearson's chi-squared test, is a non-parametric statistical hypothesis test that determines whether there is a significant association between categorical variables or if observed categorical data frequencies deviate substantially from those expected under a specified null distribution.[1] Developed by mathematician Karl Pearson in 1900, it provides a criterion for evaluating the fit of sample data to a theoretical model without assuming normality, marking a foundational advancement in modern statistical inference.[2] The test computes a test statistic based on the formula \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}, where O_i represents observed frequencies and E_i expected frequencies, which approximately follows a chi-squared distribution under the null hypothesis for large sample sizes.[3]
Pearson's innovation addressed the need for a general method to test deviations in correlated systems of variables, originating from his work on random sampling and probable errors in biological and social data.[1] Initially introduced for goodness-of-fit analysis—assessing if data conform to a hypothesized probability distribution like the normal or Poisson—the test has since expanded to include the test of independence, which examines associations between two categorical variables in a contingency table, and the test of homogeneity, which compares distributions across multiple populations.[2] For the independence test, degrees of freedom are calculated as (r-1)(c-1), where r and c are the number of rows and columns in the table, enabling p-value computation to reject or retain the null hypothesis of no association.[1]
Key assumptions include random sampling, independence of observations, and sufficiently large expected frequencies (typically at least 5 per cell to ensure the chi-squared approximation holds, as per Cochran's rule).[2] Violations, such as small sample sizes, may necessitate alternatives like Fisher's exact test.[2] Widely used in fields like biology, sociology, and medicine for analyzing survey data, genetic inheritance, and clinical trials, the chi-squared test remains a cornerstone of categorical data analysis due to its simplicity and robustness.[1]
Introduction
Definition and Purpose
The chi-squared test is a statistical hypothesis test that employs the chi-squared distribution to assess the extent of discrepancies between observed frequencies and expected frequencies in categorical data.[4] It evaluates whether these differences are likely due to random variation or indicate a significant deviation from the null hypothesis.[5] Under the null hypothesis, the test statistic follows an asymptotic chi-squared distribution, allowing for the computation of p-values to determine statistical significance.[4]
The primary purposes of the chi-squared test are to examine independence between two or more categorical variables in contingency tables and to test the goodness-of-fit of observed data to a specified theoretical distribution.[6] In the test of independence, it determines whether the distribution of one variable depends on the levels of another, such as assessing associations in survey responses across demographic groups.[5] For goodness-of-fit, it verifies if empirical frequencies align with expected proportions under models like uniformity or specific probability distributions.[4]
The test statistic is given by
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i},
where O_i represents the observed frequencies and E_i the expected frequencies for each category i.[4] Developed in early 20th-century statistics for analyzing categorical data, the chi-squared test is non-parametric, imposing no assumptions on the underlying distribution of the data itself, but relying on the asymptotic chi-squared distribution of the statistic under the null hypothesis.[7] This makes it versatile for applications where data are counts or proportions without normality requirements.[5]
Assumptions and Prerequisites
The chi-squared test requires that observations are independent, meaning each data point is collected without influencing others, to ensure the validity of the underlying statistical inference.[5] This independence assumption holds when the sample is drawn as a simple random sample from the population, avoiding any systematic dependencies or clustering in the data.[8] Additionally, the test is designed for categorical data, where variables are discrete or binned into mutually exclusive categories, rather than continuous measurements that have not been discretized.[9]
A critical assumption concerns sample size adequacy: the expected frequencies in at least 80% of the cells should be 5 or greater, with no expected frequencies less than 1, to justify the asymptotic approximation to the chi-squared distribution under the null hypothesis.[10] Violations of this rule, particularly in small samples, can lead to unreliable p-values, necessitating alternatives such as exact tests like Fisher's exact test.[11]
Prior to applying the chi-squared test, users should possess foundational knowledge in probability theory, including concepts like expected values and distributions, as well as the hypothesis testing framework—encompassing null and alternative hypotheses, test statistics, and interpretation of p-values at chosen significance levels (e.g., α = 0.05).[12] These prerequisites enable proper setup of the test for applications such as assessing independence in contingency tables.[13]
Historical Development
In 1900, Karl Pearson introduced the chi-squared test in a paper published in the Philosophical Magazine, presenting it as a criterion to determine whether observed deviations from expected probabilities in a system of correlated variables could reasonably be ascribed to random sampling. This formulation addressed key limitations in prior approaches to analyzing categorical data, particularly in biological contexts where earlier methods struggled to quantify discrepancies between empirical observations and theoretical expectations.[14] Pearson's work was motivated by the need for a robust tool to evaluate patterns in genetics, building on challenges posed by datasets like those from Gregor Mendel's experiments on pea plant inheritance, which highlighted inconsistencies in fitting discrete distributions to observed frequencies.[15]
Pearson derived the test statistic as a sum of squared deviations between observed and expected frequencies, divided by the expected frequencies to account for varying scales across categories; this measure captured the overall discrepancy in a single value, inspired by the summation of squared standardized normals from multivariate normal theory.[16] He symbolized the statistic with the Greek letter χ²—pronounced "chi-squared"—reflecting its connection to the squared form of the character χ, a notation that has persisted in statistical literature. Initially, Pearson applied the test to biological data on inheritance patterns, such as ratios in genetic crosses, enabling researchers to assess whether empirical results aligned with hypothesized Mendelian proportions under random variation.[14]
A pivotal aspect of Pearson's contribution was establishing the asymptotic distribution of the χ² statistic under the null hypothesis of good fit, linking it to a chi-squared distribution with k degrees of freedom, where k equals the number of categories minus the number of parameters estimated from the data. This theoretical foundation allowed for probabilistic inference, with larger values of the statistic indicating poorer fit and lower probabilities of the data arising by chance alone.[16] By formalizing this approach, Pearson provided the first systematic method for goodness-of-fit testing in categorical settings, profoundly influencing the development of modern statistical inference in biology and beyond.[14]
Subsequent Contributions and Naming
Following Karl Pearson's initial formulation, Ronald A. Fisher advanced the chi-squared test in the 1920s by rigorously establishing its asymptotic chi-squared distribution under the null hypothesis and extending its application to testing independence in contingency tables. In his 1922 paper, Fisher derived the appropriate degrees of freedom for the test statistic—(r-1)(c-1) for an r × c contingency table—correcting earlier inconsistencies in Pearson's approach and enabling more accurate p-value calculations for assessing deviations from independence.[17]
The nomenclature distinguishes "Pearson's chi-squared test" as the statistical procedure itself, crediting its originator, from the "chi-squared distribution," which describes the limiting probability distribution of the test statistic. This naming convention arises from Pearson's adoption of the symbol χ² (chi squared) for the statistic, while Fisher provided the foundational proof of its convergence to the chi-squared distribution, solidifying the theoretical basis.[18]
In the 1930s, the chi-squared test became integrated into the Neyman-Pearson framework for hypothesis testing, which emphasized specifying alternative hypotheses, controlling both Type I and Type II error rates, and using p-values to quantify evidence against the null. This incorporation elevated the test's role in formal inferential procedures, aligning it with broader developments in statistical decision theory.
By the 1940s, the chi-squared test achieved widespread recognition in genetics, as seen in Fisher's 1936 application to evaluate the goodness-of-fit of Gregor Mendel's experimental ratios to Mendelian expectations, revealing improbably precise results suggestive of data adjustment. In social sciences, it facilitated analysis of associations in categorical survey data, with standardization occurring through its prominent inclusion in influential textbooks like E.F. Lindquist's 1940 Statistical Analysis in Educational Research, which exemplified its use in fields such as education and sociology.[19][20]
The Pearson Chi-squared Statistic
The chi-squared test evaluates hypotheses concerning the distribution of categorical data. The null hypothesis H_0 asserts that the observed frequencies conform to expected frequencies under a specified theoretical distribution (goodness-of-fit test) or that categorical variables are independent (test of independence), while the alternative hypothesis H_A posits deviation from this fit or presence of dependence.[4]
The Pearson chi-squared statistic is given by
\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i},
where the sum is over all categories i, O_i denotes the observed frequency in category i, and E_i is the expected frequency under H_0. This formulation measures the discrepancy between observed and expected values, normalized by the expected frequencies to account for varying category sizes.[4] The statistic was originally proposed by Karl Pearson in 1900 as a measure of goodness of fit for frequency distributions.[7]
The statistic derives from the multinomial likelihood under H_0, where the data follow a multinomial distribution with probabilities yielding the expected frequencies E_i. The log-likelihood ratio test statistic G^2 = 2 \sum_i O_i \log(O_i / E_i) provides an alternative measure, but under large samples, a second-order Taylor (quadratic) approximation of the log-likelihood around the null yields the Pearson form \chi^2 asymptotically.[21]
For the goodness-of-fit test, the expected frequencies are E_i = n p_i, where n is the total sample size and p_i are the theoretical probabilities for each category under H_0.[4] In the test of independence for an r \times c contingency table, the expected frequency for cell (i,j) is E_{ij} = (r_i c_j) / N, where r_i is the total for row i, c_j the total for column j, and N the grand total.[22]
Under H_0 and large sample sizes, \chi^2 approximately follows a chi-squared distribution with appropriate degrees of freedom df, and the p-value is P(\chi^2_{df} > \chi^2_{\text{obs}}), where \chi^2_{\text{obs}} is the computed statistic.[4]
Asymptotic Properties and Degrees of Freedom
Under the null hypothesis and for sufficiently large sample sizes, the Pearson chi-squared statistic converges in distribution to a central chi-squared distribution with a specified number of degrees of freedom, providing the theoretical basis for hypothesis testing. This asymptotic property, established by Pearson in his foundational work, allows the use of chi-squared critical values to assess the significance of observed deviations from expected frequencies.[7]
The degrees of freedom for the chi-squared distribution depend on the test context. In the test of independence for an r \times c contingency table, the degrees of freedom are (r-1)(c-1), reflecting the number of independent cells after accounting for row and column marginal constraints.[5] For the goodness-of-fit test involving k categories where the expected frequencies are fully specified, the degrees of freedom are k - 1; if m parameters of the hypothesized distribution are estimated from the data, this adjusts to k - 1 - m.[4]
This asymptotic chi-squared distribution arises from the Central Limit Theorem applied to the multinomial sampling model underlying the test. Under the null hypothesis, the standardized differences (O_i - E_i)/\sqrt{E_i} for each category i are approximately independent standard normal random variables for large expected frequencies E_i, so their squares sum to a chi-squared variate.[23]
To conduct the test, the observed chi-squared statistic is compared to the critical value \chi^2_{\alpha, df} from the chi-squared distribution with df degrees of freedom at significance level \alpha; the null hypothesis is rejected if the statistic exceeds this value. Critical values are available in standard tables or computed via statistical software functions.[24]
The validity of the chi-squared approximation strengthens as the expected frequencies increase, typically recommended to be at least 5 in most cells to ensure reliable inference.[4]
Primary Applications
Test of Independence for Categorical Data
The chi-squared test of independence assesses whether there is a statistically significant association between two categorical variables, using an r × c contingency table that displays observed frequencies Oij for each combination of row category i (where i = 1 to r) and column category j (where j = 1 to c). This setup arises from cross-classifying a sample of N observations into the table cells based on their values for the two variables. The null hypothesis H0 posits that the row variable and column variable are independent, implying that the distribution of one variable does not depend on the levels of the other; the alternative hypothesis Ha suggests dependence or association between them.
Under the null hypothesis, expected frequencies Eij for each cell are computed as the product of the row total for i and the column total for j, divided by the overall sample size N:
E_{ij} = \frac{(\sum_j O_{ij}) \times (\sum_i O_{ij})}{N}.
These expected values represent what would be anticipated if the variables were truly independent, preserving the marginal totals of the observed table. The test then evaluates deviations between observed and expected frequencies using the Pearson chi-squared statistic, which approximates a chi-squared distribution with (r-1)(c-1) degrees of freedom for sufficiently large samples (typically when all expected frequencies exceed 5).
Interpretation involves computing the p-value from the chi-squared distribution of the test statistic; if the p-value is less than the chosen significance level α (commonly 0.05), the null hypothesis is rejected in favor of the alternative, indicating evidence of dependence between the variables. To quantify the strength of any detected association beyond mere significance, measures such as Cramér's V can be applied, defined as the square root of the chi-squared statistic divided by N times the minimum of (r-1) and (c-1), yielding a value between 0 (no association) and 1 (perfect association). This test is particularly common in analyzing survey data, such as examining the relationship between gender (rows) and voting preference (columns) in election studies.
If the test rejects independence, post-hoc analysis of cell contributions aids in identifying which specific combinations drive the result. Standardized Pearson residuals, calculated as (Oij - Eij) / √Eij, highlight deviations; residuals with absolute values exceeding about 2 (corresponding to a roughly 5% tail probability under the null) suggest cells where observed frequencies differ markedly from expectations, signaling localized associations. These residuals follow an approximate standard normal distribution under the null, facilitating targeted interpretation while accounting for varying expected cell sizes.[25]
Goodness-of-Fit Test
The chi-squared goodness-of-fit test evaluates whether the distribution of observed categorical data aligns with a predefined theoretical distribution, providing a measure of discrepancy between observed and expected frequencies. This test is particularly valuable when assessing if sample outcomes conform to expected probabilities derived from theoretical models, such as uniform, Poisson, or multinomial distributions. Introduced as part of the broader chi-squared framework by Karl Pearson, it serves as a foundational tool in statistical inference for distribution validation.[7][4]
In the standard setup, the data is partitioned into k mutually exclusive categories, yielding observed counts Oi for each category i = 1, 2, ..., k. The corresponding expected counts are then calculated as E_i = n p_i, where n is the total sample size and p_i represents the theoretical probability for category i, often set to $1/k for uniformity or derived from parametric models like the Poisson distribution. For instance, in testing dice fairness, each of the six faces would have an expected probability of $1/6 under the null assumption of uniformity. The test proceeds by computing the chi-squared statistic from these frequencies, as detailed in the mathematical formulation section.[4][26]
The null hypothesis (H0) asserts that the observed data arises from the specified theoretical distribution, implying no significant deviation between observed and expected frequencies. The alternative hypothesis (Ha) posits that the data does not follow this distribution, indicating a mismatch that could arise from systematic biases or non-random processes. Unique applications include verifying uniformity in random number generators or gaming devices like dice, as well as checking adherence to multinomial models in fields such as genetics or market research.[4][27]
When the theoretical probabilities p_i involve parameters estimated directly from the sample data—such as the mean in a Poisson fit—the degrees of freedom must be adjusted to account for this estimation, given by df = k - 1 - m, where m is the number of parameters fitted. This adjustment ensures the test's validity by reducing the effective freedom to reflect the information used in parameter estimation. The chi-squared goodness-of-fit test is commonly applied in quality control to assess process consistency, such as verifying that defect rates or product categorizations match expected distributional norms in manufacturing.[4][28]
Computational Methods
Step-by-Step Calculation
To perform a manual calculation of the chi-squared test statistic, begin by organizing the data into a contingency table for tests of independence or a frequency table for goodness-of-fit tests, recording the observed frequencies O_i in each cell or category.[5][29]
Next, compute the expected frequencies E_i for each cell or category, which depend on the specific application: for a test of independence in categorical data, these are derived from the marginal totals and overall sample size as outlined in the mathematical formulation; for a goodness-of-fit test, they are obtained by multiplying the total sample size by the hypothesized proportions for each category.[5][29]
Then, calculate the test statistic using the formula
\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i},
where the sum is taken over all cells or categories, providing a measure of deviation between observed and expected frequencies.[5][29]
Determine the degrees of freedom (df) based on the test type—for independence, df = (r - 1)(c - 1) where r is the number of rows and c is the number of columns in the contingency table; for goodness-of-fit, df = k - 1 - m where k is the number of categories and m is the number of parameters estimated from the data—and use a chi-squared distribution table or software to find the critical value at a chosen significance level (e.g., α = 0.05) or the p-value, referencing the asymptotic properties of the statistic for large samples.[5][29]
Compare the computed \chi^2 to the critical value: if \chi^2 exceeds the critical value (or if the p-value < α), reject the null hypothesis H_0 of independence or good fit; otherwise, fail to reject it. Report the results in the standard format, such as "\chi^2(df) = value, p = value," to summarize the finding.[5][29]
Note that the chi-squared approximation is reliable only when expected frequencies meet certain conditions, such as E_i \geq 5 in at least 80% of cells with no E_i < 1, as smaller values can lead to inaccurate p-values and may require alternative tests.[5][29]
Software and Implementation Notes
The chi-squared test is implemented in various statistical software packages, facilitating both tests of independence and goodness-of-fit. In R, the chisq.test() function from the base stats package handles both types of tests on count data.[30] For a contingency table test of independence, users provide a matrix of observed frequencies, optionally applying Yates's continuity correction via the correct parameter (default: TRUE for 2x2 tables).[30] For goodness-of-fit, the function accepts a vector of observed counts and a vector of expected probabilities via the p parameter.[30] An example for independence is:
r
observed <- matrix(c(10, 20, 30, 40), nrow=2)
result <- chisq.test(observed, correct=FALSE)
print(result)
observed <- matrix(c(10, 20, 30, 40), nrow=2)
result <- chisq.test(observed, correct=FALSE)
print(result)
This outputs the chi-squared statistic, p-value, degrees of freedom, and components like observed and expected counts.[30]
In Python, the SciPy library provides scipy.stats.chi2_contingency() for tests of independence on contingency tables, returning the statistic, p-value, degrees of freedom, and expected frequencies.[31] The function applies Yates's correction by default but allows disabling it; since version 1.11.0, a method parameter supports Monte Carlo simulation or permutation tests for improved accuracy with small samples.[31] For goodness-of-fit, scipy.stats.chisquare() compares observed frequencies to expected ones under the null hypothesis of equal probabilities (or user-specified via f_exp).[32] An example for independence is:
python
import numpy as np
from scipy.stats import chi2_contingency
observed = np.array([[10, 20], [30, 40]])
stat, p, dof, expected = chi2_contingency(observed)
print(f'Statistic: {stat}, p-value: {p}')
import numpy as np
from scipy.stats import chi2_contingency
observed = np.array([[10, 20], [30, 40]])
stat, p, dof, expected = chi2_contingency(observed)
print(f'Statistic: {stat}, p-value: {p}')
This assumes observed and expected frequencies are at least 5 for asymptotic validity.[32]
SPSS implements the test through the Crosstabs procedure (Analyze > Descriptive Statistics > Crosstabs) for independence, where users select row and column variables, then enable Chi-square under Statistics; it outputs the statistic, p-value, and optionally residuals.[33] For goodness-of-fit, use Nonparametric Tests > Legacy Dialogs > Chi-Square, specifying test proportions.[33] The software requires expected frequencies ≥1 with no more than 20% of cells <5.[33] In Microsoft Excel, the CHISQ.TEST() function computes the p-value for a goodness-of-fit or independence test by comparing actual and expected ranges, with degrees of freedom as a second output via CHISQ.DIST.RT().[34] For example: =CHISQ.TEST(A1:B2, C1:D2) where A1:B2 holds observed and C1:D2 expected values.[34]
Software implementations often issue warnings for low expected counts, as the chi-squared approximation may be unreliable if more than 20% of cells have expected frequencies <5 or any <1.[30][31] In R, chisq.test() explicitly warns if expected values are <5.[30] Similarly, SciPy notes potential inaccuracy for small frequencies and recommends alternatives like exact tests.[32] Residuals, useful for identifying influential cells, are accessible in outputs: R provides Pearson residuals (result$residuals) and standardized residuals (result$stdres); SciPy allows computation from returned expected frequencies as (observed - expected) / sqrt(expected); SPSS includes them in Crosstabs tables when selected.[30][31][33]
As of 2025, modern software includes simulation-based options like Monte Carlo or bootstrapping for p-values in small samples to enhance accuracy beyond the asymptotic approximation.[30][31] In R, set simulate.p.value=TRUE with B replicates for Monte Carlo p-values.[30] SciPy's chi2_contingency supports 'monte-carlo' or 'permutation' methods.[31] SPSS Exact Tests module offers Monte Carlo simulation for exact p-values in the Crosstabs dialog.[33] These approaches resample the data to estimate the null distribution, mitigating issues with low counts.[30][31]
Yates's Correction for Continuity
Yates's correction for continuity is a modification to the standard Pearson chi-squared statistic designed specifically for 2×2 contingency tables involving small sample sizes. Introduced by Frank Yates in 1934, it adjusts the test to better account for the discrete nature of categorical count data when approximating the continuous chi-squared distribution, thereby improving the accuracy of the p-value estimation.[35][36]
The corrected statistic is computed as
\chi^2 = \sum \frac{(|O_i - E_i| - 0.5)^2}{E_i},
where O_i denotes the observed frequency in cell i, E_i the expected frequency under the null hypothesis of independence, and the subtraction of 0.5 serves as the continuity correction to mitigate the discontinuity between discrete observations and the continuous approximation. This adjustment reduces the value of the chi-squared statistic compared to the uncorrected version, making it less likely to reject the null hypothesis and thus lowering the risk of Type I error inflation in small samples.[36][37]
The correction is recommended for application in 2×2 tables when all expected cell frequencies are at least 1 and at least one is less than 5, as these conditions indicate potential inadequacy of the chi-squared approximation without adjustment. However, Yates explicitly advised against its use for tables larger than 2×2, where the correction has minimal impact and may unnecessarily complicate computations.[35][36]
Despite its historical utility, the routine use of Yates's correction remains debated among statisticians, with critics arguing that it is overly conservative, particularly in modern contexts where exact tests are computationally feasible, potentially reducing statistical power without substantial benefits in controlling error rates. Influential analyses, such as those by Agresti, highlight that the correction is often unnecessary given advancements in exact methods and simulation-based approaches.[38]
Fisher's Exact Test and Binomial Test as Alternatives
Fisher's exact test provides an exact alternative to the chi-squared test of independence for 2×2 contingency tables, particularly when sample sizes are small and the chi-squared approximation may be unreliable. Developed by Ronald A. Fisher, the test computes the probability of observing the given table (or one more extreme) under the null hypothesis of independence, assuming fixed marginal totals, using the hypergeometric distribution.[39] The p-value is obtained by summing the hypergeometric probabilities of all tables with the same margins that are as or less probable than the observed table.
This test is especially recommended for 2×2 tables where one or more expected cell frequencies are less than 5, as the chi-squared test's asymptotic approximation performs poorly in such cases, potentially leading to inaccurate p-values.[40] Computationally, Fisher's exact test traditionally relies on enumerating all possible tables consistent with the fixed margins, though for larger tables this becomes intensive; modern implementations use efficient network algorithms to optimize the summation over the probability space. Despite these challenges for tables beyond 2×2, the test is routinely available in statistical software for practical use.[41]
For even simpler cases, such as testing a single proportion in a 2×1 table (e.g., comparing observed successes to an expected rate under the null), the binomial test serves as an exact alternative. This test evaluates deviations from the hypothesized proportion using the exact binomial distribution, calculating the p-value as the cumulative probability of outcomes as extreme as or more extreme than observed.[42] Like Fisher's test, it is preferred when expected counts are small (e.g., fewer than 5 successes or failures), avoiding reliance on normal approximations inherent in large-sample methods. The p-value is computed directly from the binomial cumulative distribution function, which is straightforward and efficient even for moderate sample sizes.[42]
Chi-squared Test for Variance
The chi-squared test for variance assesses whether the variance of a normally distributed population equals a specified hypothesized value, applying specifically to continuous data rather than the categorical data addressed by Pearson's chi-squared test of independence or goodness-of-fit. This test evaluates the null hypothesis H_0: \sigma^2 = \sigma_0^2, where \sigma^2 is the population variance and \sigma_0^2 is the hypothesized value.[43]
The test statistic is formulated as
\chi^2 = \frac{(n-1) s^2}{\sigma_0^2},
where n denotes the sample size and s^2 represents the sample variance, calculated as s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 with \bar{x} as the sample mean.[43][44]
Assuming the population is normally distributed, under the null hypothesis, this statistic follows an exact chi-squared distribution with n-1 degrees of freedom, eliminating the need for asymptotic approximations unlike in categorical applications.[44][45]
For hypothesis testing, a two-sided alternative (H_a: \sigma^2 \neq \sigma_0^2) rejects H_0 if the p-value or test statistic falls outside critical values from the chi-squared distribution table at the chosen significance level; one-sided alternatives (H_a: \sigma^2 > \sigma_0^2 or H_a: \sigma^2 < \sigma_0^2) use the appropriate tail.[43][45]
This formulation emerged in the 1920s through Ronald A. Fisher's development of inference methods for normal distributions, distinct from Karl Pearson's earlier work on chi-squared for discrete data.
Interpretation and Limitations
The chi-squared test for variance involves computing the test statistic \chi^2 = \frac{(n-1)s^2}{\sigma_0^2}, where n is the sample size, s^2 is the sample variance, and \sigma_0^2 is the hypothesized population variance under the null hypothesis H_0: \sigma^2 = \sigma_0^2. Under H_0 and assuming normality, this statistic follows a chi-squared distribution with n-1 degrees of freedom.[46][43] For a two-sided test at significance level \alpha, reject H_0 if \chi^2 > \chi^2_{1-\alpha/2, n-1} (upper critical value) or \chi^2 < \chi^2_{\alpha/2, n-1} (lower critical value); one-sided alternatives adjust the critical region accordingly.[46][47] A p-value is obtained by comparing the observed \chi^2 to the chi-squared distribution, with rejection if p < \alpha.[43]
A (1-\alpha) \times 100\% confidence interval for the population variance \sigma^2 is given by
\left( \frac{(n-1)s^2}{\chi^2_{1-\alpha/2, n-1}}, \frac{(n-1)s^2}{\chi^2_{\alpha/2, n-1}} \right),
where the quantiles are from the chi-squared distribution with n-1 degrees of freedom; this interval contains \sigma_0^2 with probability $1-\alpha under normality.[48][49] If the interval excludes \sigma_0^2, it supports rejection of H_0. For the standard deviation \sigma, take square roots of the interval bounds.[50]
This test is often paired with the t-test for the population mean in normal theory inference, where both assess aspects of a normal distribution: the t-test evaluates the mean assuming known or estimated variance, while the chi-squared test verifies the variance assumption.[51][52]
The test relies on the strict assumption that the population is normally distributed, making it highly sensitive to departures from normality, such as skewness, kurtosis, or outliers, which can distort the \chi^2 distribution and lead to invalid p-values or coverage probabilities for confidence intervals.[49] For non-normal data, robust alternatives like Levene's test (adapted for single samples) or bootstrap methods are recommended over the chi-squared approach.[49][24] Additionally, the test exhibits low power for detecting deviations from H_0, especially with small to moderate sample sizes, often requiring large n (e.g., >30) to achieve adequate sensitivity to variance changes.[53]
The chi-squared test for a single variance can be generalized to compare two population variances using the F-test, where the statistic F = s_1^2 / s_2^2 (assuming equal hypothesized variances) follows an F-distribution with n_1-1 and n_2-1 degrees of freedom under normality and H_0: \sigma_1^2 = \sigma_2^2.[47][54]
Illustrative Examples
Contingency Table Example
To illustrate the chi-squared test for independence, consider a hypothetical 2x2 contingency table examining the association between smoking status and lung cancer diagnosis in a sample of 100 patients. The observed frequencies are as follows:
Under the null hypothesis of independence, the expected frequency E_{ij} for each cell is calculated as E_{ij} = \frac{(\text{row total}_i \times \text{column total}_j)}{\text{grand total}}. Thus, the expected values are:
| Lung Cancer | No Lung Cancer | Total |
|---|
| Smokers | 22.5 | 27.5 | 50 |
| Non-Smokers | 22.5 | 27.5 | 50 |
| Total | 45 | 55 | 100 |
The chi-squared test statistic is then \chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}, yielding \chi^2 = \frac{(40-22.5)^2}{22.5} + \frac{(10-27.5)^2}{27.5} + \frac{(5-22.5)^2}{22.5} + \frac{(45-27.5)^2}{27.5} = 49.49. With 1 degree of freedom (df = (rows-1) \times (columns-1)), the p-value is much less than 0.001, leading to rejection of the null hypothesis of independence.
To identify which cells contribute most to the significant result, standardized Pearson residuals are computed as r_{ij} = \frac{O_{ij} - E_{ij}}{\sqrt{E_{ij}}}. These are approximately 3.69 for smokers with lung cancer, -3.34 for smokers without lung cancer, -3.69 for non-smokers with lung cancer, and 3.34 for non-smokers without lung cancer. The large positive residual for smokers with lung cancer (and corresponding negative for non-smokers with lung cancer) highlights the key association driving the departure from independence.
Applying Yates's correction for continuity, suitable for 2x2 tables with moderate sample sizes, adjusts the statistic to \chi^2 = \sum \frac{(|O_{ij} - E_{ij}| - 0.5)^2}{E_{ij}} = 46.71, which remains highly significant (p < 0.001). This correction reduces the chi-squared value slightly but does not alter the conclusion.
In summary, these results provide strong evidence of an association between smoking and lung cancer in this hypothetical dataset, with smokers overrepresented among those diagnosed.
Goodness-of-Fit Example
A classic illustration of the chi-squared goodness-of-fit test involves assessing the fairness of a six-sided die, where the null hypothesis states that each face appears with equal probability of \frac{1}{6}. Consider an experiment with 30 rolls, yielding the observed frequencies shown in the table below.
| Face | Observed Frequency (O_i) |
|---|
| 1 | 3 |
| 2 | 7 |
| 3 | 5 |
| 4 | 10 |
| 5 | 2 |
| 6 | 3 |
The expected frequency for each face under the null hypothesis is E_i = \frac{30}{6} = 5. The test statistic is computed as
\chi^2 = \sum_{i=1}^{6} \frac{(O_i - 5)^2}{5} = 9.2.
With 5 degrees of freedom (number of categories minus 1), the corresponding p-value is approximately 0.101, which exceeds the common significance level of 0.05, so there is insufficient evidence to reject the null hypothesis of a fair die.
In contrast, a biased die would produce observed frequencies that lead to rejection of the null. For instance, in a large-scale experiment rolling 12 dice 26,306 times (totaling 315,672 face outcomes), the observed frequencies were 53,222 for 1, 52,118 for 2, 52,465 for 3, 52,338 for 4, 52,244 for 5, and 53,285 for 6, compared to an expected 52,612 each. This yielded \chi^2 = 24.74 with 5 degrees of freedom and a p-value of approximately 0.00016, providing strong evidence to reject uniformity and conclude bias (attributed to uneven die dimensions).[55]
When the hypothesized probabilities are not fully specified but estimated from the sample data (e.g., maximum likelihood estimation of parameters in a distribution like Poisson or exponential), the degrees of freedom must be adjusted downward to account for the estimation; specifically, df = k - 1 - m, where k is the number of categories and m is the number of estimated parameters.[56]
Broader Applications and Considerations
Applications in Various Fields
The chi-squared test finds extensive application in genetics, particularly for assessing Hardy-Weinberg equilibrium, where it serves as a goodness-of-fit test to evaluate whether observed allele and genotype frequencies in a population conform to expected proportions under assumptions of random mating, no selection, mutation, migration, or genetic drift.[57] This test is routinely used in population genetics studies to detect deviations that may indicate evolutionary forces at play, such as in analyses of single nucleotide polymorphisms (SNPs) in genomic data.[58]
In the social sciences, the chi-squared test is commonly employed to analyze contingency tables from survey data, testing for independence between categorical variables, such as the relationship between education level and income brackets.[59] For instance, researchers use it to determine if observed crosstabulations of responses to questions on voting behavior and demographic factors significantly differ from what would be expected under independence, informing sociological theories on social stratification.[60]
Market research leverages the chi-squared goodness-of-fit test to assess whether consumer preferences for products align with anticipated market shares, such as evaluating if brand choices among surveyed customers deviate from proportional expectations based on historical sales data.[61] This application helps marketers identify significant shifts in consumer behavior, guiding decisions on product positioning and advertising strategies.[62]
In physics and engineering, the chi-squared test for variance is applied in quality control processes to verify if the variability in measurements from manufacturing or experimental data matches a specified standard, assuming normality, thereby ensuring process stability in settings like semiconductor production or material testing.[43] For example, it tests whether the dispersion in component dimensions conforms to design tolerances, flagging potential issues in production lines.[63]
The chi-squared test is integrated into machine learning pipelines for evaluating independence between categorical features and target variables in datasets, aiding feature selection to enhance model performance in classification tasks.[64] This usage supports preprocessing in areas like predictive analytics for customer segmentation, where it identifies non-redundant categorical inputs.[65]
Common Pitfalls and Extensions
One common pitfall in applying the chi-squared test is ignoring cells with low expected frequencies, which can lead to invalid p-values due to the poor performance of the asymptotic approximation. Specifically, the test assumes that expected cell counts are at least 5 in at least 80% of cells, with no expected value less than 1, to ensure the chi-squared distribution adequately approximates the test statistic; violations often result in inflated Type I error rates.[5] Another frequent error is failing to apply corrections for multiple testing when conducting several chi-squared tests simultaneously, such as in post-hoc analyses of contingency tables, which increases the family-wise error rate; methods like the Bonferroni adjustment, dividing the significance level by the number of tests, are recommended to control this.[66][67] Additionally, interpreting a significant chi-squared result as evidence of causation rather than mere association confuses statistical dependence with directional influence, as the test only assesses whether categorical variables are independent under the null hypothesis.[68] Over-reliance on the asymptotic chi-squared approximation in small samples exacerbates these issues, as the test's validity diminishes when expected frequencies are low, potentially leading to inaccurate p-values; in such cases, exact alternatives like Fisher's exact test should be considered.[69]
Extensions of the chi-squared test address limitations in handling ordered categories or complex scenarios. For ordinal data, where categories have a natural order (e.g., low, medium, high satisfaction levels), the standard chi-squared test of independence may overlook trends; an ordinal chi-squared test incorporates scores for the ordered levels to detect monotonic associations more powerfully, such as through the Cochran-Armitage trend test statistic.[70] In cases with sparse data or violations of asymptotic assumptions, simulation-based p-values provide a robust alternative by generating the null distribution via Monte Carlo resampling of the observed margins, yielding more accurate inference without relying on the chi-squared approximation.[71] For multi-way contingency tables beyond two dimensions, log-linear models extend the chi-squared framework by modeling the logarithm of expected cell frequencies as a linear combination of main effects and interactions, allowing hierarchical assessment of associations via likelihood ratio tests that parallel chi-squared goodness-of-fit evaluations.[72]
Recent developments include Bayesian analogs to the chi-squared test, which incorporate prior information using Dirichlet priors on multinomial probabilities to compute posterior probabilities of independence, offering advantages in small samples or when eliciting expert priors.[73]