Contingency table

A contingency table, also known as a cross-tabulation or crosstab, is a matrix that displays the frequency distribution of two or more categorical variables, with rows and columns representing the categories of each variable and cell entries showing the joint counts or frequencies of their combinations.^[1]^[2] These tables are fundamental in statistics for summarizing multivariate categorical data and facilitating the examination of potential associations or dependencies between variables.^[3] The structure of a contingency table typically includes marginal totals for rows and columns, which represent the univariate distributions of the individual variables, while the inner cells capture the joint frequencies from which conditional distributions can be derived.^[2] For a two-way table with r rows and c columns, the degrees of freedom for associated statistical tests are (r-1)(c-1), enabling assessments of independence under the null hypothesis that the variables are unrelated.^[2] Tables can extend to higher dimensions for three or more variables, though two-way tables remain the most common for initial exploratory analysis.^[4] They are widely applied in fields such as social sciences, medicine, and public health to test hypotheses about relationships—for instance, using tests of independence such as the chi-squared test—while measures quantify the strength of associations.^[2]^[1] Modern extensions include log-linear models for multi-way interactions^[4] and continuity corrections, such as Yates' correction, to improve test accuracy in sparse tables.^[5]

Fundamentals

Definition and Purpose

A contingency table, also known as a cross-tabulation or two-way frequency table, is a matrix that presents the multivariate frequency distribution of two or more categorical variables, with rows and columns representing the categories of each variable and cell entries showing the observed frequencies.^[6]^[1]^[7] These tables provide a structured way to summarize joint occurrences of categories across variables, enabling researchers to visualize how observations are distributed across combinations without requiring numerical or continuous data.^[8] The primary purpose of contingency tables is to explore potential associations between categorical variables, test hypotheses regarding their independence, and facilitate the computation of conditional probabilities from the data.^[9]^[10] For instance, they are widely applied in epidemiology to assess relationships between risk factors and outcomes, such as exposure status and disease incidence.^[11] In the social sciences, they help analyze patterns in survey responses or demographic data, while in market research, they reveal dependencies between consumer preferences and demographics to inform segmentation strategies.^[12]^[13] Unlike parametric models, contingency tables allow for model-free visualization of dependencies, making them versatile for initial exploratory analysis across disciplines.^[14] Contingency tables are typically structured as r × c matrices, where r denotes the number of row categories and c the number of column categories, and they often incorporate fixed marginal totals for conducting conditional analyses that focus on distributions within subsets of the data.^[15]^[6] This empirical focus on observed frequencies distinguishes them from logical constructs like truth tables, which enumerate all possible outcomes rather than summarizing real-world data counts.^[1] A common application involves using such tables to detect deviations from independence, as in the chi-squared test.^[10]

Historical Development

The roots of contingency tables trace back to early 19th-century efforts in vital statistics and demography, where scholars employed multi-way tables to explore associations between variables such as age, sex, and criminality, emphasizing marginal distributions over independence testing.^[16] Precursors included Pierre-Simon Laplace and Siméon Denis Poisson's probabilistic analyses of 2×2 tables for comparing proportions in the early 1800s, and Félix Gavarret's 1840 application of these methods to medical data like sex ratios in births.^[16] By the late 19th century, figures such as Charles Sanders Peirce (1884) developed measures of association for 2×2 tables, applying them to predictive problems like tornado forecasting, while Francis Galton (1892) used expected frequencies in 3×3 fingerprint classification tables to assess independence.^[16] The modern statistical foundation of contingency tables was established in 1900 by Karl Pearson, who introduced the chi-squared test as a measure of goodness-of-fit and independence, applicable to tables assessing associations between categorical variables.^[17] Pearson's seminal paper, "On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can Be Reasonably Supposed to Have Been Caused by Random Sampling," provided a criterion for evaluating whether observed frequencies deviated significantly from expectations under random sampling, marking the shift toward rigorous hypothesis testing in contingency analysis. Collaborating closely, George Udny Yule extended these ideas in the same year, developing association measures like Yule's coefficient of colligation for 2×2 tables, which quantified dependence in binary outcomes such as disease incidence by exposure.^[18] Pearson later coined the term "contingency table" in his 1904 work, "On the Theory of Contingency and Its Relation to Association and Normal Correlation," formalizing the framework for multivariate categorical data. In the 1920s and 1930s, Ronald A. Fisher advanced the methodology for small-sample contingency tables, critiquing the chi-squared approximation's limitations and developing exact inference procedures. In his 1922 paper, "On the Application of the χ² Method to Association and Contingency Tables," Fisher outlined applications of chi-squared to multi-way tables while highlighting issues with low expected frequencies. By 1934, Fisher formalized Fisher's exact test in Statistical Methods for Research Workers, using the hypergeometric distribution to compute precise p-values for 2×2 tables under fixed margins, illustrated famously by the "lady tasting tea" experiment, which addressed small-sample independence without relying on asymptotic approximations. These contributions emphasized exact conditional inference, influencing tests for sparse data. The mid-20th century saw expansion to multi-way contingency tables, facilitated by computational advancements that enabled handling larger, more complex datasets beyond manual calculations. Early theoretical work, such as Maurice S. Bartlett's 1935 exploration of interactions in multi-dimensional tables, paved the way, but practical implementation accelerated post-World War II with electronic computers allowing iterative estimation for higher-order analyses. By the 1970s, Leo A. Goodman integrated log-linear models into contingency table analysis, treating cell frequencies as Poisson or multinomial outcomes and using iterative proportional fitting to model interactions hierarchically. Goodman's series of papers, starting with "The Multivariate Analysis of Qualitative Data: Interactions among Multiple Classifications" in 1970, provided stepwise procedures and direct estimation for building models that captured main effects and higher-order associations in multi-way tables.^[19] This approach, further developed by Stephen E. Fienberg and others, revolutionized the field by enabling sophisticated inference on complex categorical structures.^[20]

Construction and Interpretation

Standard Format

A contingency table is typically arranged as a rectangular array that cross-classifies observations from two categorical variables, with rows representing the levels of one variable (say, with r categories) and columns representing the levels of the other (with c categories).^[10] The cells of this r \times c table contain the observed frequencies, denoted as n_{ij}, which count the number of observations falling into the i-th row category and j-th column category.^[21] This layout facilitates the visualization of the joint distribution of the two variables.^[1] Standard notation employs double subscripts for cell entries, such as O_{ij} or n_{ij} for observed counts, where uppercase O is sometimes used to distinguish from expected values in statistical analyses.^[10] Certain cells may contain structural zeros, indicated where combinations of categories are impossible or precluded by design, rendering those probabilities inherently zero and the table incomplete.^[22] While the two-way table serves as the standard format, extensions to multi-way tables incorporate additional dimensions, such as a three-dimensional r \times c \times k array for three variables, though these are analyzed by slicing or modeling the higher-order interactions.^[23] Contingency tables are generally symmetric in the sense that the variables are interchangeable, but asymmetric variants exist where the ordering of rows and columns matters, as in confusion matrices that distinguish predicted from actual outcomes in classification tasks.^[24] In interpretation, the row totals (margins) summarize the distribution across columns for each row category, and column totals do likewise for rows; these margins enable the computation of conditional proportions, often expressed as percentages within rows or columns to assess relative frequencies.^[10]

Illustrative Example

Consider a hypothetical survey of 200 individuals assessing the relationship between gender and preference for a particular product (yes or no response). The data can be organized into a 2x2 contingency table, with gender as the row variable (male, female) and preference as the column variable (yes, no). The observed frequencies in each cell represent the count of respondents falling into each category combination.^[1] The resulting table is as follows:

Gender	Yes	No
Male	30	70
Female	40	60

To construct this table, first define the categorical variables: assign one to rows (e.g., gender with two levels: male and female) and the other to columns (e.g., preference with two levels: yes and no). Then, tally the raw frequencies from the survey data into the appropriate cells based on respondents' answers, ensuring each individual contributes to exactly one cell.^[1] Simple percentages can highlight row-specific proportions for initial interpretation. For example, among males, 30 out of 100 (30%) expressed a yes preference, compared to 40 out of 100 (40%) among females. These percentages are calculated by dividing each cell frequency by the row total and multiplying by 100.^[1] Visually inspecting the table reveals patterns, such as a higher proportion of females expressing yes preference relative to males, indicating an apparent difference in preferences across gender groups that may warrant further analysis.^[1]

Marginal Totals and Expected Frequencies

In a contingency table, marginal totals are computed by summing the observed frequencies along the rows and columns, providing summaries of the univariate distributions of the respective categorical variables. The row marginal total for row i, denoted n_{i.}, is the sum of frequencies across all columns in that row: n_{i.} = \sum_j n_{ij}, where n_{ij} is the observed frequency in cell (i,j). Similarly, the column marginal total for column j, denoted n_{.j}, is n_{.j} = \sum_i n_{ij}. The grand total, n_{..}, is the sum of all cell frequencies or equivalently the sum of all row or column marginals: n_{..} = \sum_i n_{i.} = \sum_j n_{.j}.^[25]^[26] Under the assumption of independence between the row and column variables, expected frequencies for each cell serve as a baseline for what would be anticipated if no association exists. The expected frequency E_{ij} for cell (i,j) is calculated as the product of the corresponding row and column marginal totals divided by the grand total:

E_{ij} = \frac{n_{i.} \cdot n_{.j}}{n_{..}}.

This formula arises from the independence assumption, where the joint probability is the product of the marginal probabilities, scaled by the total sample size.^[26]^[25] Comparing observed frequencies n_{ij} to these expected frequencies E_{ij} reveals potential discrepancies that may indicate dependence between variables; substantial deviations suggest the data do not align with the independence model. For valid use in approximate statistical tests, such as the chi-squared test, the expected frequencies should generally be at least 5 in at least 80% of the cells, with no expected frequency less than 1, to ensure the reliability of the normal approximation underlying the test. If these conditions are violated (e.g., more than 20% of cells have expected frequencies below 5), alternative exact methods may be required.^[3]^[27]

Tests of Association and Independence

Chi-Squared Test of Independence

The chi-squared test of independence is a statistical procedure used to determine whether there is a significant association between two categorical variables in a contingency table, by comparing observed cell frequencies to those expected under the assumption of independence.^[28] The test was originally developed by Karl Pearson in 1900 as a method to assess deviations from expected frequencies in multivariate data. It is particularly applicable to large samples where the approximation to the chi-squared distribution holds reliably.^[29] The null hypothesis (H_0) states that the two variables are independent, meaning the observed frequencies should align closely with expected frequencies derived from the marginal totals.^[28] The alternative hypothesis (H_a) posits that the variables are associated, implying a significant deviation in the observed frequencies. The test statistic is calculated as

\chi^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{(O_{ij} - E_{ij})^2}{E_{ij}},

where O_{ij} is the observed frequency in row i and column j, E_{ij} is the corresponding expected frequency, r is the number of rows, and c is the number of columns. Under the null hypothesis, this statistic follows a chi-squared distribution with (r-1)(c-1) degrees of freedom.^[30] The p-value is obtained by comparing the computed \chi^2 to the critical value from the chi-squared distribution table for the given degrees of freedom and significance level (e.g., \alpha = 0.05), or more commonly, by using statistical software to derive it directly.^[28] Key assumptions include that the data consist of frequencies from a random sample, the observations are independent, and the expected frequencies are sufficiently large—typically at least 5 in at least 80% of the cells, with no expected frequency less than 1—to ensure the validity of the chi-squared approximation.^[28] Violations of these assumptions, particularly small expected frequencies, can lead to inflated Type I error rates.^[31] For 2×2 contingency tables, Yates' continuity correction is often applied to improve the approximation by adjusting the test statistic: subtract 0.5 from the absolute difference |O_{ij} - E_{ij}| in each term of the summation, an adjustment introduced by Frank Yates in 1934 to account for the discrete nature of the data. This correction is particularly recommended when sample sizes are moderate but expected frequencies are borderline.^[32] To perform the test, first compute the expected frequencies for each cell as E_{ij} = \frac{(row_i\ total) \times (column_j\ total)}{grand\ total}, then calculate the \chi^2 statistic (with or without Yates' correction as appropriate), determine the degrees of freedom, and evaluate the p-value.^[29] If the p-value is less than the chosen significance level, the null hypothesis is rejected, indicating evidence of an association between the variables.^[28] Modern statistical software, such as R or SAS, automates these computations and provides options for corrections and exact alternatives when assumptions are not met.^[3]

Fisher's Exact Test

Fisher's exact test is a statistical procedure used to assess whether there is a significant association between two categorical variables in a 2×2 contingency table, particularly when sample sizes are small and the assumptions of asymptotic approximations may not hold.^[33] The test conditions on the observed marginal totals (row and column sums) and evaluates the exact probability of obtaining the observed table or one more extreme under the null hypothesis of independence, assuming the margins are fixed.^[34] This conditional approach leads to a hypergeometric distribution for the cell frequencies, providing an exact rather than approximate inference. Under the null hypothesis, the distribution of the table entries follows a hypergeometric distribution because the fixed margins imply that the allocation of observations to cells is like sampling without replacement from a finite population.^[35] For a 2×2 table with cell counts a, b, c, d, row totals n_{1.} and n_{2.}, column totals n_{.1} and n_{.2}, and grand total n_{..}, the exact probability of the observed table is given by:

P = \frac{ n_{1.}! \, n_{2.}! \, n_{.1}! \, n_{.2}! }{ n_{..}! \, a! \, b! \, c! \, d! }

To compute the p-value, all possible 2×2 tables with the same fixed marginal totals are enumerated, each assigned its hypergeometric probability, and the sum of probabilities for tables as extreme as or more extreme than the observed one (in terms of deviation from independence) is calculated.^[34] This one- or two-tailed summation ensures the test's exactness, avoiding reliance on large-sample approximations like the chi-squared test, which is suitable for larger samples.^[27] The primary advantage of Fisher's exact test is its provision of precise p-values without approximation errors, making it ideal for small samples where expected frequencies may be low (e.g., less than 5).^[36] However, it becomes computationally intensive for large sample sizes due to the need to enumerate numerous tables, and for tables larger than 2×2, an extension known as the Freeman-Halton test can be applied, though it is used sparingly owing to even greater computational demands.^[37]^[38]

Likelihood-Ratio Test

The likelihood-ratio test for assessing independence in contingency tables compares the maximum likelihood under the saturated model, which estimates a separate parameter for each cell and thus fits the observed data perfectly, to the maximum likelihood under the restricted independence model, which assumes cell probabilities are the product of row and column marginal probabilities and typically employs a multinomial or Poisson likelihood for the cell counts.^[39] This comparison evaluates whether the independence assumption adequately explains the observed frequencies. The test was originally developed by Wilks in 1935 as a general method for hypothesis testing in contingency tables.^[40] The test statistic, known as the deviance or G², is given by

G^2 = 2 \sum_{i=1}^r \sum_{j=1}^c O_{ij} \ln \left( \frac{O_{ij}}{E_{ij}} \right),

where O_{ij} denotes the observed frequency in row i and column j, and E_{ij} is the expected frequency under independence, computed as the product of the row and column marginal totals divided by the grand total.^[39] Under the null hypothesis of no association between the row and column variables, G^2 follows an asymptotic chi-squared distribution with (r-1)(c-1) degrees of freedom, where r and c are the numbers of rows and columns.^[39] To conduct the test, one computes G^2 and compares it to the critical value from the chi-squared distribution at the desired significance level; rejection of the null occurs if G^2 exceeds this value or if the associated p-value is below the threshold, providing evidence of dependence.^[39] Relative to the Pearson chi-squared test, the likelihood-ratio test exhibits superior performance in scenarios with sparse data or small expected cell frequencies, offering greater statistical power and a more reliable asymptotic approximation without the strict requirement that all expected values exceed 5.^[41] Additionally, its foundation in likelihood principles makes it especially suitable for evaluating hierarchical models in log-linear analysis, where the additivity of G^2 across nested sub-models facilitates sequential model comparisons and goodness-of-fit assessments.^[42] While the two tests have comparable power in large samples, the likelihood-ratio approach is often preferred for its theoretical robustness in complex categorical data structures.^[41]

Measures of Strength of Association

Odds Ratio

In a 2×2 contingency table, the odds ratio (OR) quantifies the association between two binary variables, such as exposure and disease outcome, as the ratio of the odds of the outcome occurring in one group relative to the other. It is calculated as

\text{OR} = \frac{a/d}{b/c} = \frac{ad}{bc},

where a, b, c, and d denote the cell counts in the table, with a and b representing the exposed group and c and d the unexposed group.^[43] An OR greater than 1 suggests a positive association, indicating higher odds of the outcome in the exposed group; an OR of 1 implies no association; and an OR less than 1 indicates a negative association.^[43] In epidemiology, the OR serves as a key measure in case-control studies to estimate the strength of exposure-outcome relationships, and when the outcome is rare (prevalence ≤10%), it approximates the relative risk.^[43]^[44] The 95% confidence interval for the OR is typically computed using Woolf's logit method, where the interval for \ln(\text{OR}) is \ln(\text{OR}) \pm 1.96 \times \text{SE}, with \text{SE} = \sqrt{1/a + 1/b + 1/c + 1/d}, and then exponentiated to obtain the CI for OR; if the interval includes 1, the association is not statistically significant at the 5% level.^[43] A notable property of the OR is its invariance to whether row or column marginal totals are fixed, allowing consistent estimation in both prospective (cohort) and retrospective (case-control) designs without alteration of the measure's value.^[45] The OR is undefined when any cell count is zero, as this leads to division by zero, necessitating adjustments like adding 0.5 to cells (Haldane-Anscombe correction) for computation.^[43] For stratified 2×2 tables, the Mantel-Haenszel estimator computes a weighted summary OR across strata to adjust for confounding variables.^[46]

Phi Coefficient

The phi coefficient, denoted as \phi, is a normalized measure of association specifically designed for two dichotomous variables arranged in a 2×2 contingency table. It quantifies the degree of linear dependence between the variables and is computed as \phi = \sqrt{\frac{\chi^2}{n}}, where \chi^2 is the Pearson chi-squared test statistic for independence in the table and n is the grand total of observations across all cells.^[47] This formulation was introduced by Karl Pearson in 1904 as the coefficient of mean square contingency, providing a correlation-like metric for binary data analogous to the Pearson product-moment correlation coefficient.^[48] The value of \phi ranges from -1 to +1, where 0 indicates no association, positive values signify a positive linear relationship, and negative values indicate a negative one.^[49] Equivalently, \phi can be calculated directly from the deviations in cell frequencies without first computing \chi^2, using the formula \phi = \frac{ad - bc}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} for a 2×2 table with cell counts a, b, c, and d (where rows and columns represent the levels of the two binary variables).^[50] This direct approach highlights its roots as the Pearson correlation applied to binary-coded data (e.g., 0/1 scoring), making it suitable for assessing proportional reduction in prediction error or the extent of linear covariation between the variables.^[47] For instance, in a study examining the association between two diagnostic tests (positive/negative outcomes), a \phi value of 0.35 would indicate a moderate positive association, meaning about 12% of the variance in one test's results is linearly explained by the other (since \phi^2 = 0.1225).^[51] Under the assumption of an underlying bivariate normal distribution for the latent continuous variables behind the dichotomization (with equal thresholds), the phi coefficient approximates the tetrachoric correlation coefficient, which estimates the correlation of the unobserved continuous traits; this equivalence holds particularly when marginal proportions are balanced (e.g., 0.5).^[52] It is most appropriately used when both variables are naturally binary, such as gender and treatment response, rather than artificially dichotomized from continuous data, as the latter can attenuate the coefficient.^[49] Despite its utility, the phi coefficient has key limitations: it is defined exclusively for 2×2 tables and cannot be straightforwardly extended to larger contingency structures without modification.^[50] Additionally, its magnitude is sensitive to the marginal distributions of the variables; the upper bound is \min(\sqrt{p(1-p)}, \sqrt{q(1-q)}), where p and q are the marginal proportions, so perfect association (\phi = 1) is only achievable if both marginals are 0.5, leading to potential underestimation of association strength in imbalanced tables.^[51] George Udny Yule further elaborated on these properties in 1912, emphasizing the need for caution in interpreting \phi when marginals deviate from uniformity.

Cramér's V and Contingency Coefficient

Cramér's V is a measure of the strength of association between two nominal variables in an r \times c contingency table, generalizing the phi coefficient to tables larger than 2×2. It is calculated as

V = \sqrt{\frac{\chi^2}{n \cdot \min(r-1, c-1)}}

where \chi^2 is the chi-squared statistic, n is the total sample size, and \min(r-1, c-1) is the minimum of the degrees of freedom across rows and columns.^[53] This yields values ranging from 0 (indicating no association) to 1 (indicating perfect association), providing a standardized index that adjusts for table dimensions.^[54] The contingency coefficient, denoted C, is an earlier chi-squared-based measure of association introduced by Karl Pearson for r \times c tables. It is computed as

C = \sqrt{\frac{\chi^2}{n + \chi^2}}

with values between 0 and less than 1; the upper bound depends on table size, approaching 1 asymptotically as \chi^2 increases but limited by \sqrt{(k-1)/k} where k = \min(r, c).^[55]^[56] Like Cramér's V, it derives from the chi-squared test but normalizes differently, making it simpler to compute yet more sensitive to table structure.^[55] Both measures quantify nominal association symmetrically and are derived from the chi-squared statistic, but they differ in interpretability and comparability. Cramér's V is preferred because its maximum value of 1 remains consistent across table sizes, facilitating comparisons, whereas the contingency coefficient's variable upper bound introduces bias and reduces cross-study utility.^[57] For interpretation, Cramér's V values are often gauged using Cohen's effect size conventions: approximately 0.10 for weak association, 0.30 for moderate, and 0.50 for strong, though these thresholds approximate behavioral science standards and may vary by context.^[58] The contingency coefficient lacks such standardized guidelines due to its non-fixed range, limiting its standalone use.^[55]

Other Measures

The Goodman-Kruskal lambda coefficient (λ) is an asymmetric measure of association for nominal variables in contingency tables, quantifying the proportional reduction in error when predicting one variable from the other.^[59] It is defined as λ = (E₁ - E₂) / E₁, where E₁ represents the error in predicting the row variable without knowledge of the column variable, and E₂ is the error with such knowledge; values range from 0 (no reduction) to 1 (perfect prediction).^[59] Lambda is particularly useful for directed relationships, such as predicting dependent from independent variables, and exists in two versions: λ_A for rows as dependent and λ_B for columns.^[59] The uncertainty coefficient, also known as Theil's U, is an asymmetric entropy-based measure of association that assesses the reduction in uncertainty about one categorical variable given knowledge of another. It is computed as U = \frac{H(\text{row}) - H(\text{row}|\text{col})}{H(\text{row})}, where H denotes Shannon entropy and H(\text{row}|\text{col}) is the conditional entropy; the coefficient ranges from 0 (no association) to 1 (complete predictability). This measure draws from information theory and is suitable for multi-category variables, providing insight into information gain beyond symmetric alternatives. The tetrachoric correlation coefficient estimates the correlation between two theoretically continuous variables observed as binary outcomes in a 2×2 contingency table, assuming an underlying bivariate normal distribution. Introduced by Karl Pearson, it is typically estimated via maximum likelihood methods that maximize the likelihood of the observed frequencies under the normal assumption, though this process can be computationally intensive due to numerical integration requirements. The coefficient ranges from -1 to 1, offering a Pearson-like correlation for dichotomous data when latent continuity is hypothesized, such as in psychological or genetic traits. Other notable measures include Cohen's kappa (κ), which evaluates inter-rater agreement for categorical items beyond chance, ranging from -1 to 1 with 0 indicating agreement no better than random.^[60] Yule's Q, a variant related to the odds ratio for 2×2 tables, measures association strength from -1 (perfect negative) to 1 (perfect positive) and simplifies interpretation in binary settings.^[61]