McNemar's test
McNemar's test is a non-parametric statistical procedure used to determine whether there are significant differences in the marginal proportions of a dichotomous outcome between two related samples, such as in paired or matched designs where the same subjects are measured under two conditions.[1] It evaluates marginal homogeneity in a 2×2 contingency table by focusing exclusively on the discordant pairs—those where the outcomes differ between the two measurements—while ignoring concordant pairs where outcomes match.[2]
Developed by psychologist Quinn McNemar, the test was first described in 1947 as a method to address the sampling error in the difference between correlated proportions, building on earlier work in psychometrics and providing a chi-squared-based approach for dependent categorical data.[1] Unlike the Pearson chi-squared test for independence, which assumes unrelated samples, McNemar's test accounts for the paired nature of the data to avoid inflated Type I error rates.[2]
The test statistic is computed as \chi^2 = \frac{(b - c)^2}{b + c}, where b and c represent the counts in the off-diagonal cells of the contingency table (discordant pairs), and it follows a chi-squared distribution with 1 degree of freedom under the null hypothesis of no difference in marginal proportions.[3] For small sample sizes (typically fewer than 20 discordant pairs), a continuity correction \chi^2 = \frac{(|b - c| - 1)^2}{b + c} or an exact binomial test is preferred to improve accuracy.[4]
Key assumptions include paired observations with exactly two dichotomous outcomes per pair and random sampling from the population, making it suitable for scenarios like pre- and post-treatment assessments in clinical trials or evaluating attitude changes in surveys.[3] Common misapplications involve using it for unpaired data or ignoring the dependency structure, which can lead to invalid inferences; extensions like the Stuart-Maxwell test handle multi-category outcomes.[2]
Overview
Definition
McNemar's test is a non-parametric statistical procedure designed to evaluate marginal homogeneity in paired binary data, determining whether there is a significant difference between the proportions observed in two related samples.[5] It specifically addresses scenarios where the same subjects or matched pairs are assessed twice on a dichotomous outcome, such as before and after an intervention, to detect changes in prevalence without assuming a normal distribution.[6] Developed to handle correlated proportions, the test focuses on the symmetry of responses across the two measurements rather than independent groups.[1]
The test relies on a 2×2 contingency table that captures paired observations, with rows and columns representing the two binary categories (e.g., yes/no or success/failure) for the initial and follow-up assessments. The table's cells denote: a for concordant pairs where both responses are positive (yes-yes), b for discordant pairs shifting from negative to positive (no-yes), c for discordant pairs shifting from positive to negative (yes-no), and d for concordant pairs where both are negative (no-no). Emphasis is placed on the off-diagonal discordant cells (b and c), as these reflect changes or inconsistencies between the paired measures, while concordant cells (a and d) do not contribute to the assessment of difference.[7][8]
This method finds broad application across disciplines involving repeated measures on binary traits. In medicine, it evaluates pre- and post-treatment outcomes, such as symptom presence before and after therapy.[9] In psychology, it analyzes shifts in attitudes or behaviors, like opinion changes in survey responses over time.[1] In epidemiology, it supports matched case-control designs to compare exposure risks between paired subjects.[5]
Historical Background
McNemar's test was developed by Quinn McNemar, an American psychologist and statistician born in 1900, who served as a professor of psychology, statistics, and education at Stanford University.[10] McNemar made significant contributions to psychological measurement, including revising the Stanford-Binet intelligence scale in 1942 and authoring the influential textbook Psychological Statistics in 1949, which became a standard reference in the field.[11][12] He also held leadership roles, such as president of the Psychometric Society from 1950 to 1951, underscoring his impact on quantitative methods in psychology.[11]
The test was first described in McNemar's 1947 paper titled "Note on the sampling error of the difference between correlated proportions or percentages," published in Psychometrika, a journal dedicated to psychometric methods.[13] This work addressed the need in psychometrics and experimental psychology for a reliable approach to analyzing paired categorical data, particularly when assessing changes in dichotomous responses within the same subjects, such as before-and-after measurements in psychological experiments.[13] It built upon foundational statistical techniques for contingency tables, including Karl Pearson's chi-square test introduced in 1900, by adapting them specifically for correlated proportions to account for the dependency in paired observations.[12]
Following its introduction, McNemar's test saw adoption in biostatistics starting in the post-1950s era, where it proved valuable for evaluating paired binary outcomes in fields like clinical research and epidemiology.[5] Early refinements included the proposal of a continuity correction by A. L. Edwards in 1948 to improve the chi-square approximation for small samples, with further adjustments explored in the 1960s and 1970s, such as those by William G. Cochran and Joseph L. Fleiss.[14] No major theoretical updates have occurred since, but by the 1980s, the test was integrated into widely used statistical software packages, facilitating its standardization as a tool for paired homogeneity testing.[15]
McNemar's broader influence lies in standardizing statistical practices for paired designs in psychological measurement, where his test complemented his work on reliability and validity in IQ testing and opinion-attitude surveys, helping to bridge psychometrics with broader statistical applications.[16][12]
Methodology
Assumptions and Hypotheses
McNemar's test requires paired binary observations collected on the same subjects, forming matched pairs where each subject provides two dichotomous responses, such as before-and-after measurements or responses to two related questions.[17] The data must be at the nominal scale, with outcomes categorized into two mutually exclusive groups, like "yes/no" or "success/failure," without assuming any ordinal or interval properties.[5] Independence is assumed between different pairs, ensuring that observations from one subject do not influence those from another, while dependence within each pair is explicitly expected and accounted for by the test's design, which focuses on the correlation induced by matching.[17]
Unlike the standard chi-squared test for independent samples, McNemar's test does not require or assume equal marginal probabilities across the two measurements prior to analysis; instead, it directly evaluates whether such equality holds under the null hypothesis.[5] It is particularly suited for handling correlated data, making it appropriate for scenarios where the pairing introduces intra-subject dependency, such as in pre-post intervention studies. Additionally, a sufficient sample size is necessary for the validity of the asymptotic chi-squared approximation, typically requiring more than 20 discordant pairs (where the two responses differ) to ensure reliable inference, though exact tests can be used for smaller samples.[5]
The null hypothesis (H₀) posits marginal homogeneity, meaning the probability of a "positive" outcome (or "yes") in the first measurement equals that in the second (p₁ = p₂), implying no systematic change or difference in the marginal proportions between the paired conditions.[1] This can also be interpreted as the absence of any net shift in the population response distribution across the two time points or treatments. The alternative hypothesis (Hₐ) states marginal inhomogeneity (p₁ ≠ p₂), indicating a significant difference in the marginal probabilities; this is often tested in a two-sided manner but can be one-sided to detect directional changes, such as an increase (p₁ < p₂) or decrease (p₁ > p₂) in the proportion.[5]
Regarding sampling considerations, McNemar's test does not assume fixed marginal totals, allowing for variability in the overall number of "yes" or "no" responses across the sample; the inference is conditional on the observed discordant pairs, which capture the relevant variability for testing the hypotheses.[1] This conditional approach ensures the test's robustness to the concordant pairs (where responses match), focusing solely on the off-diagonal elements of the 2×2 contingency table that inform about changes.[17]
Test Statistic and Procedure
McNemar's test procedure involves constructing a 2×2 contingency table from paired dichotomous observations, with rows representing outcomes from the first measurement (e.g., "no" and "yes") and columns from the second measurement. The cell counts are denoted as follows: a for pairs where both are "no," b for changes from "no" to "yes," c for changes from "yes" to "no," and d for pairs where both are "yes." Only the discordant pairs (b and c) contribute to the test, as the concordant cells (a and d) do not inform on changes in marginal proportions.
The test statistic is computed as
\chi^2 = \frac{(b - c)^2}{b + c},
which follows a chi-squared distribution with 1 degree of freedom under the null hypothesis of marginal homogeneity for large samples (typically when b + c \geq 20). This single degree of freedom arises because the test estimates one parameter: the difference between the marginal proportions. To obtain the p-value, \chi^2 is compared to the cumulative distribution function of the chi-squared distribution with 1 df; the null hypothesis is rejected if the p-value is less than the significance level \alpha (commonly 0.05).
For small samples where b + c < 20, the asymptotic approximation may be unreliable, so an exact alternative uses a binomial test on the discordant pairs.[14] Under the null, the number of "no-to-yes" changes (b) follows a binomial distribution with parameters n = b + c and p = 0.5; the two-sided p-value is twice the minimum of the binomial cumulative probabilities for \min(b, c) and n - \min(b, c).[14]
The step-by-step application proceeds as follows: first, collect paired dichotomous data from the same subjects across two conditions; second, tabulate the data into the 2×2 table and extract b and c; third, compute the test statistic (chi-squared for large samples or exact binomial for small); finally, determine the p-value and compare to \alpha to decide whether to reject the null hypothesis of no systematic change in proportions.[14]
Variations
One common modification to the standard McNemar's test is the application of Yates' continuity correction, which adjusts the chi-squared statistic to better approximate the discrete binomial distribution when the number of discordant pairs is small. The corrected statistic is given by
\chi^2 = \frac{(|b - c| - 1)^2}{b + c},
where b and c represent the off-diagonal counts in the 2×2 contingency table. This correction, originally proposed by Yates for chi-squared tests in the 1930s and adapted to McNemar's test in subsequent applications, helps reduce the inflation of type I error rates for small samples where b + c < 20.
For scenarios with very small numbers of discordant pairs, particularly when b + c < 10, the chi-squared approximation becomes unreliable, leading to the use of the exact McNemar's test. This exact version treats the number of discordant pairs in one direction (e.g., b) as following a binomial distribution with parameters n = b + c and p = 0.5 under the null hypothesis of marginal homogeneity, computing the p-value directly from the binomial cumulative distribution function. Mid-p adjustments, which average the exact p-value with the probability of the observed data, can further improve performance by reducing conservativeness. This approach avoids asymptotic approximations entirely and is recommended for precise inference in small samples.[18]
The standard McNemar's test is limited to binary outcomes, but for multi-category paired data in k \times k tables (k > 2), the Stuart-Maxwell test provides a generalization to test marginal homogeneity. This extension computes a test statistic as a quadratic form involving the vector of marginal differences and the estimated covariance matrix of the discordant pairs, following a chi-squared distribution with k-1 degrees of freedom under the null. Originally developed by Stuart in 1955 and refined by Maxwell in 1970, it extends the focus on off-diagonal elements to multiple categories while maintaining the paired structure.
One-sided versions of McNemar's test are employed when there is a directional hypothesis, such as testing for an increase in one category (e.g., b > c). In these cases, the p-value is derived from the one-tailed binomial cumulative probability, Pr(X \geq b | n = b + c, p = 0.5), providing greater power for detecting asymmetries in a specified direction compared to the two-sided test. This variant is particularly useful in applications like diagnostic test comparisons where superiority in one direction is anticipated.[18]
For ordinal paired data, weighted variants of McNemar's test incorporate weights based on the magnitude of discordance between categories, enhancing sensitivity to ordered differences. These weights, often linear functions of the distance between category scores, modify the test statistic to account for the ordinal nature, as in extensions for multinomial responses where symmetry is weighted by response category distances. Such approaches are applicable in contexts like item response theory models, including Rasch models, to analyze ordered ratings while preserving the paired design.[19]
Applications
Worked Examples
To illustrate the application of McNemar's test, consider a hypothetical medical study evaluating a smoking cessation program in 100 patients, where smoking status is assessed before and after treatment as a binary outcome (smoker or non-smoker).
The resulting 2×2 contingency table for paired observations is as follows:
| Post: Smoker | Post: Non-smoker |
|---|
| Pre: Smoker | a = 40 | b = 20 |
| Pre: Non-smoker | c = 10 | d = 30 |
Here, b = 20 represents patients who were smokers before but non-smokers after (positive changes), while c = 10 represents those who were non-smokers before but smokers after (negative changes). The test statistic is computed as \chi^2 = \frac{(b - c)^2}{b + c} = \frac{(20 - 10)^2}{30} \approx 3.33, with 1 degree of freedom, yielding a p-value of 0.068. This indicates no statistically significant change in smoking status at the 0.05 level, though the direction suggests a modest net reduction in smoking (more positive than negative changes).
In a psychological context, McNemar's test can assess attitude shifts on a dichotomous scale, such as agreement or disagreement with a policy statement, measured before and after an informational exposure in 50 subjects.
The contingency table is:
| Post: Agree | Post: Disagree |
|---|
| Pre: Agree | a = 15 | b = 12 |
| Pre: Disagree | c = 5 | d = 18 |
With b = 12 (shift from agree to disagree) and c = 5 (shift from disagree to agree), the test statistic is \chi^2 = \frac{(12 - 5)^2}{17} \approx 2.88, p ≈ 0.090. The non-significant result suggests no overall change in attitudes, despite a slight net shift toward disagreement. In such paired designs, b and c capture the direction and magnitude of discordant responses, enabling evaluation of whether one outcome predominates over the other.
A useful measure of effect size in these examples is the standardized difference in discordant proportions, \frac{b - c}{b + c}, which quantifies the net directional change relative to total discordance; for the medical case, this is \frac{10}{30} \approx 0.33, indicating a moderate effect. For the psychological case, it is \frac{7}{17} \approx 0.41.
For small samples where the chi-square approximation may be unreliable (e.g., b + c < 25), an exact binomial test is preferred, treating the discordant pairs as binomial trials under the null of equal discordance (p = 0.5). Consider a hypothetical dataset with b + c = 8 discordant pairs, say b = 6 and c = 2; the two-sided p-value is p = 2 \times \sum_{k=0}^{2} \binom{8}{k} (0.5)^8 \approx 0.29, indicating non-significance. This approach maintains validity for paired binary data under the test's assumptions of independence across pairs.
Computational Implementation
McNemar's test can be implemented efficiently in Python using the statsmodels library, which provides the mcnemar function for analyzing 2x2 contingency tables. To perform the test, import the module and pass a 2x2 table as a list of lists, where the off-diagonal elements represent discordant pairs; the function returns the test statistic and p-value based on the chi-squared approximation by default.[20]
python
import statsmodels.stats.contingency_tables as ct
table = [[a, b], [c, d]] # a and d are concordant; b and c are discordant
result = ct.mcnemar(table)
print(result.statistic, result.pvalue)
import statsmodels.stats.contingency_tables as ct
table = [[a, b], [c, d]] # a and d are concordant; b and c are discordant
result = ct.mcnemar(table)
print(result.statistic, result.pvalue)
Here, a, b, c, and d are the cell counts from the contingency table.[20]
In R, the base stats package includes the mcnemar.test function, which tests for symmetry in a 2x2 contingency table and outputs the chi-squared statistic, degrees of freedom, and p-value, with an optional continuity correction applied by default.[21]
r
mcnemar.test(matrix(c(a, b, c, d), 2, 2))
mcnemar.test(matrix(c(a, b, c, d), 2, 2))
The output includes interpretation details such as the chi-squared value and associated p-value, indicating whether to reject the null hypothesis of marginal homogeneity.[21]
For other software, SAS uses PROC FREQ with the TABLES statement and AGREE option to compute McNemar's test on paired categorical data, producing the test statistic and p-value alongside agreement measures.[22] The syntax is:
sas
PROC FREQ DATA=dataset;
TABLES var1*var2 / AGREE;
RUN;
PROC FREQ DATA=dataset;
TABLES var1*var2 / AGREE;
RUN;
In SPSS, the NPAR TESTS procedure with the MCNEMAR subcommand handles paired nominal variables, generating the chi-squared statistic and exact p-value if specified.[23] The syntax is:
NPAR TESTS
/MCNEMAR=var1 WITH var2 (PAIRED)
/STATISTICS EXACT.
NPAR TESTS
/MCNEMAR=var1 WITH var2 (PAIRED)
/STATISTICS EXACT.
Variations such as continuity correction can be applied in Python's statsmodels by setting the correction parameter to True, which adjusts the chi-squared statistic by subtracting 0.5 from the absolute difference of discordant counts to improve the approximation for small samples.[20] For the exact test, use scipy.stats.binomtest on the discordant counts (e.g., testing if the probability of one discordant type equals 0.5 under the null), providing a binomial-based p-value without approximation.[24]
python
from scipy.stats import binomtest
p_value = binomtest(b, n=b + c, p=0.5).pvalue # b and c are discordant counts
from scipy.stats import binomtest
p_value = binomtest(b, n=b + c, p=0.5).pvalue # b and c are discordant counts
This approach is suitable when cell counts are small (e.g., less than 10).
Data input typically involves a 2x2 contingency table for direct use in the test functions, but for raw paired data, preprocess using libraries like pandas to create the table via cross-tabulation before applying the test.[20] For example, in Python:
python
import pandas as pd
ct = pd.crosstab(before, after) # before and after are paired binary variables
import pandas as pd
ct = pd.crosstab(before, after) # before and after are paired binary variables
This ensures the input matches the required format, handling concordant and discordant pairs appropriately.
Interpretation and Extensions
Power and Limitations
The statistical power of McNemar's test primarily depends on the effect size, defined as \left| b - c \right| / (b + c), which represents the proportion of discordant pairs favoring one marginal proportion over the other, the total number of discordant pairs b + c, and the chosen significance level \alpha.[25] Power increases with larger effect sizes, more discordant pairs, and lower \alpha, but remains contingent on the within-pair correlation structure.[7] The statistical power of the test is low when the number of discordant pairs is fewer than 25.[5] Simulation-based approaches, such as Monte Carlo enumeration under the multinomial model, provide accurate power estimates when analytical approximations are unreliable, particularly for small to moderate sample sizes.[7]
A key limitation of the test is its low power in small samples or when discordant pairs are nearly balanced (b \approx c), as the effective sample size is reduced to only the discordant counts, potentially leading to failure in detecting meaningful differences.[5] Additionally, by conditioning on discordant pairs, the test discards information from concordant pairs (a and d), which can limit efficiency compared to unconditional alternatives in certain settings.[8] In repeated measures designs, such as before-after studies, the test assumes no carryover effects from the first measurement to the second, which could bias results if violated, as in crossover trials without sufficient washout periods.[26] The test is also sensitive to multiple comparisons, where performing it repeatedly without adjustments (e.g., Bonferroni correction) inflates the family-wise Type I error rate.[27]
Regarding error rates, the asymptotic \chi^2 approximation tends to over-reject the null hypothesis (elevated Type I error) when the total sample size is small or discordant pairs are few (b + c < 25), making the exact binomial test preferable for such cases despite its computational intensity for very large n.[5] This can correspondingly increase Type II errors in low-power scenarios, underscoring the need for adequate planning.[28]
McNemar's test should be avoided for unpaired binary data, where the standard Pearson chi-squared test is more appropriate; for continuous paired outcomes, where the paired t-test or Wilcoxon signed-rank test applies; and in the presence of missing data unless missingness is completely at random (MCAR), as listwise deletion in paired analyses can introduce bias otherwise.[5][8]
Power can be enhanced through larger overall sample sizes to yield more discordant pairs, or by employing one-sided alternatives when a directional hypothesis is justified, which approximately doubles power for the same setup.[7] Reporting effect sizes, such as \left| b - c \right| / (b + c), is advisable to aid interpretation of practical significance beyond p-values.[29]
For unpaired binary data from independent samples, the standard Pearson chi-squared test assesses association between two categorical variables, while Fisher's exact test provides an exact alternative suitable for small sample sizes; these differ from McNemar's test by not accounting for the paired structure that controls for individual variability.[6]
When dealing with more than two related binary measurements per subject, such as in repeated measures designs, Cochran's Q test extends the principles of McNemar's test to evaluate overall differences across multiple time points or conditions, serving as a non-parametric analog to repeated measures ANOVA for dichotomous outcomes.[30]
For paired ordinal data, the Wilcoxon signed-rank test is appropriate, as it incorporates the magnitude and direction of differences through ranking, providing greater power than McNemar's nominal-level approach when the ordering of categories conveys additional information.[5]
In the context of marginal homogeneity for multi-way contingency tables from paired categorical data, Bhapkar's test offers a robust alternative to generalized versions of McNemar's test by utilizing the asymptotic normality of marginal proportions, while Wald's test provides a score-based approach for testing equality of marginal distributions under correlated observations.[31]
Contemporary extensions for paired binary outcomes incorporate covariates through logistic mixed-effects models, which account for random effects to model heterogeneity across subjects, or generalized estimating equations (GEE), which focus on population-averaged effects while handling within-subject correlation via working correlation matrices.[32] Bayesian analogs to McNemar's test employ beta-binomial priors to model the dependence in discordant pairs, enabling posterior inference on the difference in marginal probabilities and incorporating prior knowledge for small samples.[33]
McNemar's test is ideal for simple pre-post designs without covariates, emphasizing discordant pairs to test marginal homogeneity; however, for adjusted analyses involving predictors or complex correlations, regression-based methods like GEE or mixed models are preferred to enhance interpretability and control for confounding.[2]