Fact-checked by Grok 2 weeks ago

Log-linear analysis

Log-linear analysis is a statistical technique used to model associations and interactions among categorical variables through the analysis of contingency tables, where the logarithm of the expected cell frequencies is expressed as a linear combination of parameters representing main effects and interactions.^[1] This approach treats the observed counts as realizations of a Poisson distribution, enabling the use of generalized linear models to test hypotheses such as independence or conditional independence in multi-way tables.^[2] Developed in the 1960s as a multiplicative extension of earlier categorical data methods, log-linear analysis provides a flexible framework for hierarchical model building and goodness-of-fit assessment via likelihood ratio tests.^[2] The method's core formulation posits that for a two-way contingency table with dimensions I × J, the log-expected frequency in cell (i, j) follows \log(\mu_{ij}) = \lambda + \lambda^X_i + \lambda^Y_j + \lambda^{XY}_{ij}, where \lambda represents the overall mean, \lambda^X_i and \lambda^Y_j are main effects, and \lambda^{XY}_{ij} captures the association; exponentiating parameters yields odds ratios for interpreting effect sizes.^[1] For higher-dimensional tables, additional interaction terms are included hierarchically, allowing examination of complex relationships like partial associations in three-way designs.^[1] Model estimation typically employs maximum likelihood via iterative proportional fitting or Newton-Raphson algorithms, with challenges such as structural zeros addressed through specialized algebraic techniques.^[2] Historically, log-linear models built on foundational work in contingency table analysis from the early 20th century, including Pearson's chi-square test (1900) and Bartlett's maximum likelihood approaches (1935), but gained prominence through Birch's 1963 formalization of multiplicative models using u-term expansions.^[2] Widely applied in fields like social sciences, epidemiology, and psychology, log-linear analysis can handle sparse data and ordinal variables—though sparse cases pose estimation challenges—offering alternatives to logistic regression when all variables are treated symmetrically as predictors.^[1]^[2] Its integration with graphical models and extensions to random effects have further enhanced its utility in modern data analysis.^[2]

Fundamentals

Definition and Purpose

Log-linear analysis is a statistical technique used to model the relationships among multiple categorical variables through the analysis of multi-dimensional contingency tables. It expresses the logarithm of the expected cell frequencies as a linear combination of parameters that capture main effects and interactions among the variables. Formally, the model is given by

\log(\mu_i) = \beta_0 + \sum_j \beta_j x_{ij} + \sum_{j<k} \beta_{jk} x_{ij} x_{ik} + \cdots,

where \mu_i denotes the expected frequency in cell i, \beta_0 is the intercept, the \beta_j represent main effects, the \beta_{jk} capture two-way interactions, and higher-order terms account for more complex associations, with x_{ij} as indicator variables for the categories of variable j. The primary purpose of log-linear analysis is to identify and test for associations, partial independence, and higher-order interactions in cross-classified categorical data, especially when traditional methods like analysis of variance (ANOVA) or linear regression are inappropriate due to the discrete nature of the variables. It enables researchers to assess whether variables are independent or exhibit dependencies that cannot be explained by lower-order effects alone, providing insights into the structure of multivariate frequency distributions. Unlike logistic regression, which models the conditional distribution of one categorical response variable given the others, log-linear analysis models the joint distribution of all variables simultaneously, treating none as distinctly dependent or independent. This approach was developed in the 1960s, with key formalizations such as Birch's 1963 work on multiplicative models, building on foundational work in multivariate categorical analysis.^[2]

Key Assumptions

Log-linear analysis relies on several foundational statistical assumptions to ensure the validity of inferences drawn from categorical data in contingency tables. Under the Poisson sampling framework, a primary assumption is the independence of cell counts, meaning that the counts in different cells are independent.^[3] This independence ensures that the variability in one cell does not influence another, allowing the model to accurately capture associations among variables without confounding from correlated errors. Under multinomial sampling, cell counts are dependent due to the fixed total sample size, but the model accounts for this. The sampling scheme underlying the data must also align with the model's structure, typically following either a Poisson distribution for individual cell counts—where the total sample size is not fixed—or a multinomial distribution when margins are fixed, such as in product-multinomial sampling.^[4] These distributions lead naturally to the log-linear parameterization because the logarithm of the expected cell frequencies is modeled as a linear function of the parameters, facilitating the analysis of cell counts as responses.^[4] Under the Poisson assumption, cell counts are independent Poisson random variables, while the multinomial variant conditions on fixed totals to model proportions.^[3] Another key assumption is the absence of structural zeros in the contingency table, where all cells are presumed to have positive probability unless theoretically impossible due to the study's design.^[5] Structural zeros represent combinations of categories that cannot occur (e.g., impossible events), and their presence without explicit modeling can bias parameter estimates; thus, standard log-linear models assume such cells are either nonexistent or handled by excluding them from the table structure.^[6] For reliable inference, particularly when using likelihood ratio tests or Pearson chi-square statistics to assess model fit, the data must satisfy a sufficient sample size requirement: expected frequencies should be at least 5 in the majority (typically 80%) of cells.^[6] This condition supports the asymptotic chi-square approximation of the deviance and ensures that parameter estimates are stable and tests have appropriate Type I error rates; smaller expected values may necessitate exact methods or collapsed tables.^[6] Finally, the model incorporates an assumption of homogeneity of variance implicit in the Poisson sampling framework, where the variance of each cell count equals its expected value (Var(Y) = E(Y) = μ).^[7] This equidispersion property is crucial for the maximum likelihood estimation and standard error calculations in log-linear models; violations, such as overdispersion, may require extensions like negative binomial variants.^[4]

Types of Variables

Log-linear analysis primarily involves categorical variables, which are divided into nominal and ordinal types based on whether they possess an inherent order. Nominal variables represent unordered categories, such as gender (male/female) or religious affiliation (Protestant/Catholic/other), where the levels have no natural ranking and are treated as distinct groups without implying superiority or progression. These variables are typically encoded using dummy indicator variables, each corresponding to a category except a reference level, to facilitate modeling in the log-linear framework. Ordinal variables, however, feature ordered categories, for example, education levels (elementary/secondary/college) or satisfaction ratings (low/medium/high), where the sequence reflects increasing or decreasing intensity; they can also be represented by indicators but may incorporate scoring to leverage the order for more efficient analysis.^[8] The foundational data structure for log-linear analysis is the contingency table, a multi-way array that cross-classifies observations according to the levels of two or more categorical variables, with each cell containing the observed frequency count of occurrences for that combination. For example, a 2×2×3 contingency table might tabulate counts across two binary nominal variables and one ordinal variable with three levels, enabling the examination of joint distributions. The margins of these tables—row totals, column totals, or higher-dimensional sums—represent univariate marginal distributions or partial associations between subsets of variables, providing summaries of the data's structure before modeling interactions.^[8]^[9] In contingency tables for log-linear models, the treatment of margins as fixed or random depends on the underlying sampling scheme, which influences the distributional assumptions. Under Poisson sampling, all margins are random, with cell counts modeled as independent Poisson random variables whose means equal their variances, suitable for independent count data without fixed totals. In multinomial sampling, the overall sample size (a one-dimensional margin) is fixed, rendering other margins random conditional on this total, which aligns with scenarios where observations are categorized into mutually exclusive cells summing to a known n. This distinction ensures that the model accounts for the dependencies introduced by fixed margins, such as in prospective studies where row totals are controlled.^[8]^[10] Interaction terms in log-linear analysis describe the associations among categorical variables through main effects, two-way interactions, and higher-order terms, forming the core of how dependencies are modeled in contingency tables. Main effects capture univariate marginal distributions for individual variables, reflecting their standalone influences on cell frequencies. Two-way interaction terms model bivariate associations between pairs of variables, such as the joint distribution of gender and education level, indicating whether categories co-occur more or less than expected under independence. Higher-order interactions, like three-way terms, address conditional dependencies among three or more variables—for instance, how the association between two variables varies across levels of a third—allowing for the representation of complex, multifaceted relationships in multi-way tables.^[8]^[11]

Model Specification and Fitting

Fitting Criteria

In log-linear analysis, parameters \beta are estimated using maximum likelihood estimation (MLE), which selects values that maximize the likelihood of the observed cell frequencies n_i under the assumption that these frequencies follow a Poisson distribution with expected values \mu_i = \exp(X_i \beta), where X_i is the design vector for cell i.^[12]^[13] The Poisson likelihood function for the parameters \beta given the observed counts n = (n_1, \dots, n_I) is

L(\beta \mid n) \propto \prod_{i=1}^I \frac{\mu_i^{n_i} \exp(-\mu_i)}{n_i!},

where the product is over all I cells in the contingency table, and the factorial terms are constants with respect to \beta.^[13]^[12] Maximizing this likelihood, or equivalently its logarithm, requires solving nonlinear equations, typically through iterative numerical procedures such as the Newton-Raphson method or iteratively reweighted least squares (IRLS), which converge to the MLE under standard conditions for generalized linear models. For large samples, the MLE \hat{\beta} possesses desirable asymptotic properties: it is consistent (converging in probability to the true \beta), asymptotically unbiased, and asymptotically normally distributed with covariance matrix given by the inverse of the observed Fisher information.^[14] In practice, contingency tables often contain empty cells (where n_i = 0), which can complicate estimation if they lead to non-convergence or infinite parameter estimates; common strategies include collapsing table dimensions to combine cells or, if theoretically justified, adding small constants (such as 0.5) to all cells prior to fitting, though the latter risks biasing results and is generally discouraged without strong rationale.^[14]^[15]

Hierarchical Principles

In log-linear analysis, the hierarchical principle governs model specification by requiring that any k-way interaction term included in the model must be accompanied by all lower-order interactions (up to (k-1)-way) among the same variables, as well as their constituent main effects. For example, including a three-way interaction term ABC necessitates the two-way terms AB, AC, and BC, along with the main effects A, B, and C. This constraint, rooted in the principle of marginality, ensures that higher-order effects are interpreted relative to the associations captured by lower-order terms, preventing the estimation of isolated partial interactions that lack substantive meaning. The hierarchical structure embodies a Markov-like assumption, wherein the conditional independence of variables given the lower-order terms is implicitly modeled; higher-order interactions thus represent deviations from independence that are conditional on the specified margins. This property facilitates the decomposition of complex associations into interpretable components, aligning with the iterative nature of model building in categorical data analysis. Hierarchical models are compactly specified using bracket notation, where terms denote the highest-order interactions, and lower-order terms are automatically included. For instance, the notation [AB][C] specifies a model with the two-way interaction between variables A and B, the main effect of C, and—by hierarchy—the main effects of A and B, corresponding to the equation \log \mu_{ijk} = u + u_A^{(i)} + u_B^{(j)} + u_C^{(k)} + u_{AB}^{(ij)}. Such notation streamlines the description of models ranging from complete independence ([A][B][C]) to partial associations. Adopting the hierarchical principle offers key advantages, including reduced overfitting by limiting the parameter space to parsimonious, nested structures and enhanced interpretability through systematic progression from main effects to higher interactions. This approach supports forward or backward selection strategies, where models are compared hierarchically to identify significant associations without redundant terms. A practical illustration occurs in analyzing a 2×2×2 contingency table, such as one cross-classifying gender (A), treatment (B), and outcome (C). The hierarchical model positing no three-way interaction but allowing all two-way interactions is expressed as:

\log \mu_{ijk} = u + u_A^{(i)} + u_B^{(j)} + u_C^{(k)} + u_{AB}^{(ij)} + u_{AC}^{(ik)} + u_{BC}^{(jk)}

This formulation tests for pairwise associations (e.g., treatment-outcome conditional on gender) while assuming uniformity across the third variable's levels, providing a baseline for assessing more complex dependencies.

Model Types

General Log-linear Models

General log-linear models provide a framework for analyzing associations among multiple categorical variables in contingency tables by modeling the logarithm of the expected cell frequencies as an additive function of parameters representing main effects and interactions. These models treat all variables symmetrically, without distinguishing between response and explanatory variables, and are fitted using maximum likelihood estimation, often via iterative methods like iterative proportional fitting. The approach is particularly suited for moderate-sized tables with a few variables, enabling tests of hypotheses about independence and conditional associations.^[16] A foundational example is the mutual independence model, which assumes no associations among the variables. For a three-way contingency table with variables A, B, and C, this model is specified as \log \mu_{ijk} = u + u_i^A + u_j^B + u_k^C, where \mu_{ijk} denotes the expected frequency in cell (i,j,k), u is the overall mean log-frequency, and the u terms capture the main effects. Under this model, the expected frequencies factorize as the product of the one-way marginal distributions, implying that the joint distribution is the product of the marginals. This model serves as a baseline for assessing higher-order dependencies.^[16] Partial association models extend the mutual independence framework by incorporating selected interactions while assuming conditional independencies for others. For instance, the model [AB][AC] for three variables includes main effects for A, B, and C, along with two-way interactions AB and AC, but omits the BC interaction, yielding \log \mu_{ijk} = u + u_i^A + u_j^B + u_k^C + u_{ij}^{AB} + u_{ik}^{AC}. This specification tests the conditional independence of B and C given A, where the absence of the BC term implies that any association between B and C is explained by their mutual relations with A. Such models allow for flexible exploration of partial dependencies in multi-way tables.^[16] The saturation model represents the most complex case, incorporating all possible main effects and interactions up to the highest order. For an I \times J \times K table, it includes terms through the three-way ABC interaction, resulting in a parameter count equal to the number of cells (IJK), yielding zero degrees of freedom and a perfect fit to the observed data (\hat{\mu}_{ijk} = n_{ijk} for all cells). While useful as a reference for model comparison, it provides no parsimony or insight into underlying structures. Interpretation of parameters in general log-linear models focuses on their role in describing log-expected margins and associations. The main effect terms u_i quantify deviations in the log-expected frequencies for specific levels of a variable from the grand mean, averaged over the other variables; for example, u_1^A - u_2^A = \log(\hat{\mu}_{\cdot 1 \cdot}/\hat{\mu}_{\cdot 2 \cdot}), the log ratio of marginal means. Interaction parameters capture multiplicative effects; in two-way cases, the u_{ij} terms correspond to log-odds ratios measuring the strength of association between two variables, conditional on others in higher dimensions. These interpretations facilitate understanding of how variables jointly influence cell frequencies.^[16] Despite their flexibility, general log-linear models face significant limitations with increasing dimensionality. The number of potential parameters expands exponentially with the number of variables (e.g., $2^p interactions for p binary variables), exacerbating the curse of dimensionality: tables become increasingly sparse, estimation becomes computationally intensive due to the need for iterative algorithms over large parameter spaces, and models risk overfitting without strong prior constraints on interactions. These challenges restrict practical application to tables with no more than four or five variables unless specialized structures are imposed.

Graphical Models

Graphical log-linear models represent conditional independencies and interactions among categorical variables in multidimensional contingency tables using undirected graphs. These models facilitate the specification of log-linear models by visualizing the structure of associations, where the absence of certain interactions is explicitly encoded through the graph's topology.^[17] In a graphical log-linear model, nodes correspond to the categorical variables (or factors), and undirected edges between nodes denote direct associations, typically interpreted as two-factor interactions. The presence of an edge indicates that the variables are directly related, while the absence suggests potential conditional independence given other variables. Higher-order interactions are implied only if necessary to maintain the graph's structure.^[18]^[19] The graph encodes conditional independencies via the d-separation criterion: two sets of variables are conditionally independent given a third set if the third set separates them in the graph, meaning there is no active path connecting the two sets when the separating nodes are conditioned upon. This separation implies that no higher-order interactions beyond those captured by the graph are required in the model. For example, if variables B and C are separated by A, then B is conditionally independent of C given A (B ⊥ C | A), precluding a three-way interaction term [ABC].^[18] The log-linear model is generated by including terms that correspond to the maximal cliques—complete subgraphs where every pair of nodes is connected by an edge—in the graph. Each maximal clique generates an interaction term of the order equal to the number of variables in it, and the full set of such terms defines the model. Marginal terms for smaller cliques are automatically included to ensure the model is hierarchical.^[17] Consider three binary variables A, B, and C forming a graph with edges A–B and A–C but no edge B–C. The maximal cliques are {A, B} and {A, C}, leading to the model [AB][AC][A]. This specifies the log-expected cell frequencies as

\log m_{abc} = \mu + \lambda^A_a + \lambda^B_b + \lambda^C_c + \lambda^{AB}_{ab} + \lambda^{AC}_{ac},

implying B ⊥ C | A.^[17] Graphical log-linear models offer advantages as a visual aid for discerning complex interdependencies among multiple variables, making it easier to hypothesize and test structures in high-dimensional data. They also connect to broader probabilistic modeling paradigms, such as Bayesian networks, by sharing foundational Markov properties that link graphical separation to statistical independence. These models can be fitted using maximum likelihood estimation, often via iterative proportional fitting algorithms.^[17]^[19]

Decomposable Models

Decomposable log-linear models form a special subclass of graphical log-linear models, characterized by an underlying dependence graph that is chordal—meaning it contains no induced cycles of length greater than three—and admits a simplicial ordering of its cliques, allowing recursive decomposition into independent components separated by complete cliques.^[20] This structure ensures that the model factorizes exactly over the maximal cliques C and separators S, facilitating efficient computation without approximation.^[21] A key advantage of decomposability is the availability of closed-form maximum likelihood estimates (MLEs), derived directly from the observed marginal tables corresponding to the cliques, with adjustments for the separators to account for overlaps; this eliminates the need for iterative optimization, making estimation computationally tractable even in higher dimensions.^[20] The iterative proportional fitting (IPF) algorithm provides an alternative fitting method for these models, operating by cyclically scaling the current estimate to match the observed margins for each clique until convergence, which occurs in a finite number of steps for decomposable cases.^[20] The expected cell frequencies \mu in a decomposable model satisfy the factorization

\mu = \prod_{C} \mu_{C}^{v_{C}} / \prod_{S} \mu_{S}^{v_{S}},

where C denotes the cliques, S the separators, and v the respective multiplicities reflecting how often each component appears in the decomposition.^[20] These models are particularly suited to sparse, high-dimensional contingency tables, such as those arising in genetics for analyzing multi-locus associations or in social surveys for multi-way interactions among categorical variables, where the chordal structure exploits sparsity to enable exact inference.^[21]

Model Evaluation

Assessing Model Fit

Assessing the fit of a log-linear model to contingency table data primarily involves goodness-of-fit tests that measure discrepancies between observed cell counts n_i and model-expected counts \mu_i, where the latter are derived from maximum likelihood estimation under the Poisson assumption. The Pearson chi-square statistic, X^2 = \sum_i \frac{(n_i - \mu_i)^2}{\mu_i}, quantifies this discrepancy and asymptotically follows a chi-squared distribution when expected counts are sufficiently large (typically at least 5 per cell). Similarly, the likelihood ratio statistic, G^2 = 2 \sum_i n_i \log(n_i / \mu_i), provides an alternative measure that is often preferred for its additive property, allowing straightforward decomposition for nested models, and it also approximates a chi-squared distribution under the null hypothesis of adequate fit. Both statistics test the model against the saturated model, with small values (or large p-values) indicating good reproduction of the observed data. The degrees of freedom for these chi-squared tests are calculated as the total number of cells in the table minus the number of free parameters estimated in the model; in Poisson log-linear models, this subtracts the parameters defining the log-expected values, including the overall mean level. For instance, in a two-way table with I rows and J columns under a model of independence, the degrees of freedom equal (I-1)(J-1). These degrees of freedom ensure the test accounts for model complexity, providing a benchmark for significance. When overall fit statistics suggest inadequacy, residual analysis helps pinpoint poorly fitted cells. Standardized Pearson residuals, \frac{n_i - \mu_i}{\sqrt{\mu_i}}, are commonly used; values with absolute magnitude greater than 2 or 3 flag cells where the model deviates substantially from observations, approximating a standard normal distribution under good fit. For count data, especially in sparse tables, Freeman-Tukey residuals—such as \sqrt{n_i} - \sqrt{\mu_i} or more adjusted forms like (\sqrt{n_i} + \sqrt{n_i + 1}) - (\sqrt{\mu_i} + \sqrt{\mu_i + 1})—offer improved properties, including better normality and reduced sensitivity to small expected values.^[22] In cases of small samples, where asymptotic chi-squared approximations may fail due to low expected counts, simulation-based methods provide robust alternatives. Monte Carlo tests simulate contingency tables under the fitted model (e.g., by generating Poisson variates with means \mu_i) and compute empirical distributions of X^2 or G^2 to derive exact p-values, enhancing reliability for complex or sparse data structures.

Comparing Multiple Models

In log-linear analysis, comparing multiple models is essential for selecting the most parsimonious representation of associations among categorical variables in contingency tables, particularly when models adhere to hierarchical principles. Nested models, where one is a special case of another (e.g., a model omitting higher-order interactions), are compared using likelihood ratio tests based on the deviance statistic G^2. The difference G^2_{M_1} - G^2_{M_2} follows a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters (\Delta df = p_1 - p_2) under the null hypothesis that the simpler model M_2 adequately fits the data; similarly, differences in the Pearson chi-squared statistic X^2 can be used. Stepwise selection procedures facilitate model comparison by iteratively building or pruning models. Backward selection begins with the saturated model (which fits the data perfectly) and removes terms whose omission does not significantly worsen fit, as determined by p-values from the nested likelihood ratio test (typically at \alpha = 0.05). Forward selection starts from the independence model and adds terms that significantly improve fit. These methods ensure hierarchical consistency while balancing model complexity and explanatory power. For broader model selection, information criteria penalize complexity to favor parsimonious models. The Akaike information criterion (AIC) is calculated as \text{AIC} = G^2 + 2p, where p is the number of parameters, providing an estimate of relative predictive accuracy; lower values indicate better models. The Bayesian information criterion (BIC), given by \text{BIC} = G^2 + p \log(n) with sample size n, imposes a stronger penalty for additional parameters in larger datasets, approximating the Bayes factor for model choice. Both are applicable to log-linear models fitted via maximum likelihood under the Poisson assumption. When models are non-nested (neither is a special case of the other), Vuong's likelihood ratio test assesses which is closer to the true data-generating process by standardizing the average difference in log-likelihoods between models, yielding a test statistic asymptotically distributed as standard normal under the null of equal fit. This test accounts for the Kullback-Leibler divergence and is particularly useful for comparing log-linear models with differing interaction structures. For instance, in analyzing a three-way contingency table of variables A, B, and C, one might first fit a two-way interaction model (including AB, AC, BC terms) and then compare it to a model adding the ABC term using the likelihood ratio test: if G^2_{\text{two-way}} - G^2_{\text{three-way}} exceeds the critical chi-squared value at \Delta df = 1, the three-way interaction significantly improves fit, justifying its inclusion.

Interpretation and Analysis

Follow-up Tests

After an initial log-linear model is fitted to contingency table data using maximum likelihood estimation, follow-up tests enable targeted examination of hypotheses about specific parameters or interactions, such as whether a particular association is present conditional on other variables. These procedures build on parameter estimates from the fitting process and are essential for dissecting complex multi-way associations in categorical data analysis.^[23] Wald tests provide a direct method for assessing the significance of individual parameters, such as log-odds ratios corresponding to interactions. The test statistic is constructed as z = \frac{\hat{\beta}}{\text{SE}(\hat{\beta})}, where \hat{\beta} is the maximum likelihood estimate of the parameter and \text{SE}(\hat{\beta}) its standard error, asymptotically following a standard normal distribution under the null hypothesis that the parameter equals zero. This approach is standard in the generalized linear models framework underlying log-linear analysis and is computationally straightforward once estimates are obtained.^[23] Score tests, alternatively known as Lagrange multiplier tests, offer an efficient alternative for hypothesis testing without refitting the full alternative model. They rely on the score statistic, the first derivative of the log-likelihood evaluated at the null parameter values, divided by the expected information matrix: approximately \frac{[U(\theta_0)]^2}{I(\theta_0)} \sim \chi^2, where U is the score function and I the Fisher information. In log-linear contexts, score tests are particularly valuable for verifying conditional independence or trends in contingency tables when the null model suffices for computation.^[23] Partial association tests investigate the relationship between two variables while accounting for others, often in multi-way tables. These can involve collapsing the table over adjusting variables to form a partial 2×2 table and applying a chi-squared test, or fitting nested log-linear models and comparing them via likelihood ratio statistics to isolate the interaction term of interest. For a three-way table, the test for partial AB association controlling for C assesses whether the conditional odds ratio between A and B varies by levels of C; equivalently, the overall chi-squared can be partitioned into components, with the partial association component following a chi-squared distribution with (I-1)(J-1) degrees of freedom. This partitioning method facilitates hierarchical decomposition of associations.^[24]^[23] When multiple follow-up tests are performed, such as simultaneously evaluating several interaction terms, the Bonferroni correction adjusts the significance level to maintain control over the family-wise error rate. The adjusted level is \alpha / m, where m is the number of tests; for instance, with five tests at nominal \alpha = 0.05, each uses \alpha = 0.01. This procedure, though conservative, is routinely recommended in log-linear analyses involving multiple parameter hypotheses to prevent spurious significant findings.^[23] An illustrative example is testing homogeneity of odds ratios in a three-way contingency table, such as a 2×2×K design where the goal is to verify if the odds ratio between two binary variables remains constant across K strata of a third. This is tested by fitting the homogeneous association log-linear model, which omits the three-way interaction term (\log \mu_{ijk} = \lambda + \lambda_i^A + \lambda_j^B + \lambda_k^C + \lambda_{ij}^{AB} + \lambda_{ik}^{AC} + \lambda_{jk}^{BC}), and comparing its deviance to that of the saturated model via a likelihood ratio test; non-significance indicates constant partial odds ratios. Such tests are common in stratified analyses, like assessing treatment effects uniform across patient subgroups.^[23]

Effect Sizes and Measures

In log-linear models for categorical data, effect sizes provide measures of the magnitude and practical importance of associations between variables, distinct from tests of statistical significance. These metrics help quantify how strongly variables interact in contingency tables, aiding interpretation in fields such as social sciences and epidemiology. Common effect sizes derive from model parameters or summary statistics of the table, focusing on the strength of dependence rather than just model fit.^[23] For two-way contingency tables, the log-odds ratio serves as a primary effect size for measuring association strength. In a saturated log-linear model, the parameter \beta_{ij} for the interaction between variables i and j represents the log-odds ratio, and the odds ratio \exp(\beta_{ij}) indicates the change in odds of one category given the other, relative to a baseline. Values near 1 suggest weak or no association, while deviations (e.g., above 2 or below 0.5) imply stronger effects. Confidence intervals for \exp(\beta_{ij}) are typically constructed using the delta method, which approximates the variance of the transformed parameter from the asymptotic normality of \hat{\beta}_{ij}. This approach ensures interpretable intervals on the odds ratio scale, facilitating assessment of uncertainty in the effect magnitude.^[23] In higher-order tables, partial associations address conditional dependencies while accounting for other variables, with collapsibility measures evaluating whether marginal effects align with conditional ones. For instance, under models of conditional independence (e.g., no three-way interaction), the partial odds ratio between two variables equals the marginal odds ratio, a property known as collapsibility; violations indicate non-collapsibility, where averaging partial log-odds ratios over strata of a third variable yields the overall effect size. Average partial associations, computed as the arithmetic mean of conditional log-odds ratios across levels of conditioning variables, quantify the typical strength of a two-way interaction in multi-way tables, useful for summarizing complex structures without assuming decomposability.^[25] Other effect sizes include uncertainty coefficients and analogs to R^2, which capture asymmetric dependence and predictive power. The uncertainty coefficient U, which measures the proportional reduction in uncertainty (entropy) of one variable given the other, is defined as U = \frac{ \sum_i \sum_j \pi_{ij} \log \left( \frac{\pi_{ij}}{\pi_{i+} \pi_{+j}} \right) }{ -\sum_j \pi_{+j} \log \pi_{+j} }, with values from 0 (independence) to 1 (perfect prediction); it is asymmetric, yielding distinct row- and column-based coefficients. For predictive accuracy, Goodman-Kruskal lambda (\lambda) or gamma (\gamma) serve as R^2-like measures: \lambda quantifies the proportional reduction in prediction error for nominal data, while \gamma adjusts for ties in ordinal data, both ranging from -1 to 1, with absolute values above 0.3 often indicating moderate association.^[23] As an example, consider a three-way contingency table analyzing the association between treatment (T), response (R), and sex (S) in a medical study. A significant three-way interaction term in the log-linear model implies that the two-way odds ratio between T and R varies by levels of S; the effect size can be interpreted as the range or average of these conditional odds ratios (e.g., \exp(\beta_{TR|S=1}) = 2.5 vs. \exp(\beta_{TR|S=2}) = 1.2), highlighting how the treatment effect's strength differs across subgroups and guiding targeted inferences.^[23]

Implementation

Software for Small Datasets

Software tools for fitting log-linear models to small contingency tables, typically with up to 5-6 dimensions, are available in several statistical packages, enabling hierarchical model specification, assessment of fit via chi-square statistics, and basic tests of effects such as associations and interactions.^[26] These tools treat cell frequencies as Poisson-distributed counts and use iterative methods like maximum likelihood to estimate parameters, supporting analyses of multi-way tables where the log-expected frequencies are modeled as linear combinations of main effects and interactions.^[27] In R, the loglm() function from the MASS package fits basic hierarchical log-linear models using iterative proportional fitting (IPF), a method equivalent to maximum likelihood for saturated models on complete tables.^[28] It allows formula-based specification similar to linear models, producing deviance-based chi-square goodness-of-fit tests and likelihood ratio statistics for comparing nested models.^[29] For more flexible extensions, the gnm package supports generalized nonlinear models (GNMs), which encompass log-linear models with multiplicative terms for complex interactions, also providing chi-square tests and effect estimates.^[30] An example for a 2x2x2 table of factors A, B, and C uses:

library(MASS)
data <- array(c(10, 20, 30, 40, 50, 60, 70, 80), dim = c(2, 2, 2))
dimnames(data) <- list(A = c("a1", "a2"), B = c("b1", "b2"), C = c("c1", "c2"))
fit <- loglm(~ A + B + C + A:B, data = data)
summary(fit)
library(MASS)
data <- array(c(10, 20, 30, 40, 50, 60, 70, 80), dim = c(2, 2, 2))
dimnames(data) <- list(A = c("a1", "a2"), B = c("b1", "b2"), C = c("c1", "c2"))
fit <- loglm(~ A + B + C + A:B, data = data)
summary(fit)

This code fits a model with main effects and the A:B interaction, yielding a deviance of approximately 5.2 (p > 0.05 for fit) under Poisson assumptions.^[26] SAS provides PROC GENMOD for fitting log-linear models via generalized linear modeling with a Poisson distribution and log link function, suitable for small multi-dimensional tables through iterative maximum likelihood estimation.^[27] It supports hierarchical terms via CLASS and MODEL statements, outputting Pearson and deviance chi-square statistics for fit assessment, as well as Wald tests for individual effects.^[31] For a 2x2x2 contingency table stored as a dataset with variables A, B, C, and COUNT, the code is:

proc genmod data=[table](/page/Table);
  class A B C;
  model [COUNT](/page/Count) = A B C A*B / dist=[poisson](/page/Poisson) link=log type=3;
  run;
proc genmod data=[table](/page/Table);
  class A B C;
  model [COUNT](/page/Count) = A B C A*B / dist=[poisson](/page/Poisson) link=log type=3;
  run;

This estimates parameters for main effects and the A:B interaction, with goodness-of-fit chi-square values indicating model adequacy.^[31] In SPSS, the GENLOG procedure performs general log-linear analysis on contingency tables, fitting hierarchical models by maximum likelihood and handling up to 10 factors for small datasets.^[32] It generates expected frequencies, log-linear parameters, and chi-square likelihood ratio tests for model fit and effect significance, with options for handling structural zeros.^[32] For two-way tables, CROSSTABS can include log-linear modeling extensions via custom syntax for association tests, but GENLOG is preferred for multi-way analysis.^[32] A basic GENLOG command for a 2x2x2 table defined by factors A, B, C is:

GENLOG A B C
  /DESIGN = A B C A*B
  /CRITERIA = DELTA=0 ZEROS=SUPPRESS.
GENLOG A B C
  /DESIGN = A B C A*B
  /CRITERIA = DELTA=0 ZEROS=SUPPRESS.

This fits the specified model, providing partial association chi-squares (e.g., G² ≈ 4.1 for A:B, p < 0.05).^[32] Stata's poisson command, or equivalently glm with family(poisson) and link(log), fits log-linear models to count data in contingency tables using maximum likelihood, ideal for small dimensions with robust standard errors for inference.^[33] Both support factor variables for hierarchical terms, yielding deviance and Pearson chi-square statistics for overall fit, alongside z-tests for effects.^[34] For a 2x2x2 table reshaped to long format with variables a, b, c, and count, the syntax is:

glm count i.a i.b i.c i.a#i.b, family(poisson) link(log) vce(robust)
glm count i.a i.b i.c i.a#i.b, family(poisson) link(log) vce(robust)

This produces coefficients interpretable as log-rate ratios (e.g., A:B interaction β ≈ 0.3, z = 2.1, p < 0.05) and a model deviance of about 6.8 for fit evaluation.^[35]

Software for Large Datasets

For handling high-dimensional and sparse contingency tables in decomposable log-linear models, specialized R packages provide efficient tools for model specification, fitting, and inference. The mimR package supports graphical and decomposable log-linear models for discrete data, enabling the representation of interactions via undirected graphs and exact computation of maximum likelihood estimates for decomposable structures. Similarly, the gRbase package extends graphical modeling capabilities to include hierarchical log-linear models for multivariate discrete data, building on Gaussian graphical models while facilitating decomposable fitting and conditional independence testing for large variable sets.^[36] Specialized standalone software like LEM (Log-linear and Event-history Modeling) offers exact fitting for decomposable log-linear and latent class models using iterative proportional fitting (IPF), which is particularly suited for sparse, high-dimensional tables by avoiding approximations and directly incorporating structural zeros. LEM's command-based interface allows precise control over model hierarchies and handles missing data via expectation-maximization, making it viable for datasets with dozens to hundreds of cells. In environments like MATLAB or Python, custom implementations are common for decomposable cases, leveraging graph libraries such as NetworkX to encode interaction structures and optimization routines in SciPy for parameter estimation. These approaches exploit the closed-form maximum likelihood estimation available for decomposable models, where parameters are derived directly from marginal distributions over cliques, enabling handling of structural zeros without iterative convergence issues. Such implementations scale to over 100 variables on standard hardware by avoiding full table materialization and focusing on graph decompositions.^[37] Key capabilities across these tools include exact likelihood computation for model evaluation and efficient management of sparse data through graph-based storage, contrasting with iterative methods that falter in high dimensions. Recent post-2020 developments integrate these frameworks with machine learning techniques for automated model search, such as genetic algorithms in graphical log-linear contexts, enhancing scalability for big data applications in contingency table analysis.^[38]

References

[1]
[PDF] Loglinear Models Loglinear models are an alternative method of ...
Loglinear models are an alternative method of analyzing contingency tables, with the ability to test many of the same hypotheses we have discussed up to this ...
[2]
[PDF] Log-linear Models and Maximum Likelihood Estimation
Jan 25, 2006 · Log-linear models are a powerful statistical tool for the analysis of categorical data and their use has increased greatly over the past two ...
[3]
[PDF] Chapter 5 - Log-Linear Models for Contingency Tables
In this model the total sample size is not fixed in advance, and all counts are therefore random. Under the assumption that the observations are independent, ...
[4]
General Loglinear Analysis - IBM
Under the Poisson distribution assumption: The total sample size is not fixed before the study, or the analysis is not conditional on the total sample size.
[5]
10 Log-Linear Models – STAT 504 | Analysis of Discrete Data
The log-linear model is natural for Poisson, Multinomial and Product-Multinomial sampling. They are appropriate when there is no clear distinction between ...
[6]
Untangle the Structural and Random Zeros in Statistical Modelings
The main issue in modeling count data with structural zeros is that structural zeros are often not observed. In fact, structural zeros may be latent and not ...
[7]
12.4 - Inference for Log-linear Models: Sparse Data | STAT 504
Tables with structural zeros are structurally incomplete. They are also known as incomplete tables. This is different from a partial classification where an ...
[8]
Log-linear models - Poisson regression - Stat@Duke
Mean and variance are equal (λ) · Distribution tends to be skewed right, especially when the mean is small · If the mean is larger, it can be approximated by a ...
[9]
[PDF] An Introduction to Categorical Data Analysis | ALAN AGRESTI
Agresti, Alan. An introduction to categorical data analysis / Alan Agresti. p. cm. Includes bibliographical references and index. ISBN 978-0-471-22618-5. 1.
[10]
[PDF] Log-linear Models for Contingency Tables - Edps/Psych/Soc 589
▷ In Log-linear models, the response variable equals the counts and expected cell counts {µij }, rather than cell probabilities {πij }; therefore, the random ...
[11]
https://galton.uchicago.edu/~yibi/teaching/stat226/2017/lectures/C07.pdf
[12]
[PDF] Chapter 7 Loglinear Models for Contingency Tables
By the Poisson-Multinomial connection, given marginal total n = n++ or ni+ or n+j, the cell counts nij are still binomial or multinomial, consistent w/ the ...
[13]
Log-Linear Models for Frequency Data: Sufficient Statistics and ...
The maximum likelihood estimate is shown to be unique if it exists, and necessary and sufficient conditions are given for its existence. Citation. Download ...
[14]
[PDF] Sufficient Statistics and Likelihood Equations Shelby J. Haberman ...
Oct 2, 2007 · This model includes conventional log-linear models for complete and incomplete factorial tables and logit models for quanta1 response analysis.
[15]
Log-Linear Models and Frequency Tables with Small Expected Cell ...
If log-linear models are applied, these asymptotic properties may remain applicable if the sample size is large and the number of cells in the table is large, ...
[16]
[PDF] Fitting log-linear models in sparse contingency tables using ... - arXiv
Dec 16, 2016 · Log-linear modeling is a popular method for the analysis of contingency table data. When the table is sparse, the data can fall on the boundary ...
[17]
Discrete Multivariate Analysis: Theory and Practice - SpringerLink
In stockDiscrete Multivariate Analysis. Overview. Authors: Yvonne M. M. Bishop,; Stephen E. Fienberg,; Paul W. Holland. Yvonne M. M. Bishop. 20015-2956, Washington ...
[18]
Graphical Log-linear Models: Fundamental Concepts and Applications
Mar 14, 2016 · We present a comprehensive study of graphical log-linear models for contingency tables. High dimensional contingency tables arise in many areas.
[19]
[PDF] Graphical and Log-Linear Models - Oxford statistics department
Jan 18, 2007 · Graphical models use undirected graphs to describe conditional independence. Log-linear models are generated by sets of subsets of variables.
[20]
Log-Linear Models and Logistic Regression - SpringerLink
In stockBook Title: Log-Linear Models and Logistic Regression · Authors: Ronald Christensen · Series Title: Springer Texts in Statistics · Publisher: Springer New York, NY.
[21]
[PDF] Decomposition of log-linear models - University of Oxford
Its dependence graph G has exactly two cliques a and b. The graph is chordal, meaning that any cycle of length ≥ 4 has a chord. A is called decomposable if A is ...
[22]
[PDF] Graphical Log-Linear Models - Digital Commons @ Wayne State
May 1, 2017 · The aim in the current study is to provide insight into graphical log-linear models. (LLMs) by providing a concise explanation of the underlying ...
[23]
[PDF] Loglinear Models - NCSS
For example suppose the hierarchical model AB, BC is fit. The expanded version of this hierarchical model is. A+B+C+AB+BC. Note that the terms AC and ABC are ...
[24]
[PDF] Categorical Data Analysis
Loglinear Models for Contingency Tables. 314. 8.1 Loglinear Models for Two-Way Tables, 314. 8.2 Loglinear Models for Independence and Interaction in.
[25]
On Partitioning χ 2 and Detecting Partial Association in Three-Way ...
Dec 5, 2018 · This paper presents a method of partitioning a χ2 statistic for the I × J × K contingency table (viz. the χ2 statistic that is based upon the ...
[26]
Decomposability and Collapsibility for Log-Linear Models
Journal of the Royal Statistical Society Series C: Applied Statistics, Volume 38, Issue 1, March 1989, Pages 189–197, https://doi.org/10.2307/2347694.
[27]
3. Loglinear Models
Jul 24, 2025 · Thus, the joint independence model can be denoted [AB][C] , as shown in the Symbol column in the table. @(tab:loglin-3way). Models of ...
[28]
The GENMOD Procedure - Poisson Regression - SAS Help Center
Sep 29, 2025 · You can use PROC GENMOD to perform a Poisson regression analysis of these data with a log link function. This type of model is sometimes called ...
[29]
Fit Log-Linear Models by Iterative Proportional Scaling - R
Description. This function provides a front-end to the standard function, loglin , to allow log-linear models to be specified and fitted in a manner similar to ...
[30]
Summary Method Function for Objects of Class 'loglm' - R
Description. Returns a summary list for log-linear models fitted by iterative proportional scaling using loglm . Usage. ## S3 method for ...
[31]
[PDF] Package 'gnm' - Generalized Nonlinear Models - CRAN
gnm provides functions to fit generalized nonlinear models by maximum likelihood. Such models extend the class of generalized linear models by allowing ...
[32]
[PDF] The GENMOD Procedure - SAS Support
You can use PROC GENMOD to perform a Poisson regression analysis of these data with a log link function. ... This example illustrates a Bayesian analysis of a log ...
[33]
General Loglinear Analysis - IBM
The General Loglinear Analysis procedure analyzes the frequency counts of observations falling into each cross-classification category in a crosstabulation ...
[34]
[PDF] Poisson regression - Stata
If you let n → ∞, you obtain the Poisson distribution. In the Poisson regression model, the incidence rate for the jth observation is assumed to be given by. rj ...
[35]
Poisson Regression | Stata Data Analysis Examples - OARC Stats
In Stata, a Poisson model can be estimated via glm command with the log link and the Poisson family. You will need to use the glm command to obtain the ...
[36]
[PDF] glm — Generalized linear models - Description Quick start Menu
Family negative binomial, log-link models—also known as negative binomial regression mod- els—are used for data with an overdispersed Poisson distribution.
[37]
[PDF] A Common Platform for Graphical Models in R: The gRbase Package
Dec 10, 2005 · Abstract. The gRbase package is intended to set the framework for computer packages for data analysis using graphical models.
[38]
[PDF] Scaling log-linear analysis to high-dimensional data - Geoff Webb
This paper shows how to scale up this procedure to datasets with hundreds of variables. C. Obstacles for high-dimensional datasets. Existing approaches to ...<|control11|><|separator|>
[39]
Graphical Local Genetic Algorithm for High-Dimensional Log-Linear ...
May 30, 2023 · Graphical log-linear models are effective for representing complex structures that emerge from high-dimensional data.<|control11|><|separator|>