Log-linear model
A log-linear model is a statistical method for analyzing associations among two or more categorical variables by modeling the logarithms of expected cell counts in a multi-way contingency table as a linear function of parameters that capture main effects and interactions between the variables.[1] These models treat all variables symmetrically, without distinguishing between response and predictor variables, and are particularly suited for count data observed under multinomial or Poisson sampling schemes.[1] Log-linear models form a special case of generalized linear models (GLMs), employing a Poisson distribution for the response and a logarithmic link function to ensure predicted counts remain positive.[1] Parameter estimation typically uses maximum likelihood methods, often implemented via iterative algorithms such as iteratively reweighted least squares or iterative proportional fitting, with goodness-of-fit assessed through deviance statistics like the likelihood ratio chi-square (G²) or Pearson's chi-square (X²).[1] Hierarchical principles guide model specification, where higher-order interactions imply lower-order ones, enabling tests for conditional independence and the estimation of measures like odds ratios to quantify associations.[1] The foundations of log-linear models trace back to early 20th-century work on contingency tables by Karl Pearson, who introduced the chi-square test in 1900, and George Udny Yule, though the modern log-linear framework emerged in the 1960s.[2] Key advancements include Bartlett's 1935 maximum likelihood estimation for three-way tables and the iterative proportional fitting algorithm by Deming and Stephan in 1940, but the pivotal unification came with Michael Birch's 1963 paper on maximum likelihood for multi-way tables, followed by Leo Goodman's extensions in 1963–1971 that popularized their use for testing interactions.[2] Shelby Haberman's 1974 work further clarified estimation conditions, solidifying log-linear models as a cornerstone of categorical data analysis.[2] In practice, log-linear models are widely applied in fields such as social sciences, epidemiology,[3] and market research[4] to explore complex dependencies in categorical data, including latent class analysis[3] and graphical models for higher-dimensional tables.[5] They offer flexibility for sparse data and model selection via criteria like the Bayesian information criterion (BIC), as highlighted in Raftery's 1980s contributions.[1] Modern software like R'sloglin package or SAS's PROC GENMOD facilitates their implementation, ensuring accessibility for rigorous inference on multivariate associations.[1]
Definition and Basic Concepts
General Definition
A log-linear model is a statistical method for categorical data analysis in which the logarithms of the expected cell counts in a multi-way contingency table are expressed as a linear function of parameters that capture main effects and interactions among the categorical variables. This structure renders the model multiplicative on the original count scale and additive on the logarithmic scale, accommodating scenarios where effects among factors interact multiplicatively. Unlike models with designated response variables, log-linear models treat all categorical variables symmetrically, without distinguishing between response and predictors.[6] Log-linear models originated in the 1960s and gained prominence in the 1970s for the analysis of categorical data, with foundational contributions from researchers like Leo A. Goodman and Stephen E. Fienberg. Goodman's early 1970s papers introduced hierarchical formulations for multi-way data structures, while Haberman advanced likelihood-based inference in 1973–1974; Fienberg advanced iterative estimation methods and addressed sampling challenges in model fitting. These innovations, facilitated by emerging computational capabilities, established log-linear models as a cornerstone for multivariate analysis.[2] Unlike standard linear models, which posit additive effects suitable for unbounded responses, log-linear models assume multiplicativity to handle strictly positive outcomes, ensuring predictions remain non-negative and reflecting proportional changes. In log-linear models for contingency tables, the responses are cell counts, typically assumed to follow a Poisson distribution (or multinomial for fixed margins), which belongs to the exponential family of distributions in generalized linear models.[7][8]Relation to Other Models
The log-linear model is a special case of the generalized linear model (GLM) framework, where the response variable follows a Poisson or multinomial distribution and a logarithmic link function is employed to model the expected cell counts in contingency tables.[9] This positioning allows log-linear models to handle non-normal responses, such as counts, through the exponential family of distributions, unifying them with other GLMs like logistic regression under a common estimation paradigm via maximum likelihood. In comparison to ordinary linear regression, which assumes additive effects and constant variance, the log-linear model is particularly suited for positive, skewed data like counts or rates, as the log link transforms multiplicative relationships into additive ones on the log scale, yielding interpretable percentage changes in expectations.[10] It also addresses heteroscedasticity inherent in count data, where variance increases with the mean, by incorporating Poisson variance-mean equality, which linear models often violate without transformations. Log-linear models extend analysis of variance (ANOVA) and regression techniques for categorical predictors to the realm of count data, treating all variables symmetrically in multi-way contingency tables rather than distinguishing response from predictors.[11] By modeling the logarithm of expected frequencies as a linear combination of main effects and interactions—analogous to ANOVA decompositions on the log scale—they enable hierarchical testing of associations among categorical variables, surpassing traditional chi-square tests in flexibility for complex structures. A precursor to modern log-linear estimation, the iterative proportional fitting (IPF) procedure, developed in the mid-20th century for adjusting contingency tables to marginal constraints, provides an efficient algorithm for obtaining maximum likelihood estimates under log-linear specifications, especially for hierarchical models. This method, formalized in the context of log-linear models during the 1970s, remains computationally valuable for large tables where direct GLM fitting may be intensive.Mathematical Formulation
Model Equation
The log-linear model is a type of generalized linear model (GLM) that employs a logarithmic link function to model the expected cell counts in multi-way contingency tables arising from categorical variables. These models treat all variables symmetrically and are particularly suited for count data. In the hierarchical parameterization for categorical data, the logarithm of the expected cell count \mu_{ijk\dots} is expressed as a linear function of parameters capturing main effects and interactions: \log(\mu_{ijk\dots}) = \mu + \lambda_i^A + \lambda_j^B + \lambda_k^C + \dots + \lambda_{ij}^{AB} + \lambda_{ik}^{AC} + \dots + \lambda_{ijk}^{ABC} + \dots, where \mu is the overall mean (on the log scale), the \lambda terms represent main effects for each factor (e.g., \lambda_i^A for levels of factor A) and higher-order interactions (e.g., \lambda_{ij}^{AB} for the two-way interaction between A and B), up to the full r-way interaction if included.[11] This can be a saturated model, which includes all possible interaction terms and fits the data perfectly, or an unsaturated model that omits higher-order terms to impose parsimony and test specific hypotheses about associations.[1] Exponentiating both sides yields \mu_{ijk\dots} = \exp(\mu + \lambda_i^A + \lambda_j^B + \dots), ensuring that predicted counts are positive and allowing for multiplicative effects among the factors.[12] Key assumptions underlying the log-linear model include the independence of observations across cells, positive expected values (\mu > 0) to avoid undefined logarithms, and, for count data in contingency tables, the observed counts typically following a Poisson or multinomial distribution with mean \mu.[13] The log-linear form arises naturally within the exponential family of distributions, specifically as the canonical link function for the Poisson distribution. For a Poisson random variable Y \sim \mathrm{Poisson}(\mu), the probability mass function is P(Y = y) = \frac{\mu^y e^{-\mu}}{y!}, which can be rewritten in exponential family form as \log P(Y = y) = [y \log \mu - \mu - \log(y!)] + \constant, where the natural parameter \theta = \log \mu directly links to the linear predictor \eta = X\beta, yielding the log-linear specification \log \mu = \eta.[14] This canonical parameterization simplifies maximum likelihood estimation and ensures desirable statistical properties, with similar structure under multinomial sampling.[12]Parameter Interpretation
In log-linear models, the parameters associated with main effects, denoted as \lambda_i for the i-th category of a categorical variable, represent the logarithmic contribution of that category to the expected cell counts relative to a baseline or reference category.[15] The exponential of these parameters, \exp(\lambda_i), serves as a multiplicative factor that scales the expected count for the given category compared to the baseline, indicating how much the count increases or decreases due to the presence of that level.[16] For instance, under effect coding constraints where the sum of parameters for a variable is zero, \lambda_i measures the deviation of the log-expected count for category i from the overall mean log-expected count across all categories of that variable.[16] Interaction parameters, such as \lambda_{ij}^{AB} for the joint effect of categories i and j from variables A and B, capture the combined influence of multiple variables on the expected counts beyond what main effects alone would predict.[1] These terms quantify associations between variables; specifically, \exp(\lambda_{ij}^{AB}) can be interpreted as a ratio of expected counts or, in the context of contingency tables, as a measure of how the relationship between two variables modifies the odds or risks within subgroups.[15] For example, in a two-way interaction, \lambda_{ij}^{AB} reflects the partial association between A and B, adjusting for other factors, and its exponential form indicates the proportional change in expected counts due to the specific category combination.[17] The hierarchy principle in log-linear models ensures that the inclusion of a higher-order interaction term implies the presence of all corresponding lower-order terms, facilitating interpretable and nested model structures.[16] For instance, specifying a three-way interaction \lambda_{ijk}^{ABC} requires including the two-way interactions like \lambda_{ij}^{AB} and main effects \lambda_i^A, as the higher-order term builds upon and modifies the lower-order associations.[1] This principle maintains consistency in parameter meanings across models and prevents overparameterization by enforcing that higher interactions represent variations in lower ones.[17] Parameters in log-linear models often translate directly into interpretable measures such as odds ratios and risk ratios, enhancing their practical utility in analysis.[15] An odds ratio for adjacent categories in a contingency table is given by \exp(\lambda_{ij}^{XY} + \lambda_{i+1,j+1}^{XY} - \lambda_{i+1,j}^{XY} - \lambda_{i,j+1}^{XY}), representing the change in odds of one outcome relative to another due to a unit shift in categories.[16] Similarly, \exp(\beta) in related GLM contexts denotes the multiplicative change in the expected response for a one-unit increase in a predictor, akin to a risk ratio in count data contexts.[1] These transformations allow parameters to convey substantive effects, such as relative risks between groups, in a scale-free manner.[17]Applications
Categorical Data Analysis
Log-linear models are widely applied in categorical data analysis to examine relationships among multiple categorical variables through multi-way contingency tables, where cell entries represent observed frequencies. These models treat the cell counts as realizations of independent Poisson random variables, enabling the specification of expected frequencies via a logarithmic link function that captures main effects and interactions additively.[18] This Poisson assumption aligns with the multinomial sampling common in contingency table studies, as the conditional distribution given the margins follows a multinomial form under the same log-linear parameterization.[19] A key feature of log-linear models in this context is their hierarchical structure, which allows researchers to specify and test models ranging from complete independence among variables to partial associations (e.g., conditional independence given a third variable) and full interaction terms encompassing all higher-order effects. For instance, in a three-way table, a model of mutual independence might include only main effects, while a partial association model could incorporate a two-way interaction alongside main effects to represent conditional dependencies. These hierarchical specifications facilitate stepwise model building, where higher-order terms are included only if supported by the data, promoting parsimonious representations of complex associations in social science datasets.[20] Such structures were extensively developed in the 1970s to address multi-dimensional tables in fields like sociology and ecology, with Stephen E. Fienberg's 1970 work providing foundational methods for analyzing interactions in higher dimensions.[2] Collapsibility in log-linear models refers to the conditions under which associations observed in a full table preserve their strength when marginalizing over one or more variables, a property crucial for interpreting aggregated data without distortion. Violations of collapsibility can manifest as Simpson's paradox, where marginal associations reverse direction compared to conditional ones, often arising in non-collapsible interaction structures like those in certain hierarchical models. For example, in a three-way table, the partial association between two variables may hold conditionally but not marginally if a third variable induces confounding interactions. Parameter interpretations in these models link interaction terms to log-odds ratios or log-expected frequency deviations, aiding in the assessment of such effects.[21][22]Econometrics and Trend Modeling
In econometrics, log-linear models are frequently employed to analyze relationships involving multiplicative growth, such as demand functions or economic expansion, where the specification takes the form \log(y) = \beta \log(x) + \epsilon. This double-log transformation allows the coefficient \beta to represent the elasticity of y with respect to x, quantifying the percentage change in y for a one percent change in x.[23][24] A prominent application occurs in labor economics through wage equations, exemplified by the Mincer equation, which models log wages as a function of education and experience to estimate returns to human capital.[25] These models are also utilized in epidemiology to examine incidence rates, where log-linear forms capture proportional changes in disease occurrence over time or across populations.[26] In trend analysis, log-linear models facilitate the modeling of exponential growth patterns, particularly in time series data, by linearizing the logarithmic scale to reveal constant percentage changes. A key example is joinpoint regression, which applies piecewise log-linear segments to detect shifts in trends, such as varying rates of increase in health or economic indicators over time. The log transformation in these models addresses heteroscedasticity inherent in multiplicative error structures, where errors proportional to the level of the variable lead to increasing variance; by stabilizing this variance on the log scale, the approach enhances the reliability of linear regression assumptions.[27] Such applications extend log-linear principles to generalized linear models for non-categorical outcomes in economics.[28]Estimation and Inference
Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is the primary method for obtaining parameter estimates in log-linear models, which are generalized linear models (GLMs) assuming a Poisson distribution for cell counts in contingency tables and a logarithmic link function relating the mean to the linear predictor. Under independent Poisson sampling, the likelihood function for observed counts y = (y_1, \dots, y_I) and expected means \mu = (\mu_1, \dots, \mu_I) is given by L(\beta) = \prod_{i=1}^I \frac{\mu_i^{y_i} e^{-\mu_i}}{y_i!}, where \mu_i = \exp(x_i^T \beta) and x_i denotes the design vector for cell i. Maximization of L(\beta) is typically performed by optimizing the log-likelihood \ell(\beta) = \sum_{i=1}^I \left( y_i \log \mu_i - \mu_i - \log y_i! \right), which ignores the constant term -\sum \log y_i! for optimization purposes. The resulting maximum likelihood estimates \hat{\beta} satisfy the score equations derived from the exponential family structure of the Poisson distribution. Due to the nonlinearity of the log link, direct closed-form solutions are unavailable except in special cases like saturated models; instead, iterative numerical methods are employed. The Newton-Raphson algorithm updates parameter estimates via \beta^{(k+1)} = \beta^{(k)} + \left( X^T W X \right)^{-1} X^T (y - \mu^{(k)}), where W is a diagonal matrix of weights \operatorname{diag}(\mu_i^{(k)}), but it can suffer from instability for sparse data. For log-linear models, this is commonly implemented as iteratively reweighted least squares (IRLS), which reframes the problem as weighted least squares on the log scale by linearizing the link function and iteratively adjusting weights based on current fitted values. IRLS converges to the MLE under standard conditions and is the basis for software implementations like those in R'sglm function.[2]
Under regularity conditions—such as the model being correctly specified, positive expected cell counts, and the information matrix being positive definite—the MLE \hat{\beta} is consistent, i.e., \hat{\beta} \to_p \beta as sample size increases, and asymptotically normal: \sqrt{n} (\hat{\beta} - \beta) \to_d N(0, I(\beta)^{-1}), where I(\beta) is the Fisher information matrix. These properties enable large-sample inference, including Wald confidence intervals for parameters.[29]
In certain cases, such as decomposable log-linear models where the generating class forms a simplicial complex, the MLE coincides with solutions from weighted least squares estimation on the logarithmic scale of the sufficient marginal statistics.[2]
Goodness-of-Fit Tests
Goodness-of-fit tests for log-linear models assess the adequacy of the fitted model in reproducing the observed contingency table frequencies, typically using statistics derived from maximum likelihood estimation. These tests compare the observed counts y_i to the expected counts \mu_i under the Poisson assumption inherent to log-linear models.[30] The deviance statistic, also known as the likelihood ratio chi-square statistic G^2, quantifies the discrepancy between observed and fitted values as D = 2 \sum_i y_i \log \left( \frac{y_i}{\mu_i} \right), where terms with y_i = 0 are taken as zero. Under the null hypothesis of adequate fit and large sample sizes, D approximately follows a chi-squared distribution with degrees of freedom equal to the number of cells minus the number of estimated parameters. A non-significant D (e.g., p-value > 0.05) indicates that the model fits the data well.[30] The Pearson chi-squared statistic provides an alternative measure of fit, defined as X^2 = \sum_i \frac{(y_i - \mu_i)^2}{\mu_i}. Like the deviance, X^2 is asymptotically chi-squared distributed with the same degrees of freedom for large samples, and it is particularly sensitive to differences in cells with large expected counts. Both statistics are equivalent to tests against the saturated model, which perfectly fits the data by estimating a separate parameter for each cell.[30] Likelihood ratio tests (LRT) extend these assessments to compare nested log-linear models, such as hierarchical structures where one model is a special case of another (e.g., testing for higher-order interactions). The test statistic is the difference in deviances between the fuller and reduced models, D_{\text{reduced}} - D_{\text{full}}, which follows a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters. This approach is useful for model selection in multi-way tables, where significant differences indicate the need for additional terms.[30] Overdispersion arises in log-linear models when the observed variability exceeds the Poisson assumption of variance equaling the mean (\text{Var}(y_i) = \mu_i), often detected if the deviance or Pearson statistic divided by its degrees of freedom exceeds 1 (e.g., values around 4 suggest substantial overdispersion). To address this, quasi-likelihood methods adjust the variance to \text{Var}(y_i) = \phi \mu_i, where the dispersion parameter \phi is estimated as the Pearson statistic divided by degrees of freedom; standard errors are then scaled by \sqrt{\phi}, and test statistics are divided by \phi to maintain valid inference without altering parameter estimates.[31]Examples and Case Studies
Two-Way Contingency Table
A two-way contingency table arises when cross-classifying observations by two categorical variables, yielding cell counts that log-linear models can analyze for patterns of independence or association.[32] Consider data from a survey of 1091 adults on gender and belief in an afterlife, presented in the following 2×2 table:| Gender | Belief: Yes | Belief: No | Total |
|---|---|---|---|
| Female | 435 | 147 | 582 |
| Male | 375 | 134 | 509 |
| Total | 810 | 281 | 1091 |