Generalized linear model

A generalized linear model (GLM) is a statistical modeling framework that extends the classical linear regression model to accommodate response variables whose distributions belong to the exponential family, unifying various regression techniques under a common structure.^[1] Introduced by John Nelder and Robert Wedderburn in 1972, GLMs were developed to provide a unified approach for estimating parameters and analyzing data in diverse scenarios, such as proportions, counts, and continuous positive values, where the assumptions of ordinary least squares do not hold.^[1] This framework has three core components: a random component specifying the probability distribution of the response variable (typically from the exponential family, including normal, binomial, Poisson, and gamma distributions); a systematic component consisting of a linear predictor formed as a linear combination of explanatory variables; and a link function that connects the expected value of the response to the linear predictor by transforming the mean to match the scale of the linear predictor.^[2] For instance, in logistic regression—a common GLM application—the link function is the logit, linking the probability of a binary outcome to predictors, while Poisson regression uses a log link for count data.^[3] GLMs are fitted using maximum likelihood estimation, often via iteratively reweighted least squares, and have become foundational in fields like epidemiology, ecology, and economics for handling non-normal data while maintaining interpretability of coefficients as changes in the link scale.^[4]

Introduction

Intuition

The classical linear regression model serves as a foundational special case of the generalized linear model (GLM), where the response variable is assumed to follow a normal distribution and the expected value is directly equal to a linear combination of the predictors, akin to drawing a straight line through data points to predict continuous outcomes like house prices or temperatures.^[5] This approach works well when responses vary symmetrically around the mean without bounds, but real-world data often defies such assumptions, such as when outcomes are binary (success/failure) or counts (number of events), leading to skewed or bounded distributions that violate normality.^[6] GLMs address this limitation by incorporating a link function, which acts like a flexible translator: it connects the linear combination of predictors—think of it as a weighted sum capturing the influence of variables like age or income—to the mean of the response in a way that respects the data's natural variability. For instance, in predicting whether a customer buys a product (binary response), the link function might transform the linear predictor into a probability between 0 and 1, avoiding impossible predictions like negative probabilities; similarly, for counting rare events like website visits, it ensures the mean is positive and aligns with count-like fluctuations. This transformation allows GLMs to handle non-normal responses without forcing awkward data manipulations, much like adjusting a map's projection to better fit irregular terrain rather than stretching a flat map.^[7] To illustrate the core relationships, consider the following conceptual diagram outlining the data flow in a GLM: | Predictors (e.g., features like dosage, exposure) | → | Linear Predictor (weighted sum) | → | Link Function (transforms to fit response scale) | → | Mean of Response (expected value) | → | Distribution (e.g., binomial for binary, Poisson for counts) | → | Observed Response | |--------------------------------------------------|---|--------------------------------|-----------------------------------------------------|---------------------------------------------|-------------------------------------------------------------|-------------------------------------|-----------------------------------| This structure highlights how inputs systematically influence outcomes through intermediary steps tailored to the data type.^[5] A primary advantage of GLMs lies in their ability to unify diverse regression techniques—ranging from ordinary least squares for continuous data to logistic regression for binaries and Poisson regression for counts—within a cohesive framework built around exponential family distributions, enabling consistent estimation and inference across applications.^[8] Exponential family distributions provide the probabilistic foundation for modeling response variability in this setup.^[7]

Historical Development

The foundations of generalized linear models (GLMs) trace back to the early 20th century, particularly Ronald A. Fisher's work on sufficiency and maximum likelihood estimation in his 1922 paper, which laid the theoretical groundwork for the exponential family of distributions by characterizing distributions in a form amenable to likelihood-based inference, influencing later extensions to regression contexts.^[9] Fisher's contributions in the 1920s and 1930s emphasized maximum likelihood estimation and the structure of variance in these families, setting the stage for broader applications beyond normal linear regression.^[10] The formal introduction of GLMs occurred in 1972 through the seminal paper by John A. Nelder and Robert W. M. Wedderburn, titled "Generalized Linear Models," published in the Journal of the Royal Statistical Society, Series A.^[8] In this work, they proposed a class of models that extended classical linear regression to response variables following exponential family distributions, incorporating a linear predictor and a link function to connect the mean response to covariates. This framework was initially implemented in the GENSTAT statistical software package, developed at Rothamsted Experimental Station, where Nelder served as director, enabling practical computation via iterative weighted least squares.^[11] Building on this, Wedderburn introduced quasi-likelihood methods in 1974, relaxing the full distributional assumptions to focus on mean-variance relationships, which further broadened the applicability of GLM-like approaches to non-exponential family data. The 1980s marked significant expansions, including the development of generalized estimating equations (GEEs) by Kung-Yee Liang and Scott L. Zeger in 1986, which adapted GLM estimation for correlated data in longitudinal and clustered studies while maintaining robust inference.^[12] GLMs gained widespread adoption through integration into major statistical software: SAS introduced PROC GENMOD in version 6.09 (1993) for fitting GLMs and GEEs,^[13] while the R programming language incorporated the glm() function in its early releases during the 1990s, facilitating accessible implementation for diverse users.^[14] In the 2000s, GLMs influenced machine learning, notably through the GLMNET package (introduced in 2008), which added lasso and elastic-net regularization to GLM estimation for high-dimensional data and variable selection. These milestones underscore GLMs' evolution from theoretical construct to a cornerstone of modern statistical and computational practice.

Overview

A generalized linear model (GLM) provides a unified approach to regression analysis by extending the classical linear model to handle response variables that follow distributions other than the normal, such as counts or proportions. The framework specifies that each independent response Y_i (for i = 1, \dots, n) arises from a distribution in the exponential family, with its expected value \mu_i = E(Y_i) connected to a linear predictor \eta_i = \mathbf{x}_i^T \boldsymbol{\beta} via a monotonic link function g, such that g(\mu_i) = \eta_i. This structure allows the mean of the response to vary flexibly across observations while maintaining a linear relationship in the parameter space.^[1] The GLM framework rests on three fundamental assumptions: the linear predictor is linear in the unknown parameters \boldsymbol{\beta}; the observations Y_i are independent; and both the conditional distribution of the response and the link function are correctly specified. These assumptions enable the model to capture systematic variation in the data without requiring the stringent normality and constant variance conditions of ordinary least squares regression. Violations of these assumptions can lead to biased estimates or invalid inferences, underscoring the importance of diagnostic checks in practice.^[4] To apply a GLM, one first selects a probability distribution from the exponential family that matches the nature of the response data, then chooses an appropriate link function—often the canonical one associated with the distribution—to connect the mean to the predictors. Parameter estimation typically proceeds via maximum likelihood methods, followed by model assessment to evaluate goodness-of-fit and predictive performance. This workflow facilitates the analysis of diverse data types, including Poisson-distributed counts or binomial proportions, within a consistent statistical paradigm.^[4] GLMs offer key advantages over ad-hoc or separate modeling strategies for non-normal data, including the interpretability of regression coefficients on the transformed scale of the link function and a cohesive likelihood-based approach to inference, such as hypothesis testing and confidence intervals. This unification simplifies software implementation and theoretical development, making GLMs a cornerstone of modern statistical modeling.^[1]

Core Components

Probability Distribution

In generalized linear models (GLMs), the response variable Y is assumed to follow a probability distribution from the exponential family, which provides a unified framework for modeling various types of data, including continuous, discrete, and count data. This family encompasses a wide range of distributions commonly encountered in statistical applications, allowing the mean and variance of Y to be linked in a manner that facilitates estimation and inference.^[2] The general form of a density or mass function for a random variable Y in the exponential family, as used in GLMs, is given by

f(y; \theta, \phi) = \exp\left[ \frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right],

where \theta is the natural (or canonical) parameter controlling the location, \phi is the dispersion parameter influencing the scale, b(\cdot) and c(\cdot, \cdot) are known functions specific to the distribution, and a(\cdot) is typically \phi / w with w a known prior weight (often 1).^[2] This parameterization ensures that the log-likelihood contributions are linear in the sufficient statistic y, simplifying computations in maximum likelihood estimation. A key property of this form is the relationship between the mean and variance of Y. The expected value is \mathbb{E}(Y) = b'(\theta) = \mu, the derivative of b with respect to \theta, while the variance is \mathrm{Var}(Y) = b''(\theta) \cdot a(\phi) = V(\mu) \cdot a(\phi), where V(\mu) = b''(\theta) is the variance function depending only on the mean.^[2] This structure parameterizes how variance relates to the mean, which is crucial for handling heteroscedasticity in non-normal data.^[15] Several common distributions belong to this family and are frequently used in GLMs, each with distinct mean-variance relationships suited to specific data types. The following table summarizes key examples:

Distribution	Natural Parameter \theta	Cumulant Function b(\theta)	Mean \mu = b'(\theta)	Variance Function V(\mu)	Dispersion \phi Interpretation
Normal	\mu	\theta^2 / 2	\theta	1 (constant)	\sigma^2 (scale)
Poisson	\log \lambda	\exp(\theta)	\exp(\theta)	\mu	1 (no dispersion)
Binomial(n, p)	\log(p/(1-p))	n \log(1 + \exp(\theta))	n / (1 + \exp(-\theta))	\mu (1 - \mu/n)	1 (no dispersion)
Gamma	-1/\mu	-\log(-\theta)	-1/\theta	\mu^2	Shape parameter (scale)

For the Normal distribution, the constant variance is ideal for symmetric continuous data with homoscedasticity.^[2] The Poisson distribution, with variance equal to the mean, suits non-negative count data where overdispersion may occur if violated. The Binomial distribution models binary or proportion data, with variance \mu(1 - \mu/n) reflecting the bounded nature of success probabilities.^[2] The Gamma distribution, featuring quadratic variance in the mean, is appropriate for positive continuous data like waiting times or sizes.^[2] The exponential family is central to GLMs because it admits sufficient statistics for \theta, reducing data dimensionality and enabling efficient iterative estimation algorithms like iteratively reweighted least squares. It also naturally defines canonical link functions where the linear predictor directly equals \theta, ensuring desirable statistical properties such as the uniqueness of maximum likelihood estimates under regularity conditions.^[2] For cases where the full distribution is unspecified but the mean-variance relationship is known, quasi-likelihood methods extend the framework by using the same estimating equations without assuming the exponential form, broadening applicability to overdispersed or misspecified models.^[15]

Linear Predictor

In a generalized linear model (GLM), the linear predictor represents the systematic component that captures the effects of explanatory variables on the response, serving as the input to the link function. For the i-th observation, it is expressed as

\eta_i = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} = \mathbf{X}_i \boldsymbol{\beta},

where \mathbf{X}_i = (1, x_{i1}, \dots, x_{ip}) is the vector of covariates (including an intercept term), and \boldsymbol{\beta} = (\beta_0, \beta_1, \dots, \beta_p)^T is the vector of unknown coefficients. This formulation assumes a linear relationship in the parameter space on the scale defined by the link function.^[16]^[17] The coefficients \beta_j in the linear predictor have a straightforward interpretation: each \beta_j quantifies the change in \eta_i associated with a one-unit increase in the corresponding covariate x_{ij}, holding all other covariates constant. This partial effect highlights the model's additive structure, where the total predicted value \eta_i is the sum of individual contributions from each predictor. However, this interpretation applies directly on the link scale, and the link function subsequently connects \eta_i to the expected response mean \mu_i.^[17]^[5] A key assumption underlying the linear predictor is linearity on the link scale, meaning the predictors enter the model additively through their coefficients without inherent curvature in the parameter specification. Multicollinearity among covariates can violate the stability of this structure by inflating the variance of coefficient estimates, leading to unreliable inferences about individual \beta_j, though the overall model fit may remain unaffected. To enhance interpretability, particularly for the intercept \beta_0, centering covariates—subtracting their means from the original values—is often recommended, as it makes \beta_0 represent the predicted \eta at average covariate levels.^[17]^[18]^[19] Within GLMs, the linear predictor can be extended to accommodate more complex relationships by incorporating interaction terms (e.g., \beta_{jk} x_{ij} x_{ik}) or polynomial terms (e.g., \beta_{j2} x_{ij}^2) as additional covariates in \mathbf{X}_i, thereby allowing for non-additive or nonlinear effects on the link scale without altering the fundamental linear structure in \boldsymbol{\beta}. These extensions maintain the model's parsimony while improving fit for data exhibiting such patterns.^[17]^[20]

Link Function

In generalized linear models (GLMs), the link function g provides a monotonic, differentiable transformation that connects the expected value of the response variable \mu = E(Y) to the linear predictor \eta = X\beta, defined as \eta = g(\mu), or equivalently \mu = g^{-1}(\eta). This transformation ensures that \mu remains within the valid support of the response distribution, such as (0, 1) for proportions in binomial models or (0, \infty) for positive continuous data in gamma models. The link function thus bridges the linear structure of the predictors to the potentially nonlinear mean-response relationship dictated by the chosen probability distribution.^[1] The canonical link function holds a special status in GLMs, as it corresponds to the natural parameter \theta of the exponential family distribution, where \eta = \theta. Formally, for distributions in the exponential family with density f(y; \theta, \phi) = \exp\left\{\frac{y\theta - b(\theta)}{a(\phi)} + c(y, \phi)\right\}, the canonical link is g(\mu) = b'^{-1}(\mu), the inverse of the derivative of the cumulant function b(\theta). This choice simplifies estimation and inference, as it aligns the linear predictor directly with the parameter that parameterizes the distribution's mean-variance relationship. Common examples include the identity link g(\mu) = \mu for the normal distribution, the log link g(\mu) = \log(\mu) for the Poisson distribution, and the logit link g(\mu) = \log\left(\frac{\mu}{1 - \mu}\right) for the binomial distribution. These canonical links were central to the foundational formulation of GLMs.^[2]^[1] Non-canonical link functions deviate from this natural parameterization, such as using the identity link with a binomial distribution or the probit link g(\mu) = \Phi^{-1}(\mu) for binomial data, where \Phi denotes the cumulative distribution function of the standard normal distribution. While non-canonical links can offer advantages like improved model fit in specific datasets by better capturing the mean structure, they introduce challenges, including more complex computational requirements for estimation and reduced interpretability of coefficients, as the linear predictor no longer directly corresponds to the natural parameter. For instance, the probit link may provide a closer approximation to underlying latent normal processes in binary data but complicates standard deviance-based diagnostics compared to the logit. The justification for adopting non-canonical links depends on whether the gains in fit outweigh the added analytical effort.^[21]^[2] Selection of the link function involves a balance of theoretical, empirical, and practical considerations. Theoretically, certain links achieve variance stabilization, transforming the response to approximate constant variance across levels of the mean, which can enhance the validity of inferences; for example, the square root link stabilizes variance in Poisson models. Empirically, information criteria such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC) are used to compare models with different links by penalizing complexity while rewarding goodness-of-fit, often favoring the link that minimizes these values. Domain knowledge also plays a key role, guiding choices based on the substantive interpretation of the mean-response relationship, such as preferring the logit for odds ratios in binary outcomes.^[22]

Estimation Methods

Maximum Likelihood Estimation

Maximum likelihood estimation (MLE) is the primary frequentist method for fitting generalized linear models (GLMs), providing estimates of the regression coefficients \beta by maximizing the likelihood function derived from the assumed exponential family distribution.^[1] The likelihood function for the parameters \beta is expressed as

L(\beta) = \prod_{i=1}^n f(y_i; \theta(\mu(X_i \beta)), \phi),

where f denotes the probability density or mass function from the exponential family, \theta is the natural parameter linked to the mean \mu, X_i is the i-th row of the design matrix, and \phi is the dispersion parameter. The corresponding log-likelihood is

l(\beta) = \sum_{i=1}^n \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi),

with b and a as cumulant and dispersion functions specific to the distribution, and c independent of \beta. Maximizing l(\beta) directly is often infeasible due to the nonlinearity introduced by the link function, so numerical optimization is required.^[1] The standard algorithm for MLE in GLMs is iteratively reweighted least squares (IRLS), which iteratively solves weighted linear regressions to approximate the score equations. Starting with initial estimates of \mu_i (often from ordinary least squares), IRLS updates a working response z_i = \eta_i + (y_i - \mu_i) \frac{d\eta_i}{d\mu_i}, where \eta_i = g(\mu_i) is the linear predictor, and then fits z = X\beta by weighted least squares with weights w_i = \frac{(d\mu_i / d\eta_i)^2}{V(\mu_i)}, where V is the variance function; the process repeats until convergence.^[1] Convergence is assessed by monitoring the relative change in \beta or the log-likelihood, typically stopping when the maximum change is less than a tolerance like $10^{-8} or after a fixed number of iterations to prevent non-convergence in edge cases. Once fitted, inference on \beta relies on asymptotic normality of the MLE, \hat{\beta} \sim N(\beta, I(\beta)^{-1}/n), where the Fisher information matrix is I(\beta) = X^T W X and W is the diagonal matrix of weights from the final IRLS iteration. Wald tests for hypotheses like H_0: \beta_j = 0 use the test statistic z = \hat{\beta}_j / \se(\hat{\beta}_j), with standard errors \se(\hat{\beta}_j) from the diagonal of I(\hat{\beta})^{-1}, approximating a standard normal under the null. Likelihood ratio tests for nested models compare -2[l(\hat{\beta}_{\text{reduced}}) - l(\hat{\beta}_{\text{full}})] \sim \chi^2_{\Delta p}, where \Delta p is the difference in parameters. For model selection and assessment, the deviance D = 2[l(\text{saturated}) - l(\text{fitted})] quantifies the discrepancy between the fitted model and a saturated model (one parameter per observation), following approximately \chi^2_{n-p} for large samples under correct specification. The Akaike information criterion extends this for comparing non-nested models via AIC = -2l(\hat{\beta}) + 2p, penalizing complexity to favor parsimonious fits with good predictive accuracy. Overdispersion, where observed variability exceeds that implied by the model (e.g., variance > mean for Poisson), is detected by testing if the scaled deviance \hat{\phi} = D / (n - p) \approx 1; values \hat{\phi} > 1 indicate overdispersion, prompting scale adjustments or alternative distributions.

Bayesian Estimation

Bayesian estimation for generalized linear models (GLMs) treats the model parameters, such as the regression coefficients \beta, as random variables and computes their posterior distribution given the observed data y. The posterior density is given by

p(\beta \mid y) \propto L(\beta \mid y) \pi(\beta),

where L(\beta \mid y) is the likelihood function derived from the exponential family distribution of the response, and \pi(\beta) is the prior distribution on the coefficients. This approach incorporates prior knowledge or regularization directly into the inference, contrasting with maximum likelihood estimation by yielding a full probability distribution rather than point estimates. For GLMs based on exponential family distributions, conjugate priors facilitate closed-form or analytically tractable posteriors. A class of such priors, proposed by Chen and Ibrahim, specifies a multivariate normal prior on \beta conditional on hyperparameters, leading to a posterior that maintains conjugacy and allows straightforward elicitation based on predictive distributions for the mean response.^[23] For the normal GLM (e.g., linear regression), a normal prior on \beta is conjugate, resulting in a normal posterior when combined with the Gaussian likelihood.^[23] These priors are particularly useful in low-data scenarios, as they enable exact inference without approximation.^[23] In cases where priors are non-conjugate, such as for logistic or Poisson GLMs with complex priors, Markov chain Monte Carlo (MCMC) methods are employed to sample from the posterior. Gibbs sampling, implemented in software like JAGS, iteratively draws from conditional posteriors for each parameter. For more efficient exploration of high-dimensional posteriors, Hamiltonian Monte Carlo (HMC), as used in Stan, leverages gradient information to propose distant moves with high acceptance rates, often via the No-U-Turn Sampler variant.^[24] These methods enable reliable posterior summaries even for non-standard GLMs. Bayesian estimation offers several advantages over frequentist approaches, including the ability to incorporate uncertainty in the dispersion parameter \phi (e.g., overdispersion in Poisson models) directly into the posterior, and to perform model averaging by placing priors over multiple candidate models. Credible intervals, derived from the posterior quantiles, provide direct probabilistic interpretations of parameter uncertainty, unlike confidence intervals which rely on asymptotic approximations. In hierarchical Bayesian GLMs, random effects for clustered data are modeled as latent variables with hyperpriors, allowing information sharing across groups to improve estimates; fuller treatments appear in extensions for correlated data.

Key Examples

General Linear Models

The general linear model (GLM) represents a foundational special case of the broader generalized linear model framework, unifying classical approaches to regression and analysis of variance under a normal distribution for the response variable and an identity link function. In this configuration, the expected value of the response directly equals the linear predictor, enabling straightforward modeling of continuous outcomes assumed to be normally distributed. This model underpins much of traditional statistical inference for linear relationships, providing the basis from which more flexible generalizations extend to non-normal data.^[1] The mathematical form of the general linear model is given by

\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon},

where \mathbf{Y} is the n \times 1 vector of observations, \mathbf{X} is the n \times p design matrix of predictors, \boldsymbol{\beta} is the p \times 1 vector of unknown parameters, and \boldsymbol{\epsilon} is the n \times 1 error vector distributed as \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}), with \sigma^2 denoting the constant error variance and \mathbf{I} the identity matrix. The identity link ensures that the mean \mu = \mathbf{X}\boldsymbol{\beta}, aligning the linear predictor directly with the response scale. Parameter estimation typically employs ordinary least squares (OLS), which minimizes the sum of squared residuals and yields the estimator

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}.

Under the normality assumption, this OLS estimator is equivalent to the maximum likelihood estimator, ensuring optimal properties such as unbiasedness and minimum variance among unbiased estimators.^[1]^[25] The validity of inference in the general linear model hinges on three core assumptions: normality of the errors, homoscedasticity (constant variance across levels of the predictors), and independence of observations. Normality implies that \boldsymbol{\epsilon} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}), supporting exact tests and confidence intervals; violations can lead to biased standard errors or invalid hypothesis tests. Homoscedasticity requires \text{Var}(\epsilon_i) = \sigma^2 for all i, while independence assumes no correlation among errors, often justified by random sampling. Analysis of variance (ANOVA) arises as a particular instance of this model when predictors are categorical, partitioning total variance into components attributable to factors and interactions, thus testing group mean differences equivalently to multiple regression on dummy variables.^[25]^[26] This classical setup transitions seamlessly into the generalized linear model paradigm when the normality assumption does not hold, such as for bounded or count data; here, the linear predictor \eta = \mathbf{X}\boldsymbol{\beta} is retained, but paired with an alternative distribution from the exponential family and a suitable link function g(\mu) = \eta to accommodate the response's scale and variance structure.^[1]

Linear Regression

Linear regression represents a fundamental application of the generalized linear model framework to continuous response variables, where the response Y is assumed to follow a normal distribution with constant variance, and the identity link function is employed. In this setup, the conditional mean of the response is directly specified as \mu = X\beta, where X is the design matrix of predictors and \beta is the vector of regression coefficients. This formulation allows the model to capture linear relationships between predictors and the expected response without transformation, making it suitable for modeling continuous outcomes such as heights, weights, or experimental measurements. As a special case of general linear models, it emphasizes predictive modeling for normally distributed errors.^[1]^[2] The interpretation of coefficients in linear regression is straightforward and intuitive within the GLM context. Each coefficient \beta_j quantifies the expected change in the response Y for a one-unit increase in the corresponding predictor x_j, while holding all other predictors constant. This additive structure facilitates clear causal or associative inferences when assumptions hold, such as in econometric analyses of income effects on consumption or biological studies of environmental factors on growth rates. Confidence intervals and hypothesis tests for \beta_j rely on the normality assumption to ensure valid inference.^[2] Model diagnostics are essential for validating the assumptions of linearity, normality, homoscedasticity, and independence in linear regression. Studentized residuals, which scale raw residuals by their estimated standard errors, are plotted against fitted values or predictors to detect non-linearity, outliers, or heteroscedasticity; deviations from a horizontal band around zero indicate violations. Quantile-quantile (Q-Q) plots compare the ordered studentized residuals to theoretical quantiles from a standard normal distribution, with points aligning closely to the reference line supporting the normality assumption. Influence measures like DFBETAS assess the impact of individual observations on specific coefficients, with values exceeding 2 in absolute terms flagging potentially influential points that could distort estimates.^[27] To address multicollinearity among predictors, the variance inflation factor (VIF) is computed for each \beta_j as the ratio of its variance in the full model to the variance in a univariate regression; VIF values greater than 5 or 10 signal problematic collinearity that inflates standard errors and reduces interpretability. In the presence of heteroscedasticity—where residual variance varies with fitted values—robust standard errors provide consistent estimates of coefficient variability by adjusting the covariance matrix to account for non-constant variance without altering the point estimates. These diagnostics collectively ensure the reliability of linear regression models for continuous responses.^[28]^[27]^[29]

Logistic Regression for Binary Data

Logistic regression models binary response data within the generalized linear model framework, where the response variable follows a binomial distribution with a single trial (n=1), equivalent to a Bernoulli distribution.^[8] In this setup, the expected value μ represents the probability of success p for each observation, and the logit link function transforms this probability to the linear predictor η via g(μ) = log(μ / (1 - μ)) = η = Xβ, where X is the design matrix and β are the coefficients.^[30] This link ensures that predicted probabilities remain between 0 and 1, making it suitable for classification tasks such as predicting disease presence or customer churn.^[8] Estimation of the parameters β typically employs maximum likelihood via iteratively reweighted least squares (IRLS), which iteratively fits weighted linear regressions to approximate the binomial likelihood.^[8] The binomial log-likelihood for binary data is l(β) = ∑ [y_i log(μ_i) + (1 - y_i) log(1 - μ_i)], maximized by updating weights based on the variance μ(1 - μ) and working residuals.^[8] Model fit is assessed using deviance, defined as D = 2 [l(saturated) - l(fitted)], which follows a chi-squared distribution under the null hypothesis of adequate fit; the Hosmer-Lemeshow test further evaluates calibration by grouping observations and comparing observed to expected frequencies across deciles of predicted risk.^[31] Coefficients β_j are interpreted as the change in the log-odds of success for a one-unit increase in predictor x_j, holding others constant; the odds ratio exp(β_j) quantifies the multiplicative change in odds.^[30] Predicted probabilities are obtained by applying the inverse logit function σ(η) = 1 / (1 + exp(-η)) to the linear predictor, enabling direct probability estimates for decision-making in applications like medical diagnostics.^[8] Alternative link functions for binary GLMs include the probit link g(μ) = Φ^{-1}(μ) = η, where Φ^{-1} is the inverse cumulative distribution function of the standard normal, often chosen for its latent variable interpretation assuming an underlying normal threshold model.^[4] The complementary log-log link g(μ) = log(-log(1 - μ)) = η suits asymmetric data, such as rare events, and aligns with extreme value distributions in survival contexts.^[4] The choice among logit, probit, and cloglog depends on theoretical motivations, with logit preferred for its direct odds interpretation and computational stability in classification.^[4]

Poisson Regression for Count Data

Poisson regression is a specific type of generalized linear model used to analyze count data, where the response variable represents the number of events occurring in a fixed interval of time or space.^[17] The model assumes that the counts follow a Poisson distribution, which is part of the exponential family, with the mean equal to the variance.^[17] In this setup, the linear predictor η is linked to the mean μ via the canonical log link function, g(μ) = log(μ) = η = Xβ, where X is the design matrix and β are the regression coefficients.^[17] The baseline rate is given by exp(β₀), representing the expected count when all predictors are zero.^[17] The coefficients βⱼ in Poisson regression are interpreted as log-rate ratios; a unit increase in predictor xⱼ multiplies the expected count by exp(βⱼ), holding other variables constant.^[17] To account for varying exposure levels, such as different observation times or population sizes, an offset term log(E) is included in the linear predictor, where E is the exposure; this adjusts the model to estimate rates rather than absolute counts.^[17] For example, in modeling disease incidence, log(time) serves as the offset to normalize counts by observation period.^[17] A key assumption of the Poisson model is equidispersion, where the variance equals the mean (Var(Y) = μ).^[17] Overdispersion occurs when the variance exceeds the mean (Var(Y) > μ), often due to unobserved heterogeneity, leading to underestimated standard errors in standard Poisson regression.^[17] One approach is the quasi-Poisson model, which retains the Poisson mean structure but allows variance Var(Y) = φμ, where φ > 1 is a dispersion parameter estimated from the data; this was introduced to handle mild overdispersion without altering the distribution.^[32] An alternative is the negative binomial regression model, which explicitly models overdispersion via Var(Y) = μ + αμ², where α ≥ 0 is a dispersion parameter; when α = 0, it reduces to the Poisson model.^[17] Model diagnostics for Poisson regression include Pearson residuals, defined as rᵢ = (yᵢ - μᵢ) / √μᵢ, which standardize deviations for assessing fit and influence; these residuals should approximate a standard normal distribution under the model.^[17] Rootograms provide a graphical diagnostic by plotting the square roots of observed and fitted frequencies against count values, helping visualize deviations in the distribution shape, such as overdispersion or poor fit at specific counts.^[33] For data with excess zeros beyond Poisson expectations, zero-inflated Poisson models extend the framework by incorporating a separate process for structural zeros, typically using a logit link for the probability of zero and a Poisson component for positive counts.^[34]

Multinomial Regression

Multinomial regression within the framework of generalized linear models addresses response variables that are categorical with more than two possible outcomes, encompassing both unordered (nominal) and ordered (ordinal) cases. For nominal responses, the model predicts the probability of each category relative to a baseline, while for ordinal responses, it accounts for the inherent ordering by modeling cumulative probabilities. This approach builds on the exponential family distribution, using the multinomial distribution for the response and an appropriate link function to connect the linear predictor to the probabilities. In the case of unordered categories, the multinomial logit model is employed, with one category designated as the baseline for comparison. The probability of observing category j given covariates X is given by

P(Y = j) = \frac{\exp(\eta_j)}{\sum_k \exp(\eta_k)},

where \eta_j = X \beta_j for j = 1, \dots, J-1, and the baseline category has \eta_J = 0. This formulation ensures the probabilities sum to 1 across all J categories. For ordered categories, the cumulative logit model is typically used, which applies the logit link to the cumulative probabilities P(Y \leq j).^[35] Under the proportional odds assumption, the log-odds ratios are constant across cumulative cuts, expressed as \log \left( \frac{P(Y \leq j)}{1 - P(Y \leq j)} \right) = \alpha_j - X \beta, where \beta represents common coefficients for the covariates.^[35] A probit link can alternatively be applied for the cumulative model, though the logit is more common. Estimation of parameters in multinomial regression proceeds via maximum likelihood, where the log-likelihood is constructed from the multinomial probabilities derived through the softmax function, equivalent to the expression above for P(Y = j). The softmax ensures normalized outputs suitable for probabilistic interpretation during optimization. For inference on category-specific effects, Wald tests compare coefficients across outcomes, assessing whether predictors differ significantly in their impact on category probabilities relative to the baseline. Interpretation focuses on the exponentiated coefficients for practical insight. In the multinomial logit model, \exp(\beta_{jk}) yields relative risk ratios (or more precisely, odds ratios) indicating the change in odds of category j versus the baseline for a one-unit increase in covariate k, holding others constant. For the ordered cumulative logit, \exp(\beta_k) represents the common log-odds ratio across thresholds, quantifying how covariates shift the cumulative probabilities and thus the location of the ordinal scale.^[35] These interpretations emphasize relative effects, aiding in understanding predictor influences on category selection or ordering. This setup generalizes binary logistic regression to multiple categories, reducing to the binary case when J = 2.

Extensions and Advanced Topics

Generalized Additive Models

Generalized additive models (GAMs) extend generalized linear models (GLMs) by allowing non-linear relationships between predictors and the linear predictor through the use of smooth functions, while preserving the exponential family distribution and link function of the GLM framework.^[36] Introduced by Hastie and Tibshirani in 1986, GAMs replace the linear combination of predictors in the standard GLM with an additive sum of smooth functions, enabling flexible modeling of complex data patterns without assuming a specific parametric form for each covariate effect.^[36] This approach is particularly useful for exploratory data analysis and situations where relationships are suspected to be non-linear but additive across variables. The core form of a GAM is given by

\eta_i = \beta_0 + \sum_{j=1}^p f_j(x_{ij}),

where \eta_i is the linear predictor for the i-th observation, \beta_0 is an intercept, x_{ij} is the value of the j-th predictor for the i-th observation, and f_j are unspecified smooth functions, often represented by splines or other basis expansions.^[36] If each f_j is linear, the GAM reduces to a standard GLM, maintaining the connection to the original framework.^[37] Common implementations, such as those in the R package mgcv, employ penalized splines for the smooth functions f_j, balancing fit and smoothness via penalties on the second derivatives. Estimation in GAMs typically involves iterative methods like backfitting, which alternates fitting univariate smooths to residuals while holding others fixed, or penalized likelihood maximization, which incorporates smoothing penalties directly into the log-likelihood.^[36] The smoothness of each f_j is controlled by parameters estimated via generalized cross-validation or similar criteria, with effective degrees of freedom providing a measure of model complexity analogous to parametric degrees of freedom in GLMs. These methods ensure that the resulting smooths are interpretable and avoid overfitting, as the additive structure facilitates partial dependence-like visualizations of individual predictor effects. The primary advantages of GAMs lie in their ability to capture non-linearities using a modest number of terms, improving predictive accuracy and model fit over rigid parametric assumptions without the interpretability loss of fully non-parametric methods.^[36] By visualizing the estimated smooth functions f_j, users can gain insights into the shape of relationships, such as monotonic increases or humps, which aids in scientific interpretation and hypothesis generation.^[37] This flexibility has made GAMs widely adopted in fields like ecology, epidemiology, and finance for modeling responses such as species abundance or disease risk.

Models for Correlated Data

Generalized linear models assume that observations are independent, but many real-world datasets exhibit correlation due to clustering, repeated measures, or hierarchical structures, such as patients followed over time or schools with multiple students. This correlation violates the independence assumption, leading to biased standard errors and invalid inference if unaddressed. Extensions to GLMs for correlated data incorporate dependence structures while maintaining the exponential family form for the response and link function for the mean. These methods are essential in fields like epidemiology, ecology, and social sciences where data naturally arise in groups.^[38] Generalized estimating equations (GEE) provide a quasi-likelihood approach to handle correlated data without fully specifying the joint distribution. Proposed by Liang and Zeger (1986), GEE extends the score equations of standard GLMs by including a working correlation matrix to account for within-cluster dependencies, while initially assuming working independence for parameter estimation. The estimating equations are solved iteratively to obtain consistent estimates of the regression coefficients, even under misspecified working correlations. To ensure valid inference, a robust variance estimator—known as the sandwich estimator—is used, which adjusts for the actual correlation structure empirically:

\hat{V} = \hat{I}^{-1} \hat{B} \hat{I}^{-1},

where \hat{I} is the model-based information matrix and \hat{B} captures the empirical variance of the score contributions. This approach yields population-averaged effects, interpreting coefficients as average changes across the population rather than for specific subjects.^[12]^[39] In contrast, generalized linear mixed models (GLMMs) explicitly model correlation through random effects integrated into the linear predictor. Breslow and Clayton (1993) formalized GLMMs by augmenting the fixed-effects linear predictor \eta = X\beta with random effects u \sim N(0, G), where G is a covariance matrix specifying the dependence, such as random intercepts for clustering or random slopes for varying trajectories. The marginal likelihood is obtained by integrating over the random effects distribution, often approximated via Laplace methods or Gaussian quadrature to handle the intractable integral:

L(\beta, \theta) = \int \prod_i f(y_i | X_i \beta + Z_i u; \phi) \, g(u; G(\theta)) \, du,

with \theta parameterizing G and \phi the dispersion. This framework produces subject-specific interpretations, where fixed effects represent effects conditional on the random effects for an individual cluster. Model-based standard errors rely on correct specification of both the mean and random effects structures.^[40] Practical applications illustrate these methods' utility. For longitudinal binary data, such as respiratory status measurements over time in patients with chronic obstructive pulmonary disease, GEE with an autoregressive order 1 (AR(1)) working correlation captures decaying dependence between consecutive observations, improving efficiency over independence assumptions while maintaining robustness. In clustered count data, like the number of epileptic seizures per patient across treatment periods, a Poisson GLMM with random intercepts models extra-Poisson variation due to patient-specific baselines, where the random effect u_j \sim N(0, \sigma^2_u) induces positive correlation within clusters. These examples highlight how GEE prioritizes marginal effects suitable for policy questions, whereas GLMMs support individualized predictions.^[39]^[40] Inference in these models differs in interpretation and robustness. GEE delivers population-averaged (marginal) effects, averaging over the correlation distribution, which aligns with unconditional summaries in public health. GLMMs, however, provide subject-specific (conditional) effects, conditioning on realized random effects for deeper mechanistic insights. Standard errors in GEE are robust to correlation misspecification via the sandwich form, whereas GLMM standard errors are model-based and sensitive to random effects assumptions, though bootstrap alternatives can enhance reliability. The choice between approaches depends on scientific goals: population-level summaries favor GEE, while heterogeneity across units suits GLMMs.^[38]