Binary regression
Binary regression refers to statistical methods in regression analysis used to model the relationship between one or more independent variables—which can be continuous or categorical—and a binary dependent variable that takes only two possible values, such as 0 or 1, success or failure, or yes or no. The two most common approaches are logistic regression, which uses the logistic (sigmoid) function, and probit regression, which uses the cumulative distribution function of the standard normal distribution; both ensure predicted probabilities remain bounded between 0 and 1, unlike linear regression which can produce values outside this range.[1] Logistic regression, the more widely used of the two, estimates the probability of the positive outcome category by applying the logistic function to a linear combination of the predictors.[2] These models are fitted using maximum likelihood estimation rather than ordinary least squares, as the binary nature of the outcome violates assumptions of continuous, normally distributed errors in linear models.[1] The logistic function originates from 19th-century mathematical modeling of population growth by Pierre François Verhulst, who introduced the term "logistic" in 1838 to describe S-shaped curves representing bounded growth.[3] Its adaptation to statistical regression began in the mid-20th century; Joseph Berkson first proposed logistic regression in 1944 as an alternative to probit models for analyzing binary data in bioassay and medical studies.[3] David Cox further developed the logistic regression model in 1958 for the analysis of binary sequences.[4] By the 1970s, advances in computational methods made it widely accessible, establishing it as a cornerstone of modern statistics.[3] At its core, the binary logistic regression model is expressed as p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k)}}, where p is the probability of the event occurring, \beta_0 is the intercept, \beta_i are the coefficients representing the change in the log-odds for a one-unit increase in predictor x_i, and the logit transformation \ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k linearizes the relationship.[2] Key assumptions include independent observations, no perfect multicollinearity among predictors (e.g., generalized variance inflation factor < 2), and linearity in the log-odds for continuous predictors.[2] Model evaluation often involves metrics like the Hosmer-Lemeshow goodness-of-fit test, area under the receiver operating characteristic curve (AUC-ROC), and odds ratios derived from exponentiated coefficients, which quantify the multiplicative effect on odds.[2] Binary regression finds extensive applications across fields such as medicine, economics, social sciences, and machine learning, particularly for predictive modeling in cross-sectional, cohort, and case-control studies.[2] In healthcare, it is commonly used to predict disease presence (e.g., lung cancer risk based on smoking history and body mass index) or treatment outcomes.[2] In business and marketing, it analyzes binary decisions like customer churn or purchase intent.[1] Extensions include multinomial logistic regression for outcomes with more than two categories and regularized variants like LASSO for high-dimensional data, addressing challenges like overfitting in large datasets.[1] Despite its strengths, limitations such as sensitivity to outliers and the need for large sample sizes for reliable estimates highlight the importance of robust diagnostic checks.[2]Fundamentals
Definition and Scope
Binary regression is a statistical method designed to model the relationship between one or more predictor variables and a dichotomous dependent variable, which assumes only two possible outcomes, such as success or failure, yes or no.[5] This approach is particularly useful in scenarios where the outcome of interest is categorical and binary, allowing researchers to quantify how explanatory variables influence the likelihood of one category over the other.[1] In binary regression, the model connects a linear predictor—formed by a combination of intercept and coefficients multiplied by the predictors—to the probability of the positive outcome through a link function, ensuring that the resulting probabilities are constrained to the interval [0, 1].[5] The foundational formulation expresses this as P(Y=1 \mid X) = F(\beta_0 + \beta_1 X_1 + \dots + \beta_k X_k), where F denotes a cumulative distribution function (CDF) that guarantees the output remains within the valid probability range.[5] This structure addresses the inherent limitations of applying linear models directly to binary data, preventing invalid predictions outside [0, 1]. Unlike traditional continuous regression, which aims to predict the conditional expected value of a continuous response variable, binary regression prioritizes estimating event probabilities for discrete outcomes, thereby providing a more appropriate framework for probabilistic inference in categorical settings. Binary regression operates as a specialized instance within the generalized linear models framework, adapting linear prediction principles to non-normal response distributions.[6]Relation to Generalized Linear Models
Binary regression serves as a special case of generalized linear models (GLMs), a framework introduced to unify various statistical models beyond ordinary linear regression by accommodating non-normal response distributions. In this context, the response variable in binary regression follows a binomial distribution—often simplified to Bernoulli for individual binary outcomes (success or failure)—with the link function typically specified as the logit (inverse logistic) or probit (inverse cumulative normal distribution) to model the probability of the positive outcome. This integration allows binary regression to leverage the unified estimation and inference procedures of GLMs while addressing the inherent constraints of binary data, such as probabilities bounded between 0 and 1.[7][8] GLMs are structured around three core components: the random component, which specifies the probability distribution of the response variable; the systematic component, consisting of a linear predictor formed by covariates; and the link function, which relates the expected value of the response to this linear predictor. For binary regression, the random component is the Bernoulli distribution, where the response Y takes values 0 or 1 with success probability p, so Y \sim \text{Bernoulli}(p) and the mean \mu = E(Y) = p. The systematic component is the linear combination \eta = X\beta, where X is the design matrix of predictors and \beta is the vector of coefficients. The link function g then transforms the mean, ensuring the model respects the distributional assumptions, such as the probit link g(\mu) = \Phi^{-1}(\mu) (where \Phi^{-1} is the inverse standard normal CDF) or the logit link g(\mu) = \log\left(\frac{\mu}{1-\mu}\right).[7][9][10] The general form of a GLM is given by g(\mu) = X\beta, where \mu = E(Y \mid X) is the conditional expectation of the response, g is the link function (monotonic and differentiable), and the equation bridges the random and systematic components. This formulation enables maximum likelihood estimation across diverse models while maintaining interpretability through the linear predictor. For binary regression, the Bernoulli assumption aligns \mu = p, ensuring the model directly estimates event probabilities via the inverse link, p = g^{-1}(X\beta).[7][9] In comparison to other GLMs, binary regression differs from linear regression, which employs a Gaussian random component and identity link function (g(\mu) = \mu), leading to unbounded predictions that can fall outside [0,1] and thus are unsuitable for probabilities. Poisson regression, used for count data, pairs a Poisson distribution with a log link to model non-negative rates, contrasting with binary regression's focus on dichotomous outcomes. These distinctions highlight the advantages of the GLM framework for binary data: the non-identity link prevents invalid predictions like negative probabilities, enhances model fit for bounded responses, and facilitates extensions to grouped binomial data when multiple trials are involved.[7][9][11]Common Models
Logistic Regression
Logistic regression models the probability of a binary outcome Y = 1 given predictors \mathbf{X} using the logit link function, where the log-odds is expressed as a linear combination of the predictors: \log\left(\frac{p}{1-p}\right) = \mathbf{X}\boldsymbol{\beta}, with p = P(Y=1 \mid \mathbf{X}) and \boldsymbol{\beta} the vector of coefficients.[12] This formulation inverts to yield the probability directly: p = \frac{1}{1 + \exp(-\mathbf{X}\boldsymbol{\beta})}. [12] The model assumes independence of observations and linearity in the log-odds scale, making it suitable for binary response data where outcomes are probabilities bounded between 0 and 1.[3] The inverse logit, or sigmoid function, produces an S-shaped curve that maps the linear predictor \mathbf{X}\boldsymbol{\beta} to probabilities in [0,1], approaching 1 as the input tends to infinity and 0 as it tends to negative infinity.[13] This function is symmetric around 0.5, where the probability is 0.5 when \mathbf{X}\boldsymbol{\beta} = 0, and its derivative equals p(1-p), facilitating computational aspects like gradient-based optimization.[13] Coefficients in logistic regression admit an odds ratio interpretation: \exp(\beta_j) represents the multiplicative change in the odds of the outcome for a one-unit increase in predictor X_j, holding all other predictors constant.[14] For instance, if \exp(\beta_j) = 1.5, the odds increase by 50% per unit rise in X_j.[14] Logistic regression was advanced by David Cox in 1958 through his analysis of binary sequences, building on earlier work in bioassay, and gained prominence in epidemiology for modeling dose-response relationships where binary outcomes like response or non-response depend on exposure levels.[12][3] Consider a simple case with a binary predictor X (e.g., treatment vs. control, coded 0 or 1) and intercept \beta_0: the probability for the control group is p_0 = 1 / (1 + \exp(-\beta_0)), while for the treatment group it becomes p_1 = 1 / (1 + \exp(-(\beta_0 + \beta_1))), where \exp(\beta_1) quantifies the odds change due to treatment.[12] This setup illustrates how the model derives event probabilities from estimated coefficients, central to applications like clinical trials.[15]Probit Regression
The probit regression model specifies the probability of a binary outcome as a function of predictors using the cumulative distribution function (CDF) of the standard normal distribution as the link function. Formally, for a binary dependent variable Y \in \{0, 1\} and predictors X, the model is P(Y=1 \mid X) = \Phi(X \beta), where \Phi denotes the standard normal CDF and \beta is the vector of regression coefficients.[16] This approach ensures that predicted probabilities lie between 0 and 1, with the S-shaped form of \Phi capturing the nonlinear relationship between X and the outcome probability. A key interpretive framework for the probit model involves a latent (unobserved) continuous variable Z = X \beta + \epsilon, where \epsilon \sim N(0, 1). The observed binary outcome is then determined by a threshold rule: Y = 1 if Z > 0, and Y = 0 otherwise. This latent variable representation links probit regression to classical threshold models, such as those used in psychometrics and bioassay, where the binary response reflects whether an underlying propensity exceeds a fixed cutoff.[16][17] In comparison to logistic regression, the probit model produces similarly monotonic increasing probability curves, but the normal CDF results in a steeper rise near the midpoint (probability of 0.5) due to the differing densities of the normal and logistic distributions. Consequently, coefficients \beta from probit and logit models cannot be directly compared without adjustment for scale; an approximate rule is that probit coefficients equal logit coefficients divided by \sqrt{\pi^2/3} \approx 1.81, reflecting the relative variances of the error terms (standard normal variance of 1 versus logistic variance of \pi^2/3 \approx 3.29).[18] The inverse of the probit link function, which transforms probabilities back to the linear predictor scale, is given by X \beta = \Phi^{-1}(p), where p = P(Y=1 \mid X); this is often computed using numerical methods or tables for the inverse normal CDF.[16] In economics, probit models have been widely applied to discrete choice analysis and binary outcome modeling.Estimation Techniques
Maximum Likelihood Estimation
Maximum likelihood estimation (MLE) is the primary method for estimating the parameters β in binary regression models, where the goal is to maximize the probability of observing the given binary outcomes under the assumed model. For n independent and identically distributed observations (y_i, X_i), with y_i ∈ {0,1}, the likelihood function is given by L(\beta) = \prod_{i=1}^n p_i^{y_i} (1 - p_i)^{1 - y_i}, where p_i = F(X_i^T β) and F is the inverse link function, such as the logistic or probit cumulative distribution function.[12] The log-likelihood, which is maximized instead for computational convenience, simplifies to l(\beta) = \sum_{i=1}^n \left[ y_i \log p_i + (1 - y_i) \log (1 - p_i) \right]. This formulation arises from the Bernoulli distribution of the binary responses, ensuring the estimates reflect the data's empirical distribution most closely. Optimization of the log-likelihood proceeds iteratively, as no closed-form solution exists for β in most cases. The Newton-Raphson algorithm updates β via successive approximations using the score function (gradient) and the observed Hessian (second derivative matrix), converging quadratically under suitable conditions. Equivalently, iteratively reweighted least squares (IRLS) reformulates the problem as a weighted linear regression at each step, where weights are the variances of the working responses derived from the model, facilitating efficient computation in generalized linear model frameworks. The inverse of the negative Hessian at convergence provides the estimated covariance matrix for standard errors of β̂.[19] Under standard regularity conditions, such as correct model specification and identifiability, the MLE β̂ exhibits desirable asymptotic properties. Specifically, β̂ is consistent, meaning β̂ →_p β as n → ∞, and asymptotically normal, with √n (β̂ - β) →_d N(0, I(β)^{-1}), where I(β) is the Fisher information matrix, E[-∂²l/∂β∂β^T].[19] These properties enable reliable inference for large samples, including confidence intervals via the Wald statistic.[19] In practice, MLE for binary regression is implemented in statistical software with built-in safeguards for convergence. In R, theglm function in base stats uses IRLS by default, monitoring changes in β̂ and deviance until a tolerance threshold (e.g., 10^{-8}) is met or a maximum iterations limit (default 25) is reached. Similarly, Python's statsmodels library employs Newton-Raphson or BFGS optimization for Logit models, with options to adjust convergence criteria like parameter change or log-likelihood improvement.
Consider a simple logistic regression example with n=10 observations, where y indicates success (1) or failure (0) based on a single predictor x (e.g., dosage levels). The model is logit(p_i) = β_0 + β_1 x_i, with p_i = 1 / (1 + exp(-(β_0 + β_1 x_i))). Starting from initial values (e.g., β = 0), IRLS iterates by fitting weighted least squares: compute working response z_i = X_i^T β + (y_i - p_i)/[p_i (1 - p_i)], weights w_i = p_i (1 - p_i), and update β via ordinary least squares on z_i ~ X_i with weights w_i. After convergence (typically 4-6 iterations), suppose β̂_0 ≈ -2.5 and β̂_1 ≈ 1.2, indicating the log-odds increase by 1.2 per unit x; standard errors are derived from the Hessian for inference.