Censored regression model
The censored regression model, also known as the Tobit model, is a statistical technique designed to estimate linear relationships between independent variables and a dependent variable that is subject to censoring, where the observed values are restricted to a certain range due to measurement limits or natural constraints, such as non-negativity, and true values beyond those limits are not directly observed but known to exist in one direction.[1][2] This model addresses scenarios like left-censoring (e.g., values below a detection limit recorded as the limit itself) or right-censoring (e.g., values above a cap reported at the cap), allowing for the inclusion of all data points while accounting for the partial observability.[1][3] Introduced by economist James Tobin in his 1958 paper on estimating relationships for limited dependent variables, the model was originally developed to analyze consumer durable expenditures, which are often left-censored at zero when no purchase occurs, providing a way to model both the decision to participate and the extent of participation in economic activities.[4][2] Tobin, who later received the Nobel Prize in Economics in 1981, framed the approach around a latent variable y^* = x\beta + \epsilon (where \epsilon is normally distributed with mean zero and variance \sigma^2), with the observed variable y equal to y^* if uncensored, or the censoring threshold otherwise.[4][2] This latent structure distinguishes it from ordinary least squares, which would bias estimates toward zero in censored data, as the Tobit model uses maximum likelihood estimation to jointly account for the probability of censoring and the conditional mean of the uncensored portion.[2][3] Unlike truncated regression, where observations outside the censoring bounds are entirely excluded from the sample (e.g., only analyzing positive expenditures), censored regression retains all observations, treating censored ones as interval data to improve efficiency and reduce bias.[3][5] Key assumptions include normality of errors, homoscedasticity (though extensions handle heteroscedasticity), and exogeneity of regressors, with violations potentially addressed via robust or semiparametric variants.[6][3] Coefficient interpretation in the Tobit model pertains to the latent variable, but marginal effects on the observed outcome—such as the expected change in y given censoring—must be computed separately, often revealing smaller impacts than raw coefficients suggest.[2][1] The model finds wide application in econometrics for analyzing limited dependent variables, such as labor supply (hours worked censored at zero for non-workers) or household spending on specific goods.[2][1] In environmental and health sciences, it models measurements with detection limits, like pollutant concentrations below quantifiable levels or censored survival times in clinical trials.[1][5] Extensions include panel data versions for longitudinal studies, dynamic Tobit for time-series with serial correlation, and penalized approaches like LASSO for high-dimensional settings with many predictors.[7][8] Despite its utility, challenges persist in non-normal errors or endogeneity, prompting ongoing research into generalized and robust estimators.[9][6]Overview
Definition
The censored regression model is a statistical technique designed to estimate relationships between variables when the dependent variable is only partially observed due to censoring, a common data limitation in empirical research. Censoring arises when the true value of the dependent variable is truncated at a threshold, meaning that while values within the observable range are recorded accurately, those exceeding the bound (either above or below) are not distinguished beyond the limit itself. For instance, this occurs in models of non-negative outcomes, such as hours worked, where negative values are impossible and thus unobserved, leading to a pile-up of observations at zero.[4] A key distinction exists between censoring and truncation: in censoring, all observations, including those at the limit, are retained in the sample with the independent variables fully observed, but the dependent variable is recorded as the threshold value when censored; in contrast, truncation excludes entire observations outside the range, resulting in a non-representative sample. This difference is crucial because censored data preserve information about the censored cases, allowing for more efficient estimation compared to truncated samples.[3] Standard linear regression applied to censored data yields biased and inconsistent parameter estimates, as the censoring mechanism violates the model's assumption of an uncensored, homoscedastic error distribution, truncating the conditional expectation and introducing selectivity bias.[4] For example, in models of hours worked in labor supply decisions, where negative hours are impossible and thus censored at zero, ordinary least squares would underestimate (or bias toward zero) the effects of predictors by ignoring the censored nature of non-participation, leading to distorted inferences about economic relationships. Various types of censoring, including left, right, and interval, further characterize these models, though their specifics vary by application.[4]Historical Context
The censored regression model originated in econometrics as a response to the challenges posed by limited dependent variables, where observations are constrained within certain bounds. In 1958, James Tobin introduced the foundational Tobit model to address zero-inflated data, particularly in analyzing household expenditures on durable goods, combining elements of probit analysis and multiple regression to estimate relationships under censoring.[4] This approach, later named the Tobit model in honor of Tobin by Arthur Goldberger, marked a seminal contribution to handling non-negative outcomes that cluster at zero.[10] Building on Tobin's work, Takeshi Amemiya provided a key formalization in 1973 by developing regression analysis for truncated normal dependent variables, which extended the framework to more general censored and truncated cases and laid groundwork for broader applications in econometric modeling.[11] Amemiya's contributions, including his comprehensive 1984 survey on Tobit models, further clarified the distinctions between censored, truncated, and incidental truncation models, influencing subsequent theoretical advancements.[12] During the 1980s and 1990s, censored regression models evolved significantly, incorporating robustness and panel data structures to address real-world complexities like fixed effects and longitudinal observations. A notable milestone was James L. Powell's 1984 proposal of the censored least absolute deviations (CLAD) estimator, offering a semiparametric alternative to maximum likelihood that is less sensitive to distributional assumptions.[13] Integration with panel data advanced in the 1990s, exemplified by Bo E. Honoré's 1992 trimmed least absolute deviations method for fixed-effects censored models, enabling consistent estimation in dynamic settings with unobserved heterogeneity. Concurrently, software implementations facilitated wider adoption; for instance, Stata's tobit command, available since the early 1980s, supported standard censored regressions, while R packages like censReg, developed in the 2000s but building on 1990s methodologies, provided flexible tools for estimation and inference.[14][15]Theoretical Foundations
Types of Censoring
In censored regression models, censoring occurs when the dependent variable is only partially observed due to upper or lower bounds imposed by the data collection process or measurement limitations, leading to biased estimates if not accounted for.[3] The primary types of censoring are classified based on the direction and nature of these bounds, with left-censoring, right-censoring, and two-sided censoring being the most common mechanisms. These distinctions arise in various econometric and statistical contexts, where the goal is to model the underlying latent variable while adjusting for the observed censored values.[16] Left-censoring happens when observations below a specified lower threshold are recorded at that threshold, masking the true value which is known only to be less than or equal to the bound. For instance, in income data, negative values are impossible, so any true negative outcomes are censored at zero, resulting in a pile-up of observations at the lower limit.[16] This type is prevalent in economic models where variables like expenditures or hours worked cannot fall below zero.[3] The standard Tobit model is often used to address left-censoring by assuming a latent normal distribution for the uncensored variable.[1] Right-censoring occurs when observations exceeding an upper threshold are set equal to that threshold, with the true value known only to be greater than or equal to the bound. A common example is test scores in educational datasets, where maximum scores are capped at 100, so any higher potential scores are recorded as 100, creating a clustering at the upper end.[16] This form is also frequent in survival analysis adapted to regression, such as when study durations end before an event, though in pure regression contexts it applies to bounded outcomes like capped subsidies or limits in experimental designs.[17] Two-sided censoring, also known as double censoring, imposes both lower and upper bounds, where observations outside this interval are recorded at the nearest bound, obscuring values below the lower limit or above the upper limit.[18] This is typical in laboratory assays or environmental measurements, such as chemical concentrations detected only between instrument limits (e.g., values below 1 ppm or above 100 ppm are set to those thresholds), leading to piles at both ends of the distribution. In such cases, the observed data reflect a truncated range, complicating inference about the full distribution.[3] Censoring can further be distinguished as fixed or random based on the mechanism generating the bounds. Fixed censoring involves predetermined, known thresholds applied uniformly, such as regulatory caps on income reporting or fixed detection limits in sensors, where the censoring point is non-stochastic and identical across observations.[19] In contrast, random censoring arises from stochastic processes independent of the dependent variable, like instrument failures or random dropout times in panel data, where the censoring threshold varies across units but does not convey information about the outcome itself.[17] For example, in a dataset of firm profits, fixed left-censoring might occur at zero due to accounting rules, while random right-censoring could result from varying survey cutoffs due to administrative constraints.[20]Relation to Other Regression Models
The censored regression model addresses key shortcomings of ordinary least squares (OLS) regression when the dependent variable is subject to censoring. OLS assumes that the error term is normally distributed with constant variance and that the dependent variable can take any real value, but censoring introduces nonlinearity and heteroskedasticity in the observed data, resulting in biased and inconsistent parameter estimates. Specifically, OLS tends to attenuate coefficients toward zero, underestimating the true relationships, as the truncated error distribution for censored observations violates the classical assumptions. This issue was first highlighted in the development of the Tobit model, where standard regression techniques fail to account for the piling up of observations at the censoring point.[4] In contrast to truncated regression models, censored regression retains the full sample while adjusting for the known censoring mechanism. Truncated models condition the likelihood only on the observed range, effectively discarding information about the distribution beyond the truncation point and leading to estimates based on a non-representative subsample. For instance, in a left-truncated model, observations below a threshold are excluded entirely, altering the joint distribution of regressors and the dependent variable. Censored models, however, observe the regressors for all units and set the dependent variable to the limit value for censored cases, enabling a more complete use of the data. This distinction ensures that censored regression provides unbiased estimates under the assumption of an underlying latent normal variable, whereas truncated regression requires rescaling the probabilities over the visible support.[21][3] Censored regression belongs to the family of limited dependent variable models but differs in its focus on continuous outcomes bounded by censoring, rather than inherently discrete or restricted forms. Models like probit and logit handle binary dependent variables by modeling the probability of a positive outcome, while Poisson regression addresses count data with a focus on the expected number of events, often assuming equality of mean and variance. Although all share the challenge of non-linear estimation and the need to model the probability of being at the limit, censored regression emphasizes the conditional expectation of the latent continuous variable, incorporating both the binary decision to exceed the censor and the magnitude beyond it. This overlap underscores the unified treatment of bounded responses in econometrics, where censored models bridge continuous and discrete approaches without assuming a fully discrete outcome.[22] The Heckman sample selection model, while related, targets a different data-generating process than pure censoring. It corrects for endogenous selection where observations are missing non-randomly due to a correlated selection rule, such as self-selection into a sample, using a two-step procedure with an inverse Mills ratio to adjust the outcome equation. In censored regression, by comparison, the full sample is available, and the issue stems from partial observability of the dependent variable at exogenous limits, without requiring a separate selection equation. This makes the Heckman approach essential for truncated samples driven by unobserved heterogeneity, but inappropriate for standard censoring scenarios where all units contribute to the estimation.[23]Model Specification
Standard Tobit Model
The standard Tobit model addresses left-censoring in regression analysis, where the dependent variable is observed only above a known lower threshold, typically zero, and is otherwise recorded at that threshold. Introduced by economist James Tobin to analyze limited dependent variables such as household expenditures on durable goods, the model posits an underlying latent process that generates both the censored and uncensored outcomes.[4] The model's core formulation relies on a latent variable y_i^*, which follows a classical linear regression specification: y_i^* = \mathbf{X}_i \boldsymbol{\beta} + \varepsilon_i, where \mathbf{X}_i is a vector of explanatory variables for the i-th observation, \boldsymbol{\beta} is the parameter vector, and \varepsilon_i is the error term. The observed dependent variable y_i is then defined as y_i = \max(0, y_i^*), indicating left-censoring at zero: when y_i^* > 0, y_i = y_i^*; otherwise, y_i = 0 and the true value is unobserved. This setup accommodates nonnegative outcomes where negative realizations are theoretically possible but empirically censored, such as consumption or labor supply.[24][1] Key assumptions underpin the model: the errors \varepsilon_i are independently and identically distributed as normal, \varepsilon_i \sim N(0, \sigma^2), ensuring the latent variable is normally distributed conditional on \mathbf{X}_i; homoskedasticity holds, with constant variance \sigma^2 independent of \mathbf{X}_i; and the regressors are exogenous, satisfying E[\varepsilon_i | \mathbf{X}_i] = 0. These normality, homoskedasticity, and exogeneity conditions enable consistent estimation via maximum likelihood and facilitate probabilistic interpretations of censoring.[24][4] For a sample of n independent observations, the likelihood function derives from the conditional distribution of the observed y_i given \mathbf{X}_i. Let I_i be an indicator such that I_i = 1 if y_i > 0 (uncensored) and I_i = 0 if y_i = 0 (censored). For censored observations (y_i = 0), the contribution is the probability that the latent variable falls below the censoring point: P(y_i^* \leq 0 | \mathbf{X}_i) = P(\varepsilon_i \leq -\mathbf{X}_i \boldsymbol{\beta}) = \Phi\left( \frac{-\mathbf{X}_i \boldsymbol{\beta}}{\sigma} \right), where \Phi(\cdot) denotes the cumulative distribution function of the standard normal distribution. For uncensored observations (y_i > 0), the contribution is the density of the observed y_i given \mathbf{X}_i: f(y_i | \mathbf{X}_i) = \frac{1}{\sigma} \phi\left( \frac{y_i - \mathbf{X}_i \boldsymbol{\beta}}{\sigma} \right), where \phi(\cdot) is the probability density function of the standard normal distribution. The full likelihood function is thus L(\boldsymbol{\beta}, \sigma) = \prod_{i=1}^n \left[ \Phi\left( \frac{-\mathbf{X}_i \boldsymbol{\beta}}{\sigma} \right) \right]^{1 - I_i} \left[ \frac{1}{\sigma} \phi\left( \frac{y_i - \mathbf{X}_i \boldsymbol{\beta}}{\sigma} \right) \right]^{I_i}. This product form arises because the observations are independent, and the censored terms integrate the normal density over (-\infty, 0], while the uncensored terms use the exact density truncated above zero.[24][1][16] To facilitate maximum likelihood estimation, the log-likelihood function is obtained by taking the natural logarithm: \ell(\boldsymbol{\beta}, \sigma) = \sum_{i=1}^n (1 - I_i) \log \Phi\left( \frac{-\mathbf{X}_i \boldsymbol{\beta}}{\sigma} \right) + \sum_{i=1}^n I_i \left[ \log \phi\left( \frac{y_i - \mathbf{X}_i \boldsymbol{\beta}}{\sigma} \right) - \log \sigma \right]. Expanding the normal density term, \log \phi(z) = -\frac{1}{2} \log(2\pi) - \frac{1}{2} z^2 where z = (y_i - \mathbf{X}_i \boldsymbol{\beta})/\sigma, yields \ell(\boldsymbol{\beta}, \sigma) = \sum_{i=1}^n (1 - I_i) \log \Phi\left( \frac{-\mathbf{X}_i \boldsymbol{\beta}}{\sigma} \right) + \sum_{i=1}^n I_i \left[ -\frac{1}{2} \log(2\pi) - \frac{1}{2} \left( \frac{y_i - \mathbf{X}_i \boldsymbol{\beta}}{\sigma} \right)^2 - \log \sigma \right]. This expression separates contributions from censored and uncensored observations, allowing numerical maximization over \boldsymbol{\beta} and \sigma to obtain parameter estimates. The derivation follows directly from the normality assumption and the censoring rule, ensuring the log-likelihood is well-defined and concave under the stated conditions.[24][16][1]Variants for Multiple Limits
The two-limit Tobit model extends the censored regression framework to accommodate censoring at both a lower threshold \delta_1 and an upper threshold \delta_2, where the observed outcome y_i equals the latent variable y_i^* only if \delta_1 \leq y_i^* \leq \delta_2; otherwise, y_i = \delta_1 if y_i^* < \delta_1 or y_i = \delta_2 if y_i^* > \delta_2.[12] This variant addresses scenarios where the dependent variable is bounded on both ends, such as wages capped by minimum and maximum rates or prices constrained by regulatory floors and ceilings. The model assumes y_i^* = \mathbf{x}_i \boldsymbol{\beta} + \epsilon_i with \epsilon_i \sim N(0, \sigma^2), building on the standard Tobit as a special case where one limit is at negative infinity. Seminal development of estimation procedures for this bounded structure drew from Rosett and Nelson's (1975) work on two-limit probit, adapted to the continuous Tobit setting in econometric literature. Further generalization leads to the interval-censored model, where the exact value of the outcome is unknown but known to fall within a specific interval (L_i, U_i), with L_i and U_i potentially varying across observations. In this setup, the likelihood contribution for each observation integrates the density of the latent variable over the interval, rather than point masses at fixed limits. This approach is particularly suited to data from periodic monitoring or grouped reporting, such as disease onset times recorded between medical checkups. Finkelstein (1986) provided a foundational proportional hazards formulation for interval-censored failure times, influencing parametric regression extensions where the latent outcome follows a linear model conditional on covariates. Generalized formulations allow censoring points to differ by observation, incorporating observation-specific lower and upper bounds \delta_{1i} and \delta_{2i} that may depend on auxiliary variables or measurement processes, yielding y_i = \max(\delta_{1i}, \min(\delta_{2i}, y_i^*)). This flexibility captures heterogeneous censoring mechanisms, such as varying instrument sensitivities. Amemiya (1984) surveyed such extensions within Tobit-like models, emphasizing their utility for non-homogeneous bounds in empirical applications. In environmental monitoring, for instance, pollutant concentrations often face multiple detection limits—low-end nondetects and high-end saturation—leading to doubly censored data; regression models here adjust for these varying thresholds to estimate exposure relationships accurately.[25]Estimation Methods
Maximum Likelihood Estimation
The maximum likelihood estimator (MLE) for the censored regression model, particularly the standard Tobit model, maximizes the log-likelihood function derived from the assumed latent variable structure, accounting for both observed and censored observations.[26] This approach assumes a normal distribution for the error terms and provides consistent estimates under correct model specification. Due to the non-linear nature of the Tobit log-likelihood, optimization typically relies on numerical methods such as the Newton-Raphson algorithm, which iteratively updates parameter estimates using first- and second-order derivatives until convergence.[27] The Newton-Raphson method is effective for this purpose as it handles the constraints imposed by censoring, though starting values (often from OLS on uncensored data) are crucial to avoid divergence.[28] Under standard regularity conditions—including correct specification of the latent model, independence of observations, and identification—the Tobit MLE is consistent and asymptotically normal. Specifically, the estimator \hat{\beta} satisfies \sqrt{n}(\hat{\beta} - \beta) \xrightarrow{d} N(0, I(\beta)^{-1}), where I(\beta) is the expected information matrix, as established by Amemiya.[26] These properties ensure reliable inference for large samples when the normality assumption holds. Standard errors for the MLE parameters are computed as the square roots of the diagonal elements of the inverse observed information matrix, evaluated at the converged estimates, providing a basis for t-tests and confidence intervals. Implementation of Tobit MLE is supported in statistical software; in R, thecensReg function from the censReg package fits the model via maximum likelihood, allowing specification of left, right, or interval censoring.[15] In Stata, the tobit command performs the estimation by default using maximum likelihood, with options for robust standard errors.[2]