Fact-checked by Grok 2 weeks ago

Endogeneity

Endogeneity is a fundamental concept in and , referring to the situation in which an explanatory variable in a model is correlated with the model's error term, thereby violating the strict exogeneity assumption and leading to biased and inconsistent estimates of the coefficients. This correlation undermines the reliability of ordinary least squares (OLS) estimators, as it implies that the explanatory variables are not truly independent of the unobserved factors captured by the error term, resulting in invalid and potentially misleading conclusions about causal relationships.

Sources of Endogeneity

Endogeneity commonly arises from several mechanisms, each introducing a specific form of between regressors and s: Additional causes include autoregression with autocorrelated errors, further complicating model specification.

Implications and Importance

The presence of endogeneity distorts the variance-covariance matrix of estimators, inflating or deflating standard errors and compromising hypothesis tests, which is particularly problematic in cross-sectional and panel data analyses common in empirical research. In fields like , , and social sciences, endogeneity threatens causal claims, as it can reverse signs or exaggerate magnitudes of estimated effects, leading to erroneous recommendations or theoretical conclusions. Addressing endogeneity is crucial for robust , often necessitating quasi-experimental designs or econometric corrections to isolate exogenous variation.

Fundamentals

Definition and Core Concepts

In econometrics and statistical modeling, endogeneity arises when an explanatory in a regression model is correlated with the error term, thereby violating the crucial assumption of exogeneity required for unbiased and consistent estimation using ordinary (OLS). This correlation implies that the explanatory is not independent of the unobservable factors captured by the error term, leading to systematic errors in parameter estimates. A common source of such endogeneity is omitted variables that influence both the explanatory and the outcome, though other mechanisms exist. At its core, consider the standard linear regression model Y = X\beta + \varepsilon, where Y is the outcome vector, X is the matrix of explanatory variables, \beta is the parameter vector, and \varepsilon is the error term. Exogeneity holds under strict conditions, such as E[\varepsilon | X] = 0, ensuring that the explanatory variables are uncorrelated with the error term both contemporaneously and conditionally on all relevant information. In contrast, endogeneity is present when \text{Cov}(X, \varepsilon) \neq 0, meaning the explanatory variables share a non-zero covariance with the disturbances, which can stem from model misspecification or inherent dependencies in the data-generating process. To illustrate the consequences, the OLS estimator under endogeneity is given by \hat{\beta} = \beta + (X'X)^{-1} X' \varepsilon, where the second term represents the arising from the non-zero , causing \hat{\beta} to deviate systematically from the true \beta. This deviation persists even in large samples, rendering the estimator inconsistent.

Historical Development

The concept of endogeneity traces its early roots to the late 19th century, when formalized and the product-moment , establishing foundational assumptions for ordinary that required errors to be uncorrelated with regressors. Pearson's work in the and early highlighted issues of spurious and variability in data, implicitly recognizing potential problems with correlated disturbances in models, though explicit of endogeneity awaited later developments. A pivotal milestone occurred in the 1940s with Trygve Haavelmo's introduction of the probability approach to , which emphasized modeling economic systems as joint probability distributions rather than deterministic relations, thereby addressing and interdependence among variables. Haavelmo's 1944 paper argued that economic theories must be framed as statistical hypotheses to enable valid inference, laying the groundwork for simultaneous equation models where variables are endogenous due to mutual causation. This work, recognized with the 1989 in Economics, shifted toward probabilistic foundations for handling correlated errors and estimation biases. In the 1940s and 1950s, the Cowles Commission advanced these ideas through collaborative efforts led by and William C. Hood, producing seminal reports on simultaneous equation systems that classified variables as endogenous or exogenous and developed criteria to resolve endogeneity. Their 1953 monograph, Studies in Econometric Method, formalized estimation procedures for interdependent models, influencing the treatment of reverse causality and omitted factors. Henri Theil's 1953 contributions further refined this by proposing repeated methods for complete equation systems, providing practical tools to estimate parameters amid endogeneity. variables emerged during this era as an early remedy to isolate exogenous variation in simultaneous systems. The 1960s marked a shift toward , integrating econometric simultaneous systems with path analysis to explicitly model endogenous relationships and latent variables, as seen in Otis Dudley Duncan's extensions of Sewall Wright's work. By the 1970s and 1980s, endogeneity concepts evolved into contexts, with Yair Mundlak's 1978 analysis using fixed effects to control for unobserved heterogeneity in production functions, and subsequent tests like Hausman's for distinguishing fixed from random effects to mitigate from correlated effects. These advancements extended simultaneous principles to longitudinal , enhancing robustness against time-invariant endogeneity sources.

Causes

Omitted Variables

Omitted variable bias arises as a primary source of endogeneity in regression models when a relevant explanatory variable is excluded from the specification, leading to a correlation between the included regressors and the error term. Consider the true data-generating process given by the linear model Y = X\beta + Z\gamma + \varepsilon, where Y is the outcome vector, X contains the included regressors, Z is the omitted variable, and \varepsilon is the error term assumed to satisfy the classical assumptions (zero mean, homoskedasticity, and independence from X and Z). If Z is omitted, the estimated model becomes Y = X\hat{\beta} + u, where the composite error u = Z\gamma + \varepsilon now incorporates the effects of Z. Since Z is typically correlated with X in observational data, this omission induces \mathrm{Cov}(X, u) \neq 0, violating the exogeneity assumption and causing the ordinary least squares (OLS) estimator \hat{\beta} to be biased and inconsistent. The in the OLS estimator can be explicitly derived by substituting the true model into the omitted specification and applying the Frisch-Waugh-Lovell theorem, which projects Z onto X. Specifically, express Z = X \delta + v, where \delta = (X'X)^{-1} X' Z captures the linear coefficients and v is orthogonal to X. Substituting yields Y = X(\beta + \gamma \delta) + (v \gamma + \varepsilon), so the plim (probability limit) of \hat{\beta} is \beta + \gamma \delta, or in vector form, \hat{\beta} - \beta = \gamma (X'X)^{-1} X' Z. This expression shows that the equals the product of the true on the omitted \gamma, the effect size of Z, and the of Z onto X. The direction and magnitude of the depend on the signs and strengths of these components: for instance, if \gamma > 0 and X and Z are positively correlated (implying positive elements in (X'X)^{-1} X' Z), the is upward. A example illustrates this mechanism in the context of wage determination. Suppose the true model for hourly wages y is y = \beta_0 + \beta_1 \mathrm{educ} + \beta_2 \mathrm{abil} + \varepsilon, where \mathrm{educ} measures years of (X) and \mathrm{abil} proxies innate (Z), with \beta_2 > 0 reflecting ability's positive impact on . Omitting \mathrm{abil} yields the misspecified y = \alpha_0 + \alpha_1 \mathrm{educ} + u, where u = \beta_2 \mathrm{abil} + \varepsilon. Since higher-ability individuals tend to acquire more education (\mathrm{Cov}(\mathrm{educ}, \mathrm{abil}) > 0), the error term correlates positively with \mathrm{educ}, biasing \hat{\alpha_1} upward and overstating the causal return to education. Empirical studies confirm this issue; for instance, OLS estimates are typically around 8-12% per year of schooling, and while including ability proxies like IQ can reduce them (e.g., from 6.5% to 5.4%), IV estimates using policy changes such as quarter-of-birth instruments often yield higher figures of 10-16%, highlighting ongoing debates about the direction of bias correction and potential heterogeneity in treatment effects. The presence of bias requires two conditions: the omitted variable must have a nonzero effect (\gamma \neq 0) and it must correlate with the included regressors (\mathrm{Cov}(X, Z) \neq 0). If either holds only weakly, the may be negligible, but in economic applications like the wage model, both are typically strong, leading to substantial endogeneity. This form of misspecification underscores the importance of theoretical guidance in variable selection to minimize omitted factors that confound .

Simultaneity and Reverse Causality

Simultaneity refers to a form of endogeneity that occurs in econometric models when two or more variables are determined jointly through mutual causation, such that the explanatory variable influences the dependent variable and vice versa, violating the exogeneity assumption required for consistent ordinary estimation. This interdependence creates a feedback loop, where the error term in one becomes correlated with the regressors due to the simultaneous determination of the variables. A classic illustration is found in models, where price and quantity are simultaneously determined, leading to biased estimates if treated as a single . Reverse causality represents a specific subtype of simultaneity, where the outcome variable directly influences the predictor variable, rather than the predictor solely causing the outcome. For instance, in studies of health outcomes, better status may encourage greater exercise participation, reversing the typical assumed direction from exercise to improvements. This bidirectional relationship introduces endogeneity because the predictor is correlated with the error term, as unobserved factors affecting the outcome also impact the predictor. In simultaneous systems, this endogeneity is formally captured through structural s that include endogenous variables on both sides. Consider a two- system: \begin{align*} Y_1 &= \alpha_1 + \beta_{12} Y_2 + \gamma_1 X_1 + \varepsilon_1 \\ Y_2 &= \alpha_2 + \beta_{21} Y_1 + \gamma_2 X_2 + \varepsilon_2 \end{align*} where \beta_{12} \neq 0 and \beta_{21} \neq 0, indicating the mutual dependence between Y_1 and Y_2, with X_1 and X_2 as exogenous variables, and \varepsilon_1, \varepsilon_2 as terms that are typically correlated across equations due to the . Estimating one equation in isolation, such as regressing Y_1 on Y_2 and X_1, results in inconsistency because Y_2 is endogenous and correlated with \varepsilon_1. A prominent example in involves the and , where increases in can prompt adjustments in decisions by central banks, while growth influences inflationary pressures, leading to correlated error terms in single-equation models. This complicates in analysis, as the feedback between the variables biases estimates of their relationship.

Measurement Error

Measurement error in explanatory variables represents a key source of endogeneity in models, where the observed variable X^* differs from the true variable X due to inaccuracies in or reporting. This , denoted as u, such that X^* = X + u, introduces correlation between the observed explanatory variable and the model's term, violating the exogeneity required for unbiased ordinary (OLS) estimation. Two primary types of measurement are distinguished: classical and nonclassical. Classical measurement assumes that u has zero mean, is uncorrelated with the true X, and is of the regression \epsilon. In contrast, nonclassical measurement allows u to correlate with X, often arising in survey contexts where reporting biases depend on the true value, such as underreporting by high earners. The mechanism by which measurement induces endogeneity can be illustrated in a simple y = \beta X + \epsilon, where \epsilon is uncorrelated with X. Substituting the observed X^*, the model becomes y = \beta (X^* - u) + \epsilon = \beta X^* + (\epsilon - \beta u). The composite term \epsilon' = \epsilon - \beta u now correlates with X^* because \text{Cov}(X^*, \epsilon') = -\beta \sigma_u^2 \neq 0 under classical assumptions (unless \beta = 0 or \sigma_u^2 = 0). This correlation biases the OLS toward zero, a phenomenon known as bias. For classical measurement in a univariate , the probability limit of the is given by \text{plim} \hat{\beta} = \beta \cdot \frac{\sigma_X^2}{\sigma_X^2 + \sigma_u^2}, where \sigma_X^2 and \sigma_u^2 are the variances of X and u, respectively; since the fraction is less than 1, \hat{\beta} underestimates \beta in absolute value. In nonclassical cases, the bias direction depends on \text{Cov}(u, X); a negative covariance (e.g., mean-reverting ) can amplify , while positive covariance may lead to overestimation. A representative example occurs in econometric analyses of wage determinants using survey data on income, where self-reported earnings often contain recall errors u that correlate with unobserved factors in \epsilon, such as worker motivation or ability. In studies of the returns to education, classical measurement error in reported schooling years attenuates the estimated coefficient on education, biasing downward the perceived economic value of additional schooling. Nonclassical errors exacerbate this when high-ability individuals underreport hours worked due to social desirability, introducing covariance between u and true productivity in \epsilon. Such issues are prevalent in labor economics datasets like the Current Population Survey, underscoring the need for validation techniques like administrative records to assess error structure.

Implications

Bias in Parameter Estimates

In ordinary least squares (OLS) regression, endogeneity occurs when an explanatory variable X is correlated with the error term \epsilon, such that \Cov(X, \epsilon) \neq 0. This correlation violates the strict exogeneity assumption required for unbiased estimation, leading to \E[\hat{\beta}] \neq \beta, where \hat{\beta} is the OLS and \beta is the true . The direction and magnitude of this finite-sample bias depend on the sign and strength of \Cov(X, \epsilon); a positive covariance imparts upward (positive) bias to \hat{\beta}, while a negative covariance results in downward (negative) bias. For instance, in the simple case of omitted variable bias, the bias term is \beta_2 \cdot \delta_{21}, where \beta_2 is the true coefficient on the omitted variable Z and \delta_{21} is the coefficient from regressing Z on X; if both are positive, the bias is upward. Under endogeneity, OLS estimators are biased even in finite samples because the orthogonality condition \Cov(X, \epsilon) = 0 does not hold, preventing \E[\hat{\beta} | X] = \beta. This bias persists regardless of sample size in the short run, distorting point estimates and inference about causal relationships. A classic example is the estimation of returns to schooling, where omitting unobserved ability—which positively correlates with both education levels and earnings—leads to upward bias in the schooling coefficient; for example, studies using ability proxies suggest an upward bias of around 40% in OLS estimates (e.g., when omitting test scores). Conventional wisdom holds that \Cov(A, S) > 0, where A is ability and S is schooling, thus inflating OLS estimates of the return to an additional year of education. More broadly, endogeneity can inflate or deflate coefficients in settings, such as overestimating policy intervention effects when unobserved confounders (e.g., firm-specific incentives) positively correlate with both adoption and outcomes like R&D returns. Such distortions arise from sources like omitted variables or , underscoring the need to address endogeneity for reliable finite-sample estimates.

Inconsistency and Asymptotic Properties

In the context of models, endogeneity leads to the inconsistency of the ordinary least squares (OLS) estimator, defined as the failure of the probability \plim_{n \to \infty} \hat{\beta} to equal the true parameter \beta. This inconsistency stems from a non-zero between the explanatory variables X and the error term \epsilon, \Cov(X, \epsilon) \neq 0, which introduces a persistent that does not diminish with increasing sample size. As a result, even large samples yield estimates that systematically deviate from the population parameters, undermining the reliability of OLS for . The asymptotic distribution of the OLS estimator under endogeneity reflects this issue: \sqrt{n} (\hat{\beta} - (\beta + \bias)) \xrightarrow{d} N(0, V), where the non-zero bias is \bias = \left( \plim_{n \to \infty} \frac{X'X}{n} \right)^{-1} \plim_{n \to \infty} \frac{X'\epsilon}{n}. This distribution centers on a pseudo-true value \beta + \bias rather than \beta, with V denoting the asymptotic variance-covariance matrix. In the simple regression case, the bias simplifies to \frac{\Cov(x, \epsilon)}{\Var(x)}, illustrating how the direction and magnitude depend on the sign and strength of the endogeneity correlation. By contrast, under strict exogeneity where \Cov(X, \epsilon) = 0, the OLS estimator is consistent (\plim_{n \to \infty} \hat{\beta} = \beta) and asymptotically normal around the true value, allowing inference to improve with larger samples. Endogeneity disrupts this convergence, ensuring that bias remains even asymptotically, which highlights the need to address the correlation for valid long-run properties. A prominent example occurs in simultaneous equation systems, where mutual causation between variables induces endogeneity, rendering OLS inconsistent due to simultaneity bias. In such models, the between a regressor and the arises from joint determination, preventing OLS from converging to true parameters and requiring specialized estimators for asymptotic .

Detection

Specification Tests

Specification tests for model misspecification provide indirect evidence of endogeneity by detecting issues such as omitted variables or incorrect functional forms that may correlate explanatory variables with the error term. These tests do not directly confirm endogeneity but signal potential problems in the specification that could lead to biased estimates if unaddressed. Unlike tests relying on external instruments, they operate within the given model structure, often using residuals or fitted values to assess adequacy. The Regression Equation Specification Error Test (RESET), proposed by Ramsey in 1969, evaluates whether the functional form is correctly specified, which can reveal omissions or nonlinearities indicative of endogeneity from omitted variables. The procedure begins by estimating the original via ordinary least squares (OLS) to obtain fitted values \hat{Y}. The model is then re-estimated by augmenting it with higher powers of \hat{Y}, typically up to the second or third power: Y = \beta_0 + \boldsymbol{\beta}_1 \mathbf{X} + \gamma_2 \hat{Y}^2 + \gamma_3 \hat{Y}^3 + \epsilon The null hypothesis of correct specification posits that the coefficients on the powered terms are jointly zero (H_0: \gamma_2 = \gamma_3 = \cdots = 0), tested via an F-statistic on their joint significance. Rejection suggests misspecification, potentially due to omitted variables that induce correlation between regressors and errors, though the test has limited power against specific linear omissions. A variant of the Hausman specification test, introduced in 1978, assesses endogeneity by comparing parameter estimates from OLS—an efficient but potentially inconsistent estimator under endogeneity—with those from an alternative consistent estimator, detecting correlation between regressors and errors if the estimates differ significantly. Under the null of no endogeneity (exogeneity), both estimators are consistent, but the difference in estimates follows a chi-squared distribution asymptotically. The test statistic is: \chi^2 = (\hat{\beta}_{OLS} - \hat{\beta}_{alt})' (\text{Var}(\hat{\beta}_{OLS}) - \text{Var}(\hat{\beta}_{alt}))^{-1} (\hat{\beta}_{OLS} - \hat{\beta}_{alt}) where \hat{\beta}_{alt} denotes the alternative estimator's coefficients and the variance difference captures efficiency gains. Significant differences indicate specification error, such as error correlation violating OLS assumptions. This approach is general but requires an alternative estimator robust to the suspected misspecification. The Wu test, developed in 1973, directly examines potential endogeneity by checking the correlation between a suspect regressor and the error term through an auxiliary regression using other included variables. First, regress the potentially endogenous variable X on the other exogenous regressors \mathbf{Z}: X = \pi_0 + \boldsymbol{\pi}_1 \mathbf{Z} + v Obtain the residuals v, then include them in the original model: Y = \beta_0 + \beta_1 X + \boldsymbol{\beta}_2 \mathbf{Z} + \delta v + u The of exogeneity is \delta = 0; a t-test on \delta rejects if significant, implying X correlates with the \epsilon (where u is the ). This procedure targets omitted variables by proxying through included regressors but assumes the other \mathbf{Z} are exogenous. To specifically address omitted variables as a source of endogeneity, one practical procedure involves incorporating potential proxies for the missing factors into the model and testing their joint or individual significance via F- or t-tests. If the proxies are significant, they suggest the original specification omitted relevant correlates, biasing coefficients toward endogeneity; this approach enhances detection when theoretical proxies are available but requires careful selection to avoid new biases.

Endogeneity Tests Using Instruments

Endogeneity tests using instruments detect violations of the exogeneity assumption in models by leveraging variables (s) that are correlated with the potentially endogenous regressors but uncorrelated with the error term. These tests compare estimators obtained under different assumptions, such as ordinary least squares (OLS), which is consistent only if exogeneity holds, and IV estimation, which is consistent regardless but less efficient under exogeneity. The core idea is to assess whether the difference between these estimators is statistically significant, indicating endogeneity if OLS is inconsistent. The seminal Hausman test, proposed in 1978, formalizes this comparison for linear models of the form y = X\beta + \varepsilon, where X includes potentially endogenous regressors. Under the H_0: X is exogenous (i.e., \text{Cov}(X, \varepsilon) = 0, so \beta_{\text{OLS}} = \beta_{\text{IV}}), the is given by H = (\hat{\beta}_{\text{OLS}} - \hat{\beta}_{\text{IV}})' [\hat{V}(\hat{\beta}_{\text{OLS}}) - \hat{V}(\hat{\beta}_{\text{IV}})]^{-1} (\hat{\beta}_{\text{OLS}} - \hat{\beta}_{\text{IV}}) \sim \chi^2_k, where k is the number of potentially endogenous regressors, \hat{\beta}_{\text{OLS}} and \hat{\beta}_{\text{IV}} are the OLS and IV estimators (typically two-stage least squares, 2SLS), and \hat{V}(\cdot) denotes consistent estimates of their matrices. Rejection of H_0 suggests endogeneity, as the IV estimator accounts for between X and \varepsilon, while OLS does not. The test assumes valid instruments—relevant (correlated with X) and exogenous (uncorrelated with \varepsilon)—and that the model is correctly specified; invalid instruments can lead to incorrect inferences. To implement the Hausman test, one first performs the IV procedure: in the first stage, regress each endogenous regressor in X on the instruments Z to obtain fitted values \hat{X}; in the second stage, regress y on \hat{X} and exogenous covariates using OLS to obtain \hat{\beta}_{\text{IV}}. The OLS estimator \hat{\beta}_{\text{OLS}} is computed directly on the original model. The difference \hat{\beta}_{\text{OLS}} - \hat{\beta}_{\text{IV}} is then evaluated using the above, with estimates derived from the respective residuals. This procedure exploits the efficiency of OLS under exogeneity and the robustness of IV otherwise, making the test powerful for detecting specification errors like omitted variables or . The Durbin-Wu-Hausman (DWH) test serves as a computationally convenient variant, often equivalent to the Hausman test under standard conditions, but implemented via an auxiliary to test directly for between the instrumented regressors and the errors. Developed through contributions from Durbin (1954), Wu (1973), and Hausman (1978), it proceeds by first regressing the suspected endogenous X on the instruments Z and exogenous variables to obtain residuals \hat{v}_X, then including \hat{v}_X in an augmented of y on X, exogenous variables, and \hat{v}_X. The H_0: the coefficient on \hat{v}_X is zero (i.e., X uncorrelated with \varepsilon) is tested using a or F-statistic on this coefficient, which follows a standard distribution under homoskedasticity; rejection indicates endogeneity. Like the Hausman test, it relies on valid instruments and rejects H_0 if endogeneity is present, but the auxiliary approach avoids direct inversion, aiding practical application in software.

Remedies

Instrumental Variables Approach

The instrumental variables (IV) approach addresses endogeneity in regression models by employing an instrument Z that satisfies two conditions: it is uncorrelated with the error term, \operatorname{Cov}(Z, \epsilon) = 0, ensuring exogeneity, and it is correlated with the endogenous regressor X, \operatorname{Cov}(Z, X) \neq 0, ensuring relevance. This method allows for consistent estimation of causal effects even when X is endogenous. In the just-identified case with multiple instruments, the IV estimator is given by \hat{\beta}_{IV} = (Z'X)^{-1} Z' Y, which minimizes the correlation between the instruments and the residuals, providing a consistent estimate under the stated conditions. A common implementation of the IV approach is two-stage least squares (2SLS), which proceeds in two steps: first, regress the endogenous regressor X on the instruments Z using ordinary least squares (OLS) to obtain fitted values \hat{X}; second, regress the outcome Y on \hat{X} using OLS to obtain \hat{\beta}_{2SLS}. The 2SLS estimator is \hat{\beta}_{2SLS} = \left[ X' Z (Z' Z)^{-1} Z' X \right]^{-1} \left[ X' Z (Z' Z)^{-1} Z' Y \right], and it is consistent in the presence of endogeneity, as the fitted \hat{X} purges the correlation between X and \epsilon. The exclusion restriction further requires that Z affects Y only through its impact on X, preventing direct channels of influence. The validity of IV estimation hinges on three key assumptions: relevance, exogeneity, and exclusion. is assessed via the first-stage F-statistic from the of X on Z; a is that instruments are strong if this F-statistic exceeds 10, as weaker instruments can lead to finite-sample and unreliable inference. Exogeneity ensures no between Z and \epsilon, while exclusion maintains the indirect only. An illustrative example is the use of geographic proximity to colleges (measured as the inverse distance to the nearest ) as an instrument for years of in estimating returns to schooling on wages, using from the National Longitudinal Survey of Young Men; this instrument is relevant because closer access increases , exogenous assuming proximity does not directly affect wages apart from education, and yields IV estimates of the return to schooling that are 25-60% higher than OLS estimates, highlighting the upward in naive regressions due to ability endogeneity.

Panel Data Methods

Panel data methods address endogeneity in econometric models by exploiting the repeated observations on the same units over time, allowing researchers to control for unobserved heterogeneity that is constant across time periods. These approaches are particularly useful when endogeneity arises from omitted variables that do not vary over time, such as individual-specific factors in longitudinal studies. By focusing on within-unit variation, techniques mitigate biases from time-invariant confounders without requiring external instruments. The fixed effects (FE) is a cornerstone of these methods, eliminating time-invariant unobserved effects through the within-group . This involves demeaning the for each , subtracting the individual-specific time average from each observation. The resulting model is: Y_{it} - \bar{Y}_i = (X_{it} - \bar{X}_i) \beta + (\epsilon_{it} - \bar{\epsilon}_i) where i indexes , t indexes time, Y_{it} is the outcome, X_{it} are regressors, \beta are parameters, and \epsilon_{it} is the error term. This removes any fixed component in the error that is constant over time, ensuring under the of strict exogeneity conditional on the fixed effects. The FE approach is consistent even if the unobserved effects are correlated with the regressors, making it robust to endogeneity from omitted time-invariant variables. In contrast to , random effects () models assume that the unobserved effects are uncorrelated with the regressors, allowing for more efficient estimation by incorporating both within- and between-unit variation. The choice between and is often determined using the Hausman test, which assesses whether the assumptions hold by comparing coefficient estimates from both models; rejection of the favors for consistency. estimators remain consistent if strict exogeneity holds conditionally on the fixed effects, whereas may be inconsistent if exists. An alternative to the within transformation is first-differencing, which removes fixed effects by taking differences between consecutive time periods for each unit. The transformed model becomes: \Delta Y_{it} = \Delta X_{it} \beta + \Delta \epsilon_{it} where \Delta denotes the first difference (Z_{it} - Z_{i,t-1}). This method eliminates time-invariant factors but requires the assumption of no serial correlation in the errors to ensure efficiency, and it can amplify measurement error in short panels. First-differencing is particularly applied in dynamic panel models but serves as a simple remedy for static cases with fixed unobservables. A key application of these methods is in labor economics, where panel wage data often suffer from endogeneity due to omitted time-constant individual . Fixed effects or first-differencing controls for this by focusing on changes within individuals over time, yielding unbiased estimates of returns to or . For instance, in studies of wage determination, these techniques address the from unobserved ability correlated with schooling choices.

Applications

Economic Examples

In labor economics, endogeneity arises in estimating the returns to when omitted variables such as innate ability bias ordinary estimates upward, as higher-ability individuals tend to acquire more schooling and earn higher wages. A seminal approach to address this uses instrumental variables (), where quarter of birth serves as an instrument for years of schooling due to compulsory attendance laws that create exogenous variation in school starting age. In their influential study, Angrist and Krueger (1991) exploited U.S. data from 1914–1939 birth cohorts, finding that individuals born in the first three quarters of the year had slightly more education than those born in the fourth quarter, as the latter started school later and faced stricter completion requirements. Their estimate indicated a 7–10% return to an additional year of schooling, lower than OLS estimates but still positive and significant, highlighting how corrects for endogeneity without relying on randomized assignment. In , endogeneity manifests in equations through bias in models, where responds to changes in output, but output simultaneously depends on levels, leading to correlated regressors and inconsistent estimates. The principle, positing that is proportional to the change in output, empirically overperforms due to this reverse , as evidenced in early U.S. data where ignoring inflated coefficients. To remedy this, lagged variables—such as prior-period output—are often employed as instruments, assuming they influence current through persistent dynamics but are uncorrelated with contemporaneous shocks. For instance, in dynamic models of firm-level , Arellano-Bond estimation uses internal lags to instrument endogenous variables, yielding unbiased estimates of effects around 0.1–0.2 in sectors. In policy evaluation, endogeneity occurs when program participation is non-random, as individuals self-select based on unobservables like , biasing estimates of effects on outcomes such as or . discontinuity designs address this by exploiting sharp cutoffs in eligibility rules, creating quasi-experimental variation where status changes discontinuously while covariates remain continuous. A representative application is the evaluation of programs like Mexico's PROGRESA (now Prospera), where eligibility was determined by a index ; studies using around this cutoff found participation increased school attendance by 20% and reduced without the selection bias inherent in observational data. This method isolates local average effects near the , providing credible evidence on program impacts despite endogenous enrollment.

Broader Statistical Contexts

In , endogeneity often manifests as in observational studies, where unmeasured factors correlate with both exposure and outcome, leading to biased estimates of causal effects. For instance, studies examining the relationship between and health outcomes, such as or , frequently encounter endogeneity due to omitted genetic predispositions that influence both smoking initiation and susceptibility. Researchers address this through instrumental variable methods, using genetic markers as instruments to isolate the causal impact of smoking while mitigating confounding from unobserved heterogeneity. This approach is particularly valuable in prenatal smoking analyses, where endogeneity arises from self-selection into smoking behaviors influenced by socioeconomic or genetic factors, resulting in overestimation of effects if not corrected. In the social sciences, endogeneity appears prominently as in survey and data, where individuals self-select into treatments or samples based on unobserved characteristics, violating assumptions of . serves as a quasi-experimental remedy, estimating the probability of treatment assignment conditional on observed covariates to balance groups and reduce from endogeneity. This method has been applied to assess social programs, such as job training initiatives, where selection into participation correlates with outcomes like , though it may not fully eliminate if key confounders remain unobserved. Empirical evaluations show that can substantially lower measured but leaves residual imbalances, underscoring its role as a partial solution in non-randomized settings. Machine learning intersects with endogeneity in causal inference, particularly through frameworks designed to handle high-dimensional data where nuisance parameters confound treatment effects. The double/debiased machine learning estimator, introduced by Chernozhukov et al., combines for nuisance prediction with debiasing techniques to achieve valid on low-dimensional causal parameters, even under model misspecification or endogeneity from omitted variables. This approach is widely adopted in causal to address endogeneity in treatment effects estimation, providing root-n consistency and asymptotic normality by cross-fitting to avoid biases. Its impact is evident in applications spanning and beyond, where it enables robust causal claims from complex datasets prone to endogenous selection. In biology, endogeneity arises in studies of gene-environment interactions, often through reverse causality where environmental exposures influence genetic expression or vice versa, complicating causal direction in twin studies. For example, twin designs examining traits like intelligence or behavior must account for gene-environment correlations that induce endogeneity, as shared genetics may drive both environmental selection and phenotypic outcomes, leading to biased heritability estimates. Reverse causality in these contexts can manifest when environmental factors, such as neighborhood effects, alter gene expression, creating feedback loops that endogenous environments exacerbate through omitted mediators or measurement error. Natural experiment approaches, including twin comparisons, help disentangle these interactions by leveraging genetic randomization to test for pleiotropy and causal direction, though persistent endogeneity from dynamic interplay remains a challenge.

References

  1. [1]
    [PDF] ENDOGENEITY - CORE
    Endogeneity occurs when one of the independent variables of the regression equaiton is correlated with the unknown random error term (Wooldridge, 2002).
  2. [2]
    [PDF] Endogeneity in Empirical Corporate Finance∗
    We begin by reviewing the sources of endogeneity—omitted variables, simultaneity, and measurement error—and their implications for inference. We then discuss in ...
  3. [3]
    Endogeneity: How Failure to Correct for it can Cause Wrong ...
    Jul 16, 2015 · Endogeneity is a major methodological concern for many areas of business and management research that rely on regression analysis to draw causal inference.
  4. [4]
    Instrumental Variables and Endogeneity Part 1: Theory - UDRC
    Aug 4, 2021 · Endogeneity occurs when one of the explanatory variables in a regression model is correlated with the error term. It can occur due to unobserved ...
  5. [5]
    [PDF] 4.8 Instrumental Variables
    The data give information on dy=dx, so OLS estimates the total effect + du=dx rather than alone. The OLS estimator is therefore biased and inconsistent for ,.
  6. [6]
  7. [7]
    Marginal Effects for Probit and Tobit with Endogeneity - arXiv
    Jun 26, 2023 · Abstract:When evaluating partial effects, it is important to distinguish between structural endogeneity and measurement errors.
  8. [8]
    [PDF] Bayesian instrumental variables: priors and likelihoods
    what can be called “weak endogeneity in the frequentist sense”, or conditional endogeneity, to make it explicit we are not considering the case where Σ is ...
  9. [9]
    Galton, Pearson, and the Peas: A Brief History of Linear Regression ...
    Dec 1, 2017 · 4. Pearson's Mathematical Development of Correlation and Regression. In 1896, Pearson published his first rigorous treatment of correlation and ...
  10. [10]
    [PDF] The Probability Approach in Econometrics Author(s): Trygve ...
    PREFACE. This study is intended as a contribution to econometrics. It repre- sents an attempt to supply a theoretical foundation for the analysis of.Missing: endogeneity | Show results with:endogeneity
  11. [11]
    The Cowles Commission's Contributions to Econometrics at ... - jstor
    The fourth, Koop- mans and Hood (1953), presents some of the Cowles results, including some later ones concerning the es- timation of one equation at a time, in ...
  12. [12]
    [PDF] Origins of the limited information maximum likelihood and two-stage ...
    Nov 5, 2004 · Theil, H., 1953b. Estimation and simultaneous correlation in complete equation systems. Centraal. Planbureau Memorandum. (Reprinted In: Raj ...Missing: endogeneity | Show results with:endogeneity
  13. [13]
    Retrospectives: Who Invented Instrumental Variable Regression?
    Philip G. Wright's 1928 work is the earliest known solution, but Sewall Wright's 1925 work used instrumental variables, though not in a simultaneous equation.
  14. [14]
    FIFTY YEARS OF STRUCTURAL EQUATION MODELING
    ... 1960s and 1970s blurred the lines between econometrics, psychometrics, and sociometrics. They showed how a general model could incorporate simultaneous equation ...
  15. [15]
    [PDF] The Early Years of Panel Data Econometrics
    The fixed- and random-effects models have a long history in astronomy, agronomy, and statistics, going back to the nineteenth century. 5. Mundlak mentioned Hoch ...
  16. [16]
    [PDF] Introductory Econometrics
    Econometrics is based upon the development of statistical methods for estimating economic relationships, testing economic theories, and evaluating and ...Missing: endogeneity | Show results with:endogeneity
  17. [17]
    Omitted variable bias: A threat to estimating causal relationships
    We aim to raise awareness of the omitted variable bias (ie, one special form of endogeneity) and highlight its severity for causal claims.
  18. [18]
    [PDF] The Causal Effect of Education on Earnings. - David Card
    This paper surveys the recent literature on the causal relationship between education and earnings. I focus on four areas of work: theoretical and ...
  19. [19]
    [PDF] Simultaneous Equation Model (Wooldridge's Book Chapter 16)
    This is a simultaneous equation model (SEM) since y1 and y2 are determined simultaneously. • Both variables are determined within the model, so are endogenous, ...
  20. [20]
    [PDF] Controlling for Endogeneity in the Health-SES Relationship ... - AURA
    Although medical evidence suggests that the direction of causality runs from SES to health in the sense that lower levels of income lead to worse health ...<|control11|><|separator|>
  21. [21]
    Public Debt, Money Supply, and Inflation: A Cross-Country Study in
    Jul 31, 2009 · This paper provides comprehensive empirical evidence that supports the predictions of Sargent and Wallace's “unpleasant monetarist arithmetic”
  22. [22]
    Measurement Error in Survey Data - ScienceDirect
    In this survey, we focus on both the importance of measurement error in standard survey-based economic variables and on the validity of the classical ...
  23. [23]
  24. [24]
    [PDF] Lecture Notes on Measurement Error - LSE Economics Department
    The measurement error in x becomes part of the error term in the regression equation thus creating an endogeneity bias. Since Ηx and u are positively corre-.
  25. [25]
    [PDF] Instrumental Variables Estimation and Two Stage Least Squares
    Oct 18, 2018 · This suggests that the OLS estimate is too high and is consistent with omitted ability bias.
  26. [26]
    [PDF] Lecture 20: Omitted Variable Bias - MIT Open Learning Library
    The omitted variable Bias formula. Correct model: Yi = β0 + β1X1i + β2X2i + i. Estimated model: Yi = α0 + α1X1i + wi. Define Ancillary (or Auxillary) ...
  27. [27]
    Estimating the Returns to Schooling: Some Econometric Problems
    It points out that in optimizing models the "ability bias" need not be positive and shows, using recent analyses of NLS data, that when schooling is treated ...
  28. [28]
    [PDF] Introductory Econometrics: A Modern Approach (with Economic ...
    Page 1. Page 2. Jeffrey M. Wooldridge. Michigan State University. 4e. Introductory. Econometrics. A Modern Approach. Australia • Brazil • Japan • Korea • Mexico ...
  29. [29]
    None
    Below is a merged summary of OLS inconsistency, asymptotic properties, bias formula, and asymptotic distribution under endogeneity from "Mostly Harmless Econometrics" (Angrist & Pischke, 2008). To retain all information in a dense and organized manner, I will use a table format in CSV style, followed by a concise narrative summary that integrates key points not easily captured in the table. The table will cover specific details such as page numbers, quotes, equations, and examples, while the narrative will provide an overarching synthesis.
  30. [30]
    Tests for Specification Errors in Classical Linear Least-squares ...
    The tests are developed by comparing the distribution of residuals under the hypothesis that the specification of the model is correct to the distribution of ...
  31. [31]
    [PDF] Specification Tests in Econometrics - JA Hausman
    Oct 31, 2002 · Specification tests in econometrics are devised to test if the conditional expectation of X is zero and if e has a spherical covariance matrix.
  32. [32]
    Alternative Tests of Independence between Stochastic Regressors ...
    In this paper, we examine four alternative tests of independence between the stochastic regressors and disturbances. In the rest of this section we specify the ...
  33. [33]
    [PDF] Some Finite-Sample Results on the Hausman Test - UCLA Economics
    Nov 17, 2023 · . One widely used method for testing the endogeneity of y1 is the Hausman test (Hausman, 1978). This test compares the ordinary least squares ( ...
  34. [34]
    [PDF] RS – Lecture 1
    Model: y = X β + Uγ + ε, we suspect X is endogenous. • Steps for augmented regression DWH test: 1. Regress x on IV (Z) and U: x = Z П ...
  35. [35]
    [PDF] Instrumental Variables Regression with Weak Instruments
    Bonferroni and AR tests and the Durbin endogeneity test have correct size for this specification but could have negligible power. Finally, for specification ...
  36. [36]
    Using Geographic Variation in College Proximity to Estimate the ...
    Oct 1, 1993 · This paper explores the use of college proximity as an exogenous determinant of schooling. Analysis of the NLS Young Men Cohort reveals that men who grew up in ...
  37. [37]
    The Accelerator's Debt to Simultaneity - jstor
    It is the thesis of this note that the empirical dominance of the accelerator is largely spurious, related to simultaneous equation bias, and that an income ...
  38. [38]
    A Genetic Instrumental Variables Analysis of the Effects of Prenatal ...
    Our results indicate that prenatal smoking produces more dramatic declines in birth weight than estimates that ignore the endogeneity of prenatal smoking.
  39. [39]
    Instrumental Variables: Application and Limitations - Epidemiology
    We conclude that instrumental variables can be useful in case of moderate confounding but are less useful when strong confounding exists.
  40. [40]
    Sources of selection bias in evaluating social programs - NIH
    We find that matching based on the propensity score eliminates some but not all of the measured selection bias, with the remaining bias still a substantial ...Missing: endogeneity | Show results with:endogeneity
  41. [41]
    Propensity Score Analysis: Recent Debate and Discussion
    Propensity score analysis is often used to address selection bias in program evaluation with observational data. However, a recent study suggested that ...Skip main navigation · Abstract · Review of Key Propensity... · Method to Compare...
  42. [42]
    Efficacy of Propensity Score Matching for Separating Selection and ...
    Apr 17, 2024 · This study evaluates the utility of propensity score matching (PSM) as a method that has been proposed as a means of removing selection effects across surveys ...
  43. [43]
    Double/Debiased Machine Learning for Treatment and Causal ...
    Jul 30, 2016 · View a PDF of the paper titled Double/Debiased Machine Learning for Treatment and Causal Parameters, by Victor Chernozhukov and 6 other authors.
  44. [44]
    Double/debiased machine learning for treatment and structural ...
    Summary. We revisit the classic semi‐parametric problem of inference on a low‐dimensional parameter θ0 in the presence of high‐dimensional nuisance paramet.Summary · Introduction and Motivation · Dml: Post‐Regularized...
  45. [45]
    Gene–Environment Correlation: Difficulties and a Natural ...
    Sep 10, 2013 · Objectives. We explored how gene–environment correlations can result in endogenous models, how natural experiments can protect against this ...
  46. [46]
    [PDF] Economics and Econometrics of Gene-Environment Interplay
    Feb 25, 2022 · Endogeneity of E may arise from four sources: reverse causality, omitted variable ... gene-by-environment interaction studies of complex ...
  47. [47]
    [PDF] Stepping towards causation in studies of neighborhood and ...
    Mar 3, 2014 · Twin designs can help neighborhood effects studies overcome selection and reverse causation problems in specifying causal mechanisms. Beyond.Missing: endogeneity | Show results with:endogeneity<|control11|><|separator|>