Fact-checked by Grok 2 weeks ago

Panel data

Panel data, also known as longitudinal data or cross-sectional time-series data, refers to a comprising observations on multiple entities—such as individuals, firms, countries, or states—collected over several successive time periods. This structure combines the cross-sectional dimension (variation across entities) with the time-series dimension (variation over time), enabling researchers to track changes within entities while comparing differences between them. The use of panel data offers several key advantages over purely cross-sectional or time-series data. It provides more informative data, greater sample variability, and increased , which enhance the precision and reliability of statistical estimates. Additionally, panel data allows for the control of unobserved individual heterogeneity—such as fixed cultural or institutional factors—that might otherwise results in cross-sectional analyses, and it facilitates the of dynamic relationships and causal effects over time. These features make panel data particularly valuable in fields like , , and social sciences for investigating topics such as , policy impacts, and behavioral patterns. In econometric analysis, panel data are commonly modeled using fixed-effects or random-effects approaches to account for entity-specific effects. Fixed-effects models treat individual intercepts as fixed parameters correlated with the explanatory variables, effectively differencing out time-invariant unobserved heterogeneity to focus on within-entity variation over time. In contrast, random-effects models assume these effects are random draws from a uncorrelated with the regressors, allowing the inclusion of time-invariant covariates and improving when the assumption holds. Datasets may be balanced (all entities observed for the same number of periods) or unbalanced (varying observation lengths), with estimation typically requiring specialized software like R's plm package or Stata's xtreg command.

Definition and Basics

Definition

Panel data is a multidimensional dataset comprising observations on multiple entities—such as individuals, firms, households, or countries—across multiple time periods, thereby integrating cross-sectional elements (variation across entities) with time-series elements (variation over time for the same entities). This structure facilitates the examination of both between-entity differences and within-entity changes, capturing unobserved heterogeneity and temporal dynamics that single-dimension data cannot. In econometric notation, the dependent for i at time t is typically denoted y_{it}, where i = [1](/page/1), \dots, [N](/page/N+) represents the N entities and t = [1](/page/1), \dots, T represents the T time periods. For a balanced panel, in which every entity is observed for all T periods, the total number of observations equals n = N \times T. Panel is distinct from , which observes multiple entities at a single time point; time-series data, which tracks a single entity over multiple periods; and pooled cross-sections, which collect on different entities in each time period. Longitudinal data represents a broader category of repeated measures over time that encompasses panel data as a specific subtype involving fixed entities followed consistently. Panels may be balanced, with uniform observations across entities, or unbalanced, with varying numbers of periods per entity.

Historical Development

The concept of panel data, combining cross-sectional and time-series observations, emerged in the mid-20th century through agricultural experiments and early longitudinal studies aimed at analyzing productivity and behavioral patterns across multiple units over time. In the and , researchers in and began using repeated observations on farms or regions to estimate production functions, addressing heterogeneity that single cross-sections or could not capture. A pivotal early contribution came from Yair Mundlak in 1961, who applied panel data to aggregate micro-level production functions, demonstrating how unobserved firm-specific effects could estimates and advocating for models that account for such heterogeneity in agricultural contexts. The formalization of panel data methods in econometrics accelerated in the 1960s and 1970s, with key advancements in modeling error structures to pool cross-sectional and time-series data efficiently. Pietro Balestra and Marc Nerlove's 1966 paper introduced the error components model, which decomposes disturbances into individual-specific, time-specific, and idiosyncratic components, enabling consistent estimation of dynamic relationships like natural gas demand across U.S. states. This work laid the groundwork for handling correlated errors in panels, influencing subsequent developments in variance components estimation. The first International Panel Data Conference in 1977 at INSEE in Paris marked a milestone, fostering collaboration and highlighting the growing importance of these techniques in empirical research. The 1980s and 1990s saw the maturation of panel data through seminal textbooks and extensions to dynamic settings, broadening applications in , , and . Cheng Hsiao's 1986 monograph, Analysis of Panel Data, provided a comprehensive framework for fixed and random effects models, emphasizing inference under limited observations per unit, and was substantially revised in 2003 to incorporate nonlinear and qualitative response models. Badi H. Baltagi's 1995 text, Econometric Analysis of Panel Data, became a standard reference, updated multiple times, including the sixth edition in 2021, which covers spatial panels, unit roots, and further methodological progress. In 1991, Manuel Arellano and Stephen Bond advanced dynamic panel estimation with (GMM) techniques, addressing and Nickell in short panels through variables derived from lagged levels. Post-2000 developments have expanded panel data analysis to accommodate environments and computational advancements, integrating for high-dimensional settings and in large-scale longitudinal studies up to 2025. Hsiao's fourth edition in 2022 reflects this evolution, incorporating Bayesian methods and nonparametric approaches for panels with many covariates. Applications have proliferated in fields like and climate modeling, leveraging computational tools for scalable estimation amid growing data availability from administrative records and surveys.

Data Structure and Types

Balanced and Unbalanced Panels

In panel data analysis, a balanced panel consists of observations on all N entities across every one of the T time periods, resulting in a total number of observations n = N \times T. This structure is advantageous because it facilitates straightforward matrix operations and the application of standard econometric methods without adjustments for incompleteness, though such datasets are relatively rare in empirical research due to real-world data collection constraints. In contrast, an unbalanced panel features missing observations for certain entity-time pairs (i,t), leading to n < N \times T. Common causes of this imbalance include sample attrition, where entities drop out over time; non-response in surveys; and gaps in data availability due to measurement issues or external events. These missing data necessitate specific handling strategies, such as listwise deletion or imputation, to proceed with analysis, though the choice depends on the underlying missingness mechanism. The implications of panel balance extend to econometric modeling, where balanced panels allow for simpler computational implementations in techniques like fixed effects estimation, as the design matrices remain full rank without sparsity. Unbalanced panels, however, can introduce complexities in estimation efficiency and require software that accommodates irregular observation patterns, potentially affecting the precision of parameter estimates if missingness is not properly addressed. Attrition in panels can be random (missing completely at random, or MCAR), where dropouts occur independently of observed or unobserved variables, or systematic (e.g., informative attrition), where missingness correlates with the outcome or covariates, leading to biased estimates if unaccounted for. Random attrition preserves the representativeness of the remaining sample, whereas informative attrition, often driven by factors like economic hardship or health changes in longitudinal studies, can systematically distort inferences about population parameters.

Long and Wide Formats

In panel data analysis, the long format organizes the dataset such that each row corresponds to a single observation for one entity at one specific time period, with columns typically including an entity identifier (e.g., individual or firm ID), a time variable (e.g., year or period), and the relevant covariates or outcome variables. This structure stacks observations vertically, resulting in a dataset where the number of rows equals the total number of entity-time combinations. The wide format, by contrast, arranges the data with one row per entity and separate columns for each time-varying variable across different periods, such as income in period 1, income in period 2, and so on. This horizontal layout condenses the data, making it more compact for entities with few time periods, and is often useful for visualization tasks or preliminary data transformations that do not require time-series indexing. However, it can become unwieldy with many time periods, as the number of columns grows proportionally. Conversion between long and wide formats is commonly performed using reshaping functions in statistical software, which facilitate efficient data manipulation. In R, functions like long_panel() from the panelr package or base reshape() can transform wide data to long by specifying entity and time identifiers, while the reverse uses widen_panel(); similar operations in Python's pandas library employ wide_to_long() or melt() for wide-to-long reshaping and pivot() for the opposite. In Stata, the reshape command handles these transformations, such as reshape long varname, i(entity_id) j(time) to convert from wide to long. For unbalanced panels, reshaping to wide format introduces missing values for entities without observations in certain periods, which must be accounted for during analysis to avoid bias. Software implementations for panel data models generally favor the long format, as it naturally supports entity-time indexing required for techniques like fixed effects estimation. For instance, Stata's xtset command declares panel structure in long format, R's plm package expects long-form data for panel regressions, and Python libraries like linearmodels process long-format inputs to handle the panel dimensions effectively. This preference stems from the format's ability to accommodate varying numbers of time periods per entity without excessive missing data complications.

Examples and Applications

Illustrative Examples

To illustrate the structure of panel data, consider a hypothetical balanced dataset tracking three individuals (denoted as i=1, 2, 3) over three consecutive years (t=1, 2, 3). The variables include annual income (in thousands of dollars, time-varying), years of education (time-invariant), and age (time-varying). This setup captures repeated observations on the same entities, allowing analysis of both temporal changes and cross-entity comparisons, as outlined in standard econometric treatments of panel structures. The balanced panel contains exactly nine observations (3 individuals × 3 years), with no missing data:
Individual (i)Year (t)IncomeEducationAge
11301225
12321226
13351227
21401630
22421631
23451632
31251028
32271029
33301030
In this example, education remains constant for each individual across years, reflecting its time-invariant nature, while income and age evolve over time. An unbalanced panel variant might arise from missing observations, such as data unavailability for individual 3 in year 3, resulting in eight observations total. This introduces gaps, which must be handled carefully in analysis to avoid bias:
Individual (i)Year (t)IncomeEducationAge
11301225
12321226
13351227
21401630
22421631
23451632
31251028
32271029
(Missing: Individual 3, Year 3)
Basic interpretation of such data highlights the within-entity and between-entity dimensions. Within entities, changes over time can be tracked, such as individual 1's income growth from 30 to 35 (thousand dollars) alongside aging from 25 to 27 years, potentially indicating career progression. Between entities, differences emerge, like individual 2's consistently higher average income (approximately 42.3) compared to individual 3's (approximately 27.3), which may reflect variations in education levels. A simple pooled mean calculation demonstrates the cross-sectional and time-series aspects. The overall pooled mean income across all observations in the balanced is (30 + 32 + 35 + 40 + 42 + 45 + 25 + 27 + 30) / 9 ≈ 34.1 (thousand dollars), aggregating the entire dataset. Cross-sectionally, the mean income in year 1 is (30 + 40 + 25) / 3 ≈ 31.7, in year 2 ≈ 33.7, and in year 3 ≈ 36.7, showing temporal trends. Along the time dimension, individual-specific means are 32.3 for i=1, 42.3 for i=2, and 27.3 for i=3, emphasizing entity heterogeneity.

Real-World Applications

In economics, panel data enables detailed analysis of firm-level productivity by tracking variables such as wages, employment, and investment over time, allowing researchers to examine how factors like influence performance across firms and periods. For instance, studies using firm-level panels have quantified the contributions of resource reallocation and technological adoption to productivity slowdowns in manufacturing sectors. In sociology, household-level panel data supports investigations into income inequality and social mobility, with the Panel Study of Income Dynamics (PSID)—initiated in 1968—serving as a key resource for tracking intergenerational wealth transfers and economic disparities over decades. This longitudinal approach has facilitated analyses of labor earnings mobility and inequality trends in the United States, revealing persistent patterns in household economic trajectories. Health sciences leverage panel data from longitudinal patient cohorts to evaluate treatment effects, particularly in clinical trials involving repeated measures of outcomes like disease progression or recovery. Such data structures allow for assessing the efficacy of interventions over time, as seen in cohort studies examining depressive symptoms and environmental factors across multiple waves. In environmental science, country-level panel data on emissions tracks annual greenhouse gas outputs to measure policy impacts, enabling comparisons of regulatory effectiveness across nations and years. For example, analyses of cross-country panels have identified combinations of climate policies that achieved major reductions in CO₂ emissions, totaling 3.2 GtCO₂ equivalent from 1970 to 2018 in select countries. Panel data's primary benefits in these applications include controlling for unobserved heterogeneity—such as time-invariant individual or entity-specific factors—through methods like fixed effects, which enhance causal inference by focusing on within-entity variation over time rather than cross-sectional comparisons. This approach mitigates biases from omitted variables, providing more reliable estimates of policy or treatment impacts compared to static data.

Advantages and Limitations

Advantages

Panel data offer substantial methodological advantages in econometric analysis by integrating cross-sectional and time-series dimensions, resulting in a larger number of observations compared to pure cross-sectional (where T=1) or time-series (where N=1) data, which enhances precision in estimating parameters. This increased sample size provides more degrees of freedom and greater variability, allowing for more reliable inference and reduced standard errors in model estimates. For instance, with n > N or T, researchers can achieve asymptotically more efficient estimators, improving the accuracy of testing. A key benefit is the ability to control for unobserved individual heterogeneity, which mitigates that often plagues cross-sectional studies. Panel data permit the incorporation of entity-specific effects, such as through fixed or random effects approaches, to account for time-invariant unobservables that differ across units, thereby isolating the impact of explanatory variables more effectively. This control enhances the of estimates by addressing sources of related to persistent differences between individuals or groups. Panel data facilitate dynamic analysis by capturing intra-unit changes over time, enabling researchers to study temporal evolution and infer more robustly than with static . Techniques like first-differencing can eliminate fixed individual effects, revealing how variables respond to shocks or policies across periods, which supports causal identification in longitudinal settings. This temporal dimension allows for the examination of adjustment dynamics and lagged effects, providing deeper insights into processes that unfold over time. In terms of efficiency, panel data exhibit less among variables due to the combined variation from both dimensions, leading to more informative datasets and higher statistical power relative to single-dimension alternatives. The expanded variability reduces issues, enabling clearer identification of relationships and more precise predictions. Overall, this structure yields estimators with better finite-sample properties, making panel methods preferable for complex models. Finally, panel data are particularly valuable for , as they support the construction of counterfactuals in longitudinal contexts, allowing evaluation of "what-if" scenarios for interventions. By observing the same units before and after policy changes, researchers can estimate treatment effects while controlling for unit-specific baselines, informing evidence-based decision-making in and sciences. This capability is essential for assessing long-term policy impacts across diverse populations.

Limitations and Challenges

Panel data analysis, while powerful, faces significant challenges in data collection and maintenance. Gathering observations on the same cross-sectional units over multiple time periods is considerably more expensive and resource-intensive than collecting cross-sectional or pure time-series data, often requiring sustained investments in surveys, administrative tracking, or longitudinal studies. For instance, large-scale panels like the Panel Study of Income Dynamics (PSID) demand ongoing financial and to ensure consistent follow-up, which can limit the feasibility of such datasets in resource-constrained research environments. A major issue arises from , where units systematically drop out of the sample over time, leading to non-representative data and potential in estimates. This dropout is often non-random, correlated with unobserved characteristics such as economic status or mobility, which distorts inferences about . Empirical evaluations, such as those of the PSID, have shown that can introduce substantial , particularly in long-running panels where cumulative losses alter sample composition. Many panel datasets are characterized by short time dimensions, with the number of periods (T) typically small—often fewer than 10—which restricts the ability to capture temporal dynamics or estimate models reliant on time-series properties. This limitation is prevalent in economic and applications, where data availability constrains T, complicating the identification of lagged effects or trends. In panels with a large number of entities (N large) and short T, the incidental parameters problem emerges prominently, especially when incorporating fixed effects for units, resulting in inconsistent and biased estimates of common parameters. Originating from early work on partially consistent observations, this issue arises because the growing number of entity-specific parameters overwhelms the limited time-series information, leading to poor finite-sample performance. Additionally, the inclusion of entity-specific dummy variables in large-N panels can induce , particularly when combined with an intercept term, which inflates variance and hampers precise estimation of coefficients. This computational and statistical challenge is exacerbated in unbalanced panels, where missing observations further complicate the handling of dummies.

Basic Econometric Analysis

Pooled OLS Regression

Pooled ordinary least squares (OLS) regression represents the simplest approach to analyzing panel data, treating the dataset as a single pooled cross-section or time series without accounting for entity-specific or time-specific effects. In this framework, the model is specified as y_{it} = \alpha + \beta' X_{it} + u_{it}, where y_{it} is the outcome variable for entity i at time t, X_{it} is a vector of time-varying regressors, \alpha is the intercept, \beta is the vector of coefficients, and u_{it} is the composite error term. Estimation proceeds by stacking all observations across entities and time periods into a single and applying standard OLS, yielding coefficient estimates \hat{\beta} = ( \sum_{i=1}^N \sum_{t=1}^T X_{it}' X_{it} )^{-1} \sum_{i=1}^N \sum_{t=1}^T X_{it}' y_{it}. This method assumes the panel structure does not introduce dependencies that violate classical OLS conditions, allowing for straightforward computation using conventional software. The validity of pooled OLS relies on several key assumptions. First, the errors must be uncorrelated with the regressors, satisfying E(u_{it} | X_{it}) = 0 for strict exogeneity. Second, homoskedasticity holds if Var(u_{it} | X_{it}) = \sigma^2 for all i, t, and there is no serial , meaning Cov(u_{it}, u_{is} | X_i) = 0 for t \neq s. Third, no unobserved entity-specific or time-specific effects are present that could bias the estimates, implying the data can be treated as a homogeneous pool. Violations of these, particularly the zero conditional mean, lead to inconsistent estimates. A primary pitfall of pooled OLS is arising from unobserved heterogeneity across entities, such as fixed individual effects \alpha_i that correlate with the regressors. For instance, in wage models where y_{it} is log wages, omitting time-invariant ability \alpha_i (which positively correlates with X_{it}) upwardly biases the return to education estimate. This bias persists even with large samples if the correlation Cov(\alpha_i, X_{it}) \neq 0, rendering the estimator inconsistent. Additionally, ignoring panel dependencies can invalidate standard errors, necessitating robust or clustered variance adjustments. Pooled OLS is appropriate as a model when entity and time effects are absent or uncorrelated with the regressors, such as in short panels with large N and negligible heterogeneity. It serves as a starting point for comparison in , particularly if tests confirm the poolability of the data, though it is generally less efficient than specialized panel estimators when structure exists.

Fixed Effects Model

The addresses unobserved time-invariant heterogeneity in panel data by treating entity-specific effects as fixed parameters to be estimated. This approach is particularly useful when these effects are correlated with the explanatory variables, allowing for more reliable on the parameters of interest. The model is formulated as y_{it} = \alpha_i + \beta' X_{it} + v_{it}, where y_{it} is the outcome for entity i at time t, \alpha_i denotes the entity-specific intercept capturing time-invariant unobserved factors, X_{it} is a vector of time-varying regressors, \beta is the vector of coefficients, and v_{it} is the idiosyncratic term. Estimation of the can be achieved through two equivalent methods under standard conditions with sufficiently large T (number of time periods). The first is the least squares dummy variable (LSDV) approach, which includes a set of entity-specific dummy variables to directly estimate the \alpha_i. The second is the within transformation, which eliminates the \alpha_i by subtracting the entity-specific time mean from each variable: \tilde{y}_{it} = y_{it} - \bar{y}_i, \quad \tilde{X}_{it} = X_{it} - \bar{X}_i, \quad \tilde{v}_{it} = v_{it} - \bar{v}_i, yielding the transformed model \tilde{y}_{it} = \beta' \tilde{X}_{it} + \tilde{v}_{it}. Ordinary applied to this demeaned equation provides consistent estimates of \beta, focusing solely on within-entity variation over time. Key assumptions underlying the fixed effects include the possibility of between the entity-specific effects \alpha_i and the regressors X_{it}, which distinguishes it from approaches assuming . Additionally, strict exogeneity is required, meaning that the idiosyncratic errors satisfy E(v_{it} | X_{i1}, \dots, X_{iT}, \alpha_i) = 0 for all t, ensuring no feedback from past, present, or future errors to the regressors. These assumptions enable the within to purge the from omitted time-invariant variables without imposing restrictions on their with observables. The primary advantages of the lie in its ability to eliminate arising from time-invariant unobserved heterogeneity, such as individual ability or firm-specific characteristics, thereby isolating the causal effects of time-varying covariates. By relying on within-entity variation, it enhances the of estimates in comparative settings, outperforming pooled OLS when entity effects are present and correlated with regressors. This method has been foundational since its early applications in .

Random Effects Model

The random effects model in panel data econometrics treats unobserved individual-specific heterogeneity as a that is uncorrelated with the explanatory variables, allowing for more efficient compared to approaches that eliminate such effects. Formulated originally by Balestra and Nerlove, the model is specified as y_{it} = \alpha + \beta' X_{it} + u_{it}, where y_{it} is the dependent variable for entity i at time t, X_{it} is a of explanatory variables, and the composite error term decomposes into u_{it} = \mu_i + v_{it}, with \mu_i representing the individual-specific random effect and v_{it} the idiosyncratic error. The individual effects \mu_i are assumed to be independently and identically distributed (IID) as \mu_i \sim (0, \sigma_\mu^2), and crucially, uncorrelated with the regressors X_{it} at all leads and lags, enabling the use of the full variation in the data, including between-entity differences. The variance of the composite error in this model is given by \text{Var}(u_{it}) = \sigma_v^2 + \sigma_\mu^2, where \sigma_v^2 = \text{Var}(v_{it}) assumes homoskedasticity and no serial correlation within individuals, while the individual effects introduce correlation across time for the same entity, with \text{Cov}(u_{it}, u_{is}) = \sigma_\mu^2 for t \neq s. Estimation proceeds via (GLS) or feasible GLS (FGLS), which accounts for this error structure by transforming the data to quasi-demean it. Wallace and Hussain developed the estimation approach, involving an initial consistent estimate of the variance components \sigma_v^2 and \sigma_\mu^2 (often via ), followed by application of the transformation factor \theta = 1 - \sqrt{\frac{\sigma_v^2}{T \sigma_\mu^2 + \sigma_v^2}}, where T is the number of time periods; the model is then estimated by OLS on the transformed variables y_{it} - \theta \bar{y}_i and X_{it} - \theta \bar{X}_i. This method yields consistent and asymptotically efficient estimates of \beta under the stated assumptions, as the random effects framework incorporates both within-entity and between-entity variation, unlike the fixed effects model, which purges the latter to eliminate correlation between effects and regressors. By leveraging the between variation, random effects estimation achieves greater statistical efficiency, particularly in panels with substantial cross-sectional heterogeneity and limited time dimensions, provided the exogeneity of \mu_i with respect to X holds.

Model Selection Tests

In panel data analysis, model selection tests are essential for determining whether to use pooled ordinary (OLS), fixed effects, or random effects models, based on the presence and nature of unobserved heterogeneity across entities. These tests help ensure the chosen model aligns with the data's structure, avoiding biased or inefficient estimates. The primary tests include the for fixed effects, the Breusch-Pagan (LM) test for random effects, and the Hausman test to distinguish between fixed and random effects approaches. The F-test compares the pooled OLS model against the fixed effects model by examining the joint significance of the entity-specific dummy variables, which capture fixed effects. Under the null hypothesis, the fixed effects are zero (i.e., no unobserved entity-specific heterogeneity), implying that pooled OLS is appropriate. The test statistic is an F-statistic derived from the restricted (pooled) and unrestricted (fixed effects) models: F = \frac{(SSR_r - SSR_u)/q}{SSR_u/(NT - N - K)} where SSR_r is the sum of squared residuals from the pooled model, SSR_u from the fixed effects model, q = N-1 is the number of restrictions (entity dummies minus one), N is the number of entities, T the number of time periods, and K the number of regressors. Rejection of the null indicates significant fixed effects, favoring the fixed effects model over pooled OLS. This test assumes standard OLS conditions hold under the null and is robust to certain violations when using clustered standard errors. The Breusch-Pagan LM test assesses whether random effects are present by comparing the pooled OLS model to the , specifically testing the that the variance of the random individual effects is zero (\sigma_\mu^2 = 0), which would justify pooled OLS. The test is based on the residuals from the pooled OLS and follows a with one degree of freedom. The LM statistic is: LM = \frac{NT}{2(T-1)} \left( \frac{\sum_i \left( \sum_t \hat{e}_{it} \right)^2 }{\sum_i \sum_t \hat{e}_{it}^2} - 1 \right)^2 \sim \chi^2_1 where \hat{e}_{it} are the pooled OLS residuals, N the number of entities, and T the time periods. Rejection of the null suggests individual-specific random effects, supporting the . This test is particularly useful when the random effects assumption of uncorrelatedness with regressors may hold, and it performs well in balanced panels. Once evidence of individual effects is found (via F or LM tests rejecting pooled OLS), the Hausman test is used to choose between fixed and random effects models. It tests the that the random effects estimators are consistent and efficient, meaning the individual effects are uncorrelated with the regressors; under the alternative, fixed effects are preferred as random effects would be inconsistent. The test compares the fixed effects (\hat{\beta}_{FE}) and random effects (\hat{\beta}_{RE}) estimates, with the statistic: H = (\hat{\beta}_{FE} - \hat{\beta}_{RE})' [\operatorname{Var}(\hat{\beta}_{FE}) - \operatorname{Var}(\hat{\beta}_{RE})]^{-1} (\hat{\beta}_{FE} - \hat{\beta}_{RE}) \sim \chi^2_k where k is the number of regressors tested (typically excluding those collinear with effects), and the variance-covariance matrices are estimated from each model. Rejection of the null (significant H) indicates correlation between effects and regressors, favoring fixed effects for consistency. The test requires the fixed effects estimator to be consistent under both hypotheses and assumes sufficient degrees of freedom; robust versions address heteroskedasticity. In practice, if the LM test rejects pooled but Hausman fails to reject random effects, the random effects model is selected for its efficiency gains over fixed effects.

Dynamic Panel Data

Model Formulation

The dynamic panel data model extends the static framework by incorporating lagged values of the dependent variable to capture temporal persistence and dynamics in the data. The standard linear formulation is given by y_{it} = \alpha_i + \beta' X_{it} + \gamma y_{i,t-1} + \epsilon_{it}, where y_{it} is the dependent variable for individual i at time t, \alpha_i denotes the unobserved individual-specific fixed effect capturing time-invariant heterogeneity, X_{it} is a vector of time-varying explanatory variables, \beta is the corresponding vector of coefficients, \gamma is the coefficient on the lagged dependent variable representing persistence, and \epsilon_{it} is the idiosyncratic error term. This model assumes a large number of cross-sectional units (N \to \infty) and a short time dimension (T fixed), common in panel data settings. Unlike static fixed effects models, which exclude lagged dependents and assume exogeneity of regressors, the inclusion of y_{i,t-1} in the dynamic model induces because the lagged term correlates with the fixed effect \alpha_i and potentially with past errors, necessitating variables for . A assumption is strict exogeneity of the regressors X_{it}, meaning E(X_{it} \mid \alpha_i, X_{i1}, \dots, X_{iT}, \epsilon_{i1}, \dots, \epsilon_{iT}) = E(X_{it} \mid \alpha_i, \epsilon_{i1}, \dots, \epsilon_{iT}) for all t, ensuring no between X_{it} and current or future errors. The error term \epsilon_{it} is typically assumed to be mean zero and serially uncorrelated, though extensions allow for AR(1) structure in the idiosyncratic errors to model mild beyond the lagged dependent variable. Additionally, errors are assumed independent across individuals, with E(\epsilon_{it} \mid \alpha_i) = 0. A prominent issue in estimating this model via fixed effects methods, such as the within-group transformation, is the Nickell , which arises from the correlation between the transformed lagged dependent variable and the transformed error term. Specifically, in panels with short T, this leads to a downward in the estimate of \gamma, with the bias of order O(1/T) and approaching zero only as T \to \infty. This is particularly severe when \gamma is close to unity, reflecting high persistence, and underscores the challenges of incidental parameters in dynamic settings.

Estimation Techniques

Estimation of dynamic panel data models requires addressing the arising from the inclusion of lagged dependent variables and potential correlation with unobserved individual effects. variables (IV) methods provide a foundational approach, particularly under the assumption of no serial correlation in the errors, where lagged values of the dependent variable serve as instruments for the lagged dependent variable in the model. The Arellano-Bond (GMM) estimator extends this IV framework by first-differencing the model to eliminate the individual-specific effects \alpha_i, yielding the equation: \Delta y_{it} = \beta' \Delta X_{it} + \gamma \Delta y_{it-1} + \Delta v_{it} Here, internal instruments are derived from the levels of the variables, such as lagged levels of y and X, which are valid under the assumptions of no serial correlation and strict exogeneity of the regressors conditional on the fixed effects. This difference GMM estimator is consistent for panels with small time dimensions T and large cross-sectional dimensions N, though it can suffer from weak when the autoregressive \gamma is close to unity. To improve efficiency, the system GMM estimator proposed by Blundell and Bond combines the differenced with an additional in levels, incorporating lagged differences of the variables as instruments for the levels under the of stationarity. This approach reduces finite-sample bias and increases precision, particularly in panels with persistent data or when T is moderate, making it widely adopted for empirical applications in . Validity of these GMM estimators is assessed through diagnostic tests, including the Sargan or test for overidentifying restrictions, which checks instrument orthogonality, and the Arellano-Bond AR(2) test for second-order serial correlation in the first-differenced errors, as first-order correlation is expected by construction. Failure of these tests may indicate model misspecification or invalid instruments. Implementation of these techniques is facilitated by software packages such as xtabond2 in , which supports both and GMM with options for instrument selection and robustness, and the plm package in , which provides functions for GMM estimation of dynamic panels alongside standard errors adjusted for clustering.

Advanced Topics

High-Dimensional Panel Models

High-dimensional panel models address scenarios where the number of entities (N) or covariates grows large relative to the time dimension (T), common in modern applications such as macroeconomic forecasting across numerous countries or firms. These models extend traditional fixed effects approaches by incorporating unobserved common factors or numerous regressors to capture pervasive heterogeneity and cross-sectional dependence, while mitigating biases from high dimensionality. Unlike standard low-dimensional panels, they require specialized to handle of dimensionality" and ensure as N and T increase. A key framework is the approximate factor model, specified as y_{it} = \lambda_i' f_t + \beta' X_{it} + \epsilon_{it}, where f_t represents unobserved common factors driving cross-sectional correlations, \lambda_i are entity-specific loadings, X_{it} are observed covariates, and \epsilon_{it} is an idiosyncratic error. Principal components analysis provides a for the factors and loadings when both N and T are large, achieving convergence rates of order \min(\sqrt{N}, \sqrt{T}). This approach, developed by Bai (2009), allows for interactive fixed effects that are correlated with regressors, outperforming iterative in simulations for panels with moderate factor counts. For sparse high-dimensional settings with many covariates (e.g., N large and T moderate), penalized methods like the are employed to select relevant predictors and estimate parameters simultaneously. The imposes an \ell_1 penalty on coefficients, shrinking irrelevant ones to zero, which is particularly useful in panels with cross-sectional dependence. Recent extensions integrate with cross-section augmentation to handle interactive effects, yielding oracle-consistent inference under sparsity assumptions. These models face the incidental parameters problem, where estimating numerous entity-specific parameters biases , especially in large N/T asymptotics with weak or heterogeneous factors. Post-2010 advances, such as the correlated effects (CCE) estimator, address this by augmenting regressions with cross-sectional averages of observables to proxy unobserved factors, ensuring even with weak . The CCE approach, originally proposed by Pesaran (2006), is robust to heterogeneous slopes and has been applied to panels with many countries, such as estimating effects across 100+ economies.

Integration with Machine Learning

The integration of (ML) techniques with panel data econometrics has gained prominence since the mid-2010s, enabling more flexible modeling of heterogeneity, nonlinearity, and high-dimensionality while preserving capabilities. Hybrid approaches leverage ML's predictive power to approximate nuisance parameters or capture complex patterns, often combined with econometric corrections for and clustering. These methods address limitations in traditional models, particularly in unbalanced or short panels common in economic and social data. Double machine learning (Double ML) extends the debiased estimation framework of Chernozhukov et al. (2018) to panel settings, allowing robust on effects amid high-dimensional confounders and unobserved heterogeneity. In static models with fixed effects, Double ML constructs Neyman-orthogonal scores that account for two-way clustering, using ML algorithms like or random forests to flexibly estimate conditional expectations while enabling cross-fitting to reduce bias. Adaptations for panels, such as those incorporating entity and time fixed effects, achieve valid even with many covariates, outperforming traditional instrumental variables in simulations with . Recent implementations, including the DoubleML, facilitate practical application to policy evaluation in longitudinal data. For dynamic panels, recurrent neural networks (RNNs) and (LSTM) models incorporate entity embeddings to handle unobserved heterogeneity across units, improving accuracy over linear autoregressive models. These architectures process sequential observations while embedding categorical identifiers (e.g., firms or regions) into low-dimensional vectors, capturing nonlinear dynamics and interactions without assuming homogeneous slopes. In applications to firm performance prediction using panel data and to macroeconomic nowcasting with mixed-frequency panels, such models have demonstrated improved accuracy over dynamic panels, particularly in short-T settings with values. In high-dimensional panels, tree-based ML methods like random forests and incorporate regularization via mixed effects to adjust for clustered errors, enhancing variable selection and . Random forests for longitudinal data model structures stochastically, splitting on both covariates and time to accommodate within-unit dependence, with out-of-bag errors providing unbiased variance estimates. variants, such as mixed-effect gradient boosting, iteratively fit trees while penalizing clustered residuals, achieving 35-76% MSE reductions in nonlinear simulations over unadjusted boosters. These approaches maintain econometric validity by clustering standard errors post-, suitable for assessment in settings. Causal ML extensions, including causal forests adapted for panels, further advance by estimating heterogeneous treatment effects under difference-in-differences assumptions. These methods, building on generalized random forests, recover time-varying effects robustly in the presence of fixed effects, with applications showing improved out-of-sample policy attribution in economic panels up to 2025. Advantages of these integrations include superior prediction in unbalanced or short panels, where traditional models falter due to incidental parameters, and enhanced for policy via debiased , as evidenced by reduced bias in treatment effect estimates across diverse datasets. However, challenges persist in balancing interpretability—ML's black-box nature obscures economic mechanisms—with econometric rigor, and ensuring exogeneity requires careful orthogonalization to avoid from unobserved time-varying factors.

Notable Datasets

Standard Panel Datasets

Standard panel datasets in econometrics provide foundational resources for analyzing longitudinal data across entities over time, typically featuring balanced or unbalanced structures with key socioeconomic variables. These datasets are widely used in studies of labor economics, , and development, offering publicly accessible data through reputable repositories. The Panel Study of Income Dynamics (PSID) is the world's longest-running longitudinal household survey, initiated in 1968 by the University of Michigan's Institute for Social Research with a nationally representative sample of approximately 18,000 individuals in 5,000 U.S. families. It collects annual data (biennial since 1997) on variables including , , , , and demographics, tracking families and their descendants across generations, resulting in approximately 44 waves and a current core sample of around 9,000 households spanning 57 years (N ≈ 9,000, T ≈ 57). While the dataset includes a balanced core subsample for certain analyses, it experiences attrition over time, with cumulative nonresponse rates managed through refreshment samples to maintain representativeness. Access to PSID data is available through the official PSID website and the Inter-university Consortium for Political and Social Research (ICPSR) repository, supporting restricted and public-use files for researchers. The German Socio-Economic (SOEP), established in 1984 by the German Institute for Economic Research (DIW ), is a longitudinal survey of private households in , initially sampling about 12,000 individuals in 6,000 households from , with enlargements including an East German sample in 1990 and subsequent immigrant and refreshment cohorts. It annually gathers from approximately 30,000 individuals in 20,000 households on topics such as labor market participation, health outcomes, , and , providing over 40 waves of harmonized information for cross-sectional and panel analyses. The SOEP maintains high retention rates, with panel documented annually, and includes weights to adjust for nonresponse and sample design. are publicly accessible via the SOEP website at DIW , with user agreements for scientific use. The World Bank's World Development Indicators (WDI) dataset compiles country-level panel data on economic, social, and environmental indicators for over 200 economies, with annual observations dating back to 1960 for many series, encompassing more than 1,400 time series variables such as GDP, poverty rates, education enrollment, and health metrics (as of the October 2025 update). This macro-level panel supports global development research, featuring balanced panels for core indicators across N ≈ 217 countries and T up to 60+ years, though coverage varies by variable and country due to data availability. The WDI is sourced from official international statistics and national accounts, ensuring comparability. It is freely downloadable through the World Bank DataBank portal, with API access for bulk retrieval, and is often integrated into repositories like the National Bureau of Economic Research (NBER) for econometric applications. These standard datasets, available via platforms such as ICPSR, DIW, and DataBank, exemplify two-dimensional (N × T) panels essential for fixed and random effects modeling in .

Multi-Dimensional Panel Datasets

Multi-dimensional panel datasets extend the traditional two-dimensional structure of entities (N) observed over time (T) by incorporating additional dimensions, such as spatial relationships or multiple categorical indices, enabling the analysis of complex interactions like geographic dependencies or multi-lateral flows. These datasets are particularly valuable in fields like , , and social sciences, where phenomena exhibit interdependence across , networks, or hierarchies beyond simple cross-sections. Spatio-temporal panels represent a common form of multi-dimensional , structuring observations as N regions cross-classified by T time periods, often incorporating spatial lags to account for geographic spillovers. For instance, U.S. county-level spanning 1960 to 1990 has been used to examine rates alongside structural covariates, revealing spatial patterns in dynamics. Such datasets typically include variables like counts or socioeconomic indicators, with spatial lags modeling how outcomes in one region influence neighbors, as seen in urban count models derived from Uniform Crime Reporting . Multi-way panels introduce further dimensions, such as in data organized by exporters, importers, and years, forming a three-way structure that captures bilateral flows and their evolution. This setup is prevalent in gravity models of , where datasets track merchandise flows between country pairs over decades, allowing estimation of multi-way clustering to address correlated errors across dimensions. Notable examples include firm-level panels linking firms, products, and time, which decompose bilateral flows into exporter-specific, importer-specific, and pairwise effects under large-dimensional asymptotics. Prominent multi-dimensional datasets encompass the , which provides data on 185 countries across variables like GDP, capital, and productivity from 1950 to 2023, incorporating macro-economic dimensions for cross-country comparisons. Similarly, the European Social Survey offers a multi-level with data on attitudes and behaviors from individuals nested within countries over multiple waves since 2002, covering up to 39 participating nations and enabling analysis of cross-national variations. Estimating models from these datasets presents heightened complexity, particularly due to spatial autocorrelation, where errors or dependent variables correlate across geographic units, necessitating specialized techniques like spatial lag or error models to avoid biased inference. Data sources such as the CEPII Gravity database facilitate this by supplying comprehensive panels of flows, distances, and trade agreements for over 200 countries from 1948 to 2020, supporting multi-way analyses while highlighting challenges in handling high-dimensional fixed effects. Since 2010, the proliferation of from satellites and sensors has expanded multi-dimensional panels, exemplified by nighttime lights datasets that track luminosity as a for economic activity across grids over time. Harmonized nighttime light observations from sensors like VIIRS, available annually from 1992 to 2024, enable spatio-temporal panels of urban extents and human activity at fine resolutions, with post-2010 improvements in data quality driving applications in .

References

  1. [1]
    Panel Data Using R: Fixed-effects and Random-effects
    May 26, 2023 · Panel data (also known as longitudinal or cross-sectional time-series data) refers to data for n different entities at different time periods.
  2. [2]
    [PDF] Panel Data Analysis Fixed and Random Effects using Stata
    Wide form data (time in columns). If your dataset is in wide format, either entity or time are in columns, you need to reshape it to long format.
  3. [3]
    Panel data analysis—advantages and challenges | TEST
    Mar 16, 2007 · We explain the proliferation of panel data studies in terms of (i) data availability, (ii) the more heightened capacity for modeling the complexity of human ...
  4. [4]
    [PDF] Panel Data Econometrics - Kansas State University
    Spatial Models. Dong Li (Kansas State University). Panel Data Econometrics. Fall 2009. 1 / 115. Page 4. Introduction. Preliminary Definitions and Some Examples.
  5. [5]
    [PDF] Panel Data Econometrics: Theory - NYU Stern
    This book is a collection of chapters that require some background in panel data econometrics ... definition of conditional probability, Pr(α,β) ¼ Pr (ajβ) Pr (β) ...
  6. [6]
    [PDF] Econometric Analysis of Panel Data - my.SMU
    ... Econometric Analysis of Panel Data. Page 4. Badi H. Baltagi. Badi H. Baltagi earned his PhD in Economics at the University of Pennsylvania in 1979. He joined ...
  7. [7]
    [PDF] Panel Data —Chapter 14 of Wooldridge's textbook - Miami University
    Panel data is obtained by observing the same person, firm, county, state, etc over several periods. 2. Panel data can be used to control for unobserved ...
  8. [8]
    [PDF] A Primer for Panel Data Analysis by Robert A. Yaffee
    Sep 21, 2003 · Panel data analysis is an increasingly popular form of longitudinal data analysis among social and behavioral science researchers.<|control11|><|separator|>
  9. [9]
    [PDF] The History of Panel Data Econometrics, 1861–1997 Preface
    models have their origin in the work on least squares of Gauss and Legendre, who were concerned with the optimal combination of astronomical observa- tions, but ...
  10. [10]
    The Early Years of Panel Data Econometrics | Request PDF
    This essay focuses on the early years of panel data econometrics and two seminal papers by Yair Mundlak (1961) and Pietro Balestra and Marc Nerlove (1966).<|separator|>
  11. [11]
    The Early Years of Panel Data Econometrics - Duke University Press
    Dec 1, 2011 · This essay focuses on the early years of panel data econometrics and two seminal papers by Yair Mundlak (1961) and Pietro Balestra and Marc ...
  12. [12]
    A Note on Error Components Models - jstor
    [2] BALESTRA, P., AND M. NERLOVE: "Pooling Cross Section and Time Series Data in the Estimation of a Dynamic Model: The Demand for Natural Gas," Technical ...
  13. [13]
    [PDF] Celebrating 40 Years of Panel Data Analysis: Past, Present and Future
    Feb 3, 2020 · The conference marked the 40th anniversary of the inaugural. International Panel Data Conference, which was held in 1977 at INSEE in Paris,.
  14. [14]
    Econometric Analysis of Panel Data - SpringerLink
    In stock Free deliveryThis textbook offers a comprehensive introduction to panel data econometrics, an area that has enjoyed considerable growth over the last two decades.
  15. [15]
    Some Tests of Specification for Panel Data: Monte Carlo Evidence ...
    Abstract. This paper presents specification tests that are applicable after estimating a dynamic model from panel data by the generalized method of moments.
  16. [16]
    [PDF] Analysis of Panel Data, Fourth Edition
    Now in its fourth edition, this comprehensive introduction to fundamental panel data method- ologies provides insights on what is most essential in panel ...
  17. [17]
    Recent Developments in Panel Data Methods - Econometrics - MDPI
    Topics of this Special Issue include static panel data models, dynamic panel data models with short or long time series dimension, nonlinear panel data models, ...Missing: 2000-2025 | Show results with:2000-2025
  18. [18]
    Attrition Bias in Econometric Models Estimated with Panel Data - jstor
    The major purposes of this article therefore are to (1) demonstrate the effect attrition can have, (2) describe two correction methods, and (3) illustrate the ...Missing: MCAR | Show results with:MCAR
  19. [19]
    [PDF] AN ANALYSIS OF SAMPLE ATTRITION IN PANEL DATA
    In this paper we present the results of a study of attrition and its potential bias in one of the most well-known panel data sets, the. Michigan Panel Study of ...
  20. [20]
    Introduction to the Fundamentals of Panel Data - Aptech
    Nov 29, 2019 · Panel data contains more information, more variability, and more efficiency than pure time series data or cross-sectional data. Panel data can ...What Is Panel Data? · Panel Data and Heterogeneity · Modeling Panel Data
  21. [21]
    Reshaping panel data with long_panel() and widen_panel() - CRAN
    Aug 21, 2023 · Most regression analyses for panel data require the data to be in long format. That means there is a row for each entity (eg, person) at each time point.Missing: econometrics | Show results with:econometrics
  22. [22]
    pandas.wide_to_long — pandas 2.3.3 documentation - PyData |
    Unpivot a DataFrame from wide to long format. Less flexible but more user-friendly than melt. With stubnames ['A', 'B'], this function expects to find one ...Dev · 1.2 · 1.1 · 2.0
  23. [23]
    [PDF] Lecture 9: Panel Data Model (Chapter 14, Wooldridge Textbook)
    Panel data is obtained by observing the same person, firm, county, etc over several periods. • Unlike the pooled cross sections, the observations for the same ...Missing: definition | Show results with:definition
  24. [24]
    (PDF) Foreign direct investment and firm level productivity - A panel ...
    This paper uses panel data to examine the effects of foreign presence on firm level productivity in the Kenyan manufacturing industry employing “traditional” ...Missing: applications | Show results with:applications
  25. [25]
    [PDF] A FIRM-LEVEL ANALYSIS OF LABOR PRODUCTIVITY IN THE ...
    We use firm-level panel data to assess if and how much each of the proposed mechanisms contributes to the productivity slowdown. The analysis of resource ...
  26. [26]
    Panel Study of Income Dynamics (PSID)
    The Panel Study of Income Dynamics (PSID) is the longest running longitudinal household survey in the world. The study began in 1968 with a nationally ...Documentation · Studies · PSID FAQ · NewsMissing: inequality | Show results with:inequality
  27. [27]
    [PDF] The Panel Study of Income Dynamics and Mobility
    May 9, 2016 · Labor Earnings Mobility and. Inequality in the United States and Germany During the Growth Years of the. 1980s. International Economic Review.
  28. [28]
    Statistical Issues in Longitudinal Data Analysis for Treatment ... - NIH
    Jun 29, 2010 · Longitudinal data consist of outcome measurements repeatedly taken on each experimental unit (e.g., cell line or mouse) over time. Such data are ...
  29. [29]
    panel data analysis of five longitudinal cohort studies - NIH
    Aug 2, 2024 · Internet exclusion was found to be significantly associated with depressive symptoms in all cohort studies and countries, except for older adults in Finland ...
  30. [30]
    Global climate policy effectiveness: A panel data analysis
    Aug 1, 2023 · This paper analyzes the energy policy paths of six leading countries and compares the effectiveness of their policies.
  31. [31]
    Countries with sustained greenhouse gas emissions reductions
    We find that 24 countries have sustained reductions in annual CO 2 and GHG emissions between 1970 and 2018, in total equalling 3.2 GtCO 2 eq since their ...
  32. [32]
    8 Panel Data - Causal Inference The Mixtape
    First, it will eliminate any and all unobserved and observed time-invariant covariates correlated with the treatment variable. So long as the treatment and the ...
  33. [33]
    Causal models for longitudinal and panel data: a survey
    In recent years, there has been a fast-growing and exciting body of research on new methods for estimating causal effects in panel or longitudinal data settings ...<|separator|>
  34. [34]
    [PDF] Panel Data Econometrics - Kansas State University
    Panel data gives more informative data, more variability, less collinearity among the variables, and more degrees of freedom.
  35. [35]
    [PDF] Panel Data - Joan Llull
    The main advantages of panel data are two. First, they allow us to deal with permanent unobserved heterogeneity, i.e. potentially relevant variables that are ...
  36. [36]
    [PDF] Econometric Analysis of Panel Data
    Page 1. Page 2. Econometric Analysis of Panel Data. Third edition. Badi H. Baltagi. Page 3. Page 4. Econometric Analysis of Panel Data. Page 5. Badi H. Baltagi.
  37. [37]
    [PDF] The need for and use of panel data | IZA World of Labor
    In contrast, panel data allow for a degree of control over selection issues and potential related biases. By extension, assessing causality is also challenging ...<|control11|><|separator|>
  38. [38]
    A Review of Panel Data on Spatial Econometrics Models
    Panel data are generally more informative data, more variability, more efficiency, better able to study the dynamics of adjusment, and better able to identify ...
  39. [39]
    [PDF] Causal Inference with Panel Data - Lecture 1 - Yiqing Xu
    Aug 23, 2021 · DiD and 2WFE have benefits: • Accounting for unobserved unit and time heterogeneity. • Accommodate many types of data structure. • Easy to ...
  40. [40]
    [PDF] Panel data. Between and within variation. Random and fixed effects ...
    Advantages of Panel Data (Baltagi, 2014). Controlling for individual heterogeneity. Panel data offer more informative data, more variability, less collinearity.
  41. [41]
    [PDF] Panel data analysis—advantages and challenges - SciSpace
    Panel data usually contain more degrees of freedom and more sample variability than cross-sectional data which may be viewed as a panel with T = 1, or time ...
  42. [42]
    [PDF] Causal Models for Longitudinal and Panel Data: A Survey
    The earlier econometric panel data literature discusses the tradeoffs between various asymptotic approximations for the analysis of dynamic linear models, e.g., ...
  43. [43]
  44. [44]
    The Panel Study of Income Dynamics after Fourteen Years - jstor
    This article considers the representativeness of the Panel Study of. Income Dynamics (PSID) over its 14-year history from 1968 to 1981.
  45. [45]
    Estimating Dynamic Random Effects Models from Panel Data ... - jstor
    PANEL DATA COVERING SHORT TIME PERIODS'. BY ALOK BHARGAVA AND J. D. SARGAN. This paper advocates the use of simultaneous equations estimators (especially LIML) ...Missing: limitations | Show results with:limitations
  46. [46]
    None
    Below is a merged summary of Pooled OLS Regression in Panel Data, consolidating all the information from the provided segments into a comprehensive response. To retain all details efficiently, I will use a table in CSV format to summarize key aspects (Model Specification, Assumptions, Pitfalls, When to Use, and Useful URLs) across the chapters, followed by a narrative explanation to tie it all together and address any additional nuances not easily captured in the table.
  47. [47]
    Retrospectives: Yair Mundlak and the Fixed Effects Estimator
    We discuss Yair Mundlak's (1927–2015) contribution to econometrics through the lens of the fixed effects estimator.
  48. [48]
    Empirical Production Function Free of Management Bias | American ...
    Yair Mundlak; Empirical Production Function Free of Management Bias, American Journal of Agricultural Economics, Volume 43, Issue 1, 1 February 1961, Pages.
  49. [49]
    Econometric Analysis of Cross Section and Panel Data - MIT Press
    This acclaimed graduate text provides a unified treatment of two methods used in contemporary econometric research, cross section and data panel methods.
  50. [50]
    On the Proper Computation of the Hausman Test Statistic in ... - MDPI
    Nov 8, 2023 · We provide new analytical results for the implementation of the Hausman specification test statistic in a standard panel data model.
  51. [51]
    [PDF] Lecture 15 Panel Data Models
    (For private use, not to be posted/shared online). • A panel, or longitudinal, data set is one where there are repeated observations on the same units: ...
  52. [52]
    [PDF] Breusch and Pagan's (1980) Test Revisited
    Jun 22, 2021 · We showed that the test has a power against fixed effects, even though it was developed to detect random effects. Because of the simplicity of ...
  53. [53]
    [PDF] Specification Tests in Econometrics - JA Hausman
    Oct 31, 2002 · In this paper a general form of specification test is proposed which attempts to provide powerful tests of assumption (1.1a) and presents a ...
  54. [54]
    [PDF] Some Tests of Specification for Panel Data: Monte Carlo Evidence ...
    Nov 3, 2002 · This paper presents specification tests that are applicable after estimating a dynamic model from panel dara by the generalized method of ...
  55. [55]
    [PDF] Dynamic panel data models - Cemmap
    Abstract. This paper reviews econometric methods for dynamic panel data models, and presents examples that illustrate the use of these procedures. The fo-.
  56. [56]
    [PDF] Biases in Dynamic Models with Fixed Effects - Stephen Nickell
    Nov 3, 2002 · The remainder of the paper is devoted to an analysis of these biases and is set out as follows. In the next section we shall compute the bias as ...
  57. [57]
    [PDF] Some Tests of Specification for Panel Data - NYU Stern
    This paper presents specification tests that are applicable after estimating a dynamic model from panel data by the generalized method of moments (GMM), ...
  58. [58]
    [PDF] Initial conditions and moment restrictions in dynamic panel data ...
    This is consistent with our analysis in Section 3.2, where we showed that there was a serious problem of weak instruments for the GMM (DIF) estimator at values.
  59. [59]
    How to do Xtabond2: An Introduction to Difference and System GMM ...
    xtabond2 implements difference and system GMM estimators, designed for small T, large N panels, and is used for situations with non-exogenous variables.
  60. [60]
    [PDF] Large-dimensional Panel Data Econometrics
    This book is motivated by the recent development in panel data models with large individuals/countries (n) and large amount of observations over time (T). It ...Missing: seminal | Show results with:seminal
  61. [61]
    Panel Data Models With Interactive Fixed Effects - Bai - 2009
    Jul 21, 2009 · This paper considers large N and large T panel data models with unobservable multiple interactive effects, which are correlated with the regressors.
  62. [62]
    [PDF] Panel Data Models With Interactive Fixed Effects - NYU Stern
    This paper considers large N and large T panel data models with unobservable mul- tiple interactive effects, which are correlated with the regressors.
  63. [63]
    Lasso penalized model selection criteria for high-dimensional ...
    This paper proposes two model selection criteria for identifying relevant predictors in the high-dimensional multivariate linear regression analysis.
  64. [64]
    Estimation and Inference in High-Dimensional Panel Data Models ...
    We develop new econometric methods for estimation and inference in high-dimensional panel data models with interactive fixed effects.Missing: seminal | Show results with:seminal
  65. [65]
  66. [66]
    A comprehensive survey on statistical and deep learning models for ...
    Oct 8, 2025 · The paper also identifies key challenges in panel data forecasting and proposes future research directions, including hybrid modeling approaches ...
  67. [67]
    Double machine learning for static panel models with fixed effects
    Apr 25, 2025 · Following Chernozhukov et al. (2018), we construct a generic Neyman orthogonal score function for panel data that accounts for (a) the ...INTRODUCTION · THE PARTIALLY LINEAR... · DML PROCEDURES FOR...
  68. [68]
    [2409.01266] Double Machine Learning meets Panel Data - arXiv
    Sep 2, 2024 · In this paper, we explore how we can adapt double/debiased machine learning (DML) (Chernozhukov et al., 2018) for panel data in the presence of unobserved ...
  69. [69]
    [PDF] Deep Neural Network Estimation in Panel Data Models
    Jun 29, 2023 · We find significant forecasting gains over both linear panel data models and time series neural networks. Containment or lockdown policies ...
  70. [70]
    Predicting Firm's Performance Based on Panel Data: Using Hybrid ...
    The LSTM model is a type of recurrent neural network whose advantage is mechanisms for remembering and combating gradient attenuation. The LSTM architecture ...
  71. [71]
    [PDF] Panel Machine Learning with Mixed-Frequency Data
    May 23, 2025 · Neural networks with embeddings offer a structured way to incorporate categorical panel information while flexibly modeling nonlin- ear dynamics ...
  72. [72]
    Random forests for high-dimensional longitudinal data
    Aug 9, 2020 · We propose a general approach of random forests for high-dimensional longitudinal data. It includes a flexible stochastic model which allows the covariance ...
  73. [73]
    Mixed effect gradient boosting for high-dimensional longitudinal data
    Aug 22, 2025 · In comprehensive simulations spanning linear and nonlinear data-generating processes, MEGB achieved 35-76% lower mean squared error (MSE) ...
  74. [74]
    Difference‐in‐Difference Causal Forests With an Application to ...
    Jul 7, 2025 · This paper introduces the difference-in-difference causal forest (DiDCF) method, which extends the causal-forest technique for estimating ...
  75. [75]
    [PDF] Forests for Differences: Robust Causal Inference Beyond Parametric ...
    May 14, 2025 · Our method generalizes the Bayesian Causal. Forest (BCF) model (Hahn et al., 2020) to panel data and DiD designs, enabling ro- bust recovery ...
  76. [76]
    Multi-way clustering estimation of standard errors in gravity models
    This paper analyzes the consequences of ignoring the multi-indexed structure with cross-sectional and panel-data gravity models of bilateral trade for ...
  77. [77]
    A multidimensional spatial lag panel data model with spatial moving ...
    Dec 29, 2017 · In this paper, we focus on estimation methods for a multidimensional spatial lag panel data model with SMA nested random effects errors. The ...
  78. [78]
    [PDF] STRUCTURAL COVARIATES OF U.S. COUNTY HOMICIDE RATES
    Using county-level data for the decennial years in the 1960 to 1990 time period, we reexamine the impact of conventional structural covariates on homicide rates ...
  79. [79]
    Likelihood-Based Inference and Prediction in Spatio-Temporal ...
    Aug 6, 2025 · Our data set combines Uniform Crime Reporting data with socioeconomic data from the 2000 census. The likelihood of the model is accurately ...
  80. [80]
    [PDF] Decomposition of Bilateral Trade Flows Using a Three-Dimensional ...
    This study decomposes the bilateral trade flows using a three-dimensional panel data model. Under the scenario that all three dimensions diverge to infinity ...
  81. [81]
    PWT 11.0 | Penn World Table | Groningen Growth and Development ...
    Oct 7, 2025 · PWT version 11.0 is a database with information on relative levels of income, output, input and productivity, covering 185 countries between ...PWT 10.0 · PWT Documentation · PWT News · Historical Development
  82. [82]
    ESS Data Portal - European Social Survey
    New editions of CROss-National Online Survey (CRONOS-3) self-completion panel data have been released for Waves 1 and 3 on Thursday 6 November 2025.Findings · Variable visualisation now... · Next page Load more
  83. [83]
    Testing for spatial autocorrelation in a fixed effects panel data model
    The aim of this paper is to assess the relevance of spatial autocorrelation in a fixed effects panel data model and in the affirmative, to identify the most ...
  84. [84]
    Gravity - CEPII
    Gravity provides all the information required to estimate gravity equations: trade flows, geographical distances, trade facilitation measures, macroeconomic ...Missing: panel | Show results with:panel
  85. [85]
    [PDF] A global dataset of annual urban extents (1992–2020) from ...
    Feb 8, 2022 · Satellite remote sensing big data have shown great poten- tial for mapping dynamics of urban areas with continuous observations spanning ...