Fact-checked by Grok 2 weeks ago

Model selection

Model selection is a fundamental process in statistical modeling and machine learning, involving the selection of the most appropriate model from a set of candidate models based on observed data, with the goal of balancing goodness-of-fit to the data against model complexity to achieve reliable prediction and inference.^[1] This task is central to scientific inquiry, as it helps identify relevant variables, interactions, and structures while mitigating issues like overfitting, where a model fits noise rather than underlying patterns, or underfitting, where it fails to capture key relationships.^[2] Approaches to model selection span frequentist, Bayesian, and predictive paradigms, each offering distinct criteria and algorithms to evaluate competing models.^[1] Historically, model selection has roots in information theory and statistical inference, evolving from early efforts in the mid-20th century to formalize model comparison. Akaike's Information Criterion (AIC), introduced in 1974, provides an estimate of the relative quality of statistical models for a given dataset by minimizing the negative log-likelihood penalized by twice the number of parameters, promoting asymptotically efficient model choice in nonparametric settings.^[3] Building on this, Schwarz's Bayesian Information Criterion (BIC), proposed in 1978, approximates the Bayes factor for model comparison and imposes a stronger penalty—logarithmic in sample size times the number of parameters—favoring parsimonious models and ensuring consistency in selecting the true model under parametric assumptions as sample size grows.^[4] These information-theoretic criteria, along with extensions like the Hannan-Quinn criterion and corrected AIC (AICc), have become staples for automated model evaluation in regression, time series, and beyond.^[2] Beyond information criteria, modern techniques include cross-validation, which partitions data into training and validation sets to assess out-of-sample predictive performance, and penalized regression methods such as LASSO (least absolute shrinkage and selection operator), which induces sparsity by shrinking irrelevant coefficients to zero.^[1] Bayesian methods, like Bayes factors and model averaging, incorporate prior probabilities to weigh models probabilistically, enhancing robustness in high-dimensional settings.^[1] Stepwise procedures—forward, backward, or bidirectional—offer heuristic searches through model spaces, though they risk instability and multiple testing issues.^[5] Challenges in model selection persist, particularly post-selection inference, where standard errors may be biased after variable elimination, necessitating specialized adjustments.^[6] In practice, model selection underpins diverse applications, from econometrics and bioinformatics to climate modeling, where selecting the right model directly impacts interpretability and decision-making.^[2] Ongoing research addresses scalability in big data eras through hybrid criteria and computational advances, ensuring model selection remains a dynamic field adapting to evolving analytical demands.^[2]

Fundamentals

Definition and Purpose

Model selection is the process of choosing the most appropriate statistical or machine learning model from a set of candidate models by evaluating their performance on available data, with the goal of identifying the model that best captures the underlying relationships while avoiding excessive complexity.^[1] For instance, in linear regression, this might involve comparing models that include different subsets of predictor variables to determine which provides the optimal balance between explanatory power and parsimony.^[7] The primary purpose of model selection is to enhance the model's ability to generalize to unseen data, thereby preventing suboptimal predictive performance that arises from capturing idiosyncratic noise rather than true patterns in the data.^[1] This is particularly crucial in scenarios where overfitting occurs, as overly complex models fitted too closely to training data often fail to perform well on new observations.^[8] The practice of model selection originated in the 1970s amid the rise of computational statistics, which enabled the evaluation of multiple models efficiently, and was formalized through influential contributions such as Hirotugu Akaike's development of the Akaike Information Criterion in 1974.^[3] Prior to this era, model choice relied more on ad hoc or theoretical considerations, but advancing computing power facilitated systematic approaches.^[9] In its basic workflow, model selection begins with specifying a collection of candidate models based on domain knowledge or exploratory analysis, followed by fitting each model to the observed data using estimation techniques like maximum likelihood.^[1] The models are then ranked according to established criteria that account for both their fit to the data and their complexity, ultimately leading to the selection of the preferred model for inference or prediction.^[10]

Key Challenges

One of the primary challenges in model selection is overfitting, where a model captures noise in the training data rather than the underlying signal, resulting in poor performance on unseen data. This occurs when the model complexity is too high relative to the sample size, leading to inflated variance in predictions. For instance, fitting a high-degree polynomial regression to noisy data can perfectly interpolate training points but fail to generalize, as the model oscillates wildly between observations. In contrast, underfitting arises when a model is overly simplistic and fails to capture important patterns or relationships in the data, leading to high bias and suboptimal predictive accuracy even on training data. A classic example is applying a linear regression model to data with inherent nonlinear structures, such as exponential growth, where the straight-line fit misses the curvature and yields systematically erroneous predictions. These issues are interconnected through the bias-variance tradeoff, a fundamental principle in statistical learning that describes how model error decomposes into bias (error from erroneous assumptions) and variance (error from sensitivity to small fluctuations in training data).^[11] As model complexity increases—such as by adding more parameters or features—bias typically decreases because the model can approximate the true function more closely, but variance rises due to greater sensitivity to sampling variability. The total expected error thus forms a U-shaped curve with respect to complexity: low-complexity models suffer high bias (underfitting), high-complexity models high variance (overfitting), and optimal performance occurs at an intermediate point balancing the two. Conceptually, this can be visualized as a plot with model flexibility on the x-axis and mean squared error on the y-axis, where the irreducible error sets a baseline, bias decreases monotonically, and variance increases, yielding the characteristic minimum.^[11] Another significant challenge is the multiple testing problem, which emerges when numerous candidate models are evaluated, increasing the likelihood of false positives (Type I errors) and invalidating standard inference procedures.^[12] In model selection, testing many subsets or hypotheses without adjustment inflates the overall error rate, as the selection process conditions inference on data-driven choices that are often ignored in post-selection analysis.^[13] This leads to overconfident confidence intervals and p-values that do not reflect the true uncertainty after selection.^[12] Finally, computational demands pose a formidable barrier, particularly in high-dimensional settings where the number of possible models grows exponentially with the number of predictors—a phenomenon known as the curse of dimensionality. For p predictors, the total subsets number $2^p, rendering exhaustive evaluation infeasible for even moderate p (e.g., p=50 yields over a quadrillion models), while high dimensions exacerbate sparsity and sampling requirements. Evaluation criteria, such as penalized likelihood methods, help mitigate overfitting by balancing fit and complexity in these scenarios.

Selection Approaches

Forward and Backward Selection

Forward selection is a greedy algorithm for variable selection in regression models that begins with an empty model containing only the intercept and iteratively adds the predictor that most improves the model fit, typically assessed using an F-test for the significance of the added variable's contribution to the residual sum of squares.^[14] The process continues by evaluating each remaining candidate predictor in the context of the current model and selecting the one yielding the largest increase in the F-statistic until no further addition meets a predefined significance threshold, such as p < 0.05, at which point selection stops.^[15] This approach is particularly suited to multiple linear regression where computational resources limit exhaustive searches.^[16] Backward selection, in contrast, initiates with a full model incorporating all available predictors and proceeds by removing the variable with the least contribution to the model, often determined by the highest p-value from a t-test (equivalent to an F-test for that term).^[15] Iterations involve refitting the model after each removal and continuing until the removal of any remaining variable would significantly worsen the fit, as indicated by exceeding a p-value threshold for retention, such as p > 0.10.^[17] This method evaluates variables in the presence of all others, potentially capturing dependencies better than forward selection in cases of multicollinearity.^[15] A hybrid approach, known as stepwise selection, combines elements of both forward and backward methods by alternating between adding the most beneficial predictor and removing the least useful one from the current model, using consistent criteria like p-value thresholds for entry (e.g., p < 0.05) and removal (e.g., p > 0.10).^[15] The process iterates until no further additions or removals satisfy the thresholds, allowing for dynamic adjustment that can recover variables overlooked in pure directional searches.^[16] Stopping rules in these methods may integrate penalized likelihood criteria, such as the Akaike information criterion (AIC), to balance fit and complexity.^[15] These directional stepwise algorithms offer computational efficiency for datasets with moderate numbers of predictors, requiring only O(p^2) model fits where p is the number of variables, compared to exhaustive methods.^[16] However, their greedy nature leads to selection of locally optimal models that may overlook the global best subset, particularly when variables exhibit strong interactions or collinearity, as early decisions cannot be revisited comprehensively.^[16] Additionally, they tend to ignore higher-order interactions unless explicitly modeled, potentially resulting in biased coefficient estimates and inflated Type I error rates due to multiple testing without adjustment.^[15] In an application to economic forecasting, forward and backward selection have been employed in multiple linear regression to identify key predictors of gross domestic product (GDP) growth from macroeconomic indicators such as inflation rates, interest rates, and export volumes; for instance, stepwise regression screened significant variables like consumer spending and investment before fitting a ridge model for prediction.^[18]

All-Subsets and Best-Subsets Methods

All-subsets regression is an exhaustive method that fits and evaluates all $2^p possible linear models for p predictors by considering every combination of variables, then ranks them according to a goodness-of-fit criterion such as adjusted R^2. This approach systematically assesses the performance of each subset to identify those with the strongest predictive power while penalizing for model complexity. Full enumeration has computational complexity O(2^p), rendering it practical only for small numbers of predictors (typically fewer than 20). To mitigate the exponential growth, best-subsets methods provide approximations by avoiding full enumeration, using branch-and-bound strategies like the leaps-and-bounds algorithm to prune suboptimal branches and identify the top models for each subset size without evaluating all combinations.^[19] This extends feasibility to up to around 40 predictors in some implementations. Modern variants employ parallel computing techniques, such as QR decomposition-based algorithms, which distribute the workload across multiple processors to evaluate subsets more efficiently.^[20] Genetic algorithms represent another optimization approach, evolving populations of candidate subsets through selection, crossover, and mutation to converge on high-performing models without exhaustive search.^[21] These methods guarantee identification of the global optimum among considered subsets, offering superior performance over greedy alternatives in terms of model accuracy when computation is feasible.^[22] However, they incur high computational costs and pose risks from multiple comparisons, potentially leading to overfitting by favoring models that fit noise in the training data. In bioinformatics, best-subsets approaches using branch-and-bound algorithms like leaps-and-bounds, often combined with pre-filtering to reduce dimensionality, have been applied to gene selection from high-dimensional expression data with thousands of features to prioritize biologically relevant predictors.^[23] Subsets in these methods are often ranked using information criteria to balance fit and complexity.

Evaluation Criteria

Penalized Likelihood Criteria

Penalized likelihood criteria provide a framework for model selection in likelihood-based models by incorporating penalties for model complexity, thereby addressing the bias toward overly complex models that arises from solely maximizing the likelihood. These criteria estimate the relative quality of models by balancing goodness-of-fit, as measured by the log-likelihood, against a penalty term that increases with the number of parameters, promoting parsimonious models while aiming to minimize expected prediction error.^[3]^[4] The Akaike Information Criterion (AIC), introduced by Hirotugu Akaike, is derived from asymptotic information theory and serves as an estimator of the relative expected Kullback-Leibler divergence between the true underlying distribution and the fitted model. Its formula is given by

AIC = -2 \log L + 2k,

where L is the maximized likelihood of the model and k is the number of estimated parameters. This criterion penalizes complexity linearly with k, making it suitable for selecting models that optimize predictive accuracy in large samples, as the penalty term approximates twice the expected bias in the log-likelihood due to estimation.^[3]^[24] A small-sample correction to AIC, known as AICc, adjusts the penalty to $2k(n+1)/(n-k-1) for finite sample sizes n, improving performance when n is small relative to k.^[25] The Bayesian Information Criterion (BIC), proposed by Gideon Schwarz, approximates the Bayesian marginal likelihood under a unit information prior and imposes a stronger penalty on model complexity, particularly in large samples where it tends to favor simpler models. The formula is

BIC = -2 \log L + k \log n,

with n denoting the sample size; the logarithmic penalty term grows with n, reflecting the increasing cost of additional parameters as data accumulates. This derivation stems from a Laplace approximation to the integral of the posterior distribution, providing a consistent estimator for the true model under certain conditions.^[4]^[26] The Hannan-Quinn information criterion (HQIC), introduced in 1979, uses an intermediate penalty of $2k \log \log n, balancing AIC's optimism and BIC's conservatism, and is consistent under certain conditions.^[27] For hierarchical or complex models, the Deviance Information Criterion (DIC), developed by David Spiegelhalter and colleagues in 2002, extends these ideas by accounting for effective model complexity through the posterior mean of the deviance. It is formulated as

DIC = -2 \log L + 2 p_D,

where p_D represents the effective number of parameters, estimated as the difference between the posterior mean of the deviance and the deviance at the posterior mean of the parameters; this makes DIC particularly useful for Bayesian hierarchical models where traditional parameter counts may underestimate complexity.^[28] In practice, model selection using these criteria involves computing the value for each candidate model and selecting the one with the minimum score, as lower values indicate better relative fit adjusted for complexity. For instance, when comparing nested generalized linear models (GLMs), such as a null model with only an intercept versus an alternative including a single predictor, AIC or BIC can identify the preferred model by quantifying whether the improvement in log-likelihood justifies the added parameter, with BIC often favoring the simpler model due to its heavier penalty in moderate to large samples.^[24]^[29] These criteria are often applied within stepwise selection procedures to iteratively build or prune models.^[30]

Resampling-Based Methods

Resampling-based methods provide empirical estimates of model performance by repeatedly partitioning or resampling the available data to simulate unseen data, thereby guiding the selection of models that generalize well beyond the training set. These techniques address the limitations of in-sample evaluations by focusing on out-of-sample error estimation, making them particularly valuable when sample sizes are moderate to large and theoretical assumptions about the data distribution are untenable. Unlike analytical criteria that rely on asymptotic approximations, resampling methods leverage the data itself to quantify variability and bias in model predictions.^[31] The holdout validation method, also known as the train-test split, is the simplest resampling approach, where the dataset is divided into a training subset for model fitting and a separate holdout subset for performance evaluation. This split typically allocates a portion—such as 70-80% for training and the remainder for testing—to compute metrics like error rates on the unseen holdout data. However, holdout validation can introduce high variance and bias, especially in small samples, because the holdout set may not represent the underlying data distribution, leading to unreliable estimates of generalization performance. For instance, with limited data, the training set might overfit, while the test set's composition exacerbates pessimistic or optimistic biases in the evaluation. Cross-validation (CV) extends the holdout idea by systematically rotating the roles of training and testing subsets to obtain a more stable performance estimate. In k-fold CV, the dataset is randomly partitioned into k equally sized folds; the model is trained k times, each time using k-1 folds for training and the remaining fold as the holdout for testing, with the average error across all folds serving as the final estimate. This method, formalized by Michael Stone in 1974, reduces the variance associated with a single holdout split by incorporating more of the data into training while ensuring each observation is tested exactly once.^[31] Common choices for k include 5 or 10, balancing computational cost and estimation accuracy; larger k approaches the full dataset usage but increases overlap between training sets. A special case is leave-one-out CV (LOOCV), where k equals the number of observations (n), training on n-1 samples and testing on the single excluded one, which is nearly unbiased but computationally intensive for large n and prone to high variance due to similar training sets across iterations. The bootstrap method offers another resampling strategy, generating multiple synthetic datasets by drawing n samples with replacement from the original data to mimic the sampling distribution of the estimator. Introduced by Bradley Efron in 1979, this technique fits the model B times (often B=1000 or more) on these bootstrap samples, enabling estimates of variability, confidence intervals, and bias correction for model parameters or predictions.^[32] In model selection, bootstrap can assess the stability of selected models by examining the distribution of performance metrics across resamples, helping identify robust candidates less sensitive to data perturbations. It is particularly useful for quantifying uncertainty in complex models where analytical variance formulas are unavailable. In practice, the choice of performance metric in resampling methods depends on the task: for regression problems, the mean squared error (MSE) quantifies prediction accuracy on holdout or CV folds, while for classification, metrics like accuracy (proportion of correct predictions) or the area under the receiver operating characteristic curve (AUC) evaluate discriminative power. These metrics are averaged over resamples to yield a single score for comparing models, with standard errors derived from the variability across folds or bootstrap replicates to inform selection decisions. A representative example is hyperparameter tuning in support vector machines (SVMs) using 10-fold CV on the Iris dataset, a classic benchmark with 150 samples across three classes based on four features. Here, candidate values for the regularization parameter C and kernel gamma are evaluated by training an SVM on each of the 10 folds' training portions and computing CV accuracy on the holdout fold, selecting the hyperparameters that maximize the average accuracy (often around 95-98% for optimal settings). This process ensures the chosen SVM generalizes effectively to new floral measurements. Resampling methods like CV can be combined with penalized likelihood criteria such as AIC in hybrid approaches to balance empirical validation with theoretical penalties for model complexity.

Advanced Techniques

Bayesian Approaches

Bayesian model selection relies on assigning posterior probabilities to candidate models using Bayes' theorem, which updates prior beliefs about models with observed data. The posterior probability of a model M given data D is given by

P(M|D) \propto P(D|M) P(M),

where P(M) is the prior probability over models and P(D|M) is the marginal likelihood, obtained by integrating the likelihood over the model's parameters: P(D|M) = \int P(D|\theta, M) P(\theta|M) d\theta.^[33] This framework enables direct comparison of models through posterior odds P(M_1|D)/P(M_2|D), which quantify the relative support for one model over another after accounting for data evidence and prior plausibility.^[34] Unlike point-estimate methods, this probabilistic approach naturally incorporates uncertainty in model choice. Computing the marginal likelihood is often challenging, particularly for complex models, but approximations like the Laplace method provide a tractable solution by expanding the log-posterior around its mode, yielding an asymptotic normal approximation to the integral.^[35] For more accurate estimation, Markov chain Monte Carlo (MCMC) methods sample from the posterior distribution to numerically evaluate or approximate the marginal likelihood, such as through thermodynamic integration or bridge sampling techniques.^[34] To handle models of varying dimensionality—such as differing numbers of predictors in regression—reversible jump MCMC extends standard MCMC by allowing trans-dimensional proposals that maintain detailed balance, enabling joint sampling over model space and parameters.^[36] Once posterior model probabilities are obtained, Bayesian model averaging constructs predictions as a weighted sum over models: \hat{y} = \sum_M \hat{y}_M P(M|D), where \hat{y}_M is the prediction from model M. This averaging reduces overfitting and variance compared to selecting a single best model, as it hedges against uncertainty in the true model by incorporating contributions from multiple plausible alternatives.^[37] Priors play a crucial role in this process; for linear regression, Zellner's g-prior on coefficients, which scales the information matrix by a factor g, facilitates analytical marginal likelihood computation and favors parsimonious models when g is small.^[38] For objective selection without strong prior assumptions, intrinsic priors derive from improper reference priors trained on minimal data subsets, yielding stable Bayes factors that approximate large-sample behavior.^[39] The Bayesian information criterion (BIC) serves as a frequentist approximation to the log posterior odds in large samples, linking the two paradigms briefly. A practical application arises in time series analysis, where Bayesian methods select ARIMA model orders (p, d, q) by computing posterior odds between candidate specifications, often using MCMC to explore the joint posterior over orders and parameters, thus quantifying uncertainty in differencing degree or lag selections.^[40] For instance, in analyzing economic data, posterior probabilities might favor an ARIMA(1,1,1) over higher-order alternatives if the marginal likelihood penalizes unnecessary complexity, leading to more robust forecasts via averaging across top models.^[41]

Dimensionality Reduction Integration

Dimensionality reduction techniques play a crucial role in model selection, particularly in high-dimensional settings where the number of candidate predictors exceeds the sample size (p >> n), by transforming or regularizing the feature space to facilitate effective variable selection while mitigating overfitting and computational burdens. These methods integrate preprocessing steps that compress the predictor space before or during the selection process, enabling scalable inference in complex datasets such as genomics or imaging. Unlike traditional subset selection, which exhaustively evaluates combinations, dimensionality reduction embeds selection within a lower-dimensional projection, preserving key information while discarding noise or redundancy. Principal Component Analysis (PCA) serves as a foundational approach for integrating dimensionality reduction with model selection by orthogonally transforming the original predictors into a set of uncorrelated principal components ordered by explained variance. In this framework, the first k components, which capture the majority of the data's variability, replace the full set of p variables, allowing subsequent model selection methods—such as stepwise regression—to operate on the reduced space of size k << p. This transformation aids selection by alleviating multicollinearity and focusing on dominant patterns, though it may obscure interpretability since components are linear combinations of originals. Krzanowski (1987) proposed criteria for variable selection within PCA by evaluating subsets that best reconstruct the principal components, ensuring the reduced model retains structural integrity. The Lasso method embeds dimensionality reduction directly into the model fitting process through L1 regularization, solving the optimization problem:

\hat{\beta} = \arg\min_{\beta} \left\{ \| y - X\beta \|^2_2 + \lambda \| \beta \|_1 \right\}

where y is the response vector, X the n × p design matrix, β the coefficient vector, and λ a tuning parameter controlling sparsity. The L1 penalty induces shrinkage, driving many coefficients to exactly zero and thereby performing automatic variable selection in high dimensions, which effectively reduces the model dimensionality without explicit preprocessing. This approach excels in sparse settings, outperforming ridge regression by producing interpretable models with fewer active predictors. Tibshirani (1996) introduced the Lasso, demonstrating its efficacy in selecting relevant variables while shrinking irrelevant ones.^[42] Partial Least Squares (PLS) extends dimensionality reduction for model selection by projecting both predictors and responses onto a smaller set of latent variables that maximize covariance, making it particularly suited for multicollinear data where PCA may overlook response-relevant directions. In PLS regression, successive components are deflated iteratively, and variable importance can be assessed via loadings or variable importance in projection (VIP) scores, allowing selection of predictors strongly associated with the response. This method integrates selection by retaining only components or variables contributing significantly to prediction, enhancing stability in correlated high-dimensional environments like chemometrics. Wold et al. (1984) developed PLS as a solution to collinearity in regression, with subsequent sparse variants incorporating penalties for explicit selection. Chun and Keleş (2009) advanced sparse PLS for simultaneous dimension reduction and variable selection, showing improved performance over Lasso in correlated genomic features.^[43] Post-selection inference addresses validity concerns arising from dimensionality reduction, where standard p-values and confidence intervals become biased due to the selection step conditioning on the data. Adjustments are necessary to control error rates after reduction, ensuring downstream tests remain valid under the selected model. Tibshirani et al. (2015) highlighted issues with naive inference post-Lasso or stepwise selection, proposing selective inference frameworks that condition on the active set to derive exact p-values and intervals, preserving type I error in high dimensions.^[44] In genomic applications with p >> n, such as gene expression data, the elastic net combines L1 and L2 penalties for sparse model selection: minimizing | y - Xβ |^2_2 + λ (α | β |_1 + (1-α) | β |^2_2 ), where α balances shrinkage and grouping of correlated predictors. This method selects grouped genes effectively, as demonstrated on leukemia microarray data where it identified biologically coherent subsets outperforming Lasso alone. Zou and Hastie (2005) introduced the elastic net, illustrating its utility in genomics for handling correlated predictors while achieving sparsity. Cross-validation can tune λ and α briefly to optimize the reduced model.%20301-320%20Zou%20&%20Hastie.pdf)

References

[1]
[PDF] Methods and Criteria for Model Selection
Model selection is an important part of any statistical analysis, and indeed is central to the pursuit of science in general. Many authors have examined this ...
[2]
Model Selection Techniques: An Overview - IEEE Xplore
Model selection is a key ingredi- ent in data analysis for reliable and reproducible statistical inference or prediction, and thus it is central to scientific ...
[3]
A new look at the statistical model identification - IEEE Xplore
Dec 31, 1974 · A new estimate minimum information theoretical criterion (AIC) estimate (MAICE) which is designed for the purpose of statistical identification is introduced.
[4]
Estimating the Dimension of a Model - Project Euclid
Abstract. The problem of selecting one of a number of models of different dimensions is treated by finding its Bayes solution, and evaluating the leading terms ...
[5]
[PDF] Model Selection and Validation (Chap 9) - University of South Carolina
Two automated methods for variable selection are best subsets and stepwise procedures. Best subsets simply finds the models that are best according to some ...
[6]
[PDF] Statistical Inference After Model Selection - Wharton Faculty Platform
In summary, model selection is a procedure by which some models are chosen over others. But model selection is subject to uncertainty. Because regression ...Missing: overview | Show results with:overview
[7]
Model selection – Knowledge and References - Taylor & Francis
Model selection refers to the process of choosing the most appropriate statistical model from a set of potential models based on the available data.
[8]
Model Selection - an overview | ScienceDirect Topics
Model selection is defined as the process of choosing a statistical model based on its predictive performance, which can be evaluated using various metrics ...
[9]
Interview with Genshiro Kitagawa | Computational Statistics
In the 1970s ... I think these developments were important conditions for the development of various models and model selection techniques, in particular, the ...
[10]
An Introduction to Model Selection - ResearchGate
Aug 8, 2025 · This paper is an introduction to model selection intended for nonspecialists who have knowledge of the statistical concepts covered in a typical first ( ...
[11]
[PDF] Neural Networks and the Bias/Variance Dilemma
Neural Networks and the Bias/Variance Dilemma. Stuart Geman. Division of ... Geman, E. Bienenstock, and R. Doursat as brain models, much less with the ...
[12]
[PDF] Model Selection and Inference: Facts and Fiction - UMD MATH
Abstract. Model selection has an important impact on subsequent inference. Ignoring the model selection step leads to invalid inference.
[13]
Valid post-selection inference - Project Euclid
We propose to produce valid “post-selection inference” by reducing the problem to one of simultaneous inference and hence suitably widening conventional ...
[14]
[PDF] The Use of an F-Statistic in Stepwise Regression Procedures
This paper will look at the forward selection procedure in detail and then relate certain aspects of the other two procedures to the corresponding problem in ...
[15]
Variable selection: review & recommendations for statisticians
We provide an overview of various available variable selection methods that are based on significance or information criteria, penalized likelihood, the change ...
[16]
Step away from stepwise | Journal of Big Data | Full Text
Sep 15, 2018 · This paper uses a series of Monte Carlo simulations to demonstrate that stepwise regression is a poor solution to a surfeit of variables.
[17]
[PDF] Variable selection: review and recommendations
Summary from simulation study. • Forward selection inferior to backward elimination. • Lasso performs well in the 'center', but shrinks towards the mean.
[18]
(PDF) Application of Linear Regression in GDP Forecasting
Firstly, the correlation test and stepwise regression method were used to screen out the variables with significant impact on GDP, and then ridge regression ...
[19]
Regressions by Leaps and Bounds: Technometrics: Vol 16, No 4
This paper describes several algorithms for computing the residual sums of squares for all possible regressions with what appears to be a minimum of arithmetic.
[20]
Parallel algorithms for computing all possible subset regression ...
Efficient parallel algorithms for computing all possible subset regression models are proposed. The algorithms are based on the dropping columns method that ...
[21]
All subsets regression using a genetic search algorithm
Subset regression procedures have been shown to provide better overall performance than stepwise regression procedures. However, it is difficult to use them ...
[22]
[PDF] Best Subset, Forward Stepwise, or Lasso? - Statistics & Data Science
Best subset performs better in high SNR, lasso in low SNR. Best subset and forward stepwise perform similarly. Relaxed lasso is the overall winner.
[23]
[PDF] The Akaike Information Criterion: Background, Derivation, Properties ...
The crite- rion was introduced by Hirotugu Akaike (1973) in his seminal paper “Information Theory and an Extension of the Maximum. Likelihood Principle.” The ...
[24]
[PDF] On the derivation of the Bayesian Information Criterion - UC Merced
Nov 8, 2010 · Abstract. We present a careful derivation of the Bayesian Inference Criterion (BIC) for model selection. The BIC is viewed here as an ...
[25]
Comparing Dynamic Causal Models using AIC, BIC and Free Energy
We compare Bayes factors based on AIC, BIC and FL for nested GLMs derived from an fMRI study. The fMRI data set was collected to study neuronal responses to ...
[26]
Model selection and Akaike's Information Criterion (AIC)
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov & B. F. Csaki (Eds.),Second International Symposium ...
[27]
[PDF] Bayesian model selection
We can work out the posterior probability over the models via Bayes' theorem: ... Note that a “fully Bayesian” approach to models would eschew model selection ...
[28]
Computing the Bayes Factor from a Markov Chain Monte Carlo ...
Determining the marginal likelihood from a simulated posterior distri- bution is central to Bayesian model selection but is computationally challenging. The ...
[29]
Marginal Likelihood Computation for Model Selection and ...
This is an up-to-date introduction to, and overview of, marginal likelihood computation for model selection and hypothesis testing.
[30]
Reversible jump Markov chain Monte Carlo computation and ...
This paper proposes a new framework for the construction of reversible Markov chain samplers that jump between parameter subspaces of differing dimensionality.
[31]
[PDF] Bayesian Model Averaging: A Tutorial - Colorado State University
stems from the observation that BMA predictions are weighted averages of single model predictions. If the individual predictions are roughly unbi- ased ...
[32]
[PDF] Fully Bayes Factors with a Generalized g-Prior
It should be noted that the first paper to effectively use a prior integrating out g was Zellner and Siow (1980); they stated things in terms of multivariate.
[33]
[PDF] The Intrinsic Bayes Factor for Model Selection and Prediction
The reason is that Bayes factors in hypothesis testing and model selection typically depend rather strongly on the prior distributions, much more so than in, ...
[34]
[PDF] BAYESIAN ANALYSIS OF ORDER UNCERTAINTY IN ARIMA ...
For each proposal distribution the first row refers to the average posterior probability of the true model while the second row shows the proportion of correct ...
[35]
Bayesian Comparison of ARIMA and Stationary ARMA Models - jstor
posterior odds, we work with the exact likelihood, assuming a Gaussian process. A by-product of our analysis is a demonstration that this leads to superior ...
[36]
Regression Shrinkage and Selection Via the Lasso - Oxford Academic
SUMMARY. We propose a new method for estimation in linear models. The 'lasso' minimizes the residual sum of squares subject to the sum of the absolute valu.Missing: original | Show results with:original
[37]
Sparse partial least squares regression for simultaneous dimension ...
We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction ...Missing: seminal | Show results with:seminal
[38]
Statistical learning and selective inference - PNAS
We describe some recent new developments in selective inference and illustrate their use in forward stepwise regression, the lasso, and principal components ...