Bayesian linear regression

Bayesian linear regression is a probabilistic extension of classical linear regression that applies Bayes' theorem to model the relationship between a dependent variable and one or more independent variables, treating the regression parameters as random variables with specified prior distributions that are updated with observed data to yield posterior distributions for inference.^[1] This approach provides a full probability distribution over the parameters rather than point estimates, enabling quantification of uncertainty in predictions and coefficients.^[2] In the standard formulation, the model posits that the observed response vector \mathbf{y} follows \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}, where \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}), \mathbf{X} is the design matrix of predictors, \boldsymbol{\beta} is the vector of coefficients, and \sigma^2 is the noise variance.^[1] The likelihood is Gaussian, p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) = (2\pi \sigma^2)^{-n/2} \exp\left[ -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \right], with n observations.^[2] Common conjugate priors include a multivariate normal distribution for \boldsymbol{\beta} given \sigma^2, such as \boldsymbol{\beta} \mid \sigma^2 \sim \mathcal{N}(\boldsymbol{\mu}_0, \sigma^2 \mathbf{V}_0), and an inverse-gamma distribution for \sigma^2, ensuring the posterior is also normal-inverse-gamma for analytical tractability.^[3] The resulting posterior distribution for the parameters is p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\beta}, \sigma^2) p(\boldsymbol{\beta}, \sigma^2), which updates the prior beliefs with the data.^[1] Key advantages of Bayesian linear regression over frequentist methods include the incorporation of prior knowledge to regularize estimates and prevent overfitting, particularly in low-data regimes, as the posterior mean for \boldsymbol{\beta} is a shrinkage estimator weighting the prior mean and the ordinary least squares estimate.^[3] It also yields a posterior predictive distribution p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}^*, \mathbf{X}) that marginalizes over the posterior, providing not only point predictions but also credible intervals that account for both parameter and noise uncertainty; in the conjugate case, this is a multivariate Student-t distribution with mean given by the posterior mean of the linear predictor \boldsymbol{\mu}^T \psi(\mathbf{x}^*) (where \psi denotes basis functions if used) and scale incorporating uncertainties in \boldsymbol{\beta} and \sigma^2.^[4] Inference can be exact with conjugate priors or approximated via Markov chain Monte Carlo for non-conjugate cases, such as shrinkage priors like the Bayesian lasso for sparse high-dimensional settings.^[3] This framework extends naturally to generalized linear models and hierarchical structures, making it versatile for complex data analyses.^[1]

Introduction and Fundamentals

Overview

Bayesian linear regression is a probabilistic approach to modeling the linear relationship between a response variable and one or more predictor variables, where the regression coefficients and noise variance are treated as random variables assigned prior probability distributions. The posterior distribution for these parameters is derived by updating the prior with the data likelihood using Bayes' theorem, yielding a full probability distribution that encapsulates uncertainty in the model. This framework extends classical linear regression by incorporating prior knowledge and providing a coherent mechanism for inference under uncertainty.^[2] The origins of Bayesian linear regression can be traced to Carl Friedrich Gauss's 1809 work on least squares estimation, which he justified using principles of inverse probability akin to Bayesian reasoning, laying early groundwork for treating parameters probabilistically. In the 20th century, the approach gained prominence through contributions like those of Dennis V. Lindley and Adrian F. M. Smith, who in 1972 developed Bayes estimates for the linear model, emphasizing exchangeability and prior elicitation to enhance statistical inference. These developments built on foundational ideas from Thomas Bayes and Pierre-Simon Laplace, adapting them to regression contexts for more robust handling of data scarcity and expert knowledge.^[5]^[6] Key motivations for adopting Bayesian linear regression over frequentist methods stem from its capacity to address limitations in ordinary least squares (OLS), which yields single point estimates without priors or inherent uncertainty measures. Priors enable regularization to prevent overfitting, particularly in high-dimensional settings, while the posterior supports credible intervals for precise uncertainty quantification and the marginal likelihood facilitates objective model selection and comparison. This results in more flexible and interpretable analyses, especially when prior information from domain expertise is available.^[1]

Model Formulation

Bayesian linear regression begins with the standard linear model, which posits a linear relationship between the response variable and predictors. Specifically, for n observations, the model is given by

\mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon},

where \mathbf{y} is the n \times 1 vector of responses, \mathbf{X} is the n \times p design matrix of predictors (including a column of ones for the intercept), \boldsymbol{\beta} is the p \times 1 vector of regression coefficients, and \boldsymbol{\epsilon} is the n \times 1 vector of errors with \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n).^[7] This formulation assumes that the errors are independent and identically distributed as normal with mean zero and constant variance \sigma^2.^[7] The likelihood function for the observed data \mathbf{y} given the parameters \boldsymbol{\beta} and \sigma^2 follows directly from the normality assumption and is expressed as

p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2) = (2\pi \sigma^2)^{-n/2} \exp\left( -\frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \right).

This multivariate normal likelihood encapsulates the contribution of the data to parameter inference in the Bayesian framework.^[7] The model relies on several key assumptions: linearity in the parameters, independence of the errors, homoscedasticity (constant variance \sigma^2), and normality of the errors.^[8] Violations of these assumptions, such as heteroscedasticity or non-normality, can affect inference; however, Bayesian approaches often demonstrate robustness to such departures through appropriate prior specifications or alternative error distributions like the t-distribution.^[7] For comparison, the frequentist ordinary least squares (OLS) estimator provides a point estimate for \boldsymbol{\beta} by minimizing the residual sum of squares, yielding \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}, with asymptotic variance \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}.^[9] This estimator coincides with the posterior mode under noninformative priors in the Bayesian setting but lacks the full uncertainty quantification provided by the posterior distribution.^[7]

Bayesian Inference Principles

Bayesian inference in linear regression applies Bayes' theorem to update beliefs about the model parameters based on observed data, yielding a posterior distribution that incorporates both prior knowledge and the evidence from the data. For the regression coefficients \boldsymbol{\beta} and error variance \sigma^2, the theorem states that the joint posterior is proportional to the product of the likelihood and the prior:

p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) \propto p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}) \, p(\boldsymbol{\beta}, \sigma^2),

where \mathbf{y} denotes the response vector, \mathbf{X} the design matrix, p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}) the Gaussian likelihood assuming normally distributed errors, and p(\boldsymbol{\beta}, \sigma^2) the prior on the parameters. This formulation allows the posterior to reflect the full uncertainty in \boldsymbol{\beta} and \sigma^2, rather than relying on point estimates.^[10] The joint prior p(\boldsymbol{\beta}, \sigma^2) is typically specified in a separable manner as p(\boldsymbol{\beta} \mid \sigma^2) \, p(\sigma^2), where the conditional prior on \boldsymbol{\beta} given \sigma^2 often follows a multivariate normal distribution scaled by the variance, and p(\sigma^2) is a marginal prior such as an inverse-gamma or scaled-inverse-chi-squared distribution. This structure facilitates analytical tractability in conjugate cases and enables the incorporation of domain-specific information, such as regularization through informative priors on \boldsymbol{\beta}.^[10] A key output of Bayesian inference is the posterior predictive distribution, which generates probabilistic forecasts for new data \mathbf{y}^* at covariate values \mathbf{X}^* by marginalizing over the posterior:

p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}, \mathbf{X}^*) = \int p(\mathbf{y}^* \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}^*) \, p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) \, d\boldsymbol{\beta} \, d\sigma^2.

This integral quantifies prediction uncertainty directly from the model. The advantages of this full distributional approach include the ability to derive credible intervals for parameters and predictions, capturing epistemic and aleatoric uncertainty, as well as shrinkage effects from priors that regularize estimates toward plausible values—particularly beneficial in high dimensions or small samples. In contrast, the maximum a posteriori (MAP) estimate optimizes the posterior for a single point, forgoing the richer uncertainty quantification provided by the full distribution.^[10]

Prior Distributions

Conjugate Priors

In Bayesian linear regression, the conjugate prior for the model parameters \boldsymbol{\beta} and the noise variance \sigma^2 is the normal-inverse-gamma (NIG) distribution, which belongs to the exponential family and matches the form of the normal likelihood to enable analytical posterior updates.^[11] The marginal prior on \sigma^2 is specified as an inverse-gamma distribution:

\sigma^2 \sim \text{Inverse-Gamma}\left(\frac{\nu_0}{2}, \frac{s_0^2 \nu_0}{2}\right),

where \nu_0 > 0 and s_0^2 > 0 are hyperparameters controlling the prior shape and scale, respectively.^[11] Conditional on \sigma^2, the prior on the regression coefficients \boldsymbol{\beta} is multivariate normal:

\boldsymbol{\beta} \mid \sigma^2 \sim \mathcal{N}\left(\boldsymbol{\mu}_0, \sigma^2 (\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0)^{-1}\right),

with hyperparameters \boldsymbol{\mu}_0 (a p \times 1 vector, where p is the number of predictors), \lambda_0 > 0, and \mathbf{X}_0 (a \lambda_0 \times p prior design matrix).^[11] The hyperparameters admit natural interpretations in terms of fictitious prior data: \nu_0 represents the prior degrees of freedom (analogous to the effective number of prior observations for the variance estimate), s_0^2 is the prior scale or expected variance based on that data, \boldsymbol{\mu}_0 is the prior mean estimate for \boldsymbol{\beta}, \lambda_0 acts as the prior sample size or precision weight, and \mathbf{X}_0 serves as the design matrix for the imaginary prior dataset that generates the prior covariance structure.^[11] This setup allows the prior to encode beliefs derived from previous experiments or domain knowledge, with larger \lambda_0 and \nu_0 implying stronger prior influence. The joint prior density p(\boldsymbol{\beta}, \sigma^2) is the product of the conditional normal and marginal inverse-gamma densities:

\begin{aligned} p(\boldsymbol{\beta}, \sigma^2) &= p(\boldsymbol{\beta} \mid \sigma^2) \, p(\sigma^2) \\ &= \frac{|\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0|^{1/2}}{(2\pi)^{p/2} (\sigma^2)^{p/2}} \exp\left[ -\frac{1}{2\sigma^2} (\boldsymbol{\beta} - \boldsymbol{\mu}_0)^\top (\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0) (\boldsymbol{\beta} - \boldsymbol{\mu}_0) \right] \\ &\quad \times \frac{ \left( \frac{s_0^2 \nu_0}{2} \right)^{\nu_0 / 2} }{ \Gamma(\nu_0 / 2) } (\sigma^2)^{-(\nu_0 / 2 + 1)} \exp\left[ -\frac{s_0^2 \nu_0}{2 \sigma^2} \right]. \end{aligned}

This form ensures conjugacy with the Gaussian likelihood, yielding a posterior of the same NIG family.^[11] Special cases of the NIG prior include non-informative limits, such as the Jeffreys prior p(\boldsymbol{\beta}, \sigma^2) \propto 1/\sigma^2, obtained by taking \nu_0 \to 0 and \lambda_0 \to 0 (or equivalently, letting the prior precision on \boldsymbol{\beta} diverge), which expresses vague prior knowledge without favoring specific values.^[11] Additionally, fixing \sigma^2 and using a normal prior on \boldsymbol{\beta} with covariance \sigma^2 (\lambda_0 \mathbf{I})^{-1} (a special case where \mathbf{X}_0^\top \mathbf{X}_0 = \mathbf{I}) yields a posterior mean equivalent to the ridge regression estimator with penalty parameter \lambda_0, providing a Bayesian interpretation of regularization.^[12]

Non-Conjugate Priors

Non-conjugate priors in Bayesian linear regression refer to prior distributions on model parameters that do not yield analytically tractable posterior distributions when combined with the likelihood, necessitating computational approximation techniques for inference.^[13] Unlike conjugate priors, which facilitate exact solutions, non-conjugate priors offer greater flexibility for incorporating complex structures such as sparsity or domain-specific knowledge, making them suitable for modern high-dimensional problems.^[14] Prominent examples include sparsity-inducing priors like the horseshoe prior, which places a conditional normal distribution on each regression coefficient \beta_j \mid \sigma^2, \tau, \lambda_j \sim \mathcal{N}(0, \sigma^2 \tau^2 \lambda_j^2) with global shrinkage \tau and local shrinkage parameters \lambda_j \sim \text{C}^+(0,1) (half-Cauchy), promoting many coefficients to near zero while allowing a few to remain large.^[15] The Laplace prior, equivalent to a double-exponential distribution, induces L1 regularization akin to the lasso and is implemented hierarchically as \beta_j \mid \sigma^2, \tau_j^2 \sim \mathcal{N}(0, \sigma^2 \tau_j^2) with \tau_j^2 \sim \text{Exp}(\lambda^2/2), enabling automatic variable selection through shrinkage.^[16] Empirical Bayes approaches, which estimate hyperparameters from the data itself (e.g., by maximizing the marginal likelihood), can be applied to various priors, including conjugate ones like Zellner's g-prior with an empirically chosen scaling factor g, to balance shrinkage in high dimensions.^[17] These priors excel in high-dimensional settings where the number of predictors exceeds observations, facilitating effective variable selection and improved prediction by concentrating posterior mass on sparse models.^[14] They also allow incorporation of domain knowledge through hierarchical structures, such as multilevel priors that model group-level variability in coefficients, enhancing interpretability in contexts like genomics for gene selection or finance for factor modeling under sparsity constraints.^[18] Preference for non-conjugate priors over conjugate ones arises in such scenarios, where analytical simplicity is traded for robustness to irrelevant variables and better out-of-sample performance.^[13] A key challenge is the resulting posterior intractability, which demands approximation via methods like Markov chain Monte Carlo, though this increases computational cost compared to conjugate cases.^[19] Hyperparameters in these priors, such as the global shrinkage \tau in the horseshoe or the rate \lambda in the Laplace, are often elicited using empirical Bayes approaches that maximize the marginal likelihood to estimate values from data, or via cross-validation to optimize predictive performance.^[20]

Posterior Inference

Analytical Posterior Derivation

In Bayesian linear regression, the analytical posterior distribution can be derived exactly when using the conjugate normal-inverse-gamma (NIG) prior for the model parameters \boldsymbol{\beta} and \sigma^2. The model assumes \mathbf{y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, where \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}_n). The NIG prior is specified as \boldsymbol{\beta} \mid \sigma^2 \sim \mathcal{N}(\boldsymbol{\mu}_0, \sigma^2 (\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0)^{-1}) and \sigma^2 \sim \text{Inverse-Gamma}(\nu_0/2, s_0^2 \nu_0 / 2), which encodes prior information as if derived from \lambda_0 pseudo-observations with design matrix \mathbf{X}_0.^[11]^[21] The likelihood is \mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X} \sim \mathcal{N}(\mathbf{X} \boldsymbol{\beta}, \sigma^2 \mathbf{I}_n). The joint posterior p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) is proportional to the product of the prior and likelihood. Due to conjugacy, this posterior remains in the NIG family, with updated hyperparameters derived by completing the square in the exponent of the normal components and aggregating scale terms for the inverse-gamma part. The conditional posterior for the coefficients is \boldsymbol{\beta} \mid \sigma^2, \mathbf{y}, \mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}_n, \sigma^2 \mathbf{B}_n^{-1}), where \mathbf{B}_n = \lambda_0 \mathbf{X}_0^\top \mathbf{X}_0 + \mathbf{X}^\top \mathbf{X}, and the marginal posterior for the variance is \sigma^2 \mid \mathbf{y}, \mathbf{X} \sim \text{Inverse-Gamma}(\nu_n/2, s_n^2 \nu_n / 2).^[11]^[22]^[21] The posterior hyperparameters are updated as follows:

\mathbf{B}_n = \lambda_0 \mathbf{X}_0^\top \mathbf{X}_0 + \mathbf{X}^\top \mathbf{X}, \quad \boldsymbol{\mu}_n = \mathbf{B}_n^{-1} (\lambda_0 \mathbf{X}_0^\top \mathbf{X}_0 \boldsymbol{\mu}_0 + \mathbf{X}^\top \mathbf{X} \hat{\boldsymbol{\beta}}),

\nu_n = \nu_0 + n, \quad s_n^2 = \frac{\nu_0 s_0^2 + (\mathbf{y} - \mathbf{X} \boldsymbol{\mu}_n)^\top (\mathbf{y} - \mathbf{X} \boldsymbol{\mu}_n) + \lambda_0 (\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n)^\top \mathbf{X}_0^\top \mathbf{X}_0 (\boldsymbol{\mu}_0 - \boldsymbol{\mu}_n)}{\nu_n}.

These updates reflect the accumulation of prior pseudo-data with the observed data: \nu_n increases with sample size n, the precision matrix \mathbf{B}_n augments the prior information matrix with the data information matrix, \boldsymbol{\mu}_n weights the prior mean and the ordinary least squares estimator \hat{\boldsymbol{\beta}} by their respective precisions, and s_n^2 adjusts the scale by the residual sum of squares relative to the posterior mean plus a prior-data discrepancy term.^[11]^[22]^[21] Integrating out \sigma^2, the marginal posterior for \boldsymbol{\beta} is a multivariate t-distribution: \boldsymbol{\beta} \mid \mathbf{y}, \mathbf{X} \sim \mathcal{T}_{\nu_n} (\boldsymbol{\mu}_n, s_n^2 \mathbf{B}_n^{-1}). This arises because the conditional normal for \boldsymbol{\beta} given \sigma^2 combined with the inverse-gamma marginal for \sigma^2 yields a t-distribution, providing heavier tails than the normal to account for uncertainty in \sigma^2.^[11]^[22]^[21] Key properties of this posterior include that the mean \boldsymbol{\mu}_n is a precision-weighted average of the prior mean \boldsymbol{\mu}_0 (weighted by the prior precision \lambda_0 \mathbf{X}_0^\top \mathbf{X}_0) and the data-based estimator \hat{\boldsymbol{\beta}} (weighted by \mathbf{X}^\top \mathbf{X}), ensuring shrinkage toward the prior. The posterior variance shrinks with increasing n, as \mathbf{B}_n grows, concentrating the distribution around \boldsymbol{\mu}_n and reflecting reduced uncertainty from more observations.^[11]^[22]

Computational Methods

When analytical solutions for the posterior distribution are intractable, such as in cases involving non-conjugate priors or high-dimensional settings, computational methods become essential for approximating Bayesian inference in linear regression models. These techniques enable the generation of samples or approximations from the posterior distribution p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}), facilitating uncertainty quantification and model fitting.^[23] Markov Chain Monte Carlo (MCMC) methods are a cornerstone for posterior sampling in Bayesian linear regression, particularly when exact derivations are unavailable. For conjugate priors, Gibbs sampling offers an efficient approach by iteratively drawing from the full conditional posterior distributions of the parameters, such as alternating between the regression coefficients \boldsymbol{\beta} given \sigma^2 and the variance \sigma^2 given \boldsymbol{\beta}.^[23] This method leverages the conditional conjugacy to produce a Markov chain that converges to the joint posterior. For more general, non-conjugate cases, the Metropolis-Hastings algorithm is employed, where proposals for parameter updates are accepted or rejected based on the posterior ratio to ensure the chain targets the desired distribution.^[24] Implementations of these MCMC techniques are available in probabilistic programming libraries like Stan, which uses Hamiltonian Monte Carlo for efficient sampling, and PyMC, which supports both Gibbs and Metropolis-Hastings via NumPyro backends. Variational inference provides a faster, optimization-based alternative to MCMC by approximating the posterior with a simpler distribution that minimizes the Kullback-Leibler (KL) divergence to the true posterior. A common mean-field approximation assumes a factorized form, such as q(\boldsymbol{\beta}, \sigma^2) = q(\boldsymbol{\beta}) q(\sigma^2), where each factor is typically Gaussian, and parameters are updated via coordinate ascent variational Bayes to iteratively optimize the evidence lower bound.^[25] This approach scales well to large datasets but introduces bias by enforcing independence among parameters.^[26] The Laplace approximation offers a deterministic method for posterior approximation by performing a second-order Taylor expansion of the log-posterior around its mode, the maximum a posteriori (MAP) estimate, yielding a Gaussian distribution centered at the mode with covariance given by the inverse Hessian.^[27] This quadratic approximation is particularly useful for moderate-dimensional problems where the posterior is roughly symmetric and unimodal, providing quick estimates of posterior moments without sampling.^[27] Assessing convergence and reliability of these approximations is critical, with diagnostics including trace plots to visualize chain mixing and stationarity, as well as the Gelman-Rubin statistic, which compares variances within and between multiple chains to detect non-convergence (values close to 1 indicate convergence).^[28] Computational costs rise with sample size n or dimension p, where MCMC may require thousands of iterations for high-dimensional models, while variational methods and Laplace approximations offer O(n p^2) or better scalability but at the expense of accuracy in multimodal posteriors.^[28] Modern software tools facilitate these methods, including JAGS for Gibbs sampling via conditional specifications, Stan for advanced MCMC with automatic differentiation, PyMC for flexible Python-based modeling with JAX acceleration as of 2025 updates, and Pyro for scalable inference in deep probabilistic models.

Model Evaluation and Predictions

Marginal Likelihood

The marginal likelihood, also known as the model evidence, in Bayesian linear regression is defined as the probability of the observed data \mathbf{y} given the design matrix \mathbf{X} and model \mathcal{M}, obtained by integrating out the parameters \boldsymbol{\beta} and \sigma^2:

p(\mathbf{y} \mid \mathbf{X}, \mathcal{M}) = \int p(\mathbf{y} \mid \boldsymbol{\beta}, \sigma^2, \mathbf{X}, \mathcal{M}) \, p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{X}, \mathcal{M}) \, d\boldsymbol{\beta} \, d\sigma^2.

This integral represents the normalizing constant of the posterior distribution and quantifies the plausibility of the data under the model prior to observing the parameters.^[29] Under conjugate priors such as the normal-inverse-gamma (NIG) distribution for (\boldsymbol{\beta}, \sigma^2), the marginal likelihood admits a closed-form expression as the probability density function of a multivariate Student-t distribution for \mathbf{y}, with location parameter \mathbf{X} \boldsymbol{\mu}_0, degrees of freedom \nu_0, and scale matrix s_0^2 (\mathbf{I}_n + \mathbf{X} \mathbf{V}_0 \mathbf{X}^\top), where \boldsymbol{\mu}_0, \mathbf{V}_0, s_0^2, and \nu_0 are the prior mean, covariance scaling matrix, scale, and degrees of freedom parameters of the NIG prior, respectively. This closed form facilitates exact computation when the prior is conjugate, enabling analytical model assessment without simulation. It is derived by completing the square in the joint density of the parameters and data, marginalizing over \boldsymbol{\beta} and \sigma^2.^[29]^[11] The marginal likelihood serves as the basis for Bayes factors, which are ratios of evidences between competing models and support Bayesian model selection. For instance, the Bayes factor comparing two nested linear regression models \mathcal{M}_1 and \mathcal{M}_0 is BF_{10} = p(\mathbf{y} \mid \mathbf{X}, \mathcal{M}_1) / p(\mathbf{y} \mid \mathbf{X}, \mathcal{M}_0), where values greater than 1 favor \mathcal{M}_1 and provide a measure of relative evidence strength. This approach is particularly useful in variable selection for linear regression, as it penalizes model complexity through prior integration.^[30] When closed-form expressions are unavailable or priors are non-conjugate, the marginal likelihood can be approximated using posterior samples from MCMC methods. The harmonic mean estimator, computed as the reciprocal of the average of the reciprocals of the likelihood evaluated at posterior draws, provides a simple Monte Carlo-based estimate but suffers from high variance. Chib's method offers a more stable alternative by equating the marginal likelihood to the ratio of prior and posterior ordinates at a high-density point, leveraging Gibbs sampling output for accuracy. In high dimensions, such as large-scale linear regression with many predictors, these approximations face challenges due to poor MCMC mixing and the curse of dimensionality, often requiring advanced techniques like bridge sampling for reliable estimates.^[31]^[32]

Predictive Distributions

In Bayesian linear regression, the posterior predictive distribution provides a full probabilistic forecast for new observations \mathbf{y}^* given inputs \mathbf{x}^*, the observed data \mathbf{y}, and design matrix \mathbf{X}. It is obtained by integrating the likelihood over the posterior distribution of the parameters:
p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}, \mathbf{x}^*) = \int \mathcal{N}(\mathbf{y}^* \mid \mathbf{x}^{* \top} \boldsymbol{\beta}, \sigma^2) \, p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) \, d\boldsymbol{\beta} \, d\sigma^2.
This distribution quantifies both the aleatoric uncertainty from the noise process and the epistemic uncertainty from the parameters, enabling predictions that reflect all sources of variability in the model. Under conjugate priors, such as the normal-inverse-gamma prior on \boldsymbol{\beta} and \sigma^2, the posterior predictive distribution admits a closed-form expression as a multivariate Student-t distribution. Specifically, for a new observation,
\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}, \mathbf{x}^* \sim t_{\nu_n} \left( \mathbf{x}^{* \top} \boldsymbol{\mu}_n, \, s_n^2 \left(1 + \mathbf{x}^{* \top} (\lambda_n \mathbf{X}_n^\top \mathbf{X}_n)^{-1} \mathbf{x}^* \right) \right),
where \nu_n, \boldsymbol{\mu}_n, s_n^2, and \lambda_n are updated posterior quantities derived from the data and prior (e.g., degrees of freedom \nu_n = \nu_0 + n, posterior mean \boldsymbol{\mu}_n, scale s_n^2, and precision scalar \lambda_n). This form arises because the marginal posterior for \boldsymbol{\beta} is Student-t when \sigma^2 has an inverse-gamma prior, and convolving with the Gaussian likelihood yields another Student-t. The heavier tails of the Student-t compared to a Gaussian capture the combined uncertainties, making it suitable for small samples or when variance is uncertain.^[33]^[29] Credible intervals derived from the posterior predictive distribution are typically wider than corresponding frequentist confidence intervals for the same data, as they incorporate uncertainty in the parameter estimates \boldsymbol{\beta} and \sigma^2 in addition to sampling variability. For instance, in linear models with partially observed outcomes, asymptotic analysis shows that Bayesian credible intervals expand to account for this full posterior variability, providing more conservative forecasts that avoid understating risk. This difference highlights the Bayesian approach's explicit quantification of parameter indeterminacy, which frequentist intervals treat as fixed. When conjugate priors do not apply or for non-conjugate cases, the posterior predictive can be approximated via Monte Carlo methods by drawing samples \{\boldsymbol{\beta}^{(s)}, \sigma^{2(s)}\}_{s=1}^S from the posterior p(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}, \mathbf{X}) and then simulating \mathbf{y}^{*(s)} \sim \mathcal{N}(\mathbf{x}^{* \top} \boldsymbol{\beta}^{(s)}, \sigma^{2(s)}) for each draw. The empirical distribution of \{\mathbf{y}^{*(s)}\} approximates p(\mathbf{y}^* \mid \mathbf{y}, \mathbf{X}, \mathbf{x}^*), from which summaries like means, variances, or quantiles can be computed. This simulation-based approach is flexible and scales to complex models using Markov chain Monte Carlo. To evaluate the quality of these predictives, metrics such as calibration (assessing reliability, e.g., via probability integral transform histograms) and sharpness (measuring concentration, e.g., average interval width) are used, often in conjunction with proper scoring rules to ensure the approximations are well-calibrated to observed outcomes.^[29]

Extensions and Applications

Advanced Variants

Bayesian linear regression can be extended to hierarchical models, where priors are placed on hyperparameters to account for variability across groups or levels, such as in multilevel regression with random effects. In these models, regression coefficients are treated as random variables drawn from a higher-level distribution, enabling partial pooling of information across subgroups to improve estimates in settings with clustered data. This approach, foundational in works like Gelman and Hill (2007), facilitates modeling of varying intercepts or slopes while incorporating uncertainty in the group-level parameters. In high-dimensional settings where the number of predictors exceeds the sample size, sparsity-inducing priors address variable selection challenges in Bayesian linear regression. The Bayesian lasso, introduced by Park and Casella (2008), employs a Laplace prior on coefficients, equivalent to a scale mixture of normals, which shrinks small coefficients toward zero while providing credible intervals for selection.^[16] The horseshoe prior, developed by Carvalho et al. (2009), uses a global-local shrinkage mechanism with half-Cauchy distributions on local scales, achieving aggressive shrinkage for irrelevant variables and minimal shrinkage for important ones, particularly effective in sparse signals.^[34] Spike-and-slab priors, as formalized by Ishwaran and Rao (2005), combine a point mass (spike) at zero with a broader slab distribution, enabling exact variable selection through posterior inclusion probabilities, though computationally intensive without approximations. Robust variants of Bayesian linear regression replace Gaussian errors with heavier-tailed distributions to mitigate outlier influence. The Student-t error model, with degrees of freedom as a parameter controlling tail heaviness, provides downweighting of anomalous observations while maintaining conjugacy in certain hierarchical setups, as analyzed in Bayesian frameworks by Geweke (1993).^[35] This extension connects to broader generalized linear models but focuses on linear mean structures with robust likelihoods, enhancing inference stability in contaminated data. Time-series extensions incorporate temporal dependence into Bayesian linear regression, such as autoregressive (AR) models where the response depends on lagged values. Bayesian AR(p) models treat AR coefficients as random with priors ensuring stationarity, estimated via MCMC, as in the foundational approach by Chib and Greenberg (1994).^[36] In the nonparametric limit, Gaussian processes emerge as infinite-dimensional Bayesian linear regression with a kernel-induced covariance, viewing the function as a draw from a GP prior updated by data, equivalent to kernelized Bayesian regression as detailed by Rasmussen and Williams (2006). Recent advancements post-2020 emphasize scalability for large datasets, particularly through variational Bayes (VB) approximations in Bayesian linear regression. Mean-field VB for high-dimensional sparse regression, as in Ray and Szabo (2021), optimizes tractable posteriors under spike-and-slab priors, achieving faster inference than MCMC while preserving selection accuracy.^[37]

Practical Applications

Bayesian linear regression finds extensive use in machine learning, particularly for incorporating uncertainty quantification into predictive models, which aids decision-making in scenarios like active learning where the model selects data points to query based on predictive variance, and in reinforcement learning for balancing exploration and exploitation through epistemic uncertainty estimates. For instance, in active learning tasks, the posterior predictive distribution allows for efficient sample selection by prioritizing regions of high uncertainty, improving model performance with fewer labeled examples. A notable connection arises in the limiting case where Bayesian linear regression employs a Gaussian process prior on the regression weights, effectively recovering Gaussian process regression, which extends the framework to non-linear mappings via kernel functions and provides scalable uncertainty estimates for complex data.^[38] In scientific domains, Bayesian linear regression supports robust inference under uncertainty. In epidemiology, it enables adjustment for confounders through informative priors on regression coefficients, facilitating causal inference in observational studies with sparse or noisy data, such as modeling disease incidence while accounting for socioeconomic variables. In finance, the approach is applied to risk modeling by placing priors on volatilities and correlations, allowing for probabilistic forecasts of asset returns and value-at-risk estimates that incorporate historical market beliefs to mitigate overfitting in volatile environments. Environmental science leverages Bayesian linear regression in spatial regression models to predict phenomena like pollutant levels, where spatial priors capture autocorrelation and enable interpolation across unsampled locations, enhancing accuracy in geospatial datasets.^[39]^[40]^[41] Practical case studies illustrate its versatility. In A/B testing, Bayesian linear regression with conjugate priors stabilizes estimates for small sample sizes by shrinking coefficients toward prior means, enabling reliable comparison of treatment effects in online experiments with limited traffic, such as website variant evaluations where traditional methods suffer from high variance. For drug dose-response modeling, it fits hierarchical linear models to quantify efficacy curves across patient subgroups, using priors informed by preclinical data to predict optimal dosages while quantifying uncertainty in therapeutic windows, as demonstrated in meta-analyses of clinical trial outcomes.^[42]^[43] Implementation is facilitated by accessible software packages. In R, the brms package provides a user-friendly interface for fitting Bayesian linear regression models via Stan, supporting conjugate priors for quick inference on simple cases. For example, a conjugate Bayesian linear regression with normal priors can be specified as follows:

r
library(brms)
fit <- brm(y ~ x, data = dataset, family = gaussian(),
           prior = [prior](/page/prior)(normal(0, 10), class = Intercept) +
                  [prior](/page/prior)(normal(0, 10), class = b),
           control = list(adapt_delta = 0.95))
summary(fit)
library(brms)
fit <- brm(y ~ x, data = dataset, family = gaussian(),
           prior = [prior](/page/prior)(normal(0, 10), class = Intercept) +
                  [prior](/page/prior)(normal(0, 10), class = b),
           control = list(adapt_delta = 0.95))
summary(fit)

This code fits the model and summarizes the posterior. In Python, PyMC offers similar functionality with NumPyro or JAX backends for efficient sampling. A conjugate example using normal likelihood and priors is:

python
import pymc as pm
import numpy as np

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    beta = pm.Normal('beta', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)
    mu = alpha + beta * x
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
    trace = pm.sample(1000)
pm.summary(trace)
import pymc as pm
import numpy as np

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sigma=10)
    beta = pm.Normal('beta', mu=0, sigma=10)
    sigma = pm.HalfNormal('sigma', sigma=1)
    mu = alpha + beta * x
    y_obs = pm.Normal('y_obs', mu=mu, sigma=sigma, observed=y)
    trace = pm.sample(1000)
pm.summary(trace)

These tools handle posterior inference automatically, often outperforming frequentist alternatives in low-data settings.^[44]^[45] The practical benefits of Bayesian linear regression include superior handling of small datasets through prior regularization, which prevents overfitting, and seamless incorporation of expert knowledge via informative priors, leading to more interpretable and robust models. Empirical comparisons in low-data regimes demonstrate reduced mean squared error (MSE) compared to ordinary least squares, particularly when priors align with domain expectations, while maintaining comparable performance on larger datasets.^[42]^[46]