Fact-checked by Grok 2 weeks ago

Evidence lower bound

The Evidence Lower Bound (ELBO), also known as the variational lower bound, is a fundamental quantity in variational inference that provides a tractable lower bound on the log marginal likelihood—often called the evidence—of observed data in probabilistic models.^[1] It enables approximate Bayesian inference by optimizing an approximating distribution to closely match the true posterior, transforming intractable integration problems into solvable optimization tasks.^[2] Introduced in the context of graphical models, the ELBO has become essential for scaling Bayesian methods to complex, high-dimensional data.^[1] The ELBO is derived from the non-negativity of the Kullback-Leibler (KL) divergence between a variational distribution q(\mathbf{z} \mid \mathbf{x}) and the true posterior p(\mathbf{z} \mid \mathbf{x}), yielding the identity \log p(\mathbf{x}) = \mathcal{L}(q) + \mathrm{KL}(q(\mathbf{z} \mid \mathbf{x}) \parallel p(\mathbf{z} \mid \mathbf{x})), where \mathcal{L}(q) is the ELBO defined as \mathcal{L}(q) = \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x})}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x})}[\log q(\mathbf{z} \mid \mathbf{x})].^[2] This bound is obtained by applying Jensen's inequality to the log evidence \log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z}) \, d\mathbf{z}, introducing the variational distribution to make the expectation computable.^[1] Maximizing the ELBO with respect to the parameters of q thus tightens the bound and improves the posterior approximation, with equality holding when q = p.^[2] In practice, the ELBO facilitates efficient inference and learning in latent variable models, such as Bayesian networks and Markov random fields, by decoupling complex dependencies through mean-field approximations or structured variational families.^[1] It underpins algorithms like coordinate ascent variational inference and stochastic gradient variants, which are faster and more scalable than Markov chain Monte Carlo for large datasets.^[2] A prominent application is in variational autoencoders (VAEs), where the ELBO serves as the objective function to train neural networks for generative modeling, balancing reconstruction accuracy and latent space regularization via the KL term.^[3] Beyond machine learning, the ELBO has influenced fields like topic modeling (e.g., latent Dirichlet allocation) and hierarchical Bayesian computation, providing a unified framework for approximate inference that trades off exactness for computational efficiency.^[2] Ongoing research continues to refine ELBO-based methods, including black-box variants and tighter bounds using alternative divergences, to enhance accuracy in diverse probabilistic modeling tasks.^[2]

Fundamentals

Definition

The evidence lower bound (ELBO), also known as the variational lower bound, serves as a tractable surrogate objective for approximating the intractable log marginal likelihood, or evidence, in Bayesian models.^[4] In these models, the evidence \log p(x) represents the marginal probability of observed data x, obtained by integrating over latent variables z and model parameters, which is often computationally infeasible due to high dimensionality.^[5] The ELBO provides a lower bound on this quantity, enabling efficient inference and model selection by optimizing an approximate posterior distribution q(z \mid x) that balances data fit and prior knowledge.^[4] Intuitively, the ELBO decomposes into two terms: an expected reconstruction likelihood that measures how well the model explains the observed data, and a Kullback-Leibler (KL) divergence term that acts as a regularization penalty, encouraging the approximate posterior to remain close to the prior distribution and thus controlling model complexity.^[4] This trade-off promotes parsimonious models that generalize well, avoiding overfitting by penalizing overly complex approximations.^[5] Within the broader framework of variational inference, the ELBO facilitates scalable Bayesian computation by transforming posterior inference into a tractable optimization problem.^[4] The variational lower bound was formalized in the context of variational methods for graphical models in the seminal work by Jordan et al. (1999).^[5] The specific term "evidence lower bound" (ELBO) emerged later in the variational inference literature, for instance in Blei (2008).^[6] This bound is expressed through the fundamental inequality \log p(x) \geq \mathrm{ELBO}(q), where equality holds when q(z \mid x) = p(z \mid x), the true posterior.^[4] The gap between the evidence and the ELBO equals the KL divergence between q(z \mid x) and the true posterior, quantifying the approximation quality.^[5]

Mathematical notation

In the standard probabilistic setup for variational inference, the observed data is denoted by \mathbf{x}, while the latent variables are represented by \mathbf{z}. The joint distribution over the observed and latent variables is p(\mathbf{x}, \mathbf{z}), which factors as the product of the likelihood p(\mathbf{x} \mid \mathbf{z}) and the prior p(\mathbf{z}). The posterior distribution, which is typically intractable, is p(\mathbf{z} \mid \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})}, where the marginal likelihood or Bayesian evidence is given by

p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) \, d\mathbf{z}.

To approximate the intractable posterior, a variational distribution q(\mathbf{z} \mid \mathbf{x}) is introduced, often parameterized by variational parameters \theta, such that q(\mathbf{z} \mid \mathbf{x}) serves as a tractable surrogate for p(\mathbf{z} \mid \mathbf{x}).^[4] The evidence lower bound (ELBO) associated with this variational distribution is expressed in integral form as \begin{align*} \mathcal{L}(q) &= \int q(\mathbf{z} \mid \mathbf{x}) \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z} \mid \mathbf{x})} , d\mathbf{z} \ &= \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x})} \left[ \log p(\mathbf{x} \mid \mathbf{z}) \right] - \mathrm{KL}\left( q(\mathbf{z} \mid \mathbf{x}) \parallel p(\mathbf{z}) \right), \end{align*} where \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x})} [\cdot] denotes expectation under q(\mathbf{z} \mid \mathbf{x}), and \mathrm{KL}(\cdot \parallel \cdot) is the Kullback-Leibler divergence. This bound provides a tractable objective that lower-bounds the log marginal likelihood \log p(\mathbf{x}).^[4]^[5] The notation assumes a continuous latent space unless otherwise specified, with intractable integrals over \mathbf{z} motivating the use of variational approximations.^[4]

Theoretical Background

Bayesian evidence

In Bayesian statistics, the evidence, also known as the marginal likelihood, is defined as the probability of the observed data \mathbf{x} under the model, obtained by integrating the joint distribution over the latent variables \mathbf{z}:

p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) \, d\mathbf{z}.

This integral marginalizes out the latent variables, yielding the predictive density of the data averaged over all possible latent configurations weighted by their prior probabilities. The evidence plays a central role in Bayesian model selection, where it facilitates comparison between competing models through the Bayes factor, defined as the ratio of the evidences for two models M_1 and M_2: B_{12} = p(\mathbf{x} \mid M_1) / p(\mathbf{x} \mid M_2). A Bayes factor greater than 1 indicates that M_1 provides a higher predictive density for the data, favoring it as a better explanation; values exceeding 10 are often interpreted as strong evidence. Higher evidence values correspond to models that better balance fit to the data with complexity via prior integration, promoting parsimony without explicit penalties like those in frequentist criteria.^[7] Computing the evidence is generally intractable because it requires evaluating high-dimensional integrals that lack closed-form solutions, particularly when the posterior p(\mathbf{z} \mid \mathbf{x}) = p(\mathbf{x}, \mathbf{z}) / p(\mathbf{x}) involves normalizing by the unknown evidence itself. This intractability is exacerbated in complex models, such as deep generative models where nonlinear transformations and high-dimensional latents make numerical integration or sampling inefficient. The concept of evidence traces back to foundational Bayesian statistics, with formal discussions on its role in inference and model choice appearing in works like those reviewing prior selection rules. However, computational challenges in evaluating the evidence became particularly prominent in the 1990s, as models grew more sophisticated and required approximations like the Bayesian Information Criterion to circumvent direct integration.^[8] In response to these difficulties, methods like the evidence lower bound have been developed to provide tractable estimates of \log p(\mathbf{x}).

Variational approximation

Variational inference (VI) provides an efficient approach to approximate intractable posterior distributions in Bayesian models by optimizing a variational distribution q(z; \theta) from a specified family of tractable densities to closely match the true posterior p(z \mid x). The optimization typically minimizes the Kullback-Leibler (KL) divergence \KL(q(z; \theta) \| p(z \mid x)), which is equivalent to maximizing the evidence lower bound (ELBO) as the objective function.^[9] This method was introduced in the context of graphical models to enable scalable inference.^[5] A common choice for the variational family is the mean-field approximation, where the latent variables are assumed independent, yielding q(z; \theta) = \prod_{i=1}^d q_i(z_i; \theta_i) with each factor being a simple distribution such as a Gaussian. This assumption simplifies computations and parameter estimation, particularly in high-dimensional settings, and formed the basis of early VI applications.^[5] Structured variational families, which relax full independence by incorporating dependencies, have also been developed to improve approximation quality while maintaining tractability.^[9] VI offers key advantages in scalability, especially for large datasets, through techniques like stochastic variational inference that leverage noisy, unbiased gradients for optimization, enabling processing of massive data without full recomputation.^[10] Unlike Markov chain Monte Carlo (MCMC) methods, which provide asymptotically exact samples but can be computationally intensive and slow to converge, VI delivers fast, deterministic approximations suitable for real-time or high-throughput applications.^[9] Despite these benefits, VI can suffer from limitations such as underestimation of posterior variance, arising from the optimization bias inherent in fitting the variational family to the mode of the posterior rather than fully capturing its spread.^[11]

Derivation and Properties

Deriving the ELBO

In variational inference, the evidence lower bound (ELBO) arises as a consequence of bounding the log marginal likelihood using the Kullback-Leibler (KL) divergence between a variational distribution q(z \mid x) and the true posterior p(z \mid x). Specifically, the KL divergence is given by

\mathrm{KL}(q(z \mid x) \parallel p(z \mid x)) = \mathbb{E}_{q(z \mid x)} \left[ \log \frac{q(z \mid x)}{p(z \mid x)} \right],

which can be rewritten as

\mathrm{KL}(q(z \mid x) \parallel p(z \mid x)) = -\mathrm{ELBO}(q) + \log p(x),

where \log p(x) is the log evidence, demonstrating that the ELBO provides a lower bound on \log p(x) since the KL divergence is nonnegative.^[4] Rearranging yields \mathrm{ELBO}(q) = \log p(x) - \mathrm{KL}(q(z \mid x) \parallel p(z \mid x)), confirming that maximizing the ELBO minimizes the KL divergence and tightens the bound on the evidence.^[4] The ELBO decomposes into two interpretable terms:

\mathrm{ELBO}(q) = \mathbb{E}_{q(z \mid x)} [\log p(x \mid z)] - \mathrm{KL}(q(z \mid x) \parallel p(z)),

where the first term is the expected log-likelihood under the variational distribution, encouraging good reconstructions of the observed data x, and the second term acts as a regularization that penalizes deviations from the prior p(z).^[5]^[4] To derive this bound, begin with the log evidence:

\log p(x) = \log \int q(z \mid x) \frac{p(x, z)}{q(z \mid x)} \, dz.

By the concavity of the logarithm, Jensen's inequality implies

\log p(x) \geq \int q(z \mid x) \log \left[ \frac{p(x, z)}{q(z \mid x)} \right] dz = \mathbb{E}_{q(z \mid x)} \left[ \log \frac{p(x, z)}{q(z \mid x)} \right],

which expands to the ELBO as shown above.^[5] Equality holds if and only if q(z \mid x) = p(z \mid x), meaning the bound is tight precisely when the variational approximation matches the true posterior.^[4]

Jensen's inequality application

Jensen's inequality states that for a concave function f and a random variable X, f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)], with equality if and only if X is constant almost surely.^[5] In the context of variational inference, the natural logarithm is a concave function, and the inequality is applied to the expectation under the variational distribution q(z) of the ratio p(x,z)/q(z), which serves as a weighted average.^[5] The log marginal likelihood can be expressed as \log p(x) = \log \mathbb{E}_{q(z)} \left[ \frac{p(x,z)}{q(z)} \right]. By Jensen's inequality, this yields \log p(x) \geq \mathbb{E}_{q(z)} \left[ \log \frac{p(x,z)}{q(z)} \right], where the right-hand side is the evidence lower bound (ELBO).^[5] Equality holds precisely when q(z) = p(z \mid x) almost surely.^[5] The concavity of the logarithm geometrically implies that the logarithm of an expectation is at least the expectation of the logarithm, resulting in a non-negative gap between \log p(x) and the ELBO that quantifies the looseness of the bound.^[5] This gap corresponds exactly to the Kullback-Leibler divergence \mathrm{KL}(q(z) \mid \mid p(z \mid x)), which vanishes only when the variational distribution matches the true posterior.^[5]

Optimization Techniques

Maximizing the ELBO

The objective in variational inference is to maximize the evidence lower bound (ELBO) with respect to the variational parameters \theta, expressed as \mathcal{L}(\theta, \phi) = \mathbb{E}_{q(\mathbf{z};\theta)}[\log p(\mathbf{x}|\mathbf{z}; \phi)] - \mathrm{KL}(q(\mathbf{z}; \theta) \| p(\mathbf{z})), where \phi denotes the model parameters.^[4] This formulation separates the expected log-likelihood under the variational distribution q(\mathbf{z}; \theta) from the Kullback-Leibler (KL) divergence measuring the discrepancy between q and the prior p(\mathbf{z}).^[4] Maximizing the ELBO tightens the variational approximation to the true posterior, as improvements in \mathcal{L} correspond to reductions in the KL divergence to the intractable posterior.^[4] A common approach to optimization is coordinate ascent variational inference (CAVI), which iteratively maximizes the ELBO by updating one factor of the variational distribution at a time while holding others fixed. In the broader variational EM framework, this alternates between optimizing \theta for inference (maximizing the expected complete-data log-likelihood) and optimizing \phi for learning (maximizing the observed-data ELBO). This coordinate-wise strategy exploits the structure of mean-field approximations, enabling closed-form updates in conjugate models. For scalability to large datasets, stochastic optimization techniques approximate the expectations in the ELBO using Monte Carlo sampling.^[10] Specifically, the expectation \mathbb{E}_{q(\mathbf{z}; \theta)}[\log p(\mathbf{x}|\mathbf{z}; \phi)] is estimated as \frac{1}{S} \sum_{s=1}^S \log p(\mathbf{x}|\mathbf{z}_s; \phi), where \{\mathbf{z}_s\}_{s=1}^S are samples drawn from q(\mathbf{z}; \theta). These noisy estimates enable stochastic gradient ascent on the ELBO, often using mini-batches to reduce computational cost while maintaining unbiased gradients.^[10] The CAVI procedure guarantees a monotonic non-decrease in the ELBO at each update step, ensuring convergence to a stationary point of the objective, though the non-convexity of the ELBO landscape may yield local optima. In practice, optimization stops when changes in the ELBO fall below a threshold or when a validation ELBO on held-out data plateaus, providing a proxy for generalization performance.^[13]

Reparameterization trick

In variational inference, optimizing the evidence lower bound (ELBO) via stochastic gradients often involves sampling latent variables z from the variational distribution q(z; \theta). However, computing gradients of the ELBO with respect to the variational parameters \theta using the score-function estimator—where the gradient passes through the sampling operation—results in high-variance estimates due to the stochastic nature of the samples.^[3]^[14] The reparameterization trick addresses this by re-expressing the sampled latent variable as a deterministic function of \theta and an auxiliary noise variable \epsilon drawn from a fixed, parameter-free distribution. Specifically, z = g([\theta](/page/Theta), [\epsilon](/page/Epsilon)) where \epsilon \sim p(\epsilon) (e.g., a standard normal), allowing the stochasticity to be detached from \theta. This transforms the gradient of the ELBO expectation into \nabla_\theta \mathbb{E}_{\epsilon \sim p(\epsilon)} [f(x, g(\theta, \epsilon); \phi)] \approx \mathbb{E}_{\epsilon \sim p(\epsilon)} [\nabla_\theta f(x, g(\theta, \epsilon); \phi)], where f denotes the log-joint or relevant terms, enabling low-variance estimates via reparameterization of the sampling process.^[3]^[14] For a Gaussian variational distribution q(z; \theta) = \mathcal{N}(z; \mu(\theta), \sigma^2(\theta)), a common example is z = \mu(\theta) + \sigma(\theta) \epsilon with \epsilon \sim \mathcal{N}(0, I). Similar reparameterizations exist for other families, such as the logistic distribution via the inverse cumulative distribution function or the gamma distribution using scale mixtures or rejection sampling approximations. This technique was independently introduced by Kingma and Welling in 2013 and Rezende et al. in 2014, facilitating backpropagation through stochastic nodes in neural network-based models.^[3]^[14]^[15] In practice, the reparameterization trick reduces the variance of gradient estimates by several orders of magnitude—often 10 to 100 times—compared to score-function methods, leading to more stable and efficient optimization of the ELBO.^[15]

Advanced Forms and Bounds

Standard ELBO form

The evidence lower bound (ELBO) in its standard form provides a tractable objective for variational inference, expressed as

\mathcal{L}(\theta, \phi) = \mathbb{E}_{q(z; \theta)} \left[ \log p(x \mid z; \phi) \right] - \mathrm{[KL](/page/KL)}\left( q(z; \theta) \| p(z) \right),

where q(z; \theta) is the variational posterior distribution parameterized by \theta, p(x \mid z; \phi) is the likelihood under parameters \phi, and p(z) is the prior over latent variables z.^[4] This formulation decomposes into an expected complete log-likelihood term and a Kullback-Leibler (KL) divergence regularization term.^[4] The first term, \mathbb{E}_{q(z; \theta)} \left[ \log p(x \mid z; \phi) \right], known as the reconstruction or expected log-likelihood, quantifies how well the model reconstructs the observed data x given samples from the approximate posterior; it encourages the generative model to fit the data effectively.^[4] The second term, -\mathrm{KL}\left( q(z; \theta) \| p(z) \right), acts as an entropy-regularized prior that penalizes deviations of the variational distribution from the prior, thereby preventing overfitting by promoting a structured latent space aligned with the model's inductive biases.^[4] For scalability to large datasets, the standard ELBO is often approximated using mini-batches, where importance sampling adjusts the estimate to account for subsampling; a prominent example is the importance weighted autoencoder (IWAE), which tightens the bound by averaging over multiple samples per data point, improving gradient estimates and model performance.^[16] The ELBO serves as a Bayesian analog to frequentist criteria like the Akaike information criterion (AIC) and Bayesian information criterion (BIC) for model selection, as maximizing it approximates the marginal likelihood while incorporating model complexity through the KL term, with asymptotic equivalence to the log evidence under certain conditions.^[17]

Data-processing inequality

The data-processing inequality (DPI) states that, for random variables X and Z forming a Markov chain with any processed variable Y = f(Z) where f is a (possibly stochastic) function, the mutual information satisfies I(X; Z) \geq I(X; Y), with equality if and only if f is invertible on the relevant support.^[18] This result underscores that further processing of information cannot increase the mutual information available about the source variable X. In variational inference, particularly in variational information bottleneck methods, the DPI is applied to limit the mutual information between the input and latent variables, I_q(X; Z) \leq I_c, encouraging compressed representations that retain relevant information for downstream tasks while regularizing the ELBO.^[19] This is useful in hierarchical models, where intermediate latent layers act as information bottlenecks, allowing variational approximations to leverage coarser representations without recomputing full dependencies.^[19] The implications of the DPI for the ELBO highlight inherent limitations in tightness due to information bottlenecks in the latent space: the bound degrades as processing reduces mutual information, reflecting a trade-off between model compression and predictive fidelity. Equality in the DPI holds when the processing function f is invertible, preserving all information about X, which informs design choices in variational models to avoid unnecessary information loss.^[18]^[19] The looseness of the evidence lower bound (ELBO) primarily arises from the mismatch between the approximate posterior q(\phi)(z|x) and the true posterior p(z|x), quantified by the KL divergence term in the ELBO decomposition, which measures how well the variational distribution captures the true conditional dependencies.^[20] In neural variational inference, this looseness is exacerbated by amortized inference, where a shared inference network parameterizes q across the dataset, leading to suboptimal approximations for individual data points and an amortization gap that often dominates the total suboptimality.^[20] Refinements to tighten the ELBO include importance-weighted autoencoders (IWAEs), which leverage multiple samples from the approximate posterior to derive a strictly tighter bound, the importance-weighted ELBO (IW-ELBO), defined as L_K(x) = \mathbb{E}_{z_1, \dots, z_K \sim q(\phi)(z|x)} \left[ \log \left( \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)} \right) \right], where K > 1 samples improve tightness and converge to the true log-evidence as K \to \infty.^[21] Variational gap analysis decomposes the total gap between the ELBO and the log-evidence into an approximation gap (due to the variational family's limited expressiveness) and an amortization gap (due to shared parameterization), enabling targeted improvements by enhancing the inference network's capacity or using more expressive distributions like normalizing flows.^[20] Advanced bounds address ELBO limitations through alternatives like \alpha-divergences or Rényi bounds; for instance, the variational Rényi bound (VR bound) generalizes the ELBO using Rényi's \alpha-divergence, L_\alpha(q; D) = \frac{1}{1-\alpha} \log \mathbb{E}_q \left[ \left( \frac{p(\theta, D)}{q(\theta)} \right)^{1-\alpha} \right] for \alpha \neq 1, which recovers the ELBO as \alpha \to 1 but yields tighter approximations for other \alpha values by balancing mode-seeking and mass-covering behaviors.^[22] Diagnostics for assessing ELBO quality involve comparing the bound to estimates of the true log-evidence \log p(x) obtained via Markov chain Monte Carlo (MCMC), such as through Pareto-smoothed importance sampling (PSIS), which evaluates the reliability of the variational posterior by fitting importance ratios to a generalized Pareto distribution and flagging unreliability if the shape parameter \hat{k} > 0.7.^[23] This approach quantifies the variational gap and guides refinements, ensuring the bound's tightness in practice.^[23]

Applications

Variational autoencoders

Variational autoencoders (VAEs) represent a prominent application of the evidence lower bound (ELBO) in generative modeling, where the goal is to learn a latent representation of data that enables both reconstruction and sampling of new instances. In the VAE framework, an encoder network parameterized by \theta approximates the posterior distribution q(z|x; \theta) over latent variables z given observed data x, while a decoder network parameterized by \phi models the likelihood p(x|z; \phi). The prior over latents is typically a standard Gaussian p(z), and the ELBO for a single data point takes the form

\mathcal{L}(x; \theta, \phi) = \mathbb{E}_{q(z|x; \theta)}[\log p(x|z; \phi)] - \mathrm{[KL](/page/KL)}(q(z|x; \theta) \| p(z)),

which lower-bounds the log marginal likelihood \log p(x) and decomposes into a reconstruction term and a regularization term that encourages the approximate posterior to match the prior.^[3] This setup facilitates amortized inference, where neural networks parameterize both the encoder and decoder to scale the variational approximation to high-dimensional data such as images or text, avoiding the need for per-data-point optimization in traditional variational inference. By maximizing the ELBO with respect to \theta and \phi, VAEs learn to compress input data into a continuous latent space while ensuring generated samples resemble the training distribution through probabilistic decoding. A notable extension is the \beta-VAE, which introduces a weighting factor \beta > 1 on the KL divergence term to enhance disentanglement in the latent space, promoting the discovery of independent factors of variation in the data. This modification balances reconstruction fidelity against latent structure, leading to more interpretable representations in tasks like visual concept learning.^[24] Training employs stochastic gradient ascent on the ELBO, leveraging the reparameterization trick to differentiate through the stochastic sampling of z.^[3] Introduced in 2013 and published in 2014, VAEs have significantly advanced unsupervised learning in deep generative models by providing a scalable, end-to-end differentiable framework for probabilistic modeling, influencing subsequent developments in areas like conditional generation and hierarchical latents.^[3]

Gaussian processes and beyond

In Gaussian processes (GPs), the evidence lower bound (ELBO) facilitates scalable approximate inference by addressing the computational challenges of exact posterior computation, which scales cubically with data size. A seminal approach is the sparse variational GP framework, which introduces inducing points to form a low-rank approximation of the GP posterior. By maximizing the ELBO with respect to the variational distribution over the inducing variables, this method enables efficient regression and prediction while preserving much of the GP's uncertainty quantification capabilities.^[25] Beyond GPs, the ELBO underpins variational inference in other classical probabilistic models, such as topic models and state-space models. In topic models like latent Dirichlet allocation (LDA), mean-field variational inference approximates the posterior over topic distributions and assignments by optimizing the ELBO, providing a tractable alternative to MCMC for discovering latent topics in large text corpora.^[26] For state-space models, black-box variational inference employs structured Gaussian variational approximations to the posterior, enabling scalable inference for nonlinear and non-conjugate settings.^[27] Extensions of ELBO-based methods broaden applicability to diverse model classes. Black-box variational inference automates ELBO optimization for arbitrary probabilistic models using stochastic gradients, requiring only forward sampling from the model and variational distributions without model-specific derivations.^[28] In hierarchical models, structured variational families—such as those incorporating partial correlations or non-factorized approximations—enhance the ELBO by better capturing dependencies, leading to tighter bounds and improved posterior approximations compared to fully factorized mean-field assumptions.^[28] These ELBO applications in non-neural models offer distinct advantages over point-estimate methods like maximum likelihood, particularly in robust uncertainty quantification. By explicitly modeling posterior variability through the variational distribution, they enable principled propagation of epistemic and aleatoric uncertainties, which is crucial for tasks like Bayesian optimization in GPs or anomaly detection in state-space models.^[25]^[26]^[27]

References

[1]
An Introduction to Variational Methods for Graphical Models
This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov ...
[2]
Full article: Variational Inference: A Review for Statisticians
In this article, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization.
[3]
[1312.6114] Auto-Encoding Variational Bayes - arXiv
Dec 20, 2013 · Authors:Diederik P Kingma, Max Welling. View a PDF of the paper titled Auto-Encoding Variational Bayes, by Diederik P Kingma and 1 other authors.
[4]
[PDF] Variational Inference: A Review for Statisticians - arXiv
May 9, 2018 · ELBO(q) = E[log p(z,x)] − E[logq(z)]. (13). This function is called the evidence lower bound (ELBO). The ELBO is the negative KL diver-.
[5]
[PDF] An Introduction to Variational Methods for Graphical Models
Abstract. This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks ...Missing: ELBO | Show results with:ELBO
[6]
[PDF] Bayes Factors - Robert E. Kass; Adrian E. Raftery
Oct 14, 2003 · The choice of these priors and the extent to which Bayes factors are sensitive to this choice is discussed in Section 5. 1995 American ...
[7]
[PDF] The Selection of Prior Distributions by Formal Rules
Nov 27, 2017 · Kass and Larry Wasserman. Source: Journal of the American Statistical Association, Vol. 91, No. 435 (Sep., 1996), pp. 1343-1370.
[8]
[1601.00670] Variational Inference: A Review for Statisticians - arXiv
Jan 4, 2016 · In this paper, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization.
[9]
[PDF] Stochastic Variational Inference
It maximizes the evidence lower bound (ELBO), a lower bound on the logarithm of the marginal probability of the observa- tions log p(x). The ELBO is equal ...
[10]
[PDF] Two problems with variational expectation maximisation for time ...
First, the compactness property of variational inference leads to a failure to propagate posterior uncertainty through time. Second, the dependence of the ...
[11]
[PDF] Automatic Differentiation Variational Inference
Maximizing the elbo minimizes the kl divergence (Jordan et al., 1999; Bishop, 2006). Optimizing the kl divergence implies a constraint that the support of the ...<|control11|><|separator|>
[12]
Stochastic Backpropagation and Approximate Inference in Deep ...
Jan 16, 2014 · Stochastic Backpropagation and Approximate Inference in Deep Generative Models. Authors:Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra.
[13]
Variance reduction properties of the reparameterization trick - arXiv
Sep 27, 2018 · We show that the marginal variances of the reparameterization gradient estimator are smaller than those of the score function gradient estimator.Missing: factor | Show results with:factor
[14]
[1509.00519] Importance Weighted Autoencoders - arXiv
Sep 1, 2015 · We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log- ...
[15]
Bayesian Model Selection via Mean-Field Variational Approximation
Dec 17, 2023 · Comparing to BIC, ELBO tends to incur smaller approximation error to the log-marginal likelihood (a.k.a. model evidence) due to a better ...
[16]
Elements of Information Theory | Wiley Online Books
Elements of Information Theory ; Author(s):. Thomas M. Cover, Joy A. Thomas, ; First published:7 April 2005 ; Print ISBN:9780471241959 | ; Online ...
[17]
[1612.00410] Deep Variational Information Bottleneck - arXiv
Dec 1, 2016 · We call this method "Deep Variational Information Bottleneck", or Deep VIB. We show that models trained with the VIB objective outperform those ...
[18]
[PDF] Inference Suboptimality in Variational Autoencoders
Table 1. Summary of Gap Terms. The middle column refers to the general case where our variational objective is a lower bound on the marginal log-likelihood.
[19]
None
### Summary of Importance Weighted Autoencoders (IWAE) from arXiv:1509.00519
[20]
None
### Summary of Rényi Divergence Variational Inference (arXiv:1602.02311)
[21]
[PDF] Yes, but Did It Work?: Evaluating Variational Inference
Jun 7, 2018 · In this paper we propose two diagnostic methods that assess, respectively, the quality of the entire variational posterior for a particular data ...
[22]
[PDF] Variational Learning of Inducing Variables in Sparse Gaussian ...
Titsias, M. K. (2009). Variational Model Selection for. Sparse Gaussian Process Regression. Technical report,. School of Computer Science, University of ...
[23]
[PDF] Latent Dirichlet Allocation - Journal of Machine Learning Research
Journal of Machine Learning Research 3 (2003) 993-1022. Submitted 2/02; Published 1/03. Latent Dirichlet Allocation. David M. Blei. BLEI@CS.BERKELEY.EDU.
[24]
Black box variational inference for state space models - arXiv
This paper introduces a 'black-box' approximate inference technique for latent variable models using a structured Gaussian variational approximate posterior.
[25]
[1401.0118] Black Box Variational Inference - arXiv
Dec 31, 2013 · In this paper, we present a "black box" variational inference algorithm, one that can be quickly applied to many models with little additional derivation.