Fact-checked by Grok 2 weeks ago

Evidence lower bound

The Evidence Lower Bound (ELBO), also known as the variational lower bound, is a fundamental quantity in variational inference that provides a tractable lower bound on the log —often called the —of observed data in probabilistic models. It enables approximate by optimizing an approximating distribution to closely match the true posterior, transforming intractable problems into solvable optimization tasks. Introduced in the context of graphical models, the ELBO has become essential for scaling Bayesian methods to complex, high-dimensional data. The ELBO is derived from the non-negativity of the Kullback-Leibler (KL) divergence between a variational distribution q(\mathbf{z} \mid \mathbf{x}) and the true posterior p(\mathbf{z} \mid \mathbf{x}), yielding the identity \log p(\mathbf{x}) = \mathcal{L}(q) + \mathrm{KL}(q(\mathbf{z} \mid \mathbf{x}) \parallel p(\mathbf{z} \mid \mathbf{x})), where \mathcal{L}(q) is the ELBO defined as \mathcal{L}(q) = \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x})}[\log p(\mathbf{x}, \mathbf{z})] - \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x})}[\log q(\mathbf{z} \mid \mathbf{x})]. This bound is obtained by applying Jensen's inequality to the log evidence \log p(\mathbf{x}) = \log \int p(\mathbf{x}, \mathbf{z}) \, d\mathbf{z}, introducing the variational distribution to make the expectation computable. Maximizing the ELBO with respect to the parameters of q thus tightens the bound and improves the posterior approximation, with equality holding when q = p. In practice, the ELBO facilitates efficient and learning in latent variable models, such as Bayesian networks and Markov random fields, by decoupling complex dependencies through mean-field approximations or structured variational families. It underpins algorithms like coordinate ascent variational inference and gradient variants, which are faster and more scalable than for large datasets. A prominent application is in variational autoencoders (VAEs), where the ELBO serves as the objective function to train neural networks for generative modeling, balancing reconstruction accuracy and regularization via the KL term. Beyond , the ELBO has influenced fields like topic modeling (e.g., ) and hierarchical Bayesian computation, providing a unified framework for approximate inference that trades off exactness for computational efficiency. Ongoing research continues to refine ELBO-based methods, including black-box variants and tighter bounds using alternative divergences, to enhance accuracy in diverse probabilistic modeling tasks.

Fundamentals

Definition

The evidence lower bound (ELBO), also known as the variational lower bound, serves as a tractable surrogate objective for approximating the intractable , or , in Bayesian models. In these models, the \log p(x) represents the marginal probability of observed x, obtained by integrating over latent variables z and model parameters, which is often computationally infeasible due to high dimensionality. The ELBO provides a lower bound on this quantity, enabling efficient and by optimizing an approximate posterior distribution q(z \mid x) that balances fit and prior knowledge. Intuitively, the ELBO decomposes into two terms: an expected reconstruction likelihood that measures how well the model explains the observed , and a divergence term that acts as a regularization penalty, encouraging the approximate posterior to remain close to the prior distribution and thus controlling model complexity. This promotes parsimonious models that generalize well, avoiding by penalizing overly complex approximations. Within the broader framework of variational , the ELBO facilitates scalable Bayesian computation by transforming posterior into a tractable . The variational lower bound was formalized in the context of variational methods for graphical models in the seminal work by et al. (1999). The specific term "evidence lower bound" (ELBO) emerged later in the variational inference literature, for instance in Blei (2008). This bound is expressed through the fundamental inequality \log p(x) \geq \mathrm{ELBO}(q), where equality holds when q(z \mid x) = p(z \mid x), the true posterior. The gap between the evidence and the ELBO equals the KL divergence between q(z \mid x) and the true posterior, quantifying the approximation quality.

Mathematical notation

In the standard probabilistic setup for variational inference, the observed data is denoted by \mathbf{x}, while the latent variables are represented by \mathbf{z}. The joint distribution over the observed and latent variables is p(\mathbf{x}, \mathbf{z}), which factors as the product of the likelihood p(\mathbf{x} \mid \mathbf{z}) and the p(\mathbf{z}). The posterior distribution, which is typically intractable, is p(\mathbf{z} \mid \mathbf{x}) = \frac{p(\mathbf{x}, \mathbf{z})}{p(\mathbf{x})}, where the or Bayesian evidence is given by p(\mathbf{x}) = \int p(\mathbf{x}, \mathbf{z}) \, d\mathbf{z}. To approximate the intractable posterior, a variational distribution q(\mathbf{z} \mid \mathbf{x}) is introduced, often parameterized by variational parameters \theta, such that q(\mathbf{z} \mid \mathbf{x}) serves as a tractable surrogate for p(\mathbf{z} \mid \mathbf{x}). The evidence lower bound (ELBO) associated with this variational distribution is expressed in integral form as \begin{align*} \mathcal{L}(q) &= \int q(\mathbf{z} \mid \mathbf{x}) \log \frac{p(\mathbf{x}, \mathbf{z})}{q(\mathbf{z} \mid \mathbf{x})} , d\mathbf{z} \ &= \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x})} \left[ \log p(\mathbf{x} \mid \mathbf{z}) \right] - \mathrm{KL}\left( q(\mathbf{z} \mid \mathbf{x}) \parallel p(\mathbf{z}) \right), \end{align*} where \mathbb{E}_{q(\mathbf{z} \mid \mathbf{x})} [\cdot] denotes expectation under q(\mathbf{z} \mid \mathbf{x}), and \mathrm{KL}(\cdot \parallel \cdot) is the Kullback-Leibler divergence. This bound provides a tractable objective that lower-bounds the log marginal likelihood \log p(\mathbf{x}). The notation assumes a continuous latent space unless otherwise specified, with intractable integrals over \mathbf{z} motivating the use of variational approximations.

Theoretical Background

Bayesian evidence

In Bayesian statistics, the evidence, also known as the marginal likelihood, is defined as the probability of the observed data \mathbf{x} under the model, obtained by integrating the joint distribution over the latent variables \mathbf{z}: p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) \, d\mathbf{z}. This integral marginalizes out the latent variables, yielding the predictive density of the data averaged over all possible latent configurations weighted by their prior probabilities. The evidence plays a central role in Bayesian model selection, where it facilitates comparison between competing models through the Bayes factor, defined as the ratio of the evidences for two models M_1 and M_2: B_{12} = p(\mathbf{x} \mid M_1) / p(\mathbf{x} \mid M_2). A Bayes factor greater than 1 indicates that M_1 provides a higher predictive density for the data, favoring it as a better explanation; values exceeding 10 are often interpreted as strong evidence. Higher evidence values correspond to models that better balance fit to the data with complexity via prior integration, promoting parsimony without explicit penalties like those in frequentist criteria. Computing the evidence is generally intractable because it requires evaluating high-dimensional integrals that lack closed-form solutions, particularly when the posterior p(\mathbf{z} \mid \mathbf{x}) = p(\mathbf{x}, \mathbf{z}) / p(\mathbf{x}) involves normalizing by the unknown itself. This intractability is exacerbated in complex models, such as deep generative models where nonlinear transformations and high-dimensional latents make numerical integration or sampling inefficient. The concept of traces back to foundational , with formal discussions on its role in and model appearing in works like those reviewing selection rules. However, computational challenges in evaluating the evidence became particularly prominent in the , as models grew more sophisticated and required approximations like the to circumvent direct integration. In response to these difficulties, methods like the evidence lower bound have been developed to provide tractable estimates of \log p(\mathbf{x}).

Variational approximation

Variational inference (VI) provides an efficient approach to approximate intractable posterior distributions in Bayesian models by optimizing a variational distribution q(z; \theta) from a specified family of tractable densities to closely match the true posterior p(z \mid x). The optimization typically minimizes the Kullback-Leibler (KL) divergence \KL(q(z; \theta) \| p(z \mid x)), which is equivalent to maximizing the (ELBO) as the objective function. This method was introduced in the context of graphical models to enable scalable inference. A common choice for the variational family is the mean-field approximation, where the latent variables are assumed independent, yielding q(z; \theta) = \prod_{i=1}^d q_i(z_i; \theta_i) with each factor being a simple such as a Gaussian. This assumption simplifies computations and parameter estimation, particularly in high-dimensional settings, and formed the basis of early applications. Structured variational families, which relax full independence by incorporating dependencies, have also been developed to improve approximation quality while maintaining tractability. VI offers key advantages in scalability, especially for large datasets, through techniques like stochastic variational inference that leverage noisy, unbiased gradients for optimization, enabling processing of massive data without full recomputation. Unlike (MCMC) methods, which provide asymptotically exact samples but can be computationally intensive and slow to converge, VI delivers fast, deterministic approximations suitable for or high-throughput applications. Despite these benefits, VI can suffer from limitations such as underestimation of posterior variance, arising from the optimization inherent in fitting the variational family to the mode of the posterior rather than fully capturing its spread.

Derivation and Properties

Deriving the ELBO

In variational inference, the evidence lower bound (ELBO) arises as a consequence of bounding the log using the Kullback-Leibler () divergence between a variational q(z \mid x) and the true posterior p(z \mid x). Specifically, the KL divergence is given by \mathrm{KL}(q(z \mid x) \parallel p(z \mid x)) = \mathbb{E}_{q(z \mid x)} \left[ \log \frac{q(z \mid x)}{p(z \mid x)} \right], which can be rewritten as \mathrm{KL}(q(z \mid x) \parallel p(z \mid x)) = -\mathrm{ELBO}(q) + \log p(x), where \log p(x) is the log evidence, demonstrating that the ELBO provides a lower bound on \log p(x) since the KL divergence is nonnegative. Rearranging yields \mathrm{ELBO}(q) = \log p(x) - \mathrm{KL}(q(z \mid x) \parallel p(z \mid x)), confirming that maximizing the ELBO minimizes the divergence and tightens the bound on the evidence. The ELBO decomposes into two interpretable terms: \mathrm{ELBO}(q) = \mathbb{E}_{q(z \mid x)} [\log p(x \mid z)] - \mathrm{KL}(q(z \mid x) \parallel p(z)), where the first term is the expected log-likelihood under the variational distribution, encouraging good reconstructions of the observed x, and the second term acts as a regularization that penalizes deviations from the p(z). To derive this bound, begin with the log evidence: \log p(x) = \log \int q(z \mid x) \frac{p(x, z)}{q(z \mid x)} \, dz. By the concavity of the logarithm, implies \log p(x) \geq \int q(z \mid x) \log \left[ \frac{p(x, z)}{q(z \mid x)} \right] dz = \mathbb{E}_{q(z \mid x)} \left[ \log \frac{p(x, z)}{q(z \mid x)} \right], which expands to the ELBO as shown above. Equality holds if and only if q(z \mid x) = p(z \mid x), meaning the bound is tight precisely when the variational approximation matches the true posterior.

Jensen's inequality application

Jensen's inequality states that for a concave function f and a random variable X, f(\mathbb{E}[X]) \geq \mathbb{E}[f(X)], with equality if and only if X is constant almost surely. In the context of variational inference, the natural logarithm is a concave function, and the inequality is applied to the expectation under the variational distribution q(z) of the ratio p(x,z)/q(z), which serves as a weighted average. The log marginal likelihood can be expressed as \log p(x) = \log \mathbb{E}_{q(z)} \left[ \frac{p(x,z)}{q(z)} \right]. By , this yields \log p(x) \geq \mathbb{E}_{q(z)} \left[ \log \frac{p(x,z)}{q(z)} \right], where the right-hand side is the evidence lower bound (ELBO). Equality holds precisely when q(z) = p(z \mid x) almost surely. The concavity of the logarithm geometrically implies that the logarithm of an expectation is at least the expectation of the logarithm, resulting in a non-negative gap between \log p(x) and the ELBO that quantifies the looseness of the bound. This gap corresponds exactly to the Kullback-Leibler divergence \mathrm{KL}(q(z) \mid \mid p(z \mid x)), which vanishes only when the variational distribution matches the true posterior.

Optimization Techniques

Maximizing the ELBO

The objective in variational inference is to maximize the evidence lower bound (ELBO) with respect to the variational parameters , expressed as \mathcal{L}(\theta, \phi) = \mathbb{E}_{q(\mathbf{z};\theta)}[\log p(\mathbf{x}|\mathbf{z}; \phi)] - \mathrm{KL}(q(\mathbf{z}; \theta) \| p(\mathbf{z})), where \phi denotes the model parameters. This formulation separates the expected log-likelihood under the variational distribution q(\mathbf{z}; \theta) from the divergence measuring the discrepancy between q and the prior p(\mathbf{z}). Maximizing the ELBO tightens the variational approximation to the true posterior, as improvements in \mathcal{L} correspond to reductions in the KL divergence to the intractable posterior. A common approach to optimization is coordinate ascent variational (CAVI), which iteratively maximizes the ELBO by updating one of the variational at a time while holding others fixed. In the broader variational EM framework, this alternates between optimizing \theta for (maximizing the expected complete-data log-likelihood) and optimizing \phi for learning (maximizing the observed-data ELBO). This coordinate-wise strategy exploits the structure of mean-field approximations, enabling closed-form updates in conjugate models. For scalability to large datasets, techniques approximate the s in the ELBO using sampling. Specifically, the \mathbb{E}_{q(\mathbf{z}; \theta)}[\log p(\mathbf{x}|\mathbf{z}; \phi)] is estimated as \frac{1}{S} \sum_{s=1}^S \log p(\mathbf{x}|\mathbf{z}_s; \phi), where \{\mathbf{z}_s\}_{s=1}^S are samples drawn from q(\mathbf{z}; \theta). These noisy estimates enable stochastic gradient ascent on the ELBO, often using mini-batches to reduce computational cost while maintaining unbiased gradients. The CAVI procedure guarantees a monotonic non-decrease in the ELBO at each update step, ensuring to a of the objective, though the non-convexity of the ELBO landscape may yield local optima. In practice, optimization stops when changes in the ELBO fall below a or when a validation ELBO on held-out data plateaus, providing a for performance.

Reparameterization trick

In variational inference, optimizing the evidence lower bound (ELBO) via often involves sampling latent variables z from the variational distribution q(z; \theta). However, computing of the ELBO with respect to the variational parameters \theta using the score-function estimator—where the gradient passes through the sampling operation—results in high-variance estimates due to the stochastic nature of the samples. The reparameterization trick addresses this by re-expressing the sampled latent variable as a deterministic function of and an auxiliary noise variable drawn from a fixed, parameter-free . Specifically, z = g([\theta](/page/Theta), [\epsilon](/page/Epsilon)) where (e.g., a standard normal), allowing the stochasticity to be detached from . This transforms the of the ELBO expectation into \nabla_\theta \mathbb{E}_{\epsilon \sim p(\epsilon)} [f(x, g(\theta, \epsilon); \phi)] \approx \mathbb{E}_{\epsilon \sim p(\epsilon)} [\nabla_\theta f(x, g(\theta, \epsilon); \phi)], where f denotes the log-joint or relevant terms, enabling low-variance estimates via reparameterization of the sampling process. For a Gaussian variational distribution q(z; \theta) = \mathcal{N}(z; \mu(\theta), \sigma^2(\theta)), a common example is z = \mu(\theta) + \sigma(\theta) \epsilon with \epsilon \sim \mathcal{N}(0, I). Similar reparameterizations exist for other families, such as the via the inverse or the using scale mixtures or approximations. This technique was independently introduced by Kingma and Welling in 2013 and Rezende et al. in 2014, facilitating through nodes in neural network-based models. In practice, the reparameterization trick reduces the variance of gradient estimates by several orders of magnitude—often 10 to 100 times—compared to score-function methods, leading to more stable and efficient optimization of the ELBO.

Advanced Forms and Bounds

Standard ELBO form

The evidence lower bound (ELBO) in its standard form provides a tractable for variational , expressed as \mathcal{L}(\theta, \phi) = \mathbb{E}_{q(z; \theta)} \left[ \log p(x \mid z; \phi) \right] - \mathrm{[KL](/page/KL)}\left( q(z; \theta) \| p(z) \right), where q(z; \theta) is the variational posterior distribution parameterized by \theta, p(x \mid z; \phi) is the likelihood under parameters \phi, and p(z) is the prior over latent variables z. This formulation decomposes into an expected complete log-likelihood term and a Kullback-Leibler (KL) divergence regularization term. The first term, \mathbb{E}_{q(z; \theta)} \left[ \log p(x \mid z; \phi) \right], known as the reconstruction or expected log-likelihood, quantifies how well the model reconstructs the observed data x given samples from the approximate posterior; it encourages the generative model to fit the data effectively. The second term, -\mathrm{KL}\left( q(z; \theta) \| p(z) \right), acts as an entropy-regularized prior that penalizes deviations of the variational distribution from the prior, thereby preventing overfitting by promoting a structured latent space aligned with the model's inductive biases. For scalability to large datasets, the standard ELBO is often approximated using mini-batches, where adjusts the estimate to account for subsampling; a prominent example is the importance weighted (IWAE), which tightens the bound by averaging over multiple samples per point, improving estimates and model performance. The ELBO serves as a Bayesian analog to frequentist criteria like the (AIC) and (BIC) for , as maximizing it approximates the while incorporating model complexity through the term, with asymptotic equivalence to the log evidence under certain conditions.

Data-processing inequality

The data-processing inequality (DPI) states that, for random variables X and Z forming a with any processed variable Y = f(Z) where f is a (possibly ) function, the satisfies I(X; Z) \geq I(X; Y), with equality f is invertible on the relevant . This result underscores that further processing of information cannot increase the available about the source variable X. In variational inference, particularly in variational information bottleneck methods, the DPI is applied to limit the mutual information between the input and latent variables, I_q(X; Z) \leq I_c, encouraging compressed representations that retain relevant information for downstream tasks while regularizing the ELBO. This is useful in hierarchical models, where intermediate latent layers act as information bottlenecks, allowing variational approximations to leverage coarser representations without recomputing full dependencies. The implications of the DPI for the ELBO highlight inherent limitations in tightness due to information bottlenecks in the : the bound degrades as processing reduces , reflecting a between model compression and predictive . Equality in the DPI holds when the processing function f is invertible, preserving all about X, which informs design choices in variational models to avoid unnecessary .

Tightness and refinements

The looseness of the evidence lower bound (ELBO) primarily arises from the mismatch between the approximate posterior q(\phi)(z|x) and the true posterior p(z|x), quantified by the KL divergence term in the ELBO decomposition, which measures how well the variational distribution captures the true conditional dependencies. In neural variational , this looseness is exacerbated by amortized inference, where a shared inference network parameterizes q across the dataset, leading to suboptimal approximations for individual data points and an amortization gap that often dominates the total suboptimality. Refinements to tighten the ELBO include importance-weighted autoencoders (IWAEs), which leverage multiple samples from the approximate posterior to derive a strictly tighter bound, the importance-weighted ELBO (IW-ELBO), defined as L_K(x) = \mathbb{E}_{z_1, \dots, z_K \sim q(\phi)(z|x)} \left[ \log \left( \frac{1}{K} \sum_{k=1}^K \frac{p_\theta(x, z_k)}{q_\phi(z_k|x)} \right) \right], where K > 1 samples improve tightness and converge to the true log-evidence as K \to \infty. Variational decomposes the total gap between the ELBO and the log-evidence into an approximation gap (due to the variational family's limited expressiveness) and an amortization gap (due to shared parameterization), enabling targeted improvements by enhancing the inference network's capacity or using more expressive distributions like normalizing flows. Advanced bounds address ELBO limitations through alternatives like \alpha-divergences or Rényi bounds; for instance, the variational Rényi bound (VR bound) generalizes the ELBO using Rényi's \alpha-divergence, L_\alpha(q; D) = \frac{1}{1-\alpha} \log \mathbb{E}_q \left[ \left( \frac{p(\theta, D)}{q(\theta)} \right)^{1-\alpha} \right] for \alpha \neq 1, which recovers the ELBO as \alpha \to 1 but yields tighter approximations for other \alpha values by balancing mode-seeking and mass-covering behaviors. Diagnostics for assessing ELBO quality involve comparing the bound to estimates of the true log-evidence \log p(x) obtained via (MCMC), such as through Pareto-smoothed importance sampling (PSIS), which evaluates the reliability of the variational posterior by fitting importance ratios to a and flagging unreliability if the shape parameter \hat{k} > 0.7. This approach quantifies the variational gap and guides refinements, ensuring the bound's tightness in practice.

Applications

Variational autoencoders

Variational autoencoders (VAEs) represent a prominent application of the evidence lower bound (ELBO) in generative modeling, where the goal is to learn a latent representation of that enables both reconstruction and sampling of new instances. In the VAE framework, an encoder parameterized by \theta approximates the posterior distribution q(z|x; \theta) over latent variables z given observed x, while a parameterized by \phi models the likelihood p(x|z; \phi). The prior over latents is typically a standard Gaussian p(z), and the ELBO for a single data point takes the form \mathcal{L}(x; \theta, \phi) = \mathbb{E}_{q(z|x; \theta)}[\log p(x|z; \phi)] - \mathrm{[KL](/page/KL)}(q(z|x; \theta) \| p(z)), which lower-bounds the log \log p(x) and decomposes into a reconstruction term and a regularization term that encourages the approximate posterior to match the prior. This setup facilitates amortized inference, where neural networks parameterize both the encoder and to scale the variational to high-dimensional such as images or text, avoiding the need for per-data-point optimization in traditional variational inference. By maximizing the ELBO with respect to \theta and \phi, VAEs learn to compress input into a continuous while ensuring generated samples resemble the training distribution through probabilistic decoding. A notable extension is the \beta-VAE, which introduces a weighting factor \beta > 1 on the divergence term to enhance disentanglement in the , promoting the discovery of independent factors of variation in the . This modification balances reconstruction fidelity against latent structure, leading to more interpretable representations in tasks like visual . employs stochastic ascent on the ELBO, leveraging the reparameterization trick to differentiate through the stochastic sampling of z. Introduced in 2013 and published in 2014, VAEs have significantly advanced in deep generative models by providing a scalable, end-to-end differentiable for probabilistic modeling, influencing subsequent developments in areas like conditional generation and hierarchical latents.

Gaussian processes and beyond

In Gaussian processes (GPs), the evidence lower bound (ELBO) facilitates scalable approximate by addressing the computational challenges of exact posterior computation, which scales cubically with data size. A seminal approach is the sparse variational GP , which introduces inducing points to form a of the GP posterior. By maximizing the ELBO with respect to the variational distribution over the inducing variables, this method enables efficient and while preserving much of the GP's capabilities. Beyond GPs, the ELBO underpins variational inference in other classical probabilistic models, such as topic models and state-space models. In topic models like (LDA), mean-field variational inference approximates the posterior over topic distributions and assignments by optimizing the ELBO, providing a tractable alternative to MCMC for discovering latent topics in large text corpora. For state-space models, black-box variational inference employs structured Gaussian variational approximations to the posterior, enabling scalable inference for nonlinear and non-conjugate settings. Extensions of ELBO-based methods broaden applicability to diverse model classes. Black-box variational inference automates ELBO optimization for arbitrary probabilistic models using stochastic gradients, requiring only forward sampling from the model and variational distributions without model-specific derivations. In hierarchical models, structured variational families—such as those incorporating partial correlations or non-factorized approximations—enhance the ELBO by better capturing dependencies, leading to tighter bounds and improved posterior approximations compared to fully factorized mean-field assumptions. These ELBO applications in non-neural models offer distinct advantages over point-estimate methods like maximum likelihood, particularly in robust . By explicitly modeling posterior variability through the variational distribution, they enable principled propagation of epistemic and aleatoric uncertainties, which is crucial for tasks like in GPs or in state-space models.

References

  1. [1]
    An Introduction to Variational Methods for Graphical Models
    This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks and Markov ...
  2. [2]
    Full article: Variational Inference: A Review for Statisticians
    In this article, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization.
  3. [3]
    [1312.6114] Auto-Encoding Variational Bayes - arXiv
    Dec 20, 2013 · Authors:Diederik P Kingma, Max Welling. View a PDF of the paper titled Auto-Encoding Variational Bayes, by Diederik P Kingma and 1 other authors.
  4. [4]
    [PDF] Variational Inference: A Review for Statisticians - arXiv
    May 9, 2018 · ELBO(q) = E[log p(z,x)] − E[logq(z)]. (13). This function is called the evidence lower bound (ELBO). The ELBO is the negative KL diver-.
  5. [5]
    [PDF] An Introduction to Variational Methods for Graphical Models
    Abstract. This paper presents a tutorial introduction to the use of variational methods for inference and learning in graphical models (Bayesian networks ...Missing: ELBO | Show results with:ELBO
  6. [6]
    [PDF] Bayes Factors - Robert E. Kass; Adrian E. Raftery
    Oct 14, 2003 · The choice of these priors and the extent to which Bayes factors are sensitive to this choice is discussed in Section 5. 1995 American ...
  7. [7]
    [PDF] The Selection of Prior Distributions by Formal Rules
    Nov 27, 2017 · Kass and Larry Wasserman. Source: Journal of the American Statistical Association, Vol. 91, No. 435 (Sep., 1996), pp. 1343-1370.
  8. [8]
    [1601.00670] Variational Inference: A Review for Statisticians - arXiv
    Jan 4, 2016 · In this paper, we review variational inference (VI), a method from machine learning that approximates probability densities through optimization.
  9. [9]
    [PDF] Stochastic Variational Inference
    It maximizes the evidence lower bound (ELBO), a lower bound on the logarithm of the marginal probability of the observa- tions log p(x). The ELBO is equal ...
  10. [10]
    [PDF] Two problems with variational expectation maximisation for time ...
    First, the compactness property of variational inference leads to a failure to propagate posterior uncertainty through time. Second, the dependence of the ...
  11. [11]
    [PDF] Automatic Differentiation Variational Inference
    Maximizing the elbo minimizes the kl divergence (Jordan et al., 1999; Bishop, 2006). Optimizing the kl divergence implies a constraint that the support of the ...<|control11|><|separator|>
  12. [12]
    Stochastic Backpropagation and Approximate Inference in Deep ...
    Jan 16, 2014 · Stochastic Backpropagation and Approximate Inference in Deep Generative Models. Authors:Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra.
  13. [13]
    Variance reduction properties of the reparameterization trick - arXiv
    Sep 27, 2018 · We show that the marginal variances of the reparameterization gradient estimator are smaller than those of the score function gradient estimator.Missing: factor | Show results with:factor
  14. [14]
    [1509.00519] Importance Weighted Autoencoders - arXiv
    Sep 1, 2015 · We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log- ...
  15. [15]
    Bayesian Model Selection via Mean-Field Variational Approximation
    Dec 17, 2023 · Comparing to BIC, ELBO tends to incur smaller approximation error to the log-marginal likelihood (a.k.a. model evidence) due to a better ...
  16. [16]
    Elements of Information Theory | Wiley Online Books
    Elements of Information Theory ; Author(s):. Thomas M. Cover, Joy A. Thomas, ; First published:7 April 2005 ; Print ISBN:9780471241959 | ; Online ...
  17. [17]
    [1612.00410] Deep Variational Information Bottleneck - arXiv
    Dec 1, 2016 · We call this method "Deep Variational Information Bottleneck", or Deep VIB. We show that models trained with the VIB objective outperform those ...
  18. [18]
    [PDF] Inference Suboptimality in Variational Autoencoders
    Table 1. Summary of Gap Terms. The middle column refers to the general case where our variational objective is a lower bound on the marginal log-likelihood.
  19. [19]
    None
    ### Summary of Importance Weighted Autoencoders (IWAE) from arXiv:1509.00519
  20. [20]
    None
    ### Summary of Rényi Divergence Variational Inference (arXiv:1602.02311)
  21. [21]
    [PDF] Yes, but Did It Work?: Evaluating Variational Inference
    Jun 7, 2018 · In this paper we propose two diagnostic methods that assess, respectively, the quality of the entire variational posterior for a particular data ...
  22. [22]
    [PDF] Variational Learning of Inducing Variables in Sparse Gaussian ...
    Titsias, M. K. (2009). Variational Model Selection for. Sparse Gaussian Process Regression. Technical report,. School of Computer Science, University of ...
  23. [23]
    [PDF] Latent Dirichlet Allocation - Journal of Machine Learning Research
    Journal of Machine Learning Research 3 (2003) 993-1022. Submitted 2/02; Published 1/03. Latent Dirichlet Allocation. David M. Blei. BLEI@CS.BERKELEY.EDU.
  24. [24]
    Black box variational inference for state space models - arXiv
    This paper introduces a 'black-box' approximate inference technique for latent variable models using a structured Gaussian variational approximate posterior.
  25. [25]
    [1401.0118] Black Box Variational Inference - arXiv
    Dec 31, 2013 · In this paper, we present a "black box" variational inference algorithm, one that can be quickly applied to many models with little additional derivation.