Fact-checked by Grok 2 weeks ago

Energy-based model

An energy-based model (EBM) is a class of probabilistic models in that define an unnormalized over data through a parametric energy function U_\theta(x), where the density is given by \rho_\theta(x) = \frac{1}{Z_\theta} e^{-U_\theta(x)}, with Z_\theta as the partition function, such that lower energy values correspond to higher probability configurations, drawing from the Boltzmann-Gibbs distribution in . EBMs capture dependencies among variables by associating a scalar to each , enabling flexible modeling without explicit during , which distinguishes them from traditional probabilistic graphical models like Bayesian networks or Markov random fields. Their roots trace back to statistical physics concepts from the late , evolving through early applications in the —such as Hopfield networks and Boltzmann machines—and gaining prominence in the with integrations for tasks like discriminative training in vision and . The term "energy-based model" was formalized in works by Hinton et al. in 2003, building on foundational contributions from LeCun and colleagues. Key characteristics of EBMs include their ability to unify generative and discriminative learning paradigms, support for structured outputs via energy minimization, and compatibility with neural network architectures for approximating complex energy functions, often using techniques like graph transformer networks or conditional random fields. Learning typically involves gradient-based optimization of loss functions such as negative log-likelihood or contrastive divergence, though challenges arise from the intractability of the partition function and the need for Markov chain Monte Carlo (MCMC) sampling to estimate gradients and generate samples. Recent advancements address these issues through sampling-free methods like score matching and noise contrastive estimation, as well as non-equilibrium physics-inspired approaches using Jarzynski equality to reduce bias in training. EBMs have found applications in diverse domains, including image and sequence modeling for , protein structure prediction in biochemistry, via molecular dynamics simulation, and for text generation. Their explicit provides interpretability advantages over implicit models like GANs or VAEs, though they relate closely to models and normalizing flows through shared objectives in log-likelihood maximization and score-based learning. Despite historical underuse due to computational hurdles, renewed interest since the —driven by scalable training techniques—positions EBMs as a versatile framework for generative modeling in high-dimensional data. In 2025, further advancements include training EBMs as policies that outperform models and unifying them with flow matching for improved generative modeling.

Overview

Definition and Motivation

Energy-based models (EBMs) are a class of probabilistic models that assign a scalar energy value to each possible configuration of variables in a system, such that configurations with lower energy are deemed more probable. This framework allows EBMs to represent complex joint probability distributions over variables by defining compatibility through the energy function alone, without requiring explicit factorization or conditional independence assumptions. The motivation for EBMs draws directly from analogies in statistical physics, particularly the , which relates the probability of a state to the negative exponent of its energy scaled by temperature. This physical inspiration enables EBMs to model intricate dependencies in data flexibly and without the need for normalization constants during the model's initial definition, making them suitable for capturing or high-dimensional distributions that traditional forms might struggle with. In contrast to conventional normalized probabilistic models, such as those based on exponential family distributions, EBMs avoid computing or approximating the partition function upfront, which often imposes computational burdens or restrictive structural assumptions. Instead, the probability distribution is implicitly defined up to normalization, allowing greater expressiveness at the cost of challenges in inference and learning. EBMs were inspired by earlier energy-based architectures like Boltzmann machines, which were introduced for binary variables to model associative memory and constraint satisfaction, but EBMs generalize this concept to continuous or mixed variable types for broader applications in machine learning.

Core Principles

Energy-based models (EBMs) fundamentally rely on the principle of unnormalized modeling, defining the probability distribution over data points x as proportional to the exponential of the negative energy, p(x) \propto \exp(-E(x)). This formulation highlights the partition function's role in ensuring proper normalization but allows model specification without its immediate computation, sidestepping the intractability often encountered in traditional probabilistic models. A key operational principle is inference through energy minimization, where the model identifies configurations of variables that yield low energy values, thereby approximating high-likelihood states for tasks such as data generation or classification. By associating lower energies with observed or desired data configurations, EBMs enable efficient exploration of the data manifold without explicit probability normalization during inference. EBMs integrate seamlessly with neural networks, employing deep architectures to parameterize the energy function E(x; \theta), which facilitates handling high-dimensional inputs like images or sequences through learned representations. This compatibility leverages the expressive power of deep nets to capture intricate patterns while maintaining the model's probabilistic foundation. The generality of EBMs stems from their ability to model both continuous and discrete data domains without presupposing factorized structures, offering a unified approach to capturing dependencies across diverse modalities and tasks. This flexibility distinguishes EBMs from more restrictive generative paradigms, allowing adaptation to complex, non-independent distributions.

Historical Development

Origins in Statistical Physics

The foundational concepts of energy-based models in statistical physics emerged from efforts to describe thermodynamic systems through probabilistic distributions, with roots in 19th-century thermodynamics and formalization in statistical mechanics. The Boltzmann distribution, central to this framework, was introduced by Ludwig Boltzmann in 1868 to model the equilibrium distribution of energies in systems like gases under thermal conditions, linking microscopic states to macroscopic probabilities via an exponential form dependent on energy and temperature. This distribution, building on earlier thermodynamic principles from figures like Rudolf Clausius and James Clerk Maxwell, provided a probabilistic interpretation of energy states, where lower-energy configurations are more probable, laying the groundwork for energy as a scalar function governing system behavior. A pivotal early application of energy-based modeling came with the Ising model, proposed by Wilhelm Lenz in 1920 and solved by Ernst Ising in his 1925 dissertation. The model represents ferromagnetism as a lattice of spins interacting pairwise, with the total energy defined by the Hamiltonian H = -J \sum_{\langle i,j \rangle} s_i s_j - h \sum_i s_i, where J is the coupling constant, h the external field, and s_i = \pm 1 the spin states; Ising's exact solution for the one-dimensional case demonstrated no phase transition at finite temperatures, highlighting the role of energy minimization in collective phenomena. This pairwise energy formulation became a cornerstone for simulating interacting particle systems in magnetism and beyond, influencing subsequent statistical models. The Ising framework inspired extensions into computational and neural-like systems, notably the Hopfield network introduced by John Hopfield in 1982. Hopfield modeled associative memory using a symmetric network of interconnected neurons, where states evolve to minimize a Lyapunov energy function E = -\frac{1}{2} \sum_{i,j} w_{ij} s_i s_j, with w_{ij} as synaptic weights and s_i binary states; this allowed pattern recall through gradient descent-like dynamics, connecting energy-based optimization to emergent computational abilities in physical systems. By analogy to spin glasses and the Ising model, Hopfield's work demonstrated how energy landscapes could store and retrieve information via local minima. During the 1970s and 1980s, Markov random fields (MRFs) further advanced energy-based representations, generalizing pairwise interactions to graphical models for spatial and lattice data in statistical physics. The Hammersley-Clifford theorem, articulated in 1971, established that MRFs are equivalent to Gibbs distributions, where the joint probability factors as P(x) = \frac{1}{Z} \exp(-U(x)), with U(x) as the energy function summing clique potentials; this enabled modeling of complex dependencies in fields like image processing and phase transitions through energy minimization. Key developments, such as Julian Besag's 1974 work on spatial interactions, applied these energy functions to lattice systems, facilitating inference in irregular data structures.

Evolution in Machine Learning

The introduction of Boltzmann machines in 1985 by David Ackley, Geoffrey Hinton, and Terrence Sejnowski marked the first application of energy-based models (EBMs) in neural networks for unsupervised learning, drawing from statistical physics to model probability distributions over binary states via an energy function. These models enabled parallel constraint satisfaction and learning of underlying data constraints through a stochastic relaxation process, laying foundational groundwork for generative modeling in machine learning. In 1986, Paul Smolensky proposed restricted Boltzmann machines (RBMs), a variant that restricts connections between layers to make inference tractable while preserving the energy-based formulation. Originally termed "Harmonium," RBMs facilitated efficient computation of marginal probabilities and were later trained effectively using approximations like contrastive divergence introduced by Hinton in 2002, which addressed the computational challenges of maximum likelihood estimation. This advancement made RBMs practical for feature learning and dimensionality reduction, influencing early deep learning architectures. The deep learning revival in the late 2000s saw EBMs extended to multilayer structures with deep Boltzmann machines (DBMs) in 2009 by Ruslan Salakhutdinov and Geoffrey Hinton, which stacked RBMs to capture hierarchical representations and improved generative capabilities through variational inference and layer-wise pretraining. Concurrently, Yann LeCun and Fu-Jie Huang advanced EBMs for discriminative tasks in 2005 by developing loss functions that minimized energies for correct configurations while increasing them for incorrect ones, enabling applications in structured prediction and classification beyond pure generation. The term "energy-based model" was formalized in 2003 by Hinton et al. in their work on sparse overcomplete representations. Recent developments from 2023 to 2025 have integrated EBMs with modern generative paradigms, addressing scalability in large-scale AI. For instance, the 2024 IRED framework combines EBMs with diffusion processes for iterative reasoning, modeling constraints as energy landscapes to enhance decision-making in complex tasks. Yann LeCun's 2023 work on latent variable EBMs (published 2024) proposes them as core components for autonomous intelligence, using latent spaces to predict world states and handle uncertainty in predictive architectures. Additionally, 2025 research unifies flow matching with EBMs via energy matching for efficient sampling and training in high-dimensional generative modeling. These advances, including energy-efficient methods like Wasserstein gradient flow corrections for stable optimization, have expanded EBMs' role in large-scale generative AI, such as improved density estimation and inverse problems. Emerging applications include enhanced reasoning in vision-language models and molecular design, stemming from these evolutions.

Mathematical Formulation

Energy Function Design

The energy function serves as the foundational component of energy-based models (EBMs), assigning a scalar value E(\mathbf{x}; \theta) to each input configuration \mathbf{x} parameterized by \theta, where lower values indicate more compatible or probable states. Commonly, it is formulated as E(\mathbf{x}; \theta) = -f_\theta(\mathbf{x}), with f_\theta(\mathbf{x}) typically implemented as a neural network that computes a scalar output representing the "goodness" or compatibility of \mathbf{x}. This negative formulation aligns the energy minimization with maximizing the network's output, facilitating intuitive design and optimization. For discrete data, such as binary or categorical variables, the energy function is often structured as a sum of unary and pairwise potentials to model local and interactive dependencies: E(\mathbf{x}) = \sum_i \psi_i(x_i) + \sum_{i < j} \psi_{ij}(x_i, x_j), where \psi_i(x_i) captures individual node biases (unary terms) and \psi_{ij}(x_i, x_j) encodes interactions between pairs (pairwise terms). A classic parameterization appears in Boltzmann machines, employing a bilinear form E(\mathbf{v}, \mathbf{h}) = -\mathbf{b}^T \mathbf{v} - \mathbf{b}_h^T \mathbf{h} - \mathbf{v}^T \mathbf{W} \mathbf{h}, where \mathbf{v} and \mathbf{h} are visible and hidden binary states, \mathbf{b} and \mathbf{b}_h are bias vectors, and \mathbf{W} is the weight matrix governing pairwise connections. This design efficiently represents graphical model structures while remaining computationally tractable for moderate-sized networks. In handling continuous data, such as images or time series, the energy function leverages architectures that capture spatial or temporal dependencies through hierarchical feature extraction. Convolutional neural networks (CNNs) are widely used for images, processing pixel grids to produce spatially aware energy landscapes, often incorporating residual blocks from ResNet designs to enable deep networks without degradation. For sequential or multimodal data, transformer architectures parameterize the energy function by applying self-attention mechanisms across tokens, effectively modeling long-range temporal dependencies in high-dimensional spaces. These choices allow EBMs to scale to complex inputs like CIFAR-10 or ImageNet datasets, generating coherent samples via gradient-based exploration. Key design considerations emphasize producing a single scalar output for direct comparability, ensuring differentiability to support gradient descent in parameter updates, and promoting scalability to avoid exponential complexity in high dimensions. Architectural decisions, such as residual connections or normalization layers, further address challenges like vanishing gradients and entrapment in local minima, enhancing the robustness of the energy landscape for practical applications.

Probability Distribution

In energy-based models (EBMs), the probability distribution over data points x is defined using the energy function E(x; \theta) through a Gibbs or Boltzmann distribution, where \theta denotes the model parameters. Specifically, the unnormalized probability is given by \exp(-E(x; \theta)), and the normalized probability density (or mass) function is p(x; \theta) = \frac{\exp(-E(x; \theta))}{Z(\theta)}, with the partition function Z(\theta) = \int \exp(-E(x; \theta)) \, dx serving as the normalizing constant that integrates to 1 over the data space. This formulation draws from statistical physics, assigning lower energies to more probable configurations and ensuring the distribution is properly normalized. The partition function Z(\theta) is generally intractable to compute exactly, as it requires integration over the entire (often high-dimensional and continuous) space of possible x, resulting in an exponential computational cost that scales poorly with dimensionality. This intractability arises because direct evaluation of Z(\theta) involves summing or integrating unnormalized probabilities across all configurations, which is infeasible for complex energy functions parameterized by neural networks, necessitating approximations in practice without altering the definitional form. For EBMs with visible variables x_v and hidden variables x_h, the conditional distribution over visible units given hidden ones is p(x_v | x_h; \theta) \propto \exp(-E(x_v, x_h; \theta)), where the proportionality holds because the partition function for this conditional marginalizes only over x_v for fixed x_h, often making it more tractable than the full Z(\theta). The log-likelihood of an observed data point x under the model is then \log p(x; \theta) = -E(x; \theta) - \log Z(\theta), which forms the basis for maximum likelihood estimation by maximizing the expected log-probability over the data distribution. EBMs extend naturally to conditional settings, such as supervised learning, where the distribution p(y | x; \theta) over labels y given inputs x is modeled as p(y | x; \theta) \propto \exp(-E(y, x; \theta)), with the corresponding partition function Z(x; \theta) = \int \exp(-E(y, x; \theta)) \, dy now conditioned on x. This allows EBMs to capture dependencies in structured prediction tasks while inheriting the same normalization challenges.

Learning and Inference

Training Methods

The primary objective in training energy-based models (EBMs) is maximum likelihood estimation (MLE), which seeks to maximize the log-likelihood of the observed data under the model distribution, given by \sum_i \log p(x_i; \theta) = \sum_i [-E(x_i; \theta) - \log Z(\theta)], where E(\cdot; \theta) is the parameterized energy function, Z(\theta) is the intractable partition function, and \theta denotes the model parameters. This objective encourages the model to assign low energy to data samples while balancing the normalization imposed by Z(\theta). The gradient of the log-likelihood with respect to \theta is \frac{\partial}{\partial \theta} \log p(x; \theta) = -\frac{\partial}{\partial \theta} E(x; \theta) + \frac{1}{Z(\theta)} \int \left[ \frac{\partial}{\partial \theta} E(y; \theta) \right] \exp(-E(y; \theta)) \, dy, which decomposes into a "positive phase" computed directly on data and a "negative phase" requiring integration over the model distribution. Exact computation of this gradient is infeasible due to the intractability of Z(\theta) and the integral, necessitating approximations via sampling from the model. One seminal approximation method is contrastive divergence (CD-k), introduced for restricted Boltzmann machines (RBMs), which uses short Markov chain Monte Carlo (MCMC) chains of k steps—often k=1—to estimate the negative phase gradient, providing a computationally efficient surrogate for MLE. CD-k initializes chains from data points and approximates the model's expectations by running Gibbs sampling for a limited number of iterations, enabling scalable training despite introducing some bias. To avoid explicit computation of Z(\theta) altogether, alternative objectives include noise-contrastive estimation (NCE), which frames parameter learning as binary classification between real data and noise samples, treating the model as a classifier and asymptotically recovering MLE as the number of noise samples grows. Similarly, score matching minimizes the expected squared difference between the model score function \nabla_x \log p(x; \theta) and the data score, reformulating MLE as a tractable Fisher divergence that requires only second-order derivatives and no sampling from the model. Recent advancements address training stability and bias in EBMs through methods like Wasserstein gradient flow, which simulates the continuous-time dynamics of probability measures under the Wasserstein metric to optimize the Kullback-Leibler divergence, yielding more stable updates than discrete MCMC approximations. Additionally, persistent contrastive divergence improves estimates of the negative phase in the likelihood gradient by maintaining persistent MCMC chains across iterations, reducing variance and improving convergence in high-dimensional settings. More recent work (as of 2025) has further developed WGF for direct optimization without MCMC.

Sampling Techniques

Sampling from energy-based models (EBMs) involves generating samples from the associated Boltzmann distribution, which is typically intractable and requires Markov chain Monte Carlo (MCMC) methods to approximate. These techniques aim to explore the low-energy regions of the energy landscape defined by the model, producing samples that reflect the target probability distribution proportional to the exponential of the negative energy function. For continuous spaces, Langevin dynamics serves as a foundational MCMC method, iteratively updating samples by following the negative gradient of the energy function augmented with Gaussian noise. The update rule is given by \mathbf{x}_{t+1} = \mathbf{x}_t - \frac{\epsilon}{2} \nabla_{\mathbf{x}} E(\mathbf{x}_t; \theta) + \sqrt{\epsilon} \, \mathbf{n}_t, where \epsilon > 0 is a small step size, \theta parameterizes the energy function E, and \mathbf{n}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) is standard Gaussian noise. This process simulates overdamped Langevin diffusion, which converges to the target distribution under mild conditions, though practical implementations often use short-run chains to balance efficiency and quality. In discrete or mixed spaces, Gibbs sampling is commonly employed, particularly for models with latent variables, by alternately sampling each variable conditioned on the others to minimize the joint energy. This block-wise update rule decomposes the high-dimensional sampling into tractable conditional distributions, making it suitable for structured data like images or text represented as binary or categorical variables. For instance, in restricted Boltzmann machines—a canonical architecture— alternates between visible and hidden units, though it can suffer from slow mixing in deep models. To address mixing issues in MCMC for deep EBMs, persistent contrastive divergence (PCD) maintains a set of persistent Markov chains across training iterations, allowing chains to continue from previous states rather than reinitializing from data, which improves exploration of the energy landscape and reduces bias in gradient estimates used for sampling. Complementarily, parallel tempering enhances mixing by running multiple chains at different temperatures—scaling the energy function by inverse temperatures (\beta_k$—and periodically swapping states between chains to facilitate escape from local minima; this is particularly effective for multimodal distributions in high-dimensional spaces. Recent advances include iterative reasoning through energy diffusion (IRED), introduced in 2024, which frames sampling as an iterative minimization process on reasoning trajectories, adapting steps based on task difficulty to generate high-quality samples more efficiently than traditional MCMC. Additionally, methods accelerate generation by parameterizing invertible flows driven by the gradient, avoiding iterative MCMC altogether; for example, variational potential flow Bayes constructs density homotopies matched to the data, enabling direct sampling via inversion. Approximating the intractable partition function Z(\theta) = \int e^{-E(\mathbf{x}; \theta)} d\mathbf{x}, essential for normalizing probabilities during sampling, often relies on annealed importance sampling (AIS). AIS bridges an initial tractable distribution to the target via a sequence of intermediate distributions with gradually decreasing temperatures, estimating Z through weighted samples from forward and reverse Markov chains; this method provides unbiased estimates with variance controlled by the annealing schedule. Alternatively, variational bounds offer tractable lower or upper approximations to \log Z, such as those derived from Jensen's inequality or higher-order extensions, by optimizing a surrogate distribution to tighten the bound during inference.

Properties and Challenges

Key Advantages

Energy-based models (EBMs) offer significant flexibility in modeling complex distributions, as they do not require explicit factorization of the joint probability, unlike many traditional probabilistic approaches that assume or other structural constraints. This allows EBMs to directly parameterize an unnormalized over the entire , the of arbitrary distributions without predefined architectural decompositions. For instance, in scenarios, EBMs can integrate diverse types, such as text and images, by jointly optimizing a shared landscape that captures inter-modal dependencies holistically. A key strength of EBMs lies in their interpretability, stemming from the intuitive landscape that governs the model's . The function defines a scalar for each input , where lower energies correspond to more probable states, allowing researchers to visualize and analyze decision-making through the geometry of minima and basins in this landscape. This provides direct insights into model preferences and modes, such as identifying regions of high energy that repel unlikely samples, which is particularly useful for debugging and understanding high-dimensional representations in tasks like image generation. EBMs provide a unified that seamlessly supports generative, discriminative, and tasks without necessitating changes to the underlying . By defining joint energies over and labels, EBMs can perform for while simultaneously enabling through energy minimization over output spaces, bridging the between probabilistic modeling and . This is evident in learning schemes where discriminative objectives refine generative capabilities, allowing a model to both unlabeled and labeled efficiently. Advancements in EBMs have enhanced their scalability, particularly through integration with deep neural networks that leverage modern hardware like GPUs for parameterizing complex energy functions. Unlike earlier , contemporary deep EBMs employ multilayer architectures to capture intricate patterns in high-dimensional data, with parallelizable computations enabling efficient training on large datasets. This compatibility has made EBMs viable for real-world applications involving millions of parameters, where GPU acceleration facilitates gradient-based optimization of the energy surface. In recent developments, EBMs demonstrate improved handling of out-of-distribution (OOD) data through explicit energy assignments that assign high values to anomalous inputs, enhancing detection reliability without auxiliary components. This approach, refined in 2024 frameworks, exploits the energy score's ability to quantify deviation from the learned in-distribution manifold, providing a robust, parameter-efficient method for safety-critical systems.

Limitations

Energy-based models (EBMs) face significant challenges due to the intractability of the partition function, which normalizes the and is computationally prohibitive to evaluate exactly in high-dimensional spaces. This intractability necessitates approximations such as contrastive divergence, which introduce biases in gradient estimates during training, leading to suboptimal rates. In high dimensions, these biases exacerbate slow , as the approximations fail to capture the full of the energy landscape, resulting in inefficient optimization. Markov chain Monte Carlo (MCMC) methods, commonly used for sampling from EBMs, suffer from poor mixing times, hindering effective of multimodal energy landscapes. This inadequate mixing often leads to mode collapse, where the model concentrates on limited regions of the data distribution, or protracted sampling times that undermine practical deployment. Such issues are particularly pronounced in deep EBMs, where the high dimensionality amplifies the challenges of chain equilibration. Training EBMs is prone to , characterized by high variance in estimates that can cause erratic updates and in optimization. For architectures without explicit regularization, this variance intensifies, making stable learning difficult and often requiring careful hyperparameter tuning to mitigate. These instabilities stem from the reliance on noisy approximations of the partition function , which propagate errors throughout the process. Compared to variational autoencoders (VAEs), EBMs exhibit scalability gaps when handling very large datasets, as the computational overhead of MCMC sampling and partition function approximations renders them less efficient for massive-scale training. These computations involve elevated energy costs, though they remain secondary to the core sampling bottlenecks in most applications. An open challenge for EBMs is the absence of built-in mechanisms for , unlike Bayesian methods that inherently provide epistemic and aleatoric estimates through posterior distributions. This limitation restricts EBM applications in safety-critical domains requiring reliable confidence measures, often necessitating post-hoc extensions to incorporate .

Applications

Generative Tasks

Energy-based models (EBMs) excel in generative tasks by learning an underlying over data, enabling the synthesis of new samples through sampling from p(\mathbf{x}). In image generation, EBMs have been employed for unconditional density estimation on benchmark datasets such as , utilizing convolutional architectures to define the energy function. A seminal approach is the (EBGAN), proposed in 2016, which reframes the discriminator as an energy network that attributes low energies to regions of the data manifold corresponding to real images, thereby guiding the generator to produce high-fidelity samples. This method demonstrated effective learning of image distributions, indicating competitive sample quality relative to contemporary GANs. Unconditional generation in EBMs involves directly sampling from the modeled p(\mathbf{x}), often via MCMC techniques like , to create novel data instances without external conditioning. This capability supports creative applications, such as art synthesis, where EBMs can produce diverse visual compositions by exploring low-energy regions of the learned manifold. For instance, EBMs trained on datasets have generated coherent artistic styles and structures, leveraging the model's ability to capture global statistics for expressive outputs. In text generation, sequence-level EBMs facilitate language modeling by defining energy functions over entire sequences, integrating seamlessly with transformer architectures in the 2020s. One notable example is the Electric model, which pre-trains transformers as energy-based cloze models to learn representations that support autoregressive and infilling generation tasks. This approach enhances coherence in generated text by minimizing energy for plausible sequences. A recent development in multimodal generation is the 2024 Energy-Based CLIP (CLIP-JEM), which extends joint energy-based models to text-to-image synthesis by combining CLIP's contrastive embeddings with an EBM scoring mechanism. The model employs a joint energy function based on cosine similarity in CLIP's latent space, assigning low energies to aligned image-text pairs to guide iterative refinement during generation. CLIP-JEM achieves realistic outputs on datasets like MS-COCO, with strong performance in compositional reasoning, such as accurately rendering object relations described in prompts. Evaluation of EBM-generated images often relies on the metric to quantify sample realism and diversity. On , EBM variants have yielded FID scores competitive with GANs, such as 27.5 in implicit EBM frameworks, underscoring their ability to match adversarial methods in visual fidelity while avoiding mode collapse. These results highlight EBMs' robustness in generative regimes, particularly where stable training and multimodal extensions are prioritized. EBMs have also been applied to , modeling atomic-resolution conformations using energy functions trained on crystallized protein data. As of 2024, such models enable precise estimation of mutational effects and folding dynamics in biochemistry.

Hybrid and Discriminative Uses

Discriminative energy-based models (EBMs) model the conditional distribution p(y \mid x) by defining an energy function E(y, x; \theta) that assigns low values to correct label-input pairs and higher values otherwise, enabling inference via p(y \mid x) \propto \exp(-E(y, x; \theta)). This framework unifies various classifiers under an energy perspective, allowing standard discriminative models like logistic regression to be reinterpreted as joint EBMs for p(x, y). In vision tasks, such as image classification, discriminative EBMs have been trained using contrastive divergence or noise-contrastive estimation to minimize energy for correct labels while maximizing it for incorrect ones, outperforming traditional methods in handling complex decision boundaries. Hybrid EBMs extend this by modeling the joint distribution p(x, y) through a shared energy function, facilitating where unlabeled refines embeddings and improves discriminative . For instance, LaplaceNet integrates graph-based energy terms with neural networks to propagate labels in semi-supervised classification, achieving reduced model complexity and higher accuracy on benchmarks like with limited labels. These hybrids leverage the generative capabilities of EBMs to regularize discriminative training, enhancing robustness in low-data regimes. In , EBMs identify outliers by assigning high energy to data points deviating from the learned , with low-energy regions defining normalcy. structured EBMs, for example, use neural networks to parameterize the energy function over data manifolds, enabling effective detection in high-dimensional spaces. This approach flags anomalies in cybersecurity, such as intrusion patterns, and fraud detection, where energy-based restricted Boltzmann machines (RBMs) identify unseen fraudulent transactions in financial data. Recent advancements in 2023 have incorporated EBMs into () by minimizing policy energy to balance exploration and exploitation in partially observable environments. Energy-based predictive representations, for instance, learn state abstractions via EBMs to improve performance on POMDPs, outperforming methods in tasks requiring long-term . A notable case study from the involves energy-based models for and recognition, as in synergistic approaches combining detection and pose estimation. These EBMs, trained discriminatively on noisy image data, achieved superior accuracy compared to support vector machines (SVMs) by jointly optimizing energy for faces across variations, demonstrating robustness in real-world scenarios like .

Extensions and Variants

Latent Variable Models

Latent variable energy-based models (LV-EBMs) incorporate hidden variables z to enhance the expressiveness of standard EBMs, allowing for more complex data representations. The joint distribution over observed data x and latents z is defined as p(x, z; \theta) \propto \exp(-E(x, z; \theta)), where E(x, z; \theta) is the energy function parameterized by \theta. The is then obtained by integrating over the latents: p(x; \theta) = \int p(x \mid z; \theta) p(z; \theta) \, dz, enabling the model to capture underlying structures not directly observable in x. Training LV-EBMs typically involves approximating the intractable posterior p(z \mid x; \theta) or marginal p(x; \theta) using methods such as variational inference or (MCMC) sampling over the . For instance, deep LV-EBMs employ bi-level score matching, which optimizes a variational posterior q(z \mid x; \phi) to approximate the true posterior while minimizing a score-matching objective on the joint . Implementation often draws from variational autoencoders (VAEs), utilizing encoder networks inspired by VAEs to amortize inference and generate latent samples, paired with energy-based decoders that define the generative process through the . A key advantage of LV-EBMs is their ability to model hierarchical structures in data, which facilitates better disentanglement of factors such as object pose and identity in images. For example, on datasets like , deep LV-EBMs trained with bi-level methods achieve improved posterior inference, with Fisher divergence between variational and model posteriors decreasing to around $10^1 to $10^2. In recent developments, Yann LeCun's work on LV-EBMs as part of a hierarchical embedding predictive architecture (H-JEPA) emphasizes their role in building world models for autonomous systems, such as robots or self-driving vehicles, by integrating predictive learning to reason and in dynamic environments. This framework positions LV-EBMs as foundational for scalable intelligence in such systems.

Joint Energy-Based Models

Joint energy-based models (JEMs) represent an extension of energy-based models designed to capture distributions over multiple variables or modalities, such as paired data from different views like images and text. The model defines a energy function E(\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n; \theta) that assigns lower energies to compatible configurations across the inputs, enabling the modeling of p(\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n; \theta) \propto \exp(-E(\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n; \theta)). This approach reinterprets standard classifiers, originally modeling conditional distributions p(y \mid x), as models by treating the logit outputs as negative energies, thus unifying discriminative and generative objectives. In applications such as tasks, JEMs leverage joint distributions for aligned outputs, such as in text-image systems. These capabilities utilize the flexibility of JEMs to handle interactions without relying on separate encoders for each . JEMs typically employs contrastive objectives to align modalities and learn the joint distribution, such as noise-contrastive estimation, which contrasts positive joint samples against negative ones to approximate the partition function without modeling marginals independently. This avoids the pitfalls of separate marginal by directly optimizing the joint likelihood. A primary challenge in JEMs is the increased dimensionality of energies, which can lead to unstable and high computational costs, particularly for complex inputs. These issues are often addressed through techniques like balanced positive and negative sample contributions along with regularization terms, such as between inputs, for more stable optimization.

Other Variants

Energy-based models have been extended to integrate with continuous normalizing flows, enabling efficient and sampling in high-dimensional spaces. These flow-based EBMs combine the flexibility of energy functions with invertible transformations for tractable likelihood computation.

Comparisons

With Generative Adversarial Networks

Generative Adversarial Networks (GANs), introduced in , employ an adversarial framework where a produces samples from to approximate the data p_g(x), while a discriminator distinguishes real from generated samples through a min-max optimization game. This implicit generative approach enables GANs to capture complex distributions without directly modeling likelihoods, focusing instead on fooling the discriminator to minimize divergence from the true data p_{data}(x). In contrast to energy-based models (EBMs), which explicitly define a via an energy function E(x) and partition function for likelihood computation, GANs operate implicitly without tractable densities. EBMs mitigate mode collapse—a common GAN issue where the generator produces limited varieties—by modeling the full energy landscape, but they incur high computational costs from (MCMC) sampling during training and inference. GANs, while faster and capable of parallel sampling, often face training instability, such as vanishing gradients or non-convergence in the adversarial game. Hybrid approaches bridge these paradigms by integrating energy-based components into GAN architectures. For instance, the Energy-based Generative Adversarial Network (EBGAN) from 2017 reframes the discriminator as an energy function that assigns low energies to real data manifolds and higher energies elsewhere, using margin losses to stabilize training and improve sample diversity. More recent unifications, such as viewing GAN discriminators as implicit energy estimators, enable discriminator-driven latent sampling to enhance mode coverage and likelihood approximation in GANs. Key trade-offs highlight EBMs' superiority in explicit , allowing precise log-likelihood evaluation for tasks requiring probabilistic inference, whereas GANs excel in generating high-fidelity samples due to their focus on perceptual quality over exact . Empirically, EBMs achieve higher log-likelihood scores on tabular datasets, benefiting from their explicit modeling, while GANs were the preferred method for image generation prior to , producing sharper and more realistic visuals despite lacking estimates.

With Diffusion Models

Diffusion models, first proposed in 2015, generate data by simulating a forward process that progressively adds to samples from the data distribution, transforming them into isotropic noise, followed by a learned reverse process that iteratively removes noise to recover the data distribution. This reverse process is parameterized by estimating the score function, defined as the of the log-density \nabla_x \log p_t(x), where t denotes the noise level, enabling the model to denoise step-by-step from pure noise. Energy-based models (EBMs) and models share foundational similarities as score-based generative approaches, where the score function in EBMs corresponds to the negative of the function -\nabla_x E(x), allowing both to model unnormalized densities implicitly. However, they differ in their generative mechanisms: EBMs rely on a time-independent global E(x) to define the joint distribution via a Boltzmann form, often requiring (MCMC) for sampling, whereas models employ a sequential, time-dependent iterative process with a fixed number of denoising steps, such as 1000, to approximate the reverse path. These distinctions make models more aligned with continuous-time dynamics, while EBMs emphasize thermodynamic principles for flexible . Recent advancements have highlighted overlaps, particularly in bridging the two paradigms. For instance, the 2024 Iterative Reasoning through Diffusion (IRED) framework integrates EBMs with processes by learning functions that represent task constraints and using annealed landscapes for iterative optimization, akin to 's noise scheduling but applied to reasoning problems like Sudoku solving and . This approach adapts the number of inference steps based on problem complexity, demonstrating superior performance in continuous- and discrete-space tasks compared to prior methods. In terms of advantages, diffusion models facilitate faster post-training sampling due to their deterministic or reverse processes with predefined steps, reducing computational overhead once trained, which has driven their adoption in high-fidelity and video . Conversely, EBMs provide greater flexibility for non-image domains, such as molecular modeling or , owing to their roots in statistical physics and ability to incorporate explicit terms for physical constraints without relying on pixel-space addition. Efforts to unify these models have emerged through flow-based extensions. A 2025 framework called Energy Matching combines matching— a simulation-free alternative to for learning continuous normalizing flows—with EBMs by introducing an entropic energy term that guides sampling toward a near the data manifold while following optimal transport paths far from it. This unification enables explicit likelihood evaluation and outperforms traditional EBMs on benchmarks like for sample fidelity, while relating to models by enhancing efficiency without time conditioning. Addressing gaps in earlier formulations, 2025 developments in EBMs, such as Variational Potential Flow Bayes (VPFB), approximate diffusion-like paths by parameterizing potential flows with functions and matching density homotopies to the data distribution via variational principles, thereby avoiding MCMC and connecting EBMs directly to diffusion's homotopy-based generation. This approach yields competitive results in image generation and out-of-distribution detection, preserving EBM interpretability while leveraging diffusion-inspired trajectories. Further advancements in 2025 include energy-based models for specialized tasks. For example, Energy-based Language Models (EDLM), presented at ICLR 2025, operate at the full sequence level to address token dependencies in discrete for text generation. Similarly, EnergyMoGen uses energy-based for compositional human motion generation, demonstrating improved in CVPR 2025. These works highlight ongoing integration of EBM principles into frameworks as of late 2025.

References

  1. [1]
  2. [2]
    [PDF] A Tutorial on Energy-Based Learning - Stanford University
    Energy-Based Models (EBMs) capture dependencies by associating a scalar energy to each configuration of variables. Learning finds an energy function with lower ...
  3. [3]
    [1708.06008] Boltzmann machines and energy-based models - arXiv
    Aug 20, 2017 · We review Boltzmann machines and energy-based models. A Boltzmann machine defines a probability distribution over binary-valued patterns.Missing: statistical physics
  4. [4]
  5. [5]
    [PDF] Learning Deep Energy Models
    In this work we propose deep energy models, which use deep feedforward neural networks to model the energy landscapes that define probabilistic models. We are ...Missing: paper | Show results with:paper
  6. [6]
    Boltzmann's Work in Statistical Physics
    Nov 17, 2004 · In his (1868), Boltzmann set out to apply this argument to a variety of other models (including gases in a static external force field).
  7. [7]
    History of the Lenz-Ising Model | Rev. Mod. Phys.
    The simplest and most popular version of this theory is the so-called "Ising model," discussed by Ernst Ising in 1925 but suggested earlier (1920) by Wilhelm ...
  8. [8]
    Neural networks and physical systems with emergent collective ...
    Apr 15, 1982 · Neural networks and physical systems with emergent collective computational abilities. J J Hopfield ... Research ArticleApril 15, 1982. Sequence- ...
  9. [9]
    [PDF] The Hammersley-Clifford Theorem and its Impact on Modern Statistics
    721–741. → John M. Hammersley and Peter Clifford (1971): Markov fields on finite graphs and lattices. Unpublished.
  10. [10]
    [PDF] A Learning Algorithm for Boltzmann Machines*
    An expanded version of this paper (Hinton, Sejnowski, & Ack- ley, 1984) presents this material in greater depth and discusses a number of related issues ...
  11. [11]
    A Learning Algorithm for Boltzmann Machines - Semantic Scholar
    A Learning Algorithm for Boltzmann Machines · D. Ackley, Geoffrey E. Hinton, T. Sejnowski · Published in Cognitive Sciences 1985 · Computer Science.
  12. [12]
    [PDF] On Contrastive Divergence Learning
    gradient, Hinton (2002) proposed the contrastive di- vergence (CD) method which approximately follows the gradient of a different function. ML learning min ...
  13. [13]
    [PDF] Loss Functions for Discriminative Training of Energy-Based Models.
    The main point of this paper is to give suf- ficient conditions that a discriminative loss function must satisfy, so that its minimization will carve out the ...Missing: 2016 | Show results with:2016
  14. [14]
    Training energy-based models with Wasserstein gradient flow
    In this paper, we leverage the Wasserstein gradient flow (WGF) of the KL divergence to correct the optimization direction of the generator in the minimax game.Missing: core | Show results with:core<|control11|><|separator|>
  15. [15]
    [PDF] A Tutorial on EnergyBased Learning - Computer Science
    In general, we will search for the value of the latent variable that allows us to get an answer (Y) of smallest energy. Page 30. Yann LeCun. Probabilistic ...
  16. [16]
    [PDF] Boltzmann Machines
    Mar 25, 2007 · Ackley, D., Hinton, G., and Sejnowski, T. (1985). A Learning Algorithm for. Boltzmann Machines. Cognitive Science, 9(1):147–169. Della Pietra ...
  17. [17]
    [PDF] Energy Transformer
    We propose a novel architecture, called the Energy Transformer (or ET for short), that uses a sequence of attention layers that are purposely designed to ...
  18. [18]
  19. [19]
    [2101.03288] How to Train Your Energy-Based Models - arXiv
    Jan 9, 2021 · Energy-Based Models (EBMs) are trained using maximum likelihood with MCMC, Score Matching (SM), and Noise Constrastive Estimation (NCE).Missing: review | Show results with:review
  20. [20]
    Training Products of Experts by Minimizing Contrastive Divergence
    Aug 1, 2002 · Hinton; Training Products of Experts by Minimizing Contrastive Divergence. Neural Comput 2002; 14 (8): 1771–1800. doi: https://doi.org ...Missing: paper | Show results with:paper<|separator|>
  21. [21]
    [PDF] A new estimation principle for unnormalized statistical models
    Noise-contrastive estimation uses nonlinear logistic regression to discriminate between data and artificially generated noise, estimating parameters by ...
  22. [22]
    [PDF] Estimation of Non-Normalized Statistical Models by Score Matching
    In this paper, we propose a simple method for estimating such non-normalized models. This is based on minimizing the expected squared distance of the score ...
  23. [23]
    Persistently Trained, Diffusion-assisted Energy-based Models - arXiv
    Apr 21, 2023 · We propose to introduce diffusion data and learn a joint EBM, called diffusion assisted-EBMs, through persistent training (ie, using persistent contrastive ...
  24. [24]
    11_energy_based_models - Jakub M. Tomczak
    In practice, most energy functions do not result in a nicely computable partition function. And, typically, the partition function is the key element that is ...
  25. [25]
    [PDF] PID-controlled Langevin Dynamics for Faster Sampling of ...
    Langevin dynamics has emerged as a fundamental sampling technique for generating samples across various implicit generative models, including Energy-based ...
  26. [26]
    Stochastic Gradient Anisotropic Langevin Dynamics for Learning ...
    Oct 19, 2023 · We propose in this paper, a novel high dimensional sampling method, based on an anisotropic stepsize and a gradient-informed covariance matrix.
  27. [27]
    [PDF] Energy-Based Modelling for Discrete and Mixed Data ... - NIPS papers
    Figure 2 illustrates the estimated energies as well as samples that are synthesised with Gibbs sampling for energy discrepancy (ED) and contrastive divergence ( ...
  28. [28]
    [PDF] Sampling and learning in discrete energy-based models - arXiv
    Nov 5, 2021 · Perturb-and-MAP offers an elegant approach to approximately sample from an energy-based model (EBM) by computing the maximum-a-posteriori ...
  29. [29]
    [PDF] Improved Contrastive Divergence Training of Energy-Based Models
    With this work, we aim to maintain the simplicity and advantages of contrastive divergence training, while resolving stability issues and in- corporating ...
  30. [30]
    [PDF] Parallel Tempering for Training of Restricted Boltzmann Machines
    Fields are energy-based models in which we can write p(v, h) ∝ e−E(v,h) ... In contrast, the tempered MCMC approach maintains exceptionally good mixing.
  31. [31]
    [2406.11179] Learning Iterative Reasoning through Energy Diffusion
    Jun 17, 2024 · We introduce iterative reasoning through energy diffusion (IRED), a novel framework for learning to reason for a variety of tasks.
  32. [32]
    Efficient Evaluation of the Partition Function of RBMs with Annealed ...
    Jul 23, 2020 · The Annealed Importance Sampling (AIS) method provides a tool to stochastically estimate the partition function of the system.
  33. [33]
    Hamiltonian Annealed Importance Sampling for partition function ...
    May 9, 2012 · We introduce an extension to annealed importance sampling that uses Hamiltonian dynamics to rapidly estimate normalization constants.Missing: approximating EBMs
  34. [34]
    [PDF] Partition Functions: Variational Bounds, Saddle-Point, and ...
    (i) Variational bounds: We show how classical Jensen-Feynman-type bounds, as well as higher-order extensions, can be used to approximate Z(λ).
  35. [35]
    Learning Multimodal Latent Generative Models with Energy-Based ...
    Sep 30, 2024 · In this paper, we propose a novel framework that integrates the multimodal latent generative model with the EBM. Both models can be trained jointly through a ...
  36. [36]
    Hybrid Discriminative-Generative Training via Contrastive Learning
    Jul 17, 2020 · In this paper we show that through the perspective of hybrid discriminative-generative training of energy-based models we can make a direct connection between ...
  37. [37]
    Revisiting Energy-Based Model for Out-of-Distribution Detection
    We introduce Outlier Exposure by Simple Transformations (OEST), a framework that enhances OOD detection by leveraging peripheral-distribution (PD) data.
  38. [38]
    Training Deep Energy-Based Models with f-Divergence Minimization
    Mar 6, 2020 · Abstract page for arXiv paper 2003.03463: Training Deep Energy-Based Models with f-Divergence Minimization. ... intractable partition function.
  39. [39]
    No MCMC Teaching For me: Learning Energy-Based Models via ...
    Feb 4, 2025 · No MCMC Teaching For me: Learning Energy-Based Models via Diffusion Synergy ... mode collapse and slow mixing of MCMC. To address these ...
  40. [40]
  41. [41]
    [PDF] Improved Contrastive Divergence Training of Energy-Based Models
    2021. "Improved. Contrastive Divergence Training of Energy-Based Models." INTERNATIONAL CONFERENCE ON. MACHINE LEARNING, VOL 139, 139. Persistent ...
  42. [42]
    Implicit generation and generalization methods for energy-based ...
    Mar 21, 2019 · We've made progress towards stable and scalable training of energy-based models (EBMs) resulting in better sample quality and generalization ...
  43. [43]
    [1609.03126] Energy-based Generative Adversarial Network - arXiv
    Sep 11, 2016 · We introduce the Energy-based Generative Adversarial Network model (EBGAN) which views the discriminator as an energy function that attributes low energies to ...
  44. [44]
    [2408.17046] Text-to-Image Generation Via Energy-Based CLIP
    CLIP-JEM not only generates realistic images from text but also achieves competitive results on the compositionality benchmark, outperforming ...
  45. [45]
    Your Classifier is Secretly an Energy Based Model and You Should ...
    Dec 6, 2019 · We propose to reinterpret a standard discriminative classifier of p(y|x) as an energy based model for the joint distribution p(x,y).
  46. [46]
    LaplaceNet: A Hybrid Graph-Energy Neural Network for Deep Semi ...
    Jun 8, 2021 · We propose a new framework, LaplaceNet, for deep semi-supervised classification that has a greatly reduced model complexity.
  47. [47]
    [PDF] Triple-Hybrid Energy-based Model Makes Better Calibrated Natural ...
    May 2, 2023 · In this pa- per, we first propose a triple-hybrid EBM which combines the benefits of classifier, con- ditional generative model and marginal ...
  48. [48]
    [PDF] Deep Structured Energy Based Models for Anomaly Detection
    We propose deep structured energy based models (DSEBMs), where the energy function is the output of a de- terministic deep neural network with structure. We ...
  49. [49]
    Energy-Based Models for Anomaly Detection: A Manifold Diffusion ...
    Oct 28, 2023 · We present a new method of training energy-based models (EBMs) for anomaly detection that leverages low-dimensional structures within data.
  50. [50]
    The Potential of Energy-Based RBM and xLSTM for Real-Time ...
    EB-RBM is utilized for its ability to detect new and previously unseen fraudulent patterns, while xLSTM focuses on identifying known fraud types. These models ...
  51. [51]
    Energy-based Predictive Representation for Reinforcement Learning
    Feb 1, 2023 · We propose a novel predictive state representation with energy-based models, that shows superior performance on POMDPs ... reinforcement learning ...
  52. [52]
    Bayesian Reparameterization of Reward-Conditioned ... - arXiv
    Abstract page for arXiv paper 2305.11340: Bayesian Reparameterization of Reward-Conditioned Reinforcement Learning with Energy-based Models.
  53. [53]
    [2306.02572] Introduction to Latent Variable Energy-Based Models
    Jun 5, 2023 · In these notes, we summarize the main ideas behind the architecture of autonomous intelligence of the future proposed by Yann LeCun.
  54. [54]
    [PDF] Bi-level Score Matching for Learning Energy-based Latent Variable ...
    by advances in amortized inference [31] but the variational bounds for the partition function are either of high-bias [37] or high-variance [34] on high ...
  55. [55]
    [PDF] Learning Energy-Based Model with Variational Auto-Encoder as ...
    Due to the intractable partition function, training energy- based models ... Energy-based models for atomic-resolution protein confor- mations. In ...
  56. [56]
    TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for ...
    Mar 9, 2025 · This paper focuses on multimodal alignment within the realm of Artificial Intelligence, particularly in text and image modalities.
  57. [57]
    Stabilized training of joint energy-based models and their practical ...
    Mar 7, 2023 · Abstract page for arXiv paper 2303.04187: Stabilized training of joint energy-based models and their practical applications.
  58. [58]
    [1406.2661] Generative Adversarial Networks - arXiv
    Jun 10, 2014 · We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models.
  59. [59]
    Deep Generative Modelling: A Comparative Review of VAEs, GANs ...
    Mar 8, 2021 · This compendium covers energy-based models, variational autoencoders, generative adversarial networks, autoregressive models, normalizing flows, in addition to ...
  60. [60]
    [PDF] Your GAN is Secretly an Energy-based Model and You Should Use ...
    As with other energy-based models, we can use an MCMC procedure such as ... Mixing time evaluation MCMC sampling methods often suffer from extremely long mixing ...
  61. [61]
    GANs vs. Diffusion Models: In-Depth Comparison and Analysis
    Oct 17, 2024 · GANs tend to require fewer training samples and offer high-quality image synthesis, while diffusion models excel in capturing complex data ...
  62. [62]
    Deep Unsupervised Learning using Nonequilibrium Thermodynamics
    Mar 12, 2015 · Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Authors:Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, Surya ...
  63. [63]
    Hitchhiker's guide on the relation of Energy-Based Models with other ...
    Jun 19, 2024 · This review aims to provide physicists with a comprehensive understanding of EBMs, delineating their connection to other generative models.
  64. [64]
    Unifying Flow Matching and Energy-Based Models for Generative ...
    Abstract page for arXiv paper 2504.10612: Energy Matching: Unifying Flow Matching and Energy-Based Models for Generative Modeling.Missing: 2024 | Show results with:2024
  65. [65]
    Learning Energy-Based Generative Models via Potential Flow - arXiv
    Apr 22, 2025 · In this paper, we propose Variational Potential Flow Bayes (VPFB), a new energy-based generative framework that eliminates the need for implicit MCMC sampling.