Fact-checked by Grok 2 weeks ago

Generative model

A generative model is a class of in that learns the underlying of a to generate new, samples that resemble the observed data. These models represent the over variables, enabling tasks such as sampling from the distribution or estimating likelihoods, in contrast to discriminative models that focus on conditional probabilities for or prediction. Generative models have evolved significantly since the early , powered by deep neural networks, and are foundational to modern applications including image synthesis, text generation, , and . Key architectures include variational autoencoders (VAEs), which encode data into a probabilistic for efficient sampling and ; generative adversarial networks (GANs), introduced in 2014, where a creates data while a discriminator evaluates through adversarial training; and diffusion models, a recent advancement that iteratively adds and removes noise to learn data distributions, achieving high-fidelity outputs in domains like video generation. Other notable types encompass autoregressive models for sequential data, normalizing flows for invertible transformations, and energy-based models for flexible . These models address challenges in modeling complex, high-dimensional data such as images, audio, and graphs, with applications spanning , , and scientific simulations. Recent advances from 2023 to 2025 emphasize hybrid approaches combining GANs, VAEs, and models for improved stability and diversity, alongside efforts to mitigate issues like mode collapse, training instability, and ethical concerns such as bias amplification in generated content. For instance, models have outperformed GANs in generating diverse, high-resolution images, while scalable architectures enable real-time applications in and healthcare.

Definition

Core Principles

Generative models are a of statistical models in that learn the P(X) over observed data variables X, or P(X, Y) when including associated labels or targets Y, to enable the synthesis of new data samples resembling those from the training distribution. By estimating this underlying distribution, these models can generate novel instances through sampling, capturing patterns such as correlations in high-dimensional data like images where certain features co-occur. This probabilistic framework allows generative models to produce realistic outputs by modeling the data-generating process itself, rather than merely classifying or predicting based on inputs. The roots of generative models trace back to early probabilistic and statistical techniques from the 18th and 19th centuries, with significant advancements in from the mid-20th century, such as Gaussian mixture models and hidden Markov models, building on foundations like . A seminal early example is the , introduced in 1985, which uses neural networks to learn and sample from joint distributions over binary data, laying groundwork for of complex patterns. These models emerged as a way to handle uncertainty and structure in data through energy-based formulations inspired by statistical physics. Unlike approaches that rely on rote or simple of examples, generative models seek to capture the underlying manifold—the low-dimensional geometric structure embedded in high-dimensional observations—enabling the creation of diverse, unseen samples that generalize beyond the input set. This distinguishes them from discriminative models, which focus on conditional distributions like P(Y \mid X) for boundary separation without generating new data. Intuitively, a generative model functions like a painter who studies numerous artworks to internalize artistic styles and compositions, then creates original pieces in those styles, in contrast to a classifier that merely identifies and labels existing styles without producing new art.

Mathematical Formulation

Generative models aim to learn a p(\mathbf{x}; \theta) over observed data \mathbf{x} parameterized by \theta, typically by maximizing the -likelihood of a \{\mathbf{x}_1, \dots, \mathbf{x}_N\}. This objective, known as (MLE), is formally expressed as \theta^* = \arg\max_\theta \frac{1}{N} \sum_{i=1}^N \log p(\mathbf{x}_i; \theta), where the average log-likelihood serves as an unbiased of the expected log-likelihood under the true data distribution. In latent variable models, the data likelihood marginalizes over unobserved latent variables \mathbf{z}, yielding the core p(\mathbf{x}; \theta) = \int p(\mathbf{x}|\mathbf{z}; \theta) p(\mathbf{z}; \theta) \, d\mathbf{z}, which renders direct computation of the log-likelihood \log p(\mathbf{X}; \theta) = \sum_i \log p(\mathbf{x}_i; \theta) intractable for high-dimensional or complex distributions. To optimize such models, variational inference approximates the intractable posterior p(\mathbf{z}|\mathbf{x}; \theta) with a tractable distribution q(\mathbf{z}|\mathbf{x}), deriving the evidence lower bound (ELBO) as a surrogate objective. The ELBO is given by \mathcal{L}(q, \theta) = \mathbb{E}_{q(\mathbf{z}|\mathbf{x})} \left[ \log p(\mathbf{x}|\mathbf{z}; \theta) \right] - \KL \left( q(\mathbf{z}|\mathbf{x}) \Vert p(\mathbf{z}; \theta) \right), where \KL denotes the Kullback-Leibler divergence, and \log p(\mathbf{x}; \theta) = \mathcal{L}(q, \theta) + \KL \left( q(\mathbf{z}|\mathbf{x}) \Vert p(\mathbf{z}|\mathbf{x}; \theta) \right) \geq \mathcal{L}(q, \theta). Maximizing the ELBO with respect to \theta and the parameters of q tightens the bound and approximates MLE. Sampling from the learned p(\mathbf{x}; \theta) enables of new instances; for directed latent models, this is achieved via ancestral sampling by drawing \mathbf{z} \sim p(\mathbf{z}; \theta) from the prior and then \mathbf{x} \sim p(\mathbf{x}|\mathbf{z}; \theta). For undirected or complex models where direct sampling is challenging, (MCMC) methods construct a whose approximates p(\mathbf{x}; \theta), yielding samples after sufficient iterations.

Discriminative vs. Generative Approaches

Key Distinctions

Generative models and discriminative models differ fundamentally in their probabilistic objectives. Generative models seek to capture the P(X, Y) between input features X and labels Y, or the P(X) in settings, enabling both the generation of and the estimation of data densities. Discriminative models, by contrast, directly model the P(Y \mid X), focusing on the boundary that separates classes without regard for the underlying distribution of the inputs. This distinction leads to key capability gaps. Generative models can handle by imputing values through the learned joint distribution and incorporate unlabeled samples in semi-supervised learning, as they explicitly model the data-generating . Discriminative models require fully for and cannot generate new instances or estimate densities, limiting their utility to prediction tasks like . A concrete example illustrates these differences in classification. A generative approach, such as naive Bayes, assumes class-conditional densities (e.g., Gaussian for continuous features) to model P(X \mid Y) and applies Bayes' rule to infer P(Y \mid X). In contrast, a discriminative method like parameterizes the posterior P(Y \mid X) directly as a linear separator in the space, bypassing any assumptions about the input . These paradigms also involve trade-offs in practice. Generative models are typically more data-hungry, as accurately estimating the full joint distribution demands larger samples to avoid poor approximations, but they provide versatility for applications like . Discriminative models converge faster to optimal classification performance with fewer examples, often yielding superior accuracy in supervised settings, though they lack the broader generative capabilities.

Practical Implications

In practical applications, the choice between generative and discriminative models hinges on the need for data generation versus direct prediction, influencing in domains where or is paramount. Generative models are particularly valuable for synthesizing in privacy-sensitive fields, such as , where they produce synthetic images that preserve patient while enabling model training without sharing . In contrast, discriminative models are preferred for high-accuracy classification tasks, like detection in systems, where they directly learn decision boundaries from labeled examples to achieve robust performance. Performance considerations further highlight the divide: generative models facilitate semi-supervised learning by modeling the underlying data distribution, allowing unlabeled data to inform the joint probability and improve generalization when labeled samples are scarce. Discriminative models, however, outperform in supervised scenarios with limited labeled data, as they focus solely on conditional probabilities and converge more rapidly to optimal error rates without estimating full densities. Hybrid approaches bridge these strengths through semi-supervised techniques that integrate generative capabilities into discriminative , such as using generated synthetic samples to augment sets and enhance classifier robustness. This combination leverages unlabeled for broader coverage while maintaining discriminative efficiency in prediction. A key deployment challenge is computational cost, where generative models often incur higher overhead due to the need for sampling or over the modeled distribution, limiting scalability in applications compared to the direct forward passes of discriminative models.

Categories of Generative Models

Traditional Methods

Traditional generative models, developed primarily in the fields of statistics and early , focus on explicitly modeling the of observed data and latent variables using tractable parametric assumptions. These approaches, often predating neural networks, enable , data generation, and through probabilistic , providing foundational techniques that emphasize interpretability and computational efficiency over to high-dimensional data. The Naive Bayes classifier exemplifies a simple generative model for classification tasks, where the goal is to model the joint distribution P(\mathbf{X}, Y) of features \mathbf{X} and class label Y. It assumes conditional independence among features given the class, leading to P(\mathbf{X} \mid Y) = \prod_{i=1}^d P(X_i \mid Y), which simplifies parameter estimation to computing class priors and marginal conditionals from training data. This independence assumption, while often violated in real data, enables efficient generation by first sampling a class from the prior and then independently sampling each feature from its conditional distribution, making it suitable for text categorization and spam detection where feature correlations are weak or ignorable. Despite its simplicity, Naive Bayes can outperform more complex models in high-dimensional settings due to its robustness to irrelevant features and low variance in estimates. Gaussian Mixture Models (GMMs) offer a flexible parametric approach to unsupervised density estimation, representing the data distribution as p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k), where \pi_k are mixing coefficients, and \mathcal{N} denotes multivariate Gaussians with means \boldsymbol{\mu}_k and covariances \boldsymbol{\Sigma}_k. Parameter fitting proceeds via the Expectation-Maximization (EM) algorithm, which alternates between an E-step assigning soft cluster responsibilities \gamma_{nk} = P(z_n = k \mid \mathbf{x}_n) based on current parameters and an M-step updating the component parameters to maximize the expected complete-data log-likelihood. Introduced in the context of incomplete data likelihood maximization, EM converges to a local maximum of the observed-data likelihood, making GMMs effective for clustering and background modeling in applications like speaker identification. GMMs generate new samples by first drawing a component from the mixing distribution and then sampling from the corresponding Gaussian, though they assume elliptical clusters and can suffer from singularity issues if components overlap excessively. Hidden Markov Models (HMMs) extend generative modeling to sequential data, assuming an underlying of hidden states Z_t that evolve via transition probabilities A_{ij} = P(Z_{t+1} = j \mid Z_t = i) and produce observations X_t through emission probabilities B_j(x) = P(X_t = x \mid Z_t = j), with initial state distribution \pi. The joint probability of a sequence is P(\mathbf{X}, \mathbf{Z}) = \pi_{z_1} B_{z_1}(x_1) \prod_{t=2}^T A_{z_{t-1} z_t} B_{z_t}(x_t), enabling generation by simulating state transitions and emissions forward in time. Parameter estimation and inference, such as computing marginal posteriors P(Z_t \mid \mathbf{X}), rely on the forward-backward algorithm, which efficiently calculates forward probabilities \alpha_t(i) = P(x_1 \dots x_t, Z_t = i) via dynamic programming and backward probabilities \beta_t(i) = P(x_{t+1} \dots x_T \mid Z_t = i) to avoid exponential complexity. This two-pass procedure, a precursor to for HMMs, underpins applications in and bioinformatics, where it maximizes sequence likelihoods despite latent state intractability. Boltzmann machines represent an early energy-based generative framework, modeling the joint distribution over binary visible and hidden units as P(\mathbf{v}, \mathbf{h}) \propto \exp(-E(\mathbf{v}, \mathbf{h})), where the energy E(\mathbf{v}, \mathbf{h}) = -\sum_{i<j} w_{ij} v_i v_j - \sum_i b_i v_i - \sum_k c_k h_k incorporates weights w_{ij} and biases. Sampling from this Boltzmann distribution, inspired by statistical physics, occurs via Gibbs sampling or simulated annealing to approximate the partition function, allowing generation of configurations that minimize energy. Learning adjusts weights to maximize data likelihood using a contrastive divergence-like procedure, contrasting positive-phase correlations from data-driven clamps with negative-phase expectations from model-generated samples. Introduced in the mid-1980s, these fully connected stochastic networks laid groundwork for unsupervised feature learning but faced training challenges due to intractable normalization, limiting scalability until restricted variants emerged.

Deep Learning-Based Methods

Deep learning-based methods marked a paradigm shift in generative modeling by leveraging neural networks to capture highly complex, high-dimensional data distributions that traditional parametric approaches struggled with. Unlike earlier explicit density estimation techniques, neural networks enable implicit density modeling, where the generative process is defined through transformations of latent variables rather than direct probability computations, allowing for more flexible and scalable representations of data manifolds. This non-parametric parameterization, often using deep architectures like multilayer perceptrons or convolutional networks, facilitates learning intricate patterns in domains such as images and text without assuming predefined functional forms. The scalability of these methods was propelled by foundational algorithms like backpropagation, which efficiently computes gradients for optimizing network parameters across vast parameter spaces, and the advent of graphics processing units (GPUs) that accelerated parallel computations essential for training on large datasets. , integral to stochastic gradient descent variants, enabled the optimization of deep generative models by propagating errors through layered architectures, making end-to-end learning feasible. Meanwhile, ' parallel processing capabilities reduced training times from weeks to hours, igniting the deep learning boom in the mid-2010s and allowing generative models to handle datasets with millions of samples. This hardware-software synergy democratized access to powerful generative tools, transitioning the field from theoretical constructs to practical applications. Key innovations in deep learning-based generative modeling include representation learning through autoencoder architectures, which compress data into low-dimensional latent spaces for reconstruction and sampling, and adversarial training paradigms that enhance output realism by pitting generator and discriminator networks against each other. Autoencoders, extended probabilistically in works like , learn disentangled representations that support diverse sample generation while incorporating latent variables to model uncertainty. Adversarial training, introduced in , fosters competition that implicitly matches data distributions, yielding sharper and more realistic outputs compared to likelihood-based methods. The evolution of these methods accelerated from the 2014 introduction of generative adversarial networks, which sparked widespread adoption, through hybrid approaches combining variational inference and adversarial objectives in the late 2010s, to the 2020s integration of multimodal capabilities that unify text, image, and audio generation within unified frameworks. This timeline reflects a progression toward ever-larger models trained on internet-scale data, with multimodal extensions enabling cross-domain synthesis, such as conditioning image generation on textual descriptions. By the early 2020s, these advancements had transformed generative modeling into a cornerstone of artificial intelligence, powering applications from creative content creation to scientific simulation.

Prominent Architectures

Variational Autoencoders

Variational autoencoders (VAEs) extend the autoencoder framework by incorporating probabilistic latent variables to model the underlying data distribution, enabling generative capabilities through variational inference. The architecture consists of an encoder network that approximates the posterior distribution q_\phi(z|x) over latent variables z given input data x, typically parameterized as a multivariate Gaussian with mean \mu and variance \sigma^2, and a decoder network that models the likelihood p_\theta(x|z) to reconstruct the input from the latent representation. Training VAEs involves maximizing the evidence lower bound (ELBO), which serves as a tractable objective for the otherwise intractable marginal likelihood p_\theta(x) = \int p_\theta(x|z) p(z) dz: \mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z)) The first term encourages faithful reconstruction of x, while the second term regularizes the approximate posterior to match the prior p(z), often a standard normal distribution, promoting a structured latent space. This formulation allows VAEs to learn a compressed, probabilistic encoding of data while facilitating sampling from the prior to generate new instances. A key innovation in VAEs is the reparameterization trick, which enables backpropagation through stochastic sampling by expressing z as a deterministic function of noise: z = \mu + \sigma \odot \epsilon, where \epsilon \sim \mathcal{N}(0, I). This transformation preserves differentiability, allowing gradients to flow from the decoder through the latent space during optimization, and scales efficiently to high-dimensional data like images. The continuous latent space learned by VAEs supports applications such as interpolation, where linear paths between encoded points generate smooth transitions in the data manifold, useful for tasks like image morphing. Furthermore, modifications like the β-VAE, which scales the KL divergence term by a factor β > 1, enhance disentangled representations by encouraging independent latent factors that correspond to distinct data attributes, such as pose or lighting in images. Despite these strengths, VAEs often produce blurry outputs in generation tasks, particularly for images, due to the pixel-wise losses like or binary cross-entropy, which average over pixel intensities and fail to capture high-frequency details.

Generative Adversarial Networks

Generative Adversarial Networks (GANs) are a class of generative models introduced in that frame data generation as an adversarial game between two neural networks: a and a discriminator. The G takes random noise z from a prior distribution p_z as input and produces samples G(z) intended to mimic the real data distribution. The discriminator D, a classifier, evaluates whether a given sample is real (from the true data distribution p_{data}) or fake (produced by the ). This setup enables the generation of high-fidelity samples without explicitly modeling the data density, distinguishing GANs from likelihood-based approaches. The training objective is formulated as a minimax game, where the discriminator aims to maximize its ability to distinguish real from fake samples, while the generator seeks to minimize the discriminator's success rate. This is captured by the value function: \min_G \max_D \mathbb{E}_{x \sim p_{data}(x)}[\log D(x)] + \mathbb{E}_{z \sim p_z(z)}[\log(1 - D(G(z)))] At equilibrium, the generator produces samples indistinguishable from real data, achieving a Nash equilibrium where D outputs 0.5 for both real and generated samples. Training alternates between updating D to improve classification accuracy and updating G to fool D, often using stochastic gradient descent. This adversarial dynamic has proven effective for generating realistic images, audio, and other data modalities. Variants of GANs extend the basic framework to handle specific tasks. Conditional GANs incorporate class labels or other conditioning information into both the and discriminator inputs, enabling controlled generation such as producing images of specific digits from the MNIST . CycleGAN, introduced in 2017, addresses unpaired image-to-image translation by training two s and two discriminators with a cycle-consistency loss, allowing style transfer between domains like horses to zebras without aligned training pairs. Despite their success, GANs suffer from training instability and mode collapse, where the generator produces limited varieties of samples, failing to capture the full data diversity. These issues arise from the non-convex optimization and vanishing gradients in the original formulation. Wasserstein GANs (WGANs), proposed in , mitigate these problems by replacing the Jensen-Shannon divergence with the Wasserstein-1 distance (), which provides smoother gradients and better convergence, while enforcing on the discriminator via weight clipping or gradient penalties.

Diffusion Models

Diffusion models represent a class of generative models that produce samples through a two-stage process: a forward diffusion phase that progressively adds to data, and a reverse denoising phase that learns to reconstruct the original distribution. Introduced in the early , these models formalize as reversing a of perturbations, enabling high-fidelity synthesis in domains like images and audio. In the forward process, starting from an original sample \mathbf{x}_0, is incrementally added over T timesteps according to the conditional q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 - \beta_t} \mathbf{x}_{t-1}, \beta_t \mathbf{I}), where \beta_t is a variance schedule that increases with t, transforming the data into isotropic \mathbf{x}_T \approx \mathcal{N}(\mathbf{0}, \mathbf{I}). This process can be sampled directly at any timestep t using a q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 - \bar{\alpha}_t) \mathbf{I}), with \bar{\alpha}_t = \prod_{s=1}^t (1 - \beta_s), allowing efficient without simulating the full chain. The reverse process, parameterized by a , approximates the posterior p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)), iteratively denoising from \mathbf{x}_T back to \mathbf{x}_0 to generate new samples. Training involves optimizing a variational lower bound on the data likelihood, which simplifies to a mean-squared error L_{\text{simple}} = \mathbb{E}_{\mathbf{x}_0, \epsilon, t} \left[ \|\epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t)\|^2 \right], where \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) is the noise added at timestep t, and \epsilon_\theta is a neural network predicting the component. This , derived by reparameterizing the reverse mean, facilitates stable gradient-based optimization using standard deep learning frameworks. Compared to earlier generative approaches, diffusion models offer advantages in training stability and sample quality, as they avoid adversarial objectives and directly optimize a tractable likelihood proxy, leading to mode-covering behavior without collapse. A prominent application is , a that operates in a compressed for efficient text-to-image generation, achieving state-of-the-art results on benchmarks like MS COCO with FID scores around 6-10 while enabling real-time inference on consumer hardware. Variants include score-based generative models, which unify diffusion processes with stochastic differential equations (SDEs) and leverage score matching to estimate the gradient of the data log-density, connecting to sampling methods like for flexible perturbation kernels and reverse-time SDE solvers. These extensions, such as the variance-preserving SDE d\mathbf{x} = \mathbf{f}(\mathbf{x}, t) dt + g(t) d\mathbf{w}, allow continuous-time formulations that improve and sample diversity in high dimensions.

Autoregressive Models

Autoregressive models represent a class of generative models that generate sequential data by modeling the through a chain of conditional probabilities, where each element is predicted based on all preceding elements. This approach factorizes the probability of a sequence \mathbf{x} = (x_1, x_2, \dots, x_n) as P(\mathbf{x}) = \prod_{i=1}^n P(x_i \mid x_{<i}), enabling the model to capture dependencies in ordered data such as text, audio, or images treated as raster scans. These models draw from earlier sequence modeling traditions, including hidden Markov models (HMMs), but leverage neural networks for more expressive representations. Early autoregressive models for image generation, such as PixelRNN and PixelCNN introduced in , apply this to pixel sequences by predicting each pixel conditioned on previous ones. PixelRNN uses recurrent neural networks (RNNs) to model row-by-row dependencies, while PixelCNN employs masked convolutions to enforce , allowing parallel computation during training and more efficient likelihood estimation compared to fully sequential RNNs. These architectures demonstrated competitive performance on tasks, such as modeling images with negative log-likelihoods around 3.14 bits per dimension for PixelCNN. In , transformer-based autoregressive models have become prominent, exemplified by the GPT series starting from in 2018. These models use the decoder-only architecture to predict the next token in a , scaling to billions of parameters in later iterations like (175 billion parameters) to generate coherent long-form text. The autoregressive formulation allows exact computation of the likelihood during training via , facilitating unsupervised pretraining on vast corpora before for specific tasks. A key strength of autoregressive models is their ability to compute exact likelihoods, which supports reliable and objectives like , unlike implicit density models. However, inference remains slow due to the sequential nature of generation, requiring one per output element, which limits scalability for long sequences despite parallelizable .

Applications and Examples

Image and Video Generation

have significantly advanced image generation by enabling the synthesis of high-fidelity visual content. , introduced by researchers in 2019, utilizes a style-based that disentangles latent factors to produce photorealistic human faces with fine-grained control over attributes like age and expression. This model has demonstrated superior performance in generating 1024x1024 resolution images, achieving perceptual quality scores that surpass prior GAN variants on datasets like FFHQ. Similarly, BigGAN, developed by in 2019, scales GAN to large batch sizes and model capacities, enabling diverse, high-resolution image synthesis across thousands of classes and improving the Inception Score by over 100 points compared to earlier methods. Diffusion models have emerged as a powerful alternative for conditional image generation, particularly in text-to-image tasks. , released by in 2022, employs a two-stage conditioned on CLIP embeddings to generate coherent, high-quality images from textual prompts, such as "a fox painting a picture of itself," with human evaluations rating its outputs as more diverse and faithful to descriptions than predecessors. , open-sourced by Stability AI in 2022, builds on latent techniques to produce 512x512 images efficiently on consumer GPUs, allowing widespread access and applications in creative workflows. These advancements extend to video generation through spatiotemporal extensions of diffusion models. Make-A-Video, introduced by Meta AI in 2022, leverages text-to-image diffusion priors with temporal super-resolution to synthesize short video clips from prompts like "a panda playing guitar," achieving realistic motion without dedicated text-video training data. The practical impact of these models is evident in artistic and industrial applications. In art, StyleGAN powers interactive tools like the website "This Person Does Not Exist," which generates indistinguishable synthetic faces to explore AI's creative potential. In film visual effects (VFX), generative models facilitate tasks such as actor de-aging in the 2024 film Here, where generative AI techniques transformed Tom Hanks' appearance across decades in real-time. Similarly, generative AI models have been used by Netflix to generate VFX elements 10 times faster than traditional methods in the series El Eternauta.

Text and Language Generation

Generative models have transformed text and language generation in (NLP), with transformer-based architectures emerging as the dominant paradigm due to their parallelizable attention mechanisms that effectively capture contextual dependencies in sequential data. These models enable the production of coherent, contextually relevant text by learning probabilistic distributions over vocabulary tokens, powering applications from to . Autoregressive approaches, which decompose text generation into sequential predictions, form the backbone of many leading systems, allowing for scalable training on vast corpora. A seminal example is the model, introduced by in 2020, featuring 175 billion parameters and demonstrating exceptional zero-shot performance on diverse tasks such as and summarization without any . 's ability to generalize across domains stems from its pretraining on internet-scale text, enabling emergent capabilities like where task instructions are provided in prompts. For specialized uses, such as chatbots, these autoregressive large language models (LLMs) are on dialogue datasets to enhance response relevance and safety, reducing hallucinations and improving user interaction. In practice, , launched by in November 2022 as a fine-tuned iteration of the GPT-3.5 series, has popularized these models for everyday content creation, including drafting emails, generating code snippets, and brainstorming ideas. Beyond autoregressive methods, models offer promising alternatives for text , particularly in achieving controllability. Diffusion-LM, developed in 2022, adapts continuous processes to text spaces, enabling non-autoregressive where attributes like or can be specified upfront for more flexible outputs. Building on this, recent 2024 innovations such as energy-based language models integrate sequence-level energy functions during denoising steps, improving global coherence and reducing inconsistencies in generated narratives compared to token-by-token autoregressive sampling. Assessing these generative models requires both automated and human-centric metrics to capture their multifaceted performance. Perplexity, a standard intrinsic measure, quantifies a model's predictive likelihood on unseen text, with lower scores indicating better fluency; for instance, demonstrated strong language modeling capabilities on benchmarks like WikiText-2. Complementing this, human evaluations gauge extrinsic qualities like and through Likert-scale ratings or pairwise comparisons, often revealing gaps in automated metrics for subjective aspects such as narrative flow in story generation. These combined approaches ensure rigorous validation, guiding iterative improvements in model design.

Scientific and Multimodal Applications

Generative models have revolutionized by enabling the design of novel molecules, particularly through variational autoencoders (VAEs) that leverage latent spaces to generate chemically valid structures. Inspired by AlphaFold's capabilities, post-2023 models integrate structural constraints into generative processes, allowing for the creation of drug-like compounds that target specific proteins with high affinity and low toxicity. For instance, VAEs trained on molecular datasets can sample from latent distributions to produce diverse scaffolds, accelerating hit identification and lead optimization in pharmaceutical pipelines. In multimodal applications, generative models facilitate cross-domain integration, such as aligning text and images via CLIP-guided processes introduced in 2021. These models use contrastive learning from CLIP to condition diffusion-based generation, producing images that closely match textual descriptions while preserving semantic coherence. Building on this, Flamingo, a 2022 vision- model, extends capabilities by incorporating visual inputs into large models for few-shot tasks, enabling joint reasoning over text and images in applications like visual . Generative models also enhance simulations in scientific domains by synthesizing realistic physical scenarios. In climate modeling, generative AI approaches, such as diffusion models, generate ensemble projections of atmospheric variables, capturing variability in sea-ice dynamics or global patterns with greater efficiency than traditional numerical simulations. For robotics, generative physical AI models simulate diverse environments and interactions—for example, MIT's Steerable Scene Generation method from 2025 creates realistic virtual training grounds for robots to practice tasks like object manipulation under varying conditions—allowing training on virtual scenarios that mimic real-world physics. By 2025, hybrid generative models have emerged as a trend in applications, particularly for synchronized audio-video from textual prompts. OpenAI's Sora exemplifies this, employing architectures to produce high-fidelity videos up to a minute long, with integrated audio that adheres to physical realism and prompt fidelity, supporting advancements in and .

Challenges and Developments

Training and Evaluation Challenges

Training generative models, particularly those with deep architectures, encounters significant hurdles during optimization. The arises when gradients diminish exponentially through in deep networks, slowing or halting learning in early layers and complicating the training of complex generative systems like variational autoencoders (VAEs) and GANs. This issue intensifies in adversarial settings, where an overly effective discriminator provides minimal learning signals to the , leading to stalled progress. to specific modes of the further exacerbates training instability; in GANs, this often results in mode collapse, where the converges to producing a narrow of samples, failing to capture the full manifold. Assessing the performance of generative models requires robust evaluation metrics that balance sample quality, diversity, and fidelity to the target distribution. The Inception Score (IS) quantifies both quality and diversity by computing the KL divergence between the predicted class distribution of generated images and the marginal class distribution, with higher scores indicating more recognizable and varied outputs. Complementing IS, the measures distributional similarity between real and generated samples using Gaussian assumptions on Inception network features, where lower FID values reflect closer alignment and higher fidelity. To address limitations in holistic metrics like FID, frameworks evaluate coverage explicitly: precision assesses the realism of generated samples, while recall gauges how comprehensively the model spans the reference data modes, enabling disentangled diagnosis of quality versus diversity deficits. Human-centric evaluations provide subjective insights into and utility, often employing Turing-like tests where evaluators attempt to differentiate generated from authentic , revealing perceptual indistinguishability in advanced models. Diversity scores, such as the Vendi Score, offer scalable, reference-free quantification of output variety by estimating entropy-like measures on sample manifolds, aiding detection of collapse or repetition without relying on ground-truth comparisons. Scalability poses a core challenge, as training state-of-the-art generative models demands vast computational resources; large-scale systems with billions of parameters typically require distributed clusters of thousands of GPUs running for weeks to achieve . In response, techniques have gained traction by 2025, compressing capabilities from cumbersome teacher models into efficient student variants, with multimodal distillation methods addressing correlation and diversity issues to yield up to 18-fold training speedups on datasets like COCO.

Recent Advances and Ethical Issues

In 2024, Stability AI released Stable Diffusion 3 (SD3), a scaled diffusion model that enhances text-to-image generation through a multimodal diffusion transformer architecture, achieving superior performance in handling complex prompts involving multiple subjects and improved typographic rendering compared to prior versions. This was followed by Stable Diffusion 3.5 in October 2024, which further optimized inference speed and accessibility for open-source applications while maintaining high-fidelity outputs. Concurrently, open-source multimodal models like LLaVA saw significant advancements; LLaVA-NeXT, introduced in January 2024, expanded visual resolution and reasoning capabilities for image-language tasks, while LLaVA-OneVision in 2025 integrated video understanding and cross-scene generalization into a fully open framework. These developments underscore a trend toward larger-scale, accessible generative systems that blend diffusion processes with transformer-based architectures for broader multimodal applications. Energy-based models (EBMs) have experienced renewed interest in generative , with 2024 research exploring their integration with other paradigms to address limitations in probability normalization and sample efficiency. This revival highlights EBMs' potential for flexible energy landscapes in modeling complex distributions, complementing and autoregressive approaches in frameworks. Ethical concerns in generative models have intensified, particularly around in outputs. A 2024 study found that large language models, including those underpinning generative systems, exhibit regressive gender stereotypes, homophobia, and racial biases in generated text and images. For instance, text-to-image models like perpetuate stereotypical representations, such as associating certain professions with specific genders or races, as evidenced by analyses showing overrepresentation of white males in imagery. Deepfakes pose additional risks, with generative models enabling sophisticated ; reported a projected 32% annual growth in U.S. losses from $12.3 billion in 2023 to $40 billion by 2027, driven by AI-generated audio and video manipulations. Intellectual property infringement remains a core issue, as training on copyrighted datasets without permission raises reproduction rights violations; the U.S. Office's 2025 report concluded that such ingestion constitutes infringement unless applies, prompting lawsuits against providers like . The European Union's AI Act, effective from 2024, classifies generative systems as high-risk if they pose significant threats to , , or , imposing obligations like and risk assessments on providers of general-purpose AI models with systemic impacts. This mandates documentation of training data and output safeguards for models exceeding computational thresholds, aiming to mitigate biases and misuse in deployment. Looking ahead, alignment techniques such as (RLHF) are increasingly applied to generative models for safer outputs, them to human preferences and reducing harmful generations. By 2025, RLHF evolutions like (RLAIF) have enabled self-optimized alignment, enhancing ethical compliance in multimodal systems without exhaustive human annotation.

References

  1. [1]
    None
    ### Summary of Generative Models from Chapter 20 of Deep Learning Book
  2. [2]
    Stanford University CS236: Deep Generative Models
    ### Definition and Overview of Generative Models
  3. [3]
    [1312.6114] Auto-Encoding Variational Bayes - arXiv
    Dec 20, 2013 · We introduce a stochastic variational inference and learning algorithm that scales to large datasets and, under some mild differentiability conditions, even ...
  4. [4]
    [1406.2661] Generative Adversarial Networks - arXiv
    Jun 10, 2014 · We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models.
  5. [5]
    [PDF] A Survey on Generative Diffusion Models - arXiv
    Abstract—Deep generative models have unlocked another profound realm of human creativity. By capturing and generalizing patterns.
  6. [6]
    None
    Summary of each segment:
  7. [7]
    Background: What is a Generative Model? | Machine Learning
    Aug 25, 2025 · Generative models create new data instances, while discriminative models classify existing data. · Generative models learn the underlying data ...Missing: seminal papers
  8. [8]
    [PDF] Pattern Recognition and Machine Learning - Microsoft
    A companion volume (Bishop and Nabney,. 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab ...
  9. [9]
    [PDF] A Learning Algorithm for Boltzmann Machines* - Computer Science
    The computotionol power of massively parallel networks of simple processing elements resides in the communication bandwidth provided by the hardware.
  10. [10]
    [PDF] An Introduction to Variational Methods for Graphical Models
    Jaakkola and Jordan (1999b) present an application of sequential variational methods to the. QMR-DT network. As we have seen, the QMR-DT network is a bipartite ...
  11. [11]
    [PDF] On Discriminative vs. Generative classifiers: A comparison of logistic ...
    Generative classifiers learn joint probability p(x,y) and use Bayes rules, while discriminative classifiers model the posterior p(y|x) directly. Naive Bayes ...<|separator|>
  12. [12]
    [PDF] Generative or Discriminative? Getting the Best of Both Worlds
    Discriminative models directly calculate p(c|x), while generative models find p(x,c) then p(c|x). A 'discriminatively trained' generative model is a new model.
  13. [13]
    [PDF] A Hybrid of Generative and Discriminative Models Based on ... - arXiv
    Apr 19, 2021 · The generative model is inferior to the discriminative model in terms of its classification performance; however, it can handle unlabeled data, ...
  14. [14]
    Synthetic Medical Images for Robust, Privacy-Preserving Training of ...
    Synthetic datasets may be useful for training robust medical AI models. ... Privacy-preserving generative deep neural networks support clinical data sharing.Abstract · Figure 5 · Discussion
  15. [15]
    [PDF] Maximum Likelihood from Incomplete Data via the EM Algorithm
    Apr 6, 2007 · By using a representation similar to that used by Dempster, Laird and. Rubin in the genetics example of Section 1, Haberman showed how ...
  16. [16]
    [PDF] baum.pdf
    FUNCTIONS OF MARKOV CHAINS. BY LEONARD E. BAUM, TED Petrie, George SOULES, AND NORMAN WEISS. Institute for Defense Analyses, California Institute of ...
  17. [17]
    A Review of Learning with Deep Generative Models from ... - arXiv
    Aug 5, 2018 · This document aims to provide a review on learning with deep generative models (DGMs), which is an highly-active area in machine learning and more generally, ...
  18. [18]
    What is Backpropagation? | IBM
    Abstractly speaking, the purpose of backpropagation is to train a neural network to make better predictions through supervised learning. More fundamentally, ...
  19. [19]
    The transformational role of GPU computing and deep learning in ...
    Mar 23, 2022 · Neural network arithmetic operations are based on matrix multiplications that are parallelized by GPUs using block multiplication and ...
  20. [20]
    Generative AI in depth: A survey of recent advances, model variants ...
    Oct 8, 2025 · We introduce a novel taxonomy of generative models, spanning GANs, VAEs, hybrid GAN-VAE architectures, and Diffusion Models (DMs) that ...
  21. [21]
    [1411.1784] Conditional Generative Adversarial Nets - arXiv
    Nov 6, 2014 · In this work we introduce the conditional version of generative adversarial nets, which can be constructed by simply feeding the data.
  22. [22]
    Unpaired Image-to-Image Translation using Cycle-Consistent ... - arXiv
    Mar 30, 2017 · We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples.
  23. [23]
    [1701.07875] Wasserstein GAN - arXiv
    Jan 26, 2017 · We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning.
  24. [24]
    [2006.11239] Denoising Diffusion Probabilistic Models - arXiv
    Access Paper: View a PDF of the paper titled Denoising Diffusion Probabilistic Models, by Jonathan Ho and 2 other authors. View PDF · TeX ...
  25. [25]
    High-Resolution Image Synthesis with Latent Diffusion Models - arXiv
    Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks.
  26. [26]
    Score-Based Generative Modeling through Stochastic Differential ...
    Nov 26, 2020 · We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting ...
  27. [27]
    [1812.04948] A Style-Based Generator Architecture for ... - arXiv
    Dec 12, 2018 · Abstract:We propose an alternative generator architecture for generative adversarial networks, borrowing from style transfer literature.
  28. [28]
    Large Scale GAN Training for High Fidelity Natural Image Synthesis
    Sep 28, 2018 · We train Generative Adversarial Networks at the largest scale yet attempted, and study the instabilities specific to such scale.
  29. [29]
  30. [30]
    Make-A-Video: Text-to-Video Generation without Text-Video Data
    Sep 29, 2022 · We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V).
  31. [31]
    The $50 Million Movie 'Here' De-Aged Tom Hanks With Generative AI
    Nov 6, 2024 · A $50 million Robert Zemeckis–directed film that used real-time generative AI face transformation techniques to portray actors Tom Hanks and Robin Wright ...
  32. [32]
    Netflix Used AI to Generate VFX Footage for "First Time" - CineD
    Jul 25, 2025 · Netflix confirms using AI to create VFX footage 10x faster than traditional methods in El Eternauta. Industry implications for filmmakers.
  33. [33]
    [2311.17633] Introduction to Transformers: an NLP Perspective - arXiv
    Nov 29, 2023 · In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models.
  34. [34]
    [2005.14165] Language Models are Few-Shot Learners - arXiv
    May 28, 2020 · Access Paper: View a PDF of the paper titled Language Models are Few-Shot Learners, by Tom B. Brown and 30 other authors. View PDF · TeX Source.
  35. [35]
    Introducing ChatGPT - OpenAI
    Nov 30, 2022 · ChatGPT is fine-tuned from a model in the GPT‑3.5 series, which finished training in early 2022. You can learn more about the 3.5 series ...Introducing ChatGPT search · Introducing ChatGPT Pro · Research · Safety
  36. [36]
    [2205.14217] Diffusion-LM Improves Controllable Text Generation
    May 27, 2022 · We develop a new non-autoregressive language model based on continuous diffusions that we call Diffusion-LM.
  37. [37]
    Energy-Based Diffusion Language Models for Text Generation - arXiv
    Oct 28, 2024 · In this work, we propose Energy-based Diffusion Language Model (EDLM), an energy-based model operating at the full sequence level for each diffusion step.
  38. [38]
    Human evaluation of automatically generated text: Current trends ...
    This paper provides an overview of how (mostly intrinsic) human evaluation is currently conducted and presents a set of best practices, grounded in the ...
  39. [39]
    survey of generative AI for de novo drug design - Oxford Academic
    In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation.
  40. [40]
    Integrating artificial intelligence in drug discovery and early drug ...
    Mar 14, 2025 · AI models, such as AlphaFold, predict protein structures with high accuracy, aiding druggability assessments and structure-based drug design. AI ...
  41. [41]
    [PDF] How generative Artificial Intelligence can transform drug discovery?
    May 27, 2025 · This review explores key Generative AI models, including Generative. Adversarial Networks (GANs), Variational Autoencoders (VAEs), flow-based ...
  42. [42]
    [PDF] Text-Guided Diffusion Models for Robust Image Manipulation
    In this paper, we proposed DiffusionCLIP, a method of text-guided image manipulation method using the pretrained diffusion models and CLIP loss. Thanks to ...
  43. [43]
    Flamingo: a Visual Language Model for Few-Shot Learning - arXiv
    We introduce Flamingo, a family of Visual Language Models (VLM) with this ability. We propose key architectural innovations to: (i) bridge powerful pretrained ...
  44. [44]
    Generative AI models enable efficient and physically consistent sea ...
    Aug 20, 2025 · GenSIM, a generative AI model, predicts sea-ice properties with improved accuracy and efficiency, while remaining physically consistent. It ...
  45. [45]
    Accelerating Climate Modeling with Generative AI
    Dec 2, 2024 · One of the researchers' key insights was that generative AI models, such as diffusion models, could be used for ensemble climate projections.
  46. [46]
    Sora: Creating video from text - OpenAI
    Feb 15, 2025 · Introducing Sora, our text-to-video model. Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user's prompt.Video generation models as... · Sora System Card · Lyndon Barrois & Sora
  47. [47]
    A Survey on Training Challenges in Generative Adversarial ...
    Jan 19, 2022 · Training challenges for GANs include mode collapse, non-convergence, and vanishing gradient, which can lead to blurry, unrealistic, and less ...
  48. [48]
    Improved Techniques for Training GANs
    - **Title:** Improved Techniques for Training GANs
  49. [49]
    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
    - **Title:** GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
  50. [50]
    Assessing Generative Models via Precision and Recall - arXiv
    May 31, 2018 · The paper proposes a novel definition of precision and recall for distributions to assess generative models, disentangling divergence into two ...
  51. [51]
    A Statistical Turing Test for Generative Models
    - **Title:** A Statistical Turing Test for Generative Models
  52. [52]
    The Vendi Score: A Diversity Evaluation Metric for Machine Learning
    Oct 5, 2022 · We showcase the Vendi Score on molecular generative modeling where we found it addresses shortcomings of the current diversity metric of choice ...
  53. [53]
    Efficient Multimodal Dataset Distillation via Generative Models
    **Title:** Efficient Multimodal Dataset Distillation via Generative Models
  54. [54]
    Stable Diffusion 3 - Stability AI
    Feb 22, 2024 · Announcing Stable Diffusion 3 in early preview, our most capable text-to-image model with greatly improved performance in multi-subject prompts, image quality, ...Missing: arxiv. | Show results with:arxiv.
  55. [55]
    Introducing Stable Diffusion 3.5 - Stability AI
    Oct 22, 2024 · In June, we released Stable Diffusion 3 Medium, the first open release from the Stable Diffusion 3 series. This release didn't fully meet ...Missing: arxiv. | Show results with:arxiv.<|separator|>
  56. [56]
    Multimodal (Visual and Language) understanding with LLaVA-NeXT
    Apr 26, 2024 · In January 2024, LLaVa-NeXT was released, which boasts significant enhancements, including higher input's visual resolution and improved logical ...
  57. [57]
    LLaVA-OneVision-1.5: Fully Open Framework for ... - arXiv
    Sep 28, 2025 · Recent advancements in Large Multimodal Models (LMMs) have demonstrated remarkable capabilities in multimodal understanding and reasoning ...
  58. [58]
    Hitchhiker's guide on the relation of Energy-Based Models with other ...
    Jun 19, 2024 · This review aims to provide physicists with a comprehensive understanding of EBMs, delineating their connection to other generative models.Missing: revival | Show results with:revival
  59. [59]
    Generative AI: UNESCO study reveals alarming evidence of ...
    Jul 5, 2024 · A UNESCO study revealed worrying tendencies in Large Language models (LLM) to produce gender bias, as well as homophobia and racial stereotyping.
  60. [60]
    AI-generated faces influence gender stereotypes and racial ... - Nature
    Apr 25, 2025 · In this study, we focus on racial and gender stereotypes in Stable Diffusion7, one of the most popular text-to-image generative models, used ...
  61. [61]
    Deepfakes and the crisis of knowing - UNESCO
    Oct 1, 2025 · Deloitte predicts that generative AI could drive U.S. fraud losses from $12.3 billion in 2023 to $40 billion by 2027—a 32% annual growth rate ( ...
  62. [62]
    [PDF] Copyright and Artificial Intelligence, Part 3: Generative AI Training ...
    May 6, 2025 · We describe different phases of training and the relationship between trained models and their training data. Finally, we address the ...
  63. [63]
    High-level summary of the AI Act | EU Artificial Intelligence Act
    Feb 27, 2024 · ... 2024). High risk AI systems (Chapter III). Some AI systems are considered 'High risk' under the AI Act. Providers of those systems will be ...High Risk Ai Systems... · Requirements For Providers... · General Purpose Ai (gpai)
  64. [64]
    A Unified Approach for Self-Optimized Alignment - arXiv
    Aug 11, 2025 · Recent advancements in RLHF focus on enhancing alignment through generative reward modeling. For example, Mahan et al. (2024) demonstrate ...Learning To Align, Aligning... · Convergence And Theoretical... · Experiments And Discussion
  65. [65]
    Inside LLMs: RLHF, RLAIF & the Evolution of Model Alignment
    Aug 3, 2025 · Until 2023, the dominant approach was Reinforcement Learning with Human Feedback (RLHF) . In 2024–2025, that pipeline is evolving into RLAIF ...