Fact-checked by Grok 2 weeks ago

Softmax function

The softmax function, also known as softargmax, is a normalized exponential function that transforms a finite-dimensional vector of real numbers, often called logits or scores, into a probability distribution over the same number of categories, ensuring that the outputs are non-negative and sum to one. Mathematically, for a vector \mathbf{z} = (z_1, \dots, z_K) with K elements, the softmax is defined as \hat{y}_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)} for each i = 1, \dots, K, where \exp(\cdot) denotes the exponential function. This formulation preserves the relative ordering of the input values while producing interpretable probabilities suitable for multi-class decision-making. The term "softmax" was coined by John S. Bridle in , introduced in the context of training stochastic model recognition algorithms as neural networks to achieve maximum estimation of parameters. In this original work, proposed the function as a of the logistic for multi-output scenarios, enabling networks to output conditional probabilities directly and facilitating discrimination-based learning via relative entropy () loss. Although later suggested "softargmax" as a more descriptive alternative to emphasize its relation to the argmax operation, "softmax" became the standard nomenclature in literature. Key properties of the softmax function include its shift invariance—adding a constant to all inputs does not change the output probabilities—and its reduction to the for binary cases, where p = \frac{1}{1 + \exp(-(z_1 - z_2))}. It arises naturally as the maximum distribution subject to expected score constraints and as a choice model under Gumbel-distributed noise added to scores, balancing of options with of the highest scores via a temperature parameter \alpha that controls sharpness (as \alpha \to \infty, it approaches a hard argmax). These attributes make it differentiable and computationally efficient, with the output odds between categories depending solely on score differences: \frac{p_i}{p_j} = \exp(\alpha (s_i - s_j)). In modern applications, the softmax function serves as the canonical activation for the output layer in neural networks for multi-class , converting raw linear predictions into probabilities that can be optimized using loss. It is integral to softmax regression, a generalization of , where model parameters are learned to maximize the likelihood of correct class assignments. Beyond classification, softmax appears in attention mechanisms (e.g., scaled dot-product attention in transformers), for policy parameterization, and probabilistic modeling of categorical data, underscoring its versatility in converting unconstrained scores to interpretable distributions.

Definition

Mathematical Definition

The softmax function, first termed and applied in the context of probabilistic interpretation of outputs by (1989), is mathematically defined for a finite-dimensional input \mathbf{z} = (z_1, \dots, z_K)^\top \in \mathbb{R}^K (with K \geq 1) as the output vector \boldsymbol{\sigma}(\mathbf{z}) whose i-th component is given by \sigma(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)}, \quad i = 1, \dots, K. The \exp(\cdot) plays a crucial role in this formulation by mapping each real-valued input z_i to a strictly positive value \exp(z_i) > 0, thereby ensuring all components of the output vector are positive before . In machine learning literature, the input vector is conventionally denoted as \mathbf{z} to represent the pre-activation logits (unbounded real values produced by a linear layer), while \mathbf{x} typically denotes the original feature inputs to the model; this distinction highlights the softmax's role as an output activation applied to logits. For the scalar case where K=1, the definition simplifies trivially to \sigma(z_1) = \frac{\exp(z_1)}{\exp(z_1)} = 1, yielding a constant output. When K=2, the softmax reduces to the binary logistic (sigmoid) function up to a shift, as \sigma(z_1, z_2)_1 = \frac{1}{1 + \exp(z_2 - z_1)} = \frac{1}{1 + \exp(-(z_1 - z_2))}, analogous to the standard sigmoid applied to the logit difference.

Basic Interpretations

The softmax function serves as a normalized transformation that converts a of unbounded real-valued inputs, often called logits or scores, into a discrete over multiple categories. By applying the to each input and dividing by the sum of exponentials across all inputs, it ensures that the outputs are strictly positive and sum to exactly one, thereby mapping the inputs onto the . This aspect makes the softmax particularly useful for interpreting raw model outputs as probabilities in multi-class settings, where the relative magnitudes of the inputs determine the likelihood assigned to each class. The outputs of the softmax function directly parameterize the of a , where each component represents the probability of a specific category in a multinomial setting. This connection arises because the softmax enforces the constraints of a valid —non-negativity and —allowing it to model the probabilities of mutually exclusive and exhaustive outcomes. In statistical terms, if the inputs are the natural logarithms of the unnormalized probabilities, the softmax recovers the normalized form, aligning with the parameterization used in models. The use of exponentials in the softmax provides an intuitive amplification of differences among the input values, transforming subtle variations in scores into more pronounced probabilistic preferences. For instance, a larger input value leads to exponentially higher output probability compared to smaller ones, which promotes decisive distributions where the highest-scoring category receives the majority of the probability mass, while still allowing for some in closer cases. This non-linear scaling ensures that the function is sensitive to relative differences rather than absolute values, enhancing its effectiveness in representing confidence levels across categories. A generalized variant of the softmax introduces a \tau > 0 to modulate the sharpness of the resulting , defined as \sigma(\mathbf{z}; \tau)_i = \frac{\exp(z_i / \tau)}{\sum_j \exp(z_j / \tau)}. When \tau = 1, it recovers the standard softmax; lower values of \tau sharpen the toward the maximum input (approaching a Dirac delta), while higher values flatten it toward uniformity, providing flexibility in controlling the between and in probabilistic outputs.

Advanced Interpretations

Smooth Approximation to Argmax

The argmax operation, denoted as \arg\max_i z_i, selects the index i corresponding to the maximum value in a \mathbf{z} \in \mathbb{R}^K, producing a encoded vector where the entry at the maximum position is 1 and all others are 0. However, this operation is non-differentiable , which poses challenges for gradient-based optimization in , as it cannot be directly incorporated into differentiable computational graphs. The softmax function addresses this limitation by serving as a smooth, differentiable approximation to argmax, often referred to as "softargmax." Defined with a temperature parameter \tau > 0, the softmax \sigma(\mathbf{z}; \tau)_i = \frac{\exp(z_i / \tau)}{\sum_{j=1}^K \exp(z_j / \tau)} maps the input vector \mathbf{z} to a probability distribution over the K categories, where the probabilities concentrate more sharply on the largest entries as \tau decreases. In the limit of vanishing temperature, the softmax output converges pointwise to the one-hot vector aligned with the argmax: \lim_{\tau \to 0^+} \sigma(\mathbf{z}; \tau)_i = 1 if i = \arg\max_j z_j (assuming no ties in \mathbf{z}), and 0 otherwise. This smoothing property enables the use of gradient-based methods to approximate discrete decision-making processes that would otherwise rely on non-differentiable argmax operations. For instance, in techniques like straight-through estimators, the forward pass may employ a hard argmax for discrete selection, while the backward pass approximates gradients through a low-temperature softmax to propagate signals effectively during training.

Relation to Boltzmann Distribution

In statistical mechanics, the Boltzmann distribution describes the probability P_i of a system occupying a particular state i with energy E_i at thermal equilibrium temperature T, given by P_i = \frac{\exp(-E_i / kT)}{\sum_j \exp(-E_j / kT)}, where k is Boltzmann's constant and the sum in the denominator runs over all possible states j. This distribution was first formulated by Ludwig Boltzmann in 1868 as part of his foundational work on the statistical mechanics of gases, deriving the equilibrium probabilities through combinatorial arguments for particle distributions. The softmax function bears a direct mathematical resemblance to the , arising from the mapping z_i = -E_i / kT, which transforms the energies into logits scaled by the inverse temperature $1/kT; thus, softmax outputs precisely model the equilibrium probabilities in the of . Consequently, the softmax inherits key concepts from the Boltzmann framework, including the partition function (the normalizing denominator \sum_j \exp(-E_j / kT)) that ensures probabilities sum to unity, and the interpretation of inputs as energy-based scores for probabilistic state selection.

Properties

Key Mathematical Properties

The softmax function \sigma: \mathbb{R}^K \to (0,1)^K, defined componentwise as \sigma(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)}, exhibits several mathematical properties that it maps to the interior of the probability . A primary property is , whereby the outputs to unity: \sum_{i=1}^K \sigma(\mathbf{z})_i = 1 for all \mathbf{z} \in \mathbb{R}^K. This follows directly from the definitional structure, as the terms in the numerator and denominator cancel out in the . Complementing this is non-negativity, with \sigma(\mathbf{z})_i > 0 for all i and \mathbf{z}, since exponentials are strictly positive and the denominator is a positive . These traits position softmax outputs as valid probability distributions over K categories. The function is also strictly monotonic in each component: if z_i > z_j, then \sigma(\mathbf{z})_i > \sigma(\mathbf{z})_j. This order-preserving behavior arises because increasing z_i relative to z_j amplifies the corresponding exponential term more than others, without altering the total sum due to . Additionally, softmax is to by a constant : \sigma(\mathbf{z} + c \mathbf{1}) = \sigma(\mathbf{z}) for any c \in \mathbb{R}, where \mathbf{1} is the all-ones . This holds because adding c to each input multiplies both numerator and denominator by \exp(c), which cancels out. Finally, the softmax function is unique as the mapping from \mathbb{R}^K to the interior of the that satisfies , non-negativity, monotonicity, and translation invariance. This uniqueness stems from its characterization as the maximum- distribution subject to moment constraints on the expected inputs, derived via Lagrange multipliers. To see this, maximize the H(p) = -\sum_{i=1}^K p_i \log p_i over p \in (0,1)^K with \sum_i p_i = 1 and \sum_i p_i z_i = \mu (for fixed mean \mu). The is L(p, \lambda, \beta) = H(p) + \lambda (\sum_i p_i z_i - \mu) + \beta (\sum_i p_i - 1). Setting partial derivatives to zero yields \frac{\partial L}{\partial p_i} = -\log p_i - 1 + \lambda z_i + \beta = 0, so p_i = \exp(\lambda z_i + \beta - 1). Applying the normalization constraint normalizes the exponentials, recovering the softmax form; strict convexity of the negative entropy ensures this solution is unique.

Gradient Computations

The gradients of the softmax function play a central role in algorithms for training neural networks, enabling the efficient computation of how perturbations in the input logits \mathbf{z} \in \mathbb{R}^K propagate to changes in the output probabilities \boldsymbol{\sigma}(\mathbf{z}) \in \Delta^{K-1}, where \Delta^{K-1} denotes the (K-1)-dimensional probability simplex. Consider the component-wise definition \sigma_i(\mathbf{z}) = \frac{\exp(z_i)}{s}, where s = \sum_{k=1}^K \exp(z_k). To derive the partial derivatives, apply the and . For i \neq j, \frac{\partial \sigma_i}{\partial z_j} = \frac{\partial}{\partial z_j} \left( \frac{\exp(z_i)}{s} \right) = -\frac{\exp(z_i)}{s^2} \cdot \frac{\partial s}{\partial z_j} = -\frac{\exp(z_i) \exp(z_j)}{s^2} = -\sigma_i \sigma_j, since \frac{\partial s}{\partial z_j} = \exp(z_j). For the case i = j, \frac{\partial \sigma_i}{\partial z_i} = \frac{\exp(z_i) \cdot s - \exp(z_i) \cdot \exp(z_i)}{s^2} = \frac{\exp(z_i) (s - \exp(z_i))}{s^2} = \sigma_i \left(1 - \sigma_i \right), as the numerator derivative includes both the direct term from \exp(z_i) and the indirect term through s. Combining these yields the general component-wise form of the Jacobian entries: \frac{\partial \sigma_i}{\partial z_j} = \sigma_i (\delta_{ij} - \sigma_j), where \delta_{ij} is the Kronecker delta (\delta_{ij} = 1 if i = j, else 0). In matrix notation, the full Jacobian J(\mathbf{z}) \in \mathbb{R}^{K \times K} is J(\mathbf{z}) = \operatorname{diag}(\boldsymbol{\sigma}(\mathbf{z})) - \boldsymbol{\sigma}(\mathbf{z}) \boldsymbol{\sigma}(\mathbf{z})^\top, which is symmetric and positive semidefinite with rank at most K-1. This structure admits a clear interpretation: the diagonal elements \sigma_i (1 - \sigma_i) \geq 0 capture self-reinforcement, where an increase in z_i boosts \sigma_i proportionally to its current value, while the off-diagonal elements -\sigma_i \sigma_j < 0 (for i \neq j) encode inter-class competition, as an increase in z_j diminishes \sigma_i to maintain the normalization \sum_i \sigma_i = 1. Consequently, each row (and column) of J sums to zero, preserving the simplex constraint under infinitesimal changes. In practice, forming the explicit K \times K Jacobian requires O(K^2) space and time, which is prohibitive for large K. However, during backpropagation, only the Jacobian-vector product J \mathbf{v} is typically needed for a downstream gradient vector \mathbf{v} \in \mathbb{R}^K, and this can be evaluated in O(K) time via J \mathbf{v} = \operatorname{diag}(\boldsymbol{\sigma}) \mathbf{v} - \boldsymbol{\sigma} (\boldsymbol{\sigma}^\top \mathbf{v}), avoiding materialization of the full matrix and enabling scalable computation in deep learning frameworks.

Numerical Considerations

Complexity and Challenges

The computation of the softmax function for an input vector \mathbf{z} \in \mathbb{R}^K involves exponentiating each of the K elements, computing their sum, and performing element-wise division for normalization, yielding a time complexity of O(K). This linear dependence on the dimension K poses challenges in high-dimensional settings, such as natural language processing where K corresponds to vocabulary sizes often exceeding 50,000, leading to substantial per-instance costs during inference and training. The space complexity is likewise O(K), required for storing the input vector, intermediate exponentials, and output probabilities, though computing in log-space via the log-sum-exp trick can avoid temporary storage of large exponential values, modestly reducing peak memory usage. A primary numerical challenge stems from the exponential operation in the softmax formula, \sigma(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^K \exp(z_j)}, which is susceptible to overflow when any z_i is large and positive, causing \exp(z_i) to approach infinity and rendering the denominator undefined. Conversely, when all z_i are large and negative, underflow occurs as \exp(z_i) rounds to zero for all terms, resulting in loss of precision and a denominator near zero. These instabilities can propagate to produce NaN values in the probabilities or degenerate distributions where one probability approaches 1 and others 0, thereby distorting gradients during as outlined in the gradient computations section.

Stable Numerical Methods

Computing the softmax function directly can lead to numerical overflow when input values are large, as the exponential terms grow rapidly. A standard technique to mitigate this is the subtract-max trick, which shifts all inputs by their maximum value before exponentiation. This ensures that all exponents are less than or equal to zero, bounding the terms and preventing overflow while preserving the original probabilities. The adjusted computation is given by \sigma(\mathbf{z})_i = \frac{\exp(z_i - m)}{\sum_j \exp(z_j - m)}, where m = \max_k z_k. This method is equivalent to the standard softmax because the shift factor e^{-m} cancels out in the ratio. For applications requiring the logarithm of softmax probabilities, such as in cross-entropy loss computations, the log-sum-exp (LSE) trick provides numerical stability. The log-softmax for each component is \log \sigma_i = z_i - \log \sum_j \exp(z_j). Direct evaluation of the sum can still cause underflow for large negative inputs, so a stabilized LSE incorporates the subtract-max: \log \sum_j \exp(z_j) = m + \log \sum_j \exp(z_j - m), where m = \max_k z_k. This formulation avoids both overflow in the exponentials and underflow in the summation, enabling accurate computation even for extreme input ranges. Stable implementations of logsumexp are essential in probabilistic modeling and optimization. In high-dimensional settings, such as attention mechanisms in transformers where the vocabulary size K or sequence length is very large (e.g., thousands), full softmax computation becomes computationally prohibitive due to O(K) or quadratic scaling. To address this, approximations like sparsemax replace the dense softmax with a sparse variant that thresholds small probabilities to zero, producing a sparse probability distribution while maintaining differentiability. Sparsemax is particularly useful in multi-label classification and attention, as it focuses computation on the most relevant elements. Additionally, sampling-based methods, such as those in sparse transformers, approximate the softmax by evaluating only a subset of keys or using low-rank approximations, reducing memory and time complexity to near-linear in sequence length. These techniques preserve much of the expressive power of full softmax for large-scale applications. Major numerical libraries incorporate these stability measures into their softmax implementations. For instance, SciPy's scipy.special.softmax applies the subtract-max trick internally to handle a wide range of input scales reliably. Similarly, PyTorch's torch.nn.functional.softmax uses dimension-specific stable computation, subtracting the maximum along the specified axis to ensure robustness in deep learning workflows. These built-in functions allow practitioners to compute softmax without manual intervention for stability.

Applications

In Neural Networks

In neural networks, the softmax function serves as a key activation in the output layer for multi-class classification tasks, transforming a vector of raw scores, or , into a probability distribution over multiple classes that sums to one. This normalization enables the network to produce interpretable outputs representing the likelihood of each class, facilitating decision-making in applications such as and . The softmax output is typically paired with the cross-entropy loss during training, which measures the divergence between the predicted probability distribution \sigma(\mathbf{z}) and the true one-hot encoded target \mathbf{y}. The loss is defined as -\sum_{i=1}^K y_i \log \sigma(\mathbf{z})_i, where K is the number of classes, and this combination yields computationally efficient gradients for , specifically \frac{\partial L}{\partial z_j} = \sigma(\mathbf{z})_j - y_j for the j-th logit. This simplification arises because the derivatives of the softmax and the negative log-likelihood cancel in a manner that avoids explicit Jacobian computations, accelerating optimization in multi-class settings. The adoption of softmax in neural networks gained prominence in the late 1980s and 1990s, as researchers sought probabilistic interpretations for feedforward classifiers amid the resurgence of connectionist models. John Bridle's work introduced the term "softmax" and advocated its use for modeling conditional probabilities in classification networks, bridging statistical pattern recognition with neural architectures. This era's emphasis on probabilistic outputs helped establish softmax as a standard for supervised learning paradigms. A notable variant involves scaling the logits by a temperature parameter T > 0 before applying softmax, yielding \sigma(\mathbf{z}/T)_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}, which controls the distribution's sharpness. When T > 1, the output softens, distributing probability more evenly across classes to aid in model or from larger networks to smaller s. In , the softened teacher probabilities guide the student via a distillation loss, improving generalization while compressing model size, as demonstrated in seminal work on transferring knowledge across neural networks. For , post-hoc temperature scaling adjusts overconfident predictions in trained models, enhancing reliability without retraining.

In Reinforcement Learning

In reinforcement learning, the softmax function plays a central role in parameterizing stochastic policies for discrete action spaces, enabling agents to select actions probabilistically based on estimated action values. Specifically, the policy \pi(a|s) is defined as \pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{b} \exp(Q(s,b)/\tau)}, where Q(s,a) denotes the action-value function for state s and action a, and \tau > 0 is a temperature parameter that scales the logits before applying the softmax. This formulation ensures that the policy outputs a valid probability distribution over actions, with higher Q-values receiving proportionally greater probability mass. The temperature parameter \tau governs the balance between exploration and exploitation in the policy. A high \tau flattens the distribution, promoting by assigning more uniform probabilities to actions and encouraging the to try suboptimal options to discover better long-term rewards. Conversely, a low \tau sharpens the toward the action with the maximum Q-value, favoring to maximize immediate expected returns. This adjustability allows softmax policies to adapt dynamically during , often starting with higher \tau for broad search and annealing to lower values for refinement. Softmax policies are integral to several gradient algorithms, particularly actor-critic methods for actions. In REINFORCE, a foundational gradient , the softmax parameterization facilitates direct optimization of the parameters via stochastic gradient ascent on the , using complete episode trajectories to estimate gradients. Similarly, in (), a widely adopted on-policy method, the policy network outputs logits that are passed through softmax to yield probabilities, enabling clipped surrogate objectives for stable updates over multiple epochs while handling environments like . The primary advantages of softmax parameterization in these algorithms stem from its differentiability, which permits efficient gradient-based optimization of the expected reward without requiring function approximations for updates. This smoothness supports guarantees under certain conditions and allows seamless integration with actors, making it suitable for high-dimensional state spaces.

In Modern Architectures

In modern architectures, the softmax function plays a pivotal role in the self- mechanisms of models, where it normalizes the similarities between query and key vectors to produce attention weights. Specifically, in scaled dot-product , the attention weights \alpha_{ij} are computed as \alpha_{ij} = \sigma\left(\frac{Q_i K_j^T}{\sqrt{d_k}}\right), where \sigma denotes the softmax operation, Q_i and K_j are the query and key vectors, and d_k is the dimensionality of the keys; this scaling by \sqrt{d_k} mitigates the vanishing gradients that would otherwise arise from high-dimensional dot products before softmax normalization. This formulation enables the model to weigh and aggregate input representations dynamically, allowing to capture long-range dependencies in sequences without relying on recurrent structures. The , which relies on this softmax-based as its core component, was introduced in the seminal work demonstrating its superiority over recurrent and convolutional models for tasks. Despite its effectiveness, the quadratic complexity O(n^2) of softmax attention with respect to sequence length n—stemming from the need to compute pairwise similarities—poses significant challenges for processing long sequences, leading to high memory and computational demands in large-scale models. To address this, researchers have developed efficient variants such as sparse attention, which factorizes the attention matrix to focus on a subset of connections, reducing complexity to O(n \sqrt{n}) or better while preserving expressive power for tasks like sequence generation. Similarly, linear attention approximations replace the softmax with kernel-based formulations that enable associative rearrangements, achieving O(n) complexity and enabling faster autoregressive prediction on very long sequences without substantial performance degradation. These innovations have been crucial for scaling transformers to handle inputs exceeding thousands of tokens, where traditional softmax attention becomes prohibitive. The integration of softmax attention has been central to the success of influential models like and , which have revolutionized by leveraging architectures for pre-training on vast corpora and across diverse tasks. For instance, employs decoder-only with softmax-normalized to generate coherent text autoregressively, achieving state-of-the-art results in unsupervised language modeling. , using encoder-only , applies bidirectional softmax to learn contextual embeddings, markedly improving performance on benchmarks like GLUE and . As of 2025, softmax-based remains foundational in these and subsequent architectures, underpinning advancements in multimodal and long-context models while inspiring ongoing optimizations for efficiency and scalability.

Historical Context

Early Foundations

The foundational ideas underlying the softmax function emerged in the realm of in the . In 1868, developed a probabilistic framework for describing the distribution of energy among particles in a gas at . In his seminal paper "Studien über das Gleichgewicht der lebendigen Kraft zwischen bewegten materiellen Punkten," Boltzmann established that the relative probability of a particle occupying a state with energy E_i is proportional to e^{-E_i / kT}, where k is Boltzmann's constant and T is the absolute temperature; this exponential form arises from maximizing under energy constraints, providing the first rigorous basis for normalized exponential probabilities in physical systems. This distribution, now known as the , laid the groundwork for later probabilistic normalizations by linking microscopic energy states to macroscopic equilibrium behaviors in gases and other systems. By the mid-20th century, the exponential normalization concept transitioned into statistics, particularly through the evolution of models. During the 1950s and , statisticians generalized the binary model—initially used for dichotomous outcomes—to handle multinomial cases with multiple categories. David Cox's 1958 paper "The of Binary Sequences" introduced binary , where the log-odds are linear in covariates. Extensions to multinomial models followed in the , including work by Cox in 1966, proposing formulations where the log-odds of category probabilities are linear in covariates, resulting in probabilities given by normalized exponentials: for categories j = 1, \dots, K, the probability is \frac{\exp(\mathbf{x}^T \boldsymbol{\beta}_j)}{\sum_{k=1}^K \exp(\mathbf{x}^T \boldsymbol{\beta}_k)}. This formulation allowed for the analysis of categorical data in fields like and social sciences, building on earlier models but favoring the for its mathematical tractability and interpretability. The approach gained traction as a tool for with responses, emphasizing the normalization step to ensure probabilities sum to one. These statistical developments found early applications in and behavioral choice modeling, where the normalized exponential form proved useful for predicting selections among discrete alternatives. A key contribution came from R. Duncan Luce's work "Individual Choice Behavior: A Theoretical Analysis," which introduced the choice axiom stating that the probability of selecting an alternative is proportional to its inherent "scale value," independent of irrelevant options; this leads to choice probabilities of the form P(i|S) = \frac{v_i}{\sum_{j \in S} v_j}, where v_i > 0 represents the strength of alternative i in choice set S. Luce's model, often implemented with exponential scale values for positive homogeneity, was applied to empirical data in and to model decision processes, such as consumer preferences or perceptual judgments, and influenced subsequent work in theory. In these contexts, the function was typically termed the "multinomial logit" or referred to simply as exponential normalization, without the designation "softmax," which emerged later in literature.

Development in AI

The softmax function emerged as a key component in during the late , building on earlier probabilistic interpretations to enable normalized outputs in neural networks for tasks. Although the mathematical form predates applications, its explicit adoption in contexts began with precursors in multi-layer perceptrons, where similar exponential normalizations were implied for producing probability distributions over multiple classes. For instance, the influential 1986 work by Rumelhart, Hinton, and Williams on through networks implicitly relied on such normalizations to handle multi-class problems in , marking an early integration into architectures despite the term not yet being coined. The term "softmax" was formally introduced by John S. Bridle in his 1989 paper presented at the Neural Information Processing Systems conference, where he described it as a "normalized " or "softmax" output stage for model recognition algorithms trained as networks. Bridle emphasized its role in maximizing between inputs and probabilistic outputs, positioning it as a natural extension of the for multi-class scenarios in probabilistic neural networks. This coinage solidified softmax as a standard activation for generating interpretable probability distributions at the output layer of networks, facilitating applications in and during the resurgence of connectionist approaches in the early 1990s. Softmax gained widespread prominence during the deep learning renaissance post-2010, becoming integral to convolutional neural networks and subsequent architectures. Its pivotal role was highlighted in the 2012 model by Krizhevsky, Sutskever, and Hinton, which employed softmax in the final layer to classify images into 1000 categories, contributing to a breakthrough error rate reduction that catalyzed the adoption of deep networks in . By 2017, softmax was embedded in the architecture introduced by Vaswani et al., where it normalizes scores to weigh importance in modeling, underpinning advancements in and enabling scalable training of models like and series. As of 2025, while the core softmax formulation remains unchanged, ongoing research focuses on scalable and hardware-efficient variants to address computational bottlenecks in large language models with billions of parameters. Innovations include approximate softmax implementations that reduce division operations and memory access, as explored in hardware-oriented algorithms for transformer-based LLMs, ensuring compatibility with edge devices and massive-scale training without altering the function's probabilistic essence. Recent 2025 developments feature variants like adaptive sparse softmax for efficient sampling in high-dimensional spaces and self-adjust softmax for dynamic in long-sequence models. These developments reflect softmax's enduring centrality in , adapting to efficiency demands rather than fundamental redesign.

Practical Aspects

Examples

To illustrate the softmax function, consider a vector of input logits \mathbf{z} = [2, 1, 0.1]. The softmax output is computed as \sigma(\mathbf{z})_i = \frac{\exp(z_i)}{\sum_j \exp(z_j)}, yielding probabilities approximately [0.659, 0.242, 0.099], which sum to 1 and emphasize the largest input value. For numerical stability, especially to avoid overflow in exponential computations, the subtract-max trick shifts the inputs by their maximum value: let m = \max(\mathbf{z}) = 2, so the adjusted vector is [0, -1, -1.9]. The exponentials are then \exp(0) = 1, \exp(-1) \approx 0.368, and \exp(-1.9) \approx 0.150, with their sum approximately 1.518. Dividing each exponential by this sum gives the same output [0.659, 0.242, 0.099]. This stable approach is commonly implemented in programming languages. In using , the function can be defined as:
python
import numpy as np

def softmax(z):
    return np.exp(z - np.max(z)) / np.sum(np.exp(z - np.max(z)))

# Test on the example
z = np.array([2, 1, 0.1])
print(softmax(z))  # Output: [0.65900114 0.24243295 0.09856591]
This implementation applies the subtract-max trick to ensure robustness across and input scales. Edge cases highlight the function's behavior. If all inputs are equal, such as \mathbf{z} = [1, 1, 1], the output is the [1/3, 1/3, 1/3], reflecting equal probabilities. Conversely, with one dominant input like \mathbf{z} = [10, 0, 0], the output approximates a vector [0.99995, 0.000023, 0.000023], concentrating probability on the maximum.

Alternatives and Variants

Temperature scaling serves as a variant of the softmax function, introducing a scalar T > 0 to modulate the distribution's without altering its core . This adjustment, applied by scaling the input logits before softmax computation, sharpens the output for low T (emphasizing confident predictions) or flattens it for high T (promoting uniformity), aiding in tasks like model where overconfident outputs need tempering. Sparsemax offers a sparse alternative to softmax, projecting input vectors onto the probability while enforcing zeros in low-scoring components, which yields interpretable, non-uniform distributions particularly beneficial for attention mechanisms and . In binary or , the activation function provides a direct analog to softmax for independent probability outputs per class, differing from softmax's mutual exclusivity. While no single activation fully replaces softmax for multi-class mutually exclusive problems, rectified linear unit (ReLU) variants are standard in hidden layers to introduce non-linearity, but they require softmax or similar at the output for probabilistic interpretation. The Gumbel-softmax distribution extends softmax by incorporating Gumbel noise for reparameterization, enabling differentiable approximations of discrete categorical sampling essential for variational inference and with discrete actions. Alternatives such as sparsemax and Gumbel-softmax are employed when softmax's dense outputs hinder interpretability, increase computational overhead on sparsity-prone data, or impede gradient flow through discrete selections, including low-precision hardware environments where zeroed computations reduce density.

References

  1. [1]
    None
    ### Summary of Softmax Function from the Paper
  2. [2]
    4.1. Softmax Regression - Dive into Deep Learning
    The softmax function gives us a vector y ^ , which we can interpret as the (estimated) conditional probabilities of each class, given any input x , such as y ^ ...4.1. 1.1. Linear Model · 4.1. 2.2. Softmax And... · 4.1. 3.2. Surprisal<|control11|><|separator|>
  3. [3]
    [PDF] The softmax function: Properties, motivation, and interpretation*
    Abstract. The softmax function is a ubiquitous helper function, frequently used as a probabilistic link function for unordered categorical data, ...
  4. [4]
    [PDF] Training Stochastic Model Recognition Algorithms as Networks can ...
    Training Stochastic Model Recognition Algorithms. 211. Training Stochastic Model Recognition. Algorithms as Networks can lead to Maximum. Mutual Information ...
  5. [5]
    [PDF] Pattern Recognition and Machine Learning - Microsoft
    A companion volume (Bishop and Nabney,. 2008) will deal with practical aspects of pattern recognition and machine learning, and will be accompanied by Matlab ...
  6. [6]
    None
    ### Summary of Softmax Function from https://www.deeplearningbook.org/contents/mlp.html
  7. [7]
    Probabilistic Interpretation of Feedforward Classification Network ...
    We explain two modifications: probability scoring, which is an alternative to squared error minimisation, and a normalised exponential (softmax) multi-input ...
  8. [8]
    [PDF] On the Properties of the Softmax Function with Application in Game ...
    Aug 21, 2018 · The goal of this paper is to expand on the known mathe- matical properties of the softmax function and demonstrate how they can be utilized to ...
  9. [9]
    Estimating or Propagating Gradients Through Stochastic Neurons ...
    Aug 15, 2013 · We examine this question, existing approaches, and compare four families of solutions, applicable in different settings.Missing: NeurIPS | Show results with:NeurIPS
  10. [10]
    Boltzmann Distribution - an overview | ScienceDirect Topics
    One of the fundamental results in classical statistical mechanics is the Boltzmann distribution function (Huang, 1987). Given a system formed by a large ...Statistical Thermodynamics · Statistical Mechanics... · IntroductionMissing: original | Show results with:original<|control11|><|separator|>
  11. [11]
    Boltzmann's Work in Statistical Physics
    Nov 17, 2004 · The celebrated formula \(S = k \log W\), expressing a relation between entropy \(S\) and probability \(W\) has been engraved on his tombstone ( ...
  12. [12]
    [PDF] PROBABILITY AND STATISTICS IN BOLTZMANN'S EARLY ...
    In the 1868 paper, Boltzmann derives the equilibrium distribution by means of an original combinatorial argument. A similar strategy is put forward also in ...
  13. [13]
    [PDF] The "Softmax" Nonlinearity: Derivation Using Statistical Mechanics ...
    The Jacobian DF of the softmax mapping is a symmetric n x n matrix that satisfies. DF(y) = diag (F,,(y» - F(y)F(y)T. (3). It is always singular with the ...
  14. [14]
  15. [15]
  16. [16]
    4 Numerical Computation - Deep Learning
    softmax function . The softmax function is often used to predict. the probabilities associated with a multinoulli distribution. The softmax function. is defined ...
  17. [17]
    Accurately computing the log-sum-exp and softmax functions
    Evaluating the log-sum-exp function or the softmax function is a key step in many modern data science algorithms, notably in inference and classification.Condition numbers and... · Algorithms with shifting · Computational experiments
  18. [18]
    softmax — SciPy v1.16.2 Manual
    The softmax function transforms each element of a collection by computing the exponential of each element divided by the sum of the exponentials of all the ...Scipy.special. · 1.15.2 · 1.13.1 · 1.13.0
  19. [19]
    [PDF] Reinforcement Learning: An Introduction - Stanford University
    We first came to focus on what is now known as reinforcement learning in late. 1979. We were both at the University of Massachusetts, working on one of.
  20. [20]
    [PDF] Simple Statistical Gradient-Following Algorithms for
    Abstract. This article presents a general class of associative reinforcement learning algorithms for connectionist networks containing stochastic units.
  21. [21]
  22. [22]
    [1706.03762] Attention Is All You Need - arXiv
    Jun 12, 2017 · View a PDF of the paper titled Attention Is All You Need, by Ashish Vaswani and 7 other authors. View PDF HTML (experimental). Abstract:The ...
  23. [23]
    Generating Long Sequences with Sparse Transformers - arXiv
    Apr 23, 2019 · We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers.
  24. [24]
    Fast Autoregressive Transformers with Linear Attention
    Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
  25. [25]
    [PDF] Improving Language Understanding by Generative Pre-Training
    Generative pre-training uses unlabeled text to train a model, followed by supervised fine-tuning on specific tasks, using a two-stage training procedure.
  26. [26]
    [1810.04805] BERT: Pre-training of Deep Bidirectional Transformers ...
    BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
  27. [27]
  28. [28]
    the early history of Boltzmann's H-theorem (1868–1877)
    Oct 7, 2011 · In this paper, I argue that there was no abrupt statistical turn. In the first part, I discuss the development of Boltzmann's research from 1868 to the ...
  29. [29]
    [PDF] The Origins of Logistic Regression - Tinbergen Institute
    Abstract. This paper describes the origins of the logistic function, its adop0 tion in bio0assay, and its wider acceptance in statistics. Its roots.
  30. [30]
    Individual choice behavior : a theoretical analysis : Luce, R. Duncan ...
    Jan 10, 2020 · Individual choice behavior : a theoretical analysis. xii, 153 pages ; 24 cm. Reprint of the ed. published by Wiley, New York. Includes bibliographical ...
  31. [31]
    The Relationship between Independent Race Models and Luce's ...
    Whether or not the hazard functions are mutually proportional, the choice axiom describes the asymptotic choice probabilities approached under uniform expansion ...
  32. [32]
    Hardware-oriented algorithms for softmax and layer normalization of ...
    Sep 12, 2024 · This paper presents hardware-oriented algorithms for both softmax and layer normalization of LLMs. We propose an approximate approach to implementing division ...Missing: developments | Show results with:developments
  33. [33]
    From Softmax to Sparsemax: A Sparse Model of Attention and Multi ...
    Feb 5, 2016 · We propose sparsemax, a new activation function similar to the traditional softmax, but able to output sparse probabilities.
  34. [34]
    Categorical Reparameterization with Gumbel-Softmax - arXiv
    Nov 3, 2016 · We show that our Gumbel-Softmax estimator outperforms state-of-the-art gradient estimators on structured output prediction and unsupervised ...Missing: function | Show results with:function