Arg max

In mathematics and optimization, the arg max (or argmax) operator denotes the argument of the maximum, specifically the value or set of values in the domain of a function that achieve the function's global maximum output. For a function f defined on a domain D, \arg\max_{x \in D} f(x) = \{ x \in D \mid f(x) \geq f(y) \ \forall y \in D \}.^[1] This notation contrasts with the max operator, which returns the maximum value itself rather than the input that produces it.^[2] The arg max is a cornerstone of optimization theory, where it identifies optimal solutions in problems ranging from continuous to discrete domains, often subject to constraints. Key properties include invariance under positive scaling and strictly increasing transformations of the objective function, ensuring that \arg\max f(x) = \arg\max (\theta f(x)) for \theta > 0 and \arg\max g(f(x)) = \arg\max f(x) for strictly monotonic increasing g.^[1] In cases where multiple inputs yield the same maximum, arg max returns the entire set, though in practice, it is often implemented to select a single representative, such as the first occurrence or an index in discrete settings like arrays.^[3] In statistics, arg max plays a central role in maximum likelihood estimation (MLE), where parameter estimates \hat{\theta} are computed as \hat{\theta} = \arg\max_{\theta} L(\theta \mid x), the value of \theta that maximizes the likelihood function L given observed data x.^[2] This approach leverages the monotonicity of the logarithm to simplify computations via the log-likelihood, and it extends to related methods like maximum a posteriori (MAP) estimation by incorporating priors. In machine learning, arg max is essential for tasks such as classification, where the predicted label is selected as \arg\max_y P(y \mid x), corresponding to the Bayes optimal classifier under the posterior probability.^[4] Its applications span generative models, decision theory, and algorithmic implementations in libraries like NumPy or TensorFlow, underscoring its versatility across theoretical and computational contexts.^[5]

Definition and Notation

Formal Definition

In mathematics, the arg max (or argmaximum) of a function f: X \to Y over a subset S \subseteq X is defined as the set of all points in S that achieve the maximum value of f on S:

\operatorname{argmax}_S f = \{ x \in S : f(s) \leq f(x) \ \forall \, s \in S \}.

This formulation captures the points where f attains its global maximum relative to S.^[6]^[1] When the codomain Y is the extended real line [-\infty, \infty], the definition is equivalently expressed using the supremum to handle cases where a maximum may not be attained:

\operatorname{argmax}_S f = \{ x \in S : f(x) = \sup_{s \in S} f(s) \}.

In this setting, the set comprises the points attaining the supremum, which may be empty if the supremum is not achieved.^[7] The arg max operator formally yields a set, which could be empty, a singleton, or contain multiple elements if the maximum is achieved at several points. In practice, particularly in fields like optimization and machine learning, it is often treated as returning a single value, with uniqueness assumed or one maximizer selected arbitrarily when multiple exist.^[1]

Common Notations

The standard mathematical notation for the arg max of a function f over a domain S is \arg\max_{x \in S} f(x), denoting the set of inputs x in S that achieve the maximum value of f.^[1] When the domain is implicit or understood from context, this simplifies to \arg\max_x f(x).^[1] Common variations include subscripted or superscripted forms to emphasize the optimization variable or constraints, such as x^* = \arg\max_x f(x) to denote a specific maximizer x^*.^[8] These notations tie back to the set-based definition by specifying the points where the function attains its supremum. In programming and computational contexts, arg max is implemented via functions that return indices of maximum values rather than sets. For instance, Python's NumPy library provides numpy.argmax(a, axis=None), which returns the index of the maximum element in array a along the specified axis. Similarly, PyTorch's torch.argmax(input, dim=None) computes the indices of the maximum values in a tensor input across a dimension.

Arg Min

The arg min operator, denoted as arg min or argmin, is the counterpart to arg max that identifies the input values achieving the minimum of a function over a specified domain. It plays a fundamental role in optimization and analysis, focusing on minimizers rather than maximizers. The formal definition of arg min for a function f: S \to \mathbb{R} over a set S is the set of minimizers given by

\operatorname{argmin}_S f = \{ x \in S : f(s) \geq f(x) \ \forall \, s \in S \}.

This set may be empty if no minimum exists, a singleton if the minimum is unique, or contain multiple elements otherwise.^[9] Common notational variants parallel those of arg max, such as \operatorname{argmin}_{x \in S} f(x) to emphasize the domain or x^* = \arg\min_x f(x) for a specific minimizer. In computational contexts, implementations like numpy.argmin() in the NumPy library return the index of the minimum element in an array, facilitating numerical minimization tasks.^[9] For extended real-valued functions, where f may take values in [-\infty, +\infty], special cases arise: if f \equiv -\infty on S, then \operatorname{argmin}_S f = \emptyset; otherwise, the set consists of points achieving the infimum \inf_{s \in S} f(s).^[10] This operator is symmetric to arg max as its dual under function negation, satisfying \operatorname{argmin} f = \operatorname{argmax} (-f).^[9] Arg min relates to the infimum concept by specifying the points where the function attains its greatest lower bound.

Max and Supremum

In mathematics, the supremum of a subset S of the real numbers, denoted \sup S, is defined as the least upper bound of S, meaning it is the smallest real number (or +\infty if unbounded above) that is greater than or equal to every element in S. Formally, for a nonempty subset S \subseteq \mathbb{R} \cup \{ +\infty \}, \sup S is the infimum of the set of all upper bounds of S. This least upper bound always exists for nonempty sets bounded above, by the completeness axiom of the real numbers.^[11]^[12] The maximum of a set S, denoted \max S, is the largest element within S itself, if such an element exists. In this case, \max S = \sup S, as the maximum serves as both an upper bound and the least such bound. However, if no element of S achieves the supremum—such as when S is open or unbounded in a way that prevents attainment—the maximum is undefined over the reals, though the supremum remains well-defined (possibly as +\infty).^[13]^[11] For a function f: S \to Y where Y is an ordered set (typically \mathbb{[R](/page/R)}), the supremum \sup_{x \in S} f(x) is the least upper bound of the image f(S), formally given by \sup_{x \in S} f(x) = \inf \{ y \in Y : y \geq f(x) \text{ for all } x \in S \}. This value may or may not be attained by f at any point in S; the maximum \max_{x \in S} f(x) exists only if there is some x_0 \in S such that f(x_0) = \sup_{x \in S} f(x), in which case it equals the supremum. Otherwise, the supremum provides the relevant bound without a corresponding maximum.^[11]^[14] In the context of the arg max operator, which identifies points in S where f achieves its maximum value, the supremum plays a foundational role: the arg max set is nonempty precisely when the supremum is attained as a maximum, allowing identification of maximizers; however, the supremum itself can always be computed and used as the optimal value even if the arg max is empty. This distinction ensures that optimization problems can reference the least upper bound regardless of attainability.^[15] A classic example of a non-attained supremum is the function f(x) = x over the open interval S = (0, 1), where \sup_{x \in (0,1)} x = 1, as 1 bounds the set from above and no smaller number does, but no x \in (0,1) equals 1, so the maximum does not exist and the arg max would be empty.^[11]

Mathematical Properties

Existence Conditions

The existence of the arg max set for a function f: S \to \mathbb{R}, defined as \arg\max_{x \in S} f(x) = \{x \in S \mid f(x) = \sup_{y \in S} f(y)\}, requires that the supremum value is attained at least at one point in the domain S. A fundamental result guaranteeing this is the extreme value theorem, which states that if f is continuous and S is a nonempty compact subset of a Euclidean space \mathbb{R}^n, then f attains its maximum on S, so \arg\max_{x \in S} f(x) is nonempty (and in fact compact).^[16] This theorem extends the classical case for univariate functions, where the Weierstrass extreme value theorem (a variant of the extreme value theorem) ensures that a continuous function on a closed and bounded interval [a, b] attains both its maximum and minimum values.^[17] Nonexistence of \arg\max arises when these conditions fail, such as on unbounded domains. For instance, consider f(x) = x defined on S = \mathbb{R}; here, \sup_{x \in \mathbb{R}} f(x) = +\infty, which is not attained at any finite point, so \arg\max_{x \in \mathbb{R}} f(x) = \emptyset.^[1] Similarly, discontinuities on compact sets can prevent attainment of the supremum. A standard counterexample is the function f: [0, 1] \to \mathbb{R} given by f(x) = x for x \in [0, 1) and f(1) = \frac{1}{2}; this is bounded with \sup_{x \in [0,1]} f(x) = 1, but the value 1 is not attained anywhere in [0, 1], yielding \arg\max_{x \in [0,1]} f(x) = \emptyset.^[18] In more general settings, such as topological spaces, existence can be ensured under weaker continuity assumptions. Specifically, if S is a compact topological space and f: S \to \mathbb{R} is upper semicontinuous, then f attains its maximum on S, so \arg\max_{x \in S} f(x) is nonempty and compact. Upper semicontinuity replaces full continuity, allowing for functions that are "continuous from above" while still guaranteeing the supremum is achieved.^[19]

Uniqueness and Multiplicity

The arg max of a function f over a convex set yields a unique point under certain regularity conditions on f. Specifically, if f is strictly quasiconcave on a convex domain, then assuming a maximizer exists, the set \arg\max f is a singleton, as any two distinct points in the set would violate the strict quasiconcavity by allowing a convex combination with a strictly higher value.^[20] This property ensures that local maxima coincide with the unique global maximum, distinguishing strict quasiconcavity from weaker forms that permit multiple points achieving the same value. In contrast, multiplicity arises when f exhibits plateaus, such as regions where f is constant over a subset of the domain. For instance, if f(x) = c for all x in the domain, then \arg\max f coincides with the entire domain, resulting in infinite multiplicity if the domain is unbounded.^[9] Such cases highlight how flat level sets can expand the arg max set beyond a single point, even when a maximum value is attained. Relations to convexity further elucidate the structure of \arg\max f. For a convex function f over a convex set, any maximum, if it exists, is attained at an extreme point of the set, as interior points cannot yield the global maximum due to the function's behavior along line segments.^[9] Moreover, if f is strictly concave—equivalently, if -f is strictly convex—then \arg\max f is unique, since the corresponding minimization of -f has a unique minimizer under strict convexity.^[9] In unbounded domains, the arg max set may be empty if f is unbounded above, or exhibit infinite multiplicity if f achieves its supremum constantly over an unbounded region. These scenarios underscore the interplay with existence conditions, such as compactness of the domain ensuring a nonempty finite arg max under continuity.

Illustrative Examples

Univariate Functions

For univariate functions, the arg max operation identifies the input value(s) that achieve the highest output, providing insight into basic optimization behaviors. A simple example is the tent-shaped function f(x) = 1 - |x| defined over all real numbers \mathbb{R}. This piecewise linear function increases to a peak of 1 at x = 0 and decreases symmetrically thereafter, yielding a unique maximum such that \arg\max_{x \in \mathbb{R}} f(x) = \{ 0 \}.^[1] Periodic functions can exhibit multiple points achieving the maximum. Consider f(x) = \cos x on the closed interval [0, 4\pi]. The cosine reaches its global maximum value of 1 at the points where it completes even multiples of its period, specifically at the endpoints and midpoint cycle: \arg\max_{x \in [0, 4\pi]} f(x) = \{ 0, 2\pi, 4\pi \}.^[21] In contrast, some functions lack a maximum altogether. For the cubic function f(x) = x^3 over \mathbb{R}, the output grows without bound as x \to \infty while being strictly increasing everywhere, so no finite input achieves a supreme value and \arg\max_{x \in \mathbb{R}} f(x) = \emptyset.^[22] To compute the arg max for differentiable univariate functions, identify critical points by solving f'(x) = 0 (or where the derivative is undefined), then apply the second derivative test: if f''(c) < 0 at a critical point c, it is a local maximum. Graphing can visually confirm these points for intuition, especially in non-differentiable cases like the absolute value example.^[22]^[23] These univariate cases highlight how domain compactness influences existence and how smoothness affects uniqueness of the arg max.^[22]

Multivariate Functions

For functions of multiple variables, the arg max extends the concept to vector-valued inputs, where the goal is to identify the point or set of points in the domain that maximize the function value. Unlike univariate cases, which involve scalar inputs and single derivatives, multivariate arg max requires analyzing partial derivatives and potentially higher-dimensional critical sets to locate maxima. This can lead to unique points, entire manifolds, or empty sets depending on the function and domain properties.^[24] A representative example of a unique arg max occurs with the quadratic function f(x,y) = -(x-1)^2 - (y-2)^2 defined on \mathbb{R}^2. This function describes an inverted paraboloid centered at (1,2), where the global maximum value of 0 is attained solely at the point (1,2). To compute this, set the gradient to zero: \nabla f = (-2(x-1), -2(y-2)) = (0,0), yielding the critical point (1,2). The Hessian matrix is H = \begin{pmatrix} -2 & 0 \\ 0 & -2 \end{pmatrix}, with determinant 4 > 0 and trace -4 < 0, confirming a local (and global) maximum since the function decreases quadratically away from this point. Boundary checks are unnecessary here due to the unbounded domain and the function approaching -\infty at infinity.^[24]^[16] In cases where the maximum is attained along a continuum, the arg max forms a set rather than a singleton. Consider f(x,y) = -(x - y)^2 on the compact domain [0,1] \times [0,1]. This function reaches its maximum value of 0 along the diagonal line segment \{(t,t) : t \in [0,1]\}, as f(t,t) = 0 for all t, while f(x,y) < 0 off the diagonal. The gradient \nabla f = (-2(x-y), 2(x-y)) = (0,0) holds precisely when x = y, identifying the entire line as a critical set. The Hessian H = \begin{pmatrix} -2 & 2 \\ 2 & -2 \end{pmatrix} has determinant 0, indicating a degenerate case where the second derivative test is inconclusive; further analysis shows the function is constant along the line x = y and concave down perpendicular to it, verifying the maximum set. On this bounded domain, boundary evaluations confirm no higher values elsewhere.^[24] Computing arg max in the multivariate setting generally involves solving \nabla f = 0 for critical points interior to the domain, followed by the Hessian test for classification: if the Hessian is negative definite (all eigenvalues negative), it confirms a local maximum. For compact domains, boundary checks are essential using techniques like Lagrange multipliers or parameterization. However, nonexistence arises on unbounded domains where the function has no upper bound. For instance, consider f(x,y) = x + y on the non-compact half-plane \mathbb{R}^2_+ = \{(x,y) : x \geq 0, y \geq 0\}. As x or y tends to infinity, f(x,y) \to \infty, so the supremum is \infty and the arg max set is empty, with no finite point achieving a maximum.^[16]^[24]

Applications

In Optimization

In mathematical optimization, the arg max operator denotes the set of points that achieve the global maximum of an objective function f(x), serving as the optimal solution to the maximization problem \max_x f(x). For unconstrained problems over a domain where f is continuous and the domain is compact, the extreme value theorem guarantees the existence of such a global arg max. In constrained settings, arg max solutions satisfy \max_x f(x) subject to equality constraints g(x) = 0 or inequality constraints g(x) \leq 0, often requiring specialized methods to identify candidates for evaluation.^[25] For equality-constrained optimization, the method of Lagrange multipliers identifies potential arg max points by constructing the Lagrangian \mathcal{L}(x, \lambda) = f(x) + \lambda^T g(x), where \lambda are the multipliers. Stationary points occur where the gradients satisfy \nabla f(x) = -\lambda^T \nabla g(x) and g(x) = 0, under suitable constraint qualifications such as linear independence of the gradients \nabla g(x). These points are then assessed by substituting into f to determine the global arg max among feasible candidates.^[25] When inequality constraints are present, as in nonlinear programming problems of the form \max_x f(x) subject to g_i(x) \leq 0 for i = 1, \dots, m and equality constraints h_j(x) = 0 for j = 1, \dots, p, the Karush-Kuhn-Tucker (KKT) conditions provide necessary conditions for local optima, applicable under convexity or constraint qualifications like Slater's condition. The KKT conditions include stationarity \nabla f(x) = \sum_{i=1}^m \mu_i \nabla g_i(x) + \sum_{j=1}^p \lambda_j \nabla h_j(x), primal feasibility g_i(x) \leq 0 and h_j(x) = 0, dual feasibility \mu_i \geq 0, and complementary slackness \mu_i g_i(x) = 0 for all i. These conditions generalize Lagrange multipliers and guide the identification of local arg max points, which are evaluated to find the global solution. Historically, arg max has played a key role in operations research since the 1940s, particularly in George Dantzig's simplex method for linear programming, where it selects the entering pivot variable as the nonbasic variable maximizing the rate of objective improvement via the most positive reduced cost. This iterative arg max selection drives the algorithm from an initial basic feasible solution toward the global optimum in polynomial average time for practical instances.^[26]

In Probability and Statistics

In probability and statistics, the arg max operation plays a central role in parameter estimation and characterizing distributional properties. One of the most prominent applications is in maximum likelihood estimation (MLE), where the goal is to find the parameter value that maximizes the probability of observing the given data under a specified probabilistic model. Formally, for a likelihood function \mathcal{L}(\theta \mid \mathbf{x}) based on observed data \mathbf{x}, the MLE is given by \hat{\theta} = \arg\max_{\theta} \mathcal{L}(\theta \mid \mathbf{x}), where \theta represents the parameters of the model. Since the logarithm is a monotonically increasing function, this is often equivalent to maximizing the log-likelihood: \hat{\theta} = \arg\max_{\theta} \log \mathcal{L}(\theta \mid \mathbf{x}). This approach, introduced by Ronald A. Fisher, provides a principled method for inference by selecting parameters that make the data most probable.^[27] Another key use of arg max arises in identifying the mode of a probability distribution, which is the point of highest density. For a continuous random variable with probability density function p(x), the mode is defined as m = \arg\max_x p(x), representing the value of x where the density achieves its global maximum. This measure of central tendency is particularly useful for multimodal or skewed distributions where the mean or median may not capture the peak concentration of probability mass. For instance, in the univariate Gaussian distribution with density p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right), the mode coincides with the mean \mu, as the density is symmetric and unimodal.^[28] Empirical risk minimization (ERM) extends these ideas to model selection in statistical learning, framing estimation as an optimization over a loss function applied to the data. In this context, the empirical risk is the average loss over the sample, and the estimator is \hat{f} = \arg\min_f R_{\text{emp}}(f), where R_{\text{emp}}(f) = \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i)) for observations (x_i, y_i) and loss L. Equivalently, for maximization-based formulations, it involves arg max over the negative empirical risk. MLE itself is a special case of ERM when the loss is the negative log-likelihood, linking classical estimation to broader statistical procedures for predictive modeling. This principle ensures that the selected model performs well on the observed data, balancing fit and generalization under appropriate capacity constraints.^[29] The asymptotic properties of arg max-based estimators, such as those from MLE, provide theoretical justification for their reliability in large samples. Under standard regularity conditions—including differentiability of the log-likelihood, identifiability of the true parameter, and domination of moments—the MLE \hat{\theta} is consistent, converging in probability to the true value \theta_0 as the sample size n \to \infty. This result, known as Cramér's consistency theorem, relies on the law of large numbers applied to the expected log-likelihood, ensuring that the maximizer of the sample criterion approaches the maximizer of the population criterion. Additionally, the MLE achieves asymptotic efficiency, attaining the Cramér-Rao lower bound on variance, which underscores its optimality among unbiased estimators in the limit. These properties hold for a wide class of models, provided the parameter space is compact and the information matrix is positive definite.

In Computer Science and Machine Learning

In computer science, the arg max operation plays a central role in discrete search algorithms for solving combinatorial optimization problems, where the goal is to find the input that maximizes an objective function over a finite set. In branch-and-bound algorithms, arg max is frequently used to select the most promising node from a priority queue based on upper bound estimates, guiding the search toward optimal solutions while pruning suboptimal branches.^[30] For instance, in solving integer linear programs, the algorithm iteratively chooses the node v = \arg\max \{ UB(\text{node}) : \text{node} \in \text{queue} \}, where UB denotes the upper bound, to efficiently explore the solution space.^[30] Similarly, dynamic programming leverages arg max in recursive formulations to compute optimal substructure, such as in the knapsack problem where the value at each state is updated via dp = \max( dp[i-1], v_j + dp[i - w_j] ), with arg max implicitly selecting the best item inclusion decision across subproblems.^[31] In machine learning, particularly in neural network classification, arg max is applied after the softmax function to determine the predicted class from output logits. The softmax converts raw scores z into a probability distribution P(y=i|x) = \frac{e^{z_i}}{\sum_j e^{z_j}}, ensuring the outputs sum to 1 and represent class probabilities.^[32] The predicted class is then \hat{y} = \arg\max_i P(y=i|x), selecting the index of the highest probability, which is standard in multi-class tasks like image recognition.^[32] This combination enables differentiable training via cross-entropy loss while yielding discrete decisions at inference. For approximating arg max in high-dimensional or continuous spaces, gradient-based methods like stochastic gradient ascent are employed, especially in reinforcement learning policy optimization. In the REINFORCE algorithm, the policy parameters \theta are updated via \theta \leftarrow \theta + \alpha \nabla_\theta J(\theta), where J(\theta) approximates the expected reward, effectively performing ascent toward the arg max policy without direct enumeration. This stochastic approach handles large action spaces by sampling trajectories and estimating gradients, converging to near-optimal policies in complex environments. Computing exact arg max for certain discrete problems is computationally challenging, as many are NP-hard; for example, finding the maximum independent set in a graph—equivalent to arg max over subset sizes maximizing non-adjacency—is NP-hard. Approximations often rely on heuristic methods like genetic algorithms, which evolve a population of candidate solutions through selection, crossover, and mutation to converge on high-quality arg max approximations for NP-hard combinatorial tasks such as the traveling salesman problem.^[33] These metaheuristics provide scalable solutions when exact methods are intractable, balancing exploration and exploitation to approach global optima.