Fact-checked by Grok 2 weeks ago

Loss function

A loss function, also known as a cost function, is a mathematical function that quantifies the discrepancy or penalty between a model's predicted output and the actual target value in optimization problems, serving as a core component in statistical estimation, machine learning, and decision theory.^[1] In machine learning, particularly supervised learning, the loss function evaluates the performance of a hypothesis or model on training data, providing a measure of how "bad" the predictions are, with the objective of minimizing this loss to improve accuracy and generalization through iterative optimization techniques like gradient descent.^[1] The choice of loss function is pivotal, as it influences the training dynamics, model convergence, and suitability for specific tasks such as regression or classification, ensuring the model learns meaningful patterns from data. Common loss functions are tailored to problem types: for regression, where continuous outputs are predicted, the mean squared error (MSE) computes the average of squared differences between predictions and targets, while the mean absolute error (MAE) uses absolute differences for robustness to outliers; in classification, binary cross-entropy measures divergence for two-class problems, and categorical cross-entropy extends this to multiple classes, with hinge loss often used in support vector machines to maximize margins.^[1] Specialized variants, such as Huber loss for combining MSE and MAE benefits or Dice loss for image segmentation, address domain-specific challenges like noisy data or imbalanced classes.^[1] The origins of loss functions trace back to the method of least squares, formalized by Adrien-Marie Legendre in 1805 for astronomical data fitting and independently developed by Carl Friedrich Gauss around 1809, which minimizes the sum of squared residuals as a foundational error measure.^[2] This approach evolved through maximum likelihood estimation in the early 20th century^[3] and was rigorously integrated into statistical decision theory by Abraham Wald in the 1940s, framing estimation and hypothesis testing as risk-minimization problems under defined losses.^[4] In contemporary machine learning, especially deep learning since the 2010s, loss functions have advanced to include adversarial and robust forms, such as those in generative adversarial networks (GANs), reflecting ongoing innovations in handling complex, high-dimensional data.

Fundamentals

Definition

In statistical decision theory, a loss function L(\theta, a) is defined as a measure that quantifies the penalty or cost incurred by selecting an action a from the action space when the true state of nature is the parameter \theta from the parameter space.^[5] In the specific contexts of estimation and prediction, this is commonly expressed as L(y, \hat{y}), where y represents the true value and \hat{y} the predicted or estimated value, thereby assessing the discrepancy between prediction and reality.^[6] The function maps this discrepancy to a non-negative real number, typically achieving its minimum value—often zero—when \hat{y} = y, indicating no error.^[7] The concept of the loss function was formalized within statistical decision theory by Abraham Wald during the 1940s and 1950s, building on earlier ideas in statistics and economics to provide a rigorous framework for evaluating decisions under uncertainty.^[8] Wald's work, particularly in his 1950 monograph Statistical Decision Functions, established loss functions as a cornerstone for analyzing estimation and hypothesis testing problems. Loss functions presuppose familiarity with basic mathematical functions and the principles of estimation, where one seeks to infer unknown parameters from observed data. They are essential for model evaluation and optimization because they transform abstract decision-making problems into concrete optimization tasks, allowing the selection of procedures that minimize penalties for errors in a principled manner.^[6] By defining the "badness" of decisions, loss functions enable the comparison of competing estimators or predictors based on their performance against potential true states.^[5] Common general properties of loss functions include non-negativity and monotonic increase with the magnitude of error, ensuring that larger deviations incur higher costs. Many are convex, which supports efficient optimization techniques like gradient descent, though this is not universal. Symmetry—where L(y, \hat{y}) = L(\hat{y}, y)—holds for some, such as squared error losses, but asymmetric variants exist to reflect directional penalties in certain applications. In stochastic environments, loss functions extend to expected loss, averaging over probabilistic outcomes to assess overall risk.^[7]

Relation to Objective Functions

In optimization problems within machine learning and statistics, loss functions specifically quantify the discrepancy between a model's prediction and the true outcome for individual data points, serving as the core measure of error to be minimized. Objective functions, by contrast, represent the broader optimization target, typically integrating the loss across a dataset—often as an empirical average—while incorporating additional components such as regularization terms to enforce desirable properties like sparsity or smoothness. For instance, the general form of an objective function can be expressed as J(\theta) = \mathbb{E}[L(y, f(\theta))] + \lambda R(\theta), where L is the loss, R is the regularizer, \theta are the model parameters, and \lambda controls the trade-off. This distinction underscores the role of objective functions in practical training: in machine learning, they often manifest as the empirical risk, which approximates the expected loss over training data, augmented by penalties to mitigate overfitting. In statistical contexts, such as linear regression, the objective evolves from a pure sum of squared errors (a quadratic loss aggregated over samples) to include L2 regularization, transforming it into ridge regression's criterion:

J(\mathbf{b}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T \mathbf{b})^2 + \lambda \|\mathbf{b}\|^2,

where the first term captures the empirical quadratic loss and the second penalizes large coefficients. This composite structure balances fidelity to the data against model complexity, a practice formalized in the 1970s to handle multicollinearity.^[9] The conceptual shift from standalone loss functions in early statistics to multifaceted objectives in modern machine learning emerged gradually, rooted in statistical decision theory's emphasis on minimizing expected loss (or risk) under uncertainty, as pioneered in the mid-20th century. By the late 20th century, with the formalization of empirical risk minimization and structural penalties in learning theory, objectives became standard for addressing generalization bounds in high-dimensional settings, marking a transition from pure error measures in classical statistics to regularized formulations prevalent post-1990s in fields like support vector machines and neural networks.

Core Examples

0-1 Loss Function

The 0-1 loss function, also known as the misclassification loss or indicator loss, is a fundamental metric in classification tasks that penalizes incorrect predictions with a value of 1 and correct predictions with 0. Formally, for a true label y \in \{-1, 1\} and a predicted score f(x) from a classifier, it is defined as

L(y, f(x)) = \mathbb{I}[y f(x) \leq 0],

where \mathbb{I} is the indicator function that equals 1 if the condition holds and 0 otherwise; this equates to 1 for misclassified examples and 0 for correctly classified ones.^[10] Equivalently, in terms of hard predicted labels \hat{y} = \operatorname{sign}(f(x)), the loss is L(y, \hat{y}) = 0 if y = \hat{y} and 1 otherwise, making it directly equivalent to the misclassification error rate.^[10] This loss function exhibits key properties that distinguish it from smoother alternatives: it is non-convex due to its step-like structure, discontinuous at decision boundaries, and non-differentiable everywhere except where y f(x) > 0.^[10] These characteristics render direct optimization of the 0-1 loss computationally challenging, as standard gradient-based methods fail, and the problem is NP-hard in general for binary classification.^[10] Despite these issues, the 0-1 loss serves as the ideal evaluation criterion for classifiers, as minimizing it corresponds exactly to maximizing classification accuracy. In applications, the 0-1 loss forms the basis for the accuracy metric, where a model's accuracy on a dataset is simply 1 minus the empirical average of the 0-1 losses over all examples. It is also pivotal in theoretical analyses of machine learning, particularly in statistical learning theory, where it underpins concepts like the Vapnik-Chervonenkis (VC) dimension—a measure of a hypothesis class's capacity defined as the largest set of points that can be shattered (perfectly classified in all possible labelings) under 0-1 loss. The VC dimension enables generalization bounds that relate empirical 0-1 risk to true risk, ensuring learnability for classes with finite VC dimension. A notable limitation of the 0-1 loss is its insensitivity to the confidence or margin of predictions: it assigns the same penalty to all misclassifications regardless of how far the predicted score is from the decision boundary, ignoring probabilistic outputs or soft predictions. This binary nature motivates the use of surrogate losses that approximate the 0-1 loss while being convex and differentiable, such as the hinge loss introduced in support vector machines, which penalizes violations proportionally to their magnitude and provides a tight upper bound on the 0-1 error.

Quadratic Loss Function

The quadratic loss function, also known as the squared error loss, measures the discrepancy between a true value y and a predicted value \hat{y} by squaring their difference, defined as
L(y, \hat{y}) = (y - \hat{y})^2.
This formulation penalizes larger errors more severely than smaller ones due to the quadratic term, making it particularly suitable for scenarios assuming Gaussian noise in the data. When averaged over a dataset of n observations, it yields the mean squared error (MSE),
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2,
which serves as a key metric for evaluating regression models under symmetric error distributions.^[7] Mathematically, the quadratic loss exhibits desirable properties for optimization: it is convex, ensuring that any local minimum is global, and continuously differentiable, facilitating gradient-based methods for minimization. Its convexity stems from the positive semi-definiteness of the associated Hessian matrix, while differentiability allows for straightforward computation of the gradient \frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y}). Furthermore, this loss is equivalent to the squared L2 norm of the residual vector, L(y, \hat{y}) = \| y - \hat{y} \|_2^2, linking it to Euclidean distance metrics in vector spaces. These attributes make it a foundational choice in parametric estimation where smooth, global optimization is prioritized.^[11]^[7] The quadratic loss gained prominence through its central role in the method of least squares, first formally published by Adrien-Marie Legendre in 1805 as an algebraic technique for fitting orbits in astronomy, where minimizing the sum of squared residuals provided an optimal solution under equal-weight assumptions. Carl Friedrich Gauss independently developed and justified the approach probabilistically around 1795, publishing it in 1809, arguing that it corresponds to maximum likelihood estimation under Gaussian errors, thus establishing its statistical rationale. This historical foundation solidified the quadratic loss as the cornerstone of linear regression and beyond.^[12]^[13] Despite its strengths, the quadratic loss is highly sensitive to outliers, as the squaring amplifies the influence of extreme residuals, potentially skewing estimates away from the true parameters. This vulnerability prompted the development of robust alternatives, such as the Huber loss function introduced by Peter J. Huber in 1964, which transitions from quadratic to linear penalties for large errors to mitigate outlier effects while retaining smoothness for small deviations.^[14]

Absolute Value Loss Function

The absolute value loss function, also known as the L1 loss or mean absolute error, measures the deviation between a true value y and a predicted value \hat{y} as L(y, \hat{y}) = |y - \hat{y}|.^[15] In the context of estimation, the value that minimizes the expected absolute value loss \mathbb{E}[|Y - c|] is the median of the distribution of Y, providing a robust central tendency measure under this criterion.^[16] This loss function is convex, ensuring that any local minimum is global and facilitating optimization in various settings, though it is non-differentiable at zero, which requires subgradient methods or approximations for gradient-based algorithms. In regularization contexts, such as Lasso regression, the L1 norm promotes sparsity by driving less important coefficients to exactly zero, enabling feature selection. The absolute value loss serves as the foundation for least absolute deviations (LAD) regression, which minimizes the sum of absolute residuals \sum_{i=1}^n |y_i - \hat{y}_i| to fit linear models.^[17] Compared to quadratic loss, LAD is more robust to outliers because it does not square errors, preventing a single large deviation from dominating the optimization.^[15] This loss is equivalent to the Manhattan distance (or L1 norm) in the space of predictions and observations.

Mathematical Foundations

Deterministic Loss

In deterministic settings, a loss function provides a measure of error for a single prediction by quantifying the discrepancy between a true outcome and its estimate, without reference to probabilistic distributions. Formally, it is defined as a mapping L: \mathcal{Y} \times \hat{\mathcal{Y}} \to [0, \infty), where \mathcal{Y} denotes the space of possible true values and \hat{\mathcal{Y}} the space of predictions, with the property that L(y, y) = 0 for all y \in \mathcal{Y}, ensuring zero penalty when the prediction exactly matches the truth. This pointwise computation applies to fixed pairs (y, \hat{y}), treating the loss as a direct cost incurred for that specific instance. Common forms of deterministic losses include symmetric distance-based measures, such as the Euclidean norm L(y, \hat{y}) = \| y - \hat{y} \|_2, which penalizes deviations proportionally to their magnitude regardless of direction. Asymmetric variants, however, impose different penalties for positive and negative errors to capture domain-specific imbalances; for instance, in estimation problems, a loss like L(y, \hat{y}) = a (\hat{y} - y)_+^2 + b (y - \hat{y})_+^2 with a \neq b weights overestimation more heavily than underestimation, as seen in applications where false positives carry higher costs. These losses play a central role in optimization through empirical risk minimization, where for a dataset of n observations \{(x_i, y_i)\}_{i=1}^n, the objective is to minimize the average pointwise loss \hat{R}(f) = \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i)) over a class of predictor functions f. This approach assumes a deterministic framework, relying solely on observed true-predicted pairs to guide model selection and training, without integrating expectations over random variables.

Expected Loss

The expected loss provides a probabilistic generalization of the pointwise loss by averaging over the uncertainty inherent in the data-generating process, thereby quantifying the average penalty incurred by a decision rule across possible outcomes. For a decision rule δ that maps observations X to actions δ(X), the expected loss, often termed the risk R(δ), is defined as the expectation of the loss function L(Y, δ(X)) taken with respect to the joint distribution of the random variables X and Y: $$ R(\delta) = \mathbb{E}[L(Y, \delta(X))],

where the expectation integrates over all possible realizations of (X, Y). In explicit integral form, assuming a continuous joint distribution P(x, y), this becomes

R(\delta) = \int L(y, \delta(x)) , dP(x, y).

[](https://www2.stat.duke.edu/~st118/sta732/Decision.pdf)[](https://people.maths.bris.ac.uk/~mazjcr/ToI/2016/ToIDecTheory.pdf) This expression arises from the law of total expectation, which decomposes the overall risk into conditional expectations over the observation space, connecting the deterministic loss—served as the core integrand L(y, δ(x))—to a comprehensive measure of performance under uncertainty in statistical decision theory.[](https://www.stat.purdue.edu/~dasgupta/528-3.pdf) A key property of the expected loss is that minimizing R(δ) yields the Bayes optimal decision rule, which achieves the lowest possible average risk under the given distribution; for the quadratic loss L(y, a) = (y - a)^2, this optimal rule selects the posterior mean \mathbb{E}[Y \mid X] as the action δ(X).[](https://www2.stat.duke.edu/courses/Fall21/sta601.001/slides/02-loss-functions-handout.pdf) Under quadratic loss, the expected loss further decomposes into interpretable components via the bias-variance tradeoff: $ R(\delta) = \mathbb{E}[\mathrm{Var}(Y \mid X)] + \mathbb{E}[(\delta(X) - \mathbb{E}[Y \mid X])^2] $, where the first term is the irreducible noise $ \mathbb{E}[\mathrm{Var}(Y \mid X)] $, and the second term, the excess risk or mean squared error of the predictor relative to the true regression function, can be further decomposed into the squared bias $ \mathbb{E}[(\mathbb{E}[\delta(X) \mid X] - \mathbb{E}[Y \mid X])^2] $ and the variance $ \mathbb{E}[\mathrm{Var}(\delta(X) \mid X)] $ of the predictor over the distribution of X.[](https://homes.cs.washington.edu/~pedrod/bvd.pdf) ### Bayes Risk The Bayes risk quantifies the overall expected loss of a decision rule under a Bayesian framework, integrating both the sampling distribution of the data and the prior beliefs about the unknown parameter. Formally, for a prior distribution $\pi$ over the parameter $\theta$ and a decision rule $\delta$, the Bayes risk is defined as

r(\pi, \delta) = \mathbb{E}\theta \left[ \mathbb{E}{X|\theta} L(\theta, \delta(X)) \right],

where the inner expectation is the risk function $R(\theta, \delta)$ conditional on $\theta$, and the outer expectation averages this risk over the prior $\pi$. This measure evaluates the performance of $\delta$ by incorporating uncertainty in $\theta$ via the prior, providing a single scalar that allows comparison and selection among decision rules.[](https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/RevisedLectureNotes2.pdf)[](https://www2.stat.duke.edu/~rcs46/lecturesModernBayes/601-module2-decision-theory/lecture3-decision-theory.pdf) The optimal Bayes rule $\delta^*$ minimizes the Bayes risk, given by $\delta^* = \arg\min_\delta r(\pi, \delta)$. This minimization yields the decision that achieves the lowest average loss under the specified prior, guiding actions based on posterior expected loss for each observed data $X$. Unlike frequentist risk, which conditions solely on fixed $\theta$ and emphasizes asymptotic behavior over repeated sampling, the Bayes risk explicitly integrates subjective or empirical priors to reflect belief updating, enabling a coherent framework for decision-making under uncertainty.[](https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/RevisedLectureNotes2.pdf) A key example arises under the 0-1 loss function $L(\theta, a) = 0$ if $a = \theta$ and 1 otherwise, where the Bayes rule $\delta^*(X)$ selects the action that maximizes the posterior probability, corresponding to the maximum a posteriori (MAP) estimator. This choice minimizes the probability of error by favoring the most probable parameter value given the data and prior.[](https://pillowlab.princeton.edu/teaching/mathtools16/slides/lec18_BayesianEstim.pdf)[](https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/RevisedLectureNotes2.pdf) The concept of Bayes risk was formalized within Abraham Wald's foundational work on statistical decision functions in 1950, establishing the theoretical basis for evaluating rules under integrated risk. This framework was further developed and extended in Bayesian statistics by Morris H. DeGroot in his 1970 treatise on optimal statistical decisions, emphasizing prior incorporation for robust inference.[](https://gwern.net/doc/statistics/decision/1950-wald-statisticaldecisionfunctions.pdf)[](https://onlinelibrary.wiley.com/doi/book/10.1002/0471729000) ## Applications in Statistics ### Frequentist Expected Loss In the frequentist paradigm of statistical decision theory, the expected loss, often termed the risk function, quantifies the average performance of a decision rule over the data-generating distribution parameterized by a fixed but unknown true parameter $\theta$. Formally, for a decision rule $\delta$ that maps observations $X$ to actions, and a loss function $L(\theta, a)$ measuring the discrepancy between the true parameter and action $a$, the risk is defined as

R(\theta, \delta) = \mathbb{E}{X \sim P\theta} \left[ L(\theta, \delta(X)) \right],

where $P_\theta$ denotes the distribution of the data under $\theta$. This formulation, introduced by Abraham Wald, evaluates the long-run frequency of loss for repeated sampling from $P_\theta$, without invoking probabilistic structure on $\theta$ itself. Since the true $\theta$ is unknown, direct computation of $R(\theta, \delta)$ is infeasible, necessitating estimators of the risk. In contexts where the loss can be directly evaluated using observed outcomes (e.g., prediction tasks with known responses $Y_i$), a primary approach is the empirical risk, which replaces the expectation with the sample average of observable losses over an independent validation dataset: $\hat{R}_n(\delta) = \frac{1}{n} \sum_{i=1}^n l(\delta(X_i), Y_i)$. Under the i.i.d. assumption from $P_\theta$, this empirical risk serves as an unbiased estimator of the true risk, as its expectation equals $R(\theta, \delta)$.[](https://faculty.washington.edu/yenchic/18W_425/Lec16_ERM.pdf) In more general decision problems without directly observable losses, plug-in estimators provide an alternative by substituting a consistent estimator $\hat{\theta}$ for $\theta$ and approximating the risk $R(\hat{\theta}, \delta)$ under the estimated distribution $P_{\hat{\theta}}$, often via Monte Carlo simulation or analytical methods in parametric models. These achieve consistency under regularity conditions like identifiability and compactness of the parameter space. Key properties of these risk estimators include asymptotic unbiasedness and consistency, where the bias of plug-in variants diminishes to zero as sample size grows, provided the underlying estimator $\hat{\theta}$ is consistent; for empirical risk in observable settings, consistency follows from the law of large numbers, converging in probability to the true risk under mild moment conditions on the loss. Furthermore, the risk framework connects to confidence intervals by interpreting loss thresholds: for instance, the set of actions $a$ where $R(\theta, \delta) \leq c$ for some constant $c$ can define asymptotic confidence regions, analogous to inverting tests based on expected loss bounds.[](https://bookdown.org/compfinezbook/introcompfinr/Asymptotic-Properties-of-Estimators.html)[](https://www.stat.cmu.edu/~larry/=sml/stein.pdf) Despite these strengths, the frequentist risk is sensitive to model misspecification, as it assumes the data arise from some $P_\theta$ in a specified family; if the true distribution lies outside this family, the computed risk fails to capture the actual expected loss, potentially leading to poor decision rules. This vulnerability prompted developments in robust statistics, starting with Peter Huber's work on minimizing maximum risk under contamination models, and extending post-1980s to influence functions and M-estimators that bound sensitivity to outliers or distributional deviations. In contrast to the Bayes risk, which averages losses over a prior distribution on $\theta$, the frequentist approach emphasizes performance under the true fixed distribution for long-run reliability.[](https://economics.yale.edu/sites/default/files/minimizing_sensitivity_-_bonhomme_-_2018-10-9.pdf) ### Statistical Decision Rules In statistical decision theory, decision rules provide a framework for selecting actions that minimize the risk associated with a loss function, where risk is defined as the expected loss under the frequentist perspective. A decision rule δ maps observations to actions, and its performance is evaluated via the risk function R(θ, δ), which quantifies the average loss for each parameter θ. Optimal rules aim to balance performance across possible θ without prior assumptions about their distribution.[](https://www2.stat.duke.edu/~pdh10/Teaching/732/Notes/admissibility.pdf) Admissibility serves as a foundational criterion for evaluating decision rules. A rule δ is inadmissible if there exists another rule δ' such that R(θ, δ') ≤ R(θ, δ) for all θ in the parameter space Θ, with strict inequality holding for at least one θ_0. Conversely, δ is admissible if no such dominating rule exists, ensuring it is not uniformly outperformed. This concept eliminates suboptimal rules, focusing analysis on the class of admissible procedures, though in infinite-dimensional spaces, complete characterization remains challenging.[](https://www2.stat.duke.edu/~pdh10/Teaching/732/Notes/admissibility.pdf)[](https://projecteuclid.org/journals/statistical-science/volume-34/issue-4/Larry-Browns-Work-on-Admissibility/10.1214/19-STS744.pdf) Minimax rules address uncertainty about θ by prioritizing robustness against the worst-case scenario. Formally, a rule δ* is minimax if it solves

\delta^* = \arg\min_{\delta} \max_{\theta \in \Theta} R(\theta, \delta),

\mathbf{X}^T \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^T \mathbf{y}

This yields the ordinary least squares estimator $\boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$, providing unbiased and minimum-variance estimates under the Gauss-Markov assumptions.[](https://www.cs.princeton.edu/courses/archive/fall18/cos324/files/linear-regression.pdf) Robust estimation addresses sensitivity to outliers by employing absolute loss, as in the median, which minimizes the sum of absolute deviations. Unlike the mean, which uses quadratic loss and has a breakdown point of 0—meaning a single outlier can arbitrarily bias it—the sample median achieves a breakdown point of 1/2, resisting contamination from up to nearly half the data points before failing.[](https://www.ine.pt/revstat/pdf/rs070101.pdf) This property, formalized by Frank R. Hampel in 1968, quantifies robustness as the smallest fraction of corrupted data that causes the estimator to break down, highlighting the median's superiority in non-normal distributions with heavy tails.[](https://bookdown.org/egarpor/inference/point-robust.html) In hypothesis testing, loss functions manifest through the risks of Type I and Type II errors, where Type I error (false rejection of a true null hypothesis) incurs a probability $\alpha$, and Type II error (failure to reject a false null) has probability $\beta$. The Neyman-Pearson framework from 1933 optimizes tests by controlling $\alpha$ while minimizing $\beta$, linking to the power function $1 - \beta(\theta)$, which measures the test's ability to detect true alternatives as a function of the parameter $\theta$. This error-based loss perspective evaluates decision rules for admissibility, ensuring no other rule dominates in expected loss across possible states.[](https://online.stat.psu.edu/stat415/lesson/25/25.2) ## Applications in Machine Learning ### Regression Loss Functions In regression tasks within machine learning, loss functions measure the discrepancy between predicted continuous values and actual outcomes, guiding model optimization toward accurate predictions. Unlike the quadratic loss, which squares errors and is highly sensitive to outliers, regression-specific losses often prioritize robustness or targeted distributional properties.[](https://arxiv.org/pdf/2211.02989) The mean absolute error (MAE), derived from the absolute loss, computes the average magnitude of errors without squaring, making it less affected by extreme values and thus more robust in datasets with outliers.[](https://arxiv.org/pdf/2211.02989) Formally, for a dataset of $n$ samples with true values $y_i$ and predictions $\hat{y}_i$,

\text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|

This L1-based metric promotes the conditional median as the optimal predictor under Laplacian error assumptions, enhancing stability in noisy environments like financial modeling.[](https://arxiv.org/pdf/2211.02989) The Huber loss addresses limitations of both quadratic and absolute losses by combining them in a piecewise manner, applying quadratic penalties to small errors for smoothness and linear penalties to large ones for outlier resistance.[](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full) It features a threshold parameter $\delta > 0$, defined as

L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \ \delta |y - \hat{y}| - \frac{1}{2} \delta^2 & \text{if } |y - \hat{y}| > \delta \end{cases}

Introduced in robust estimation contexts, this hybrid form balances differentiability and bounded influence, with $\delta$ typically tuned via cross-validation to control robustness levels.[](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full) For scenarios requiring predictions beyond the mean, such as interval forecasts or asymmetric risks, the quantile loss—also known as pinball loss—enables quantile regression by penalizing errors proportionally to their direction relative to a quantile level $\tau \in (0,1)$.[](https://www.econ.uiuc.edu/~roger/NAKE/rqs78.pdf) The loss is

L_\tau(y, \hat{y}) = \begin{cases} \tau (y - \hat{y}) & \text{if } y \geq \hat{y} \ (\tau - 1)(y - \hat{y}) & \text{if } y < \hat{y} \end{cases}

This asymmetric formulation, originating from regression quantiles, yields the $\tau$-th conditional quantile as the minimizer, allowing models to capture varying uncertainty across the outcome distribution, such as in demand forecasting.[](https://www.econ.uiuc.edu/~roger/NAKE/rqs78.pdf) In modern deep learning applications, particularly for time series forecasting with recurrent networks like LSTMs since the 2010s, these losses enhance performance under non-stationary or outlier-prone data. For instance, Huber loss mitigates extreme market events in LSTM-based stock predictions, outperforming mean squared error by reducing sensitivity to volatility spikes.[](https://www.researchgate.net/publication/373240182_Robust_Stock_Market_Forecasting_Using_Huber_Loss_Mitigating_Extreme_Events_and_Outliers_in_LSTM_Models) Similarly, quantile loss facilitates probabilistic LSTM forecasts, providing uncertainty bands for multi-step predictions in energy and financial time series.[](https://arxiv.org/html/2411.15674v1) ### Classification Loss Functions Classification loss functions are designed for tasks involving categorical predictions, where the goal is to assign inputs to discrete classes. Unlike the ideal 0-1 loss, which incurs a penalty of 1 for any misclassification and 0 otherwise, these functions provide differentiable surrogates that enable gradient-based optimization in machine learning models. They address the non-convexity and non-differentiability of the 0-1 loss by promoting margins or probabilistic confidence in predictions. The cross-entropy loss, also known as log loss, is a cornerstone for probabilistic classifiers and measures the divergence between the true class distribution and the predicted probabilities. For a single example with true label encoded as a one-hot vector $ \mathbf{y} $ and predicted probabilities $ \mathbf{p} $, it is defined as

L(\mathbf{y}, \mathbf{p}) = -\sum_{i=1}^C y_i \log(p_i),

where $ C $ is the number of classes. This loss encourages the model to assign high probability to the correct class while penalizing overconfident incorrect predictions, making it suitable for training neural networks with softmax outputs. It originates from information theory and has been widely adopted since the early days of logistic regression in machine learning. Hinge loss, primarily used in support vector machines (SVMs), focuses on maximizing the margin between classes without assuming probabilistic outputs. For a binary classification with true label $ y \in \{-1, 1\} $ and model score $ f(\mathbf{x}) $, the loss is

L(y, f(\mathbf{x})) = \max(0, 1 - y f(\mathbf{x})).

This formulation penalizes predictions where the score for the correct class falls below a margin of 1, promoting sparse solutions and robustness to outliers. Introduced in the context of SVMs, it has influenced large-margin classifiers and remains a benchmark for non-probabilistic approaches. To handle class imbalance, where minority classes are underrepresented, the focal loss modifies the cross-entropy by down-weighting well-classified examples. Defined as

L(y, p_t) = -\alpha (1 - p_t)^\gamma \log(p_t),

D_F(p \parallel q) = F(p) - F(q) - \langle \nabla F(q), p - q \rangle,

where this divergence measures the difference between points $p$ and $q$ in a way that generalizes common losses like squared error (from $F(x) = \frac{1}{2}\|x\|^2$) or Kullback-Leibler divergence (from $F(x) = \sum x_i \log x_i$).[](https://pages.cs.wisc.edu/~cmzhang/publication-paper/Canadian_Journal_Statistics_2009_119.pdf) This construction unifies many statistical losses and ensures properties like non-negativity and linearity in the first argument, facilitating their use in regression and classification tasks. To align loss functions with specific tasks, they can be derived from foundational principles in information theory or geometry. For instance, the cross-entropy loss, widely used in probabilistic modeling, arises from the Kullback-Leibler (KL) divergence, which quantifies the information loss when approximating one probability distribution $P$ by another $Q$:

D_{\text{KL}}(P \parallel Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}.

The cross-entropy $H(P, Q) = -\sum_i P(i) \log Q(i)$ equals the KL divergence plus the entropy of $P$, so minimizing cross-entropy is equivalent to minimizing KL divergence up to a constant, making it suitable for tasks requiring distributional matching, such as classification. Geometrically, losses can be built using divergences that respect the manifold structure of the data space, as in Bregman constructions, which induce distances compatible with the geometry defined by $F$.[](https://pages.cs.wisc.edu/~cmzhang/publication-paper/Canadian_Journal_Statistics_2009_119.pdf) For data with inherent ordering, such as ordinal outcomes, asymmetric loss functions are designed to penalize over- and under-predictions differently, reflecting unequal costs. A quadratic asymmetric form weights errors based on direction, for example, using $w_1 (y - \hat{y})^2$ for underprediction ($y > \hat{y}$) and $w_2 (\hat{y} - y)^2$ for overprediction ($\hat{y} > y$), with $w_1 \neq w_2$ to encode ordinal preferences. This approach shifts the optimal predictor away from the mean toward medians or quantiles, aligning with the ranked nature of ordinal data.[](https://www.nber.org/system/files/working_papers/t0167/t0167.pdf) Ensuring optimization tractability is a key consideration in loss function design, often achieved through convex analysis. Convex losses guarantee that local minima are global, enabling efficient algorithms like gradient descent to converge reliably. For robustness to outliers, the Huber loss combines quadratic behavior near zero with linear tails:

\rho(t) = \begin{cases} \frac{1}{2} t^2 & |t| \leq k \ k |t| - \frac{1}{2} k^2 & |t| > k, \end{cases}

References

[1]
What is Loss Function? | IBM
In machine learning, loss functions measure model performance by calculating the deviation of a model's predictions from the “ground truth” predictions.Overview · How do loss functions work?
[2]
[PDF] The Method of Least Squares - The University of Texas at Dallas
1974, for a history and pre-history of LSM). The modern approach was first exposed in 1805 by the French mathematician Legendre in a now classic memoir, but ...
[3]
Predictions and Decision Theory (1) - Statistics & Data Science
Feb 4, 2021 · A loss function ℓ(y,a): how much it hurts to take action a when the ... Abraham Wald: reformulates inference as decision problems ...
[4]
[PDF] Lecture 1 (Statistical Decision Theory) - People @EECS
Aug 29, 2019 · In this lecture, we discuss a unified theoretical framework of statistics proposed by Abraham Wald, which ... Loss Function: L(θ, δ). The loss ...
[5]
[PDF] 5 Decision Theory: Basic Concepts - Purdue Department of Statistics
It seems only natural that once we have specified a loss function, we should prefer decision procedures δ that have low risk. Lower risk is considered such ...
[6]
Loss function | Linear regression, statistics, machine learning
Typically, loss functions are increasing in the absolute value of the estimation error and they have convenient mathematical properties, such as ...Estimation Losses · Empirical Risk Minimization · Quadratic Loss
[7]
[PDF] Statistical Decision Functions - Gwern
A brief historical note on the developments leading up to the present stage of the theory is given in Section 1.7 of. Chapter 1. Thefirst chapter is devoted to ...
[8]
Ridge Regression: Biased Estimation for Nonorthogonal Problems
[11] HOERL, A. E. and KENNARD, R. W. (1968). On regression analysis and biased estimation. Technometrics 10, 422-423. Abstract.
[9]
[PDF] Algorithms for Direct 0–1 Loss Optimization in Binary Classification
0–1 loss is robust to out- liers since it is not affected by a misclassified point's distance from the margin, but this property also makes it non-convex; the ...Missing: discontinuous | Show results with:discontinuous
[10]
Loss Functions and Metrics in Deep Learning - arXiv
Aug 8, 2024 · 2.1 Properties of loss functions 1. Convexity: A loss function is convex if any local minimum is also the global minimum.
[11]
[PDF] Legendre On Least Squares - University of York
His work on geometry, in which he rearranged the propositions of Euclid, is one of the most successful textbooks ever written. On the Method of Least Squares.
[12]
Gauss and the Invention of Least Squares - Project Euclid
Abstract. The most famous priority dispute in the history of statistics is that between Gauss and Legendre, over the discovery of the method of least squares.
[13]
Robust Estimation of a Location Parameter - Project Euclid
March, 1964 Robust Estimation of a Location Parameter. Peter J. Huber ... This paper contains a new approach toward a theory of robust estimation; it ...
[14]
[PDF] Analysis of least absolute deviation - HKUST Math Department
Unlike the LS method, the LAD method is not sensitive to outliers and produces robust es- timates.
[15]
Proof: The median minimizes the mean absolute error
Sep 23, 2024 · The median is the sole critical point, that it must be a global minimum. Therefore, the median must minimize the mean absolute error, completing the proof.
[16]
[PDF] Least absolute deviation estimation of linear econometric models
Feb 13, 2007 · The paper by Charnes et al. (1955) is considered to be a seminal paper for giving a new lease of life to L1 regression. Fisher (1961) showed ...
[17]
[PDF] STA 732: Inference Notes 10. Parameter Estimation from a Decision ...
The two theorems together say that under some regularity conditions on the model and the loss function, every Bayes rule is admissible and every admissible rule ...
[18]
[PDF] 3 Statistical Decision Theory - University of Bristol
This is a large set of loss functions, which should satisfy most clients who do not have a specific loss function already in mind. For point estimators there ...<|control11|><|separator|>
[19]
[PDF] Loss Functions, Bayes Risk and Posterior Summaries - Stat@Duke
Bayes risk is defined as the expected loss of using averaging over the posterior distribution. the Bayes optimal estimate is the estimator that has the lowest.
[20]
Mean squared error of an estimator | Bias-variance decomposition
The mean squared error (MSE) of an estimator is a measure of the expected losses generated by the estimator. In this page: we briefly review some concepts ...
[21]
[PDF] Lecture 2. Bayes Decision Theory
loss function: L(α(x),y) cost of making decision α(x) when true state is y. The risk function combines the loss function, the decision rule, and the ...<|control11|><|separator|>
[22]
[PDF] Intro to Decision Theory - Stat@Duke
The Bayes action δ∗(x) for any fixed x is the decision δ(x) that minimizes the posterior risk. If the problem at hand is to estimate some unknown parameter θ,.
[23]
[PDF] Bayesian Estimation & Information Theory - pillow lab @ princeton
Typical Loss functions and Bayesian estimators. 2. “zero-one” loss. (1 unless. ) • posterior maximum (or “mode”). • known as maximum a posteriori (MAP) estimate ...Missing: rule | Show results with:rule
[24]
Optimal Statistical Decisions | Wiley Online Books
Optimal Statistical Decisions ; Author(s):. Morris H. DeGroot, ; First published:16 April 2004 ; Print ISBN:9780471680291 | ; Online ISBN: ...
[25]
[PDF] Lecture 16: Learning Theory: Empirical Risk Minimization
The empirical risk ... (Cc) is an unbiased estimator of R(Cc). Thus, as long as we have a set of new data, the empirical risk on the new dataset is an unbaised ...
[26]
7.3 Asymptotic Properties of Estimators - Bookdown
7.3 Asymptotic Properties of Estimators. Estimator bias and precision are finite sample properties. That is, they are properties that hold for a fixed ...
[27]
[PDF] Stein's Unbiased Risk Estimate - Statistics & Data Science
We'll walk through Stein's univariate and multivariate lemmas on the normal distribution. Following this, we'll discuss how they apply to unbiased risk ...
[28]
[PDF] Minimizing Sensitivity to Model Misspecification
Oct 9, 2018 · This paper relates to several branches of the literature in economet- rics and statistics on robustness and sensitivity analysis. As in the ...
[29]
[PDF] Admissibility - Stat@Duke
Jan 17, 2025 · The main objectives of statistical decision theory are to develop, evaluate and com- pare the statistical properties of decision rules. A ...Missing: source | Show results with:source
[30]
Larry Brown's Work on Admissibility - Project Euclid
Stein had shown that for squared error loss, the best invariant estimator was admissible for p = 1 (Stein,. 1959) and for p = 2 (James and Stein, 1961), in both.
[31]
[PDF] Basics of Decision Theory - Andrew B. Nobel
A rule d ∈ D is said to be minimax if Rm(d) = R∗ m. Definition: The optimal Bayes risk for a family of decision rules D under a prior π is. R∗ π = inf.<|control11|><|separator|>
[32]
[PDF] Lecture 17: October 7 17.1 Minimax Estimators through Bayes ...
Minimax estimators have lowest maximum risk, while Bayes estimators have lowest average risk. Bayes estimators can be used to find minimax estimators.
[33]
[PDF] Stein-1956.pdf - Yale Statistics and Data Science
If the loss is the sum of squares of the errors, this estimator is admissible for. » 2, but inadmissible for n ≥ 3. Since the usual estimator is best among ...
[34]
[PDF] On the Mathematical Foundations of Theoretical Statistics Author(s)
(17) R. A. FISHER (1922). " The Interpretation of X2 from Contingency Tables, and the Calculation of P," 'J.R.S.S.,' lxxxv., pp. 87-94. (18) K. PEARSON ...
[35]
[PDF] Likelihood Inference - Kosuke Imai
Maximum likelihood estimation (MLE):. ˆ θn. ≡ argmax θ∈Θ. L(θ | Y1,...,Yn) ... Bayesian Information Criterion: BIC = −2 · loglikelihood + K · log n.
[36]
[PDF] econstor
Nov 26, 2021 · Abstract: Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed. The general ...
[37]
[PDF] Ordinary Least Squares Linear Regression - cs.Princeton
Aug 27, 2018 · ... the most popular one for linear regression is the squared loss or quadratic loss: ℓ(y, y) = (y − y). 2 . (1). Figure 2a plots the squared loss ...
[38]
[PDF] THE BREAKDOWN POINT — EXAMPLES AND ...
The breakdown point is one of the most popular measures of robustness of a statistical procedure. Originally introduced for location functionals (Hampel, 1968, ...
[39]
3.7 Robust estimators | A First Course on Statistical Inference
The breakdown point of the sample median is ⌊n/2⌋/n, ⌊ n / 2 ⌋ / n , with ... Finally, a robust alternative for estimating σ σ is the median absolute deviation.
[40]
25.2 - Power Functions | STAT 415
The power function of a hypothesis test depends on the parameter being investigated, and increases as the actual mean moves further from the null mean.Missing: loss | Show results with:loss
[41]
[PDF] A Comprehensive Survey of Regression Based Loss Functions for ...
Nov 5, 2022 · This paper summarizes 14 regression loss functions used for time series forecasting, which are important for instigating the learning process ...
[42]
https://www.econ.uiuc.edu/~roger/NAKE/rqs78.pdf
[43]
(PDF) Robust Stock Market Forecasting Using Huber Loss
Through experiments with sample stock market data, we compare the performance of LSTM models trained with Huber loss against those using traditional MSE loss.
[44]
Quantile deep learning models for multi-step ahead time series ...
Nov 24, 2024 · Applying the quantile loss function to time series data allows for a broader range of predicted values and enables an overview of uncertainties.
[45]
Statistical Properties of the log-cosh Loss Function Used in Machine ...
Aug 9, 2022 · This paper analyzes a popular loss function used in machine learning called the log-cosh loss function. A number of papers have been published ...Missing: seminal | Show results with:seminal
[46]
[PDF] Robust Optimization for Deep Regression - CVF Open Access
In this work, we propose a re- gression model with ConvNets that achieves robustness to such outliers by minimizing Tukey's biweight function, an. M-estimator ...
[47]
V-Net: Fully Convolutional Neural Networks for Volumetric Medical ...
Jun 15, 2016 · In this work we propose an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network.
[48]
[1609.04802] Photo-Realistic Single Image Super-Resolution Using ...
Sep 15, 2016 · To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes ...
[49]
[PDF] Listwise Approach to Learning to Rank - Theory and Algorithm
In this paper, we aim to investigate the listwise ap- proach to learning to rank, particularly from the view- point of loss functions. Actually similar ...Missing: seminal | Show results with:seminal
[50]
A Survey of Loss Functions in Deep Learning - MDPI
This paper theoretically improves the knowledge system of the loss function ... log-cosh loss function used in machine learning. arXiv 2022, arXiv ...Missing: seminal | Show results with:seminal
[51]
A comprehensive survey of loss functions and metrics in deep learning
Apr 11, 2025 · This paper presents a comprehensive review of loss functions and performance metrics in deep learning, highlighting key developments and practical insights
[52]
Risk Aversion in the Small and in the Large - jstor
This paper concerns utility functions for money. A measure of risk aversion in the small, the risk premium or insurance premium for an arbitrary risk, and a ...
[53]
[PDF] Prospect Theory: An Analysis of Decision under Risk - MIT
BY DANIEL KAHNEMAN AND AMOS TVERSKY'. This paper presents a critique of expected utility theory as a descriptive model of decision making under risk, ...
[54]
Thirty Years of Prospect Theory in Economics: A Review and ...
In 1979, Daniel Kahneman and Amos Tversky, published a paper in Econometrica titled "Prospect Theory: An Analysis of Decision under Risk." The paper ...
[55]
https://www.aeaweb.org/articles?id=10.1257/aer.59.1.25
[56]
[PDF] On the Design of Loss Functions for Classification
The paper proposes deriving loss functions from the functional form of minimum conditional risk, not just the loss itself, and that convexity of the loss is ...
[57]
[PDF] New aspects of Bregman divergence in regression and classification ...
This article aims to study new aspects of Bregman divergence (BD), a notion which unifies nearly all of the commonly used loss functions in regression and ...
[58]
[PDF] optimal prediction under - asymmetric loss
Theorem 2 shows that the optimal predictor under conditional normality is the conditional mean plus a function of the conditional prediction-error variance.Missing: seminal | Show results with:seminal
[59]
[1502.06254] The fundamental nature of the log loss function - arXiv
Feb 22, 2015 · This note shows that the log loss function is most selective in that any prediction algorithm that is optimal for a given data sequence.
[60]
RoBoSS: A Robust, Bounded, Sparse, and Smooth Loss Function ...
Sep 5, 2023 · In this paper, we address the aforementioned constraints by proposing a novel robust, bounded, sparse, and smooth (RoBoSS) loss function for supervised ...
[61]
Application of machine learning methods in clinical trials for ...
We aim to implement machine learning (ML) algorithms into the response-adaptive randomization (RAR) design and improve the treatment outcomes.Abstract · MATERIALS AND METHODS · RESULTS · DISCUSSION
[62]
[PDF] Large-Scale Machine Learning with Stochastic Gradient Descent
Léon Bottou. Table 1. Stochastic gradient algorithms for various learning systems. Loss. Stochastic gradient algorithm. Adaline (Widrow and Hoff, 1960).
[63]
[1412.6980] Adam: A Method for Stochastic Optimization - arXiv
Dec 22, 2014 · We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order ...
[64]
[2008.10797] The Fairness-Accuracy Pareto Front - arXiv
Aug 25, 2020 · We put to use the concept of Pareto optimality from multi-objective optimization and seek the fairness-accuracy Pareto front of a neural network classifier.