Fact-checked by Grok 2 weeks ago

Loss function

A loss function, also known as a cost function, is a mathematical function that quantifies the discrepancy or penalty between a model's predicted output and the actual target value in optimization problems, serving as a core component in statistical estimation, machine learning, and decision theory. In machine learning, particularly supervised learning, the loss function evaluates the performance of a hypothesis or model on training data, providing a measure of how "bad" the predictions are, with the objective of minimizing this loss to improve accuracy and generalization through iterative optimization techniques like gradient descent. The choice of loss function is pivotal, as it influences the training dynamics, model convergence, and suitability for specific tasks such as regression or classification, ensuring the model learns meaningful patterns from data. Common loss functions are tailored to problem types: for regression, where continuous outputs are predicted, the mean squared error (MSE) computes the average of squared differences between predictions and targets, while the mean absolute error (MAE) uses absolute differences for robustness to outliers; in classification, binary cross-entropy measures divergence for two-class problems, and categorical cross-entropy extends this to multiple classes, with hinge loss often used in support vector machines to maximize margins. Specialized variants, such as Huber loss for combining MSE and MAE benefits or Dice loss for image segmentation, address domain-specific challenges like noisy data or imbalanced classes. The origins of loss functions trace back to the method of least squares, formalized by Adrien-Marie Legendre in 1805 for astronomical data fitting and independently developed by Carl Friedrich Gauss around 1809, which minimizes the sum of squared residuals as a foundational error measure. This approach evolved through maximum likelihood estimation in the early 20th century and was rigorously integrated into statistical decision theory by Abraham Wald in the 1940s, framing estimation and hypothesis testing as risk-minimization problems under defined losses. In contemporary machine learning, especially deep learning since the 2010s, loss functions have advanced to include adversarial and robust forms, such as those in generative adversarial networks (GANs), reflecting ongoing innovations in handling complex, high-dimensional data.

Fundamentals

Definition

In statistical decision theory, a loss function L(\theta, a) is defined as a measure that quantifies the penalty or cost incurred by selecting an action a from the action space when the true state of nature is the parameter \theta from the parameter space. In the specific contexts of estimation and prediction, this is commonly expressed as L(y, \hat{y}), where y represents the true value and \hat{y} the predicted or estimated value, thereby assessing the discrepancy between prediction and reality. The function maps this discrepancy to a non-negative real number, typically achieving its minimum value—often zero—when \hat{y} = y, indicating no error. The concept of the loss function was formalized within statistical decision theory by Abraham Wald during the 1940s and 1950s, building on earlier ideas in statistics and economics to provide a rigorous framework for evaluating decisions under uncertainty. Wald's work, particularly in his 1950 monograph Statistical Decision Functions, established loss functions as a cornerstone for analyzing estimation and hypothesis testing problems. Loss functions presuppose familiarity with basic mathematical functions and the principles of estimation, where one seeks to infer unknown parameters from observed data. They are essential for model evaluation and optimization because they transform abstract decision-making problems into concrete optimization tasks, allowing the selection of procedures that minimize penalties for errors in a principled manner. By defining the "badness" of decisions, loss functions enable the comparison of competing estimators or predictors based on their performance against potential true states. Common general properties of loss functions include non-negativity and monotonic increase with the magnitude of error, ensuring that larger deviations incur higher costs. Many are convex, which supports efficient optimization techniques like gradient descent, though this is not universal. Symmetry—where L(y, \hat{y}) = L(\hat{y}, y)—holds for some, such as squared error losses, but asymmetric variants exist to reflect directional penalties in certain applications. In stochastic environments, loss functions extend to expected loss, averaging over probabilistic outcomes to assess overall risk.

Relation to Objective Functions

In optimization problems within machine learning and statistics, loss functions specifically quantify the discrepancy between a model's prediction and the true outcome for individual data points, serving as the core measure of error to be minimized. Objective functions, by contrast, represent the broader optimization target, typically integrating the loss across a dataset—often as an empirical average—while incorporating additional components such as regularization terms to enforce desirable properties like sparsity or smoothness. For instance, the general form of an objective function can be expressed as J(\theta) = \mathbb{E}[L(y, f(\theta))] + \lambda R(\theta), where L is the loss, R is the regularizer, \theta are the model parameters, and \lambda controls the trade-off. This distinction underscores the role of objective functions in practical training: in machine learning, they often manifest as the empirical risk, which approximates the expected loss over training data, augmented by penalties to mitigate overfitting. In statistical contexts, such as linear regression, the objective evolves from a pure sum of squared errors (a quadratic loss aggregated over samples) to include L2 regularization, transforming it into ridge regression's criterion: J(\mathbf{b}) = \sum_{i=1}^n (y_i - \mathbf{x}_i^T \mathbf{b})^2 + \lambda \|\mathbf{b}\|^2, where the first term captures the empirical quadratic loss and the second penalizes large coefficients. This composite structure balances fidelity to the data against model complexity, a practice formalized in the 1970s to handle multicollinearity. The conceptual shift from standalone loss functions in early statistics to multifaceted objectives in modern machine learning emerged gradually, rooted in statistical decision theory's emphasis on minimizing expected loss (or risk) under uncertainty, as pioneered in the mid-20th century. By the late 20th century, with the formalization of empirical risk minimization and structural penalties in learning theory, objectives became standard for addressing generalization bounds in high-dimensional settings, marking a transition from pure error measures in classical statistics to regularized formulations prevalent post-1990s in fields like support vector machines and neural networks.

Core Examples

0-1 Loss Function

The 0-1 loss function, also known as the misclassification loss or indicator loss, is a fundamental metric in classification tasks that penalizes incorrect predictions with a value of 1 and correct predictions with 0. Formally, for a true label y \in \{-1, 1\} and a predicted score f(x) from a classifier, it is defined as L(y, f(x)) = \mathbb{I}[y f(x) \leq 0], where \mathbb{I} is the indicator function that equals 1 if the condition holds and 0 otherwise; this equates to 1 for misclassified examples and 0 for correctly classified ones. Equivalently, in terms of hard predicted labels \hat{y} = \operatorname{sign}(f(x)), the loss is L(y, \hat{y}) = 0 if y = \hat{y} and 1 otherwise, making it directly equivalent to the misclassification error rate. This loss function exhibits key properties that distinguish it from smoother alternatives: it is non-convex due to its step-like structure, discontinuous at decision boundaries, and non-differentiable everywhere except where y f(x) > 0. These characteristics render direct optimization of the 0-1 loss computationally challenging, as standard gradient-based methods fail, and the problem is NP-hard in general for binary classification. Despite these issues, the 0-1 loss serves as the ideal evaluation criterion for classifiers, as minimizing it corresponds exactly to maximizing classification accuracy. In applications, the 0-1 loss forms the basis for the accuracy metric, where a model's accuracy on a dataset is simply 1 minus the empirical average of the 0-1 losses over all examples. It is also pivotal in theoretical analyses of machine learning, particularly in statistical learning theory, where it underpins concepts like the Vapnik-Chervonenkis (VC) dimension—a measure of a hypothesis class's capacity defined as the largest set of points that can be shattered (perfectly classified in all possible labelings) under 0-1 loss. The VC dimension enables generalization bounds that relate empirical 0-1 risk to true risk, ensuring learnability for classes with finite VC dimension. A notable limitation of the 0-1 loss is its insensitivity to the confidence or margin of predictions: it assigns the same penalty to all misclassifications regardless of how far the predicted score is from the decision boundary, ignoring probabilistic outputs or soft predictions. This binary nature motivates the use of surrogate losses that approximate the 0-1 loss while being convex and differentiable, such as the hinge loss introduced in support vector machines, which penalizes violations proportionally to their magnitude and provides a tight upper bound on the 0-1 error.

Quadratic Loss Function

The quadratic loss function, also known as the squared error loss, measures the discrepancy between a true value y and a predicted value \hat{y} by squaring their difference, defined as
L(y, \hat{y}) = (y - \hat{y})^2.
This formulation penalizes larger errors more severely than smaller ones due to the quadratic term, making it particularly suitable for scenarios assuming Gaussian noise in the data. When averaged over a dataset of n observations, it yields the mean squared error (MSE),
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2,
which serves as a key metric for evaluating regression models under symmetric error distributions.
Mathematically, the quadratic loss exhibits desirable properties for optimization: it is convex, ensuring that any local minimum is global, and continuously differentiable, facilitating gradient-based methods for minimization. Its convexity stems from the positive semi-definiteness of the associated Hessian matrix, while differentiability allows for straightforward computation of the gradient \frac{\partial L}{\partial \hat{y}} = -2(y - \hat{y}). Furthermore, this loss is equivalent to the squared L2 norm of the residual vector, L(y, \hat{y}) = \| y - \hat{y} \|_2^2, linking it to Euclidean distance metrics in vector spaces. These attributes make it a foundational choice in parametric estimation where smooth, global optimization is prioritized. The quadratic loss gained prominence through its central role in the method of least squares, first formally published by Adrien-Marie Legendre in 1805 as an algebraic technique for fitting orbits in astronomy, where minimizing the sum of squared residuals provided an optimal solution under equal-weight assumptions. Carl Friedrich Gauss independently developed and justified the approach probabilistically around 1795, publishing it in 1809, arguing that it corresponds to maximum likelihood estimation under Gaussian errors, thus establishing its statistical rationale. This historical foundation solidified the quadratic loss as the cornerstone of linear regression and beyond. Despite its strengths, the quadratic loss is highly sensitive to outliers, as the squaring amplifies the influence of extreme residuals, potentially skewing estimates away from the true parameters. This vulnerability prompted the development of robust alternatives, such as the Huber loss function introduced by Peter J. Huber in 1964, which transitions from quadratic to linear penalties for large errors to mitigate outlier effects while retaining smoothness for small deviations.

Absolute Value Loss Function

The absolute value loss function, also known as the L1 loss or mean absolute error, measures the deviation between a true value y and a predicted value \hat{y} as L(y, \hat{y}) = |y - \hat{y}|. In the context of estimation, the value that minimizes the expected absolute value loss \mathbb{E}[|Y - c|] is the median of the distribution of Y, providing a robust central tendency measure under this criterion. This loss function is convex, ensuring that any local minimum is global and facilitating optimization in various settings, though it is non-differentiable at zero, which requires subgradient methods or approximations for gradient-based algorithms. In regularization contexts, such as Lasso regression, the L1 norm promotes sparsity by driving less important coefficients to exactly zero, enabling feature selection. The absolute value loss serves as the foundation for least absolute deviations (LAD) regression, which minimizes the sum of absolute residuals \sum_{i=1}^n |y_i - \hat{y}_i| to fit linear models. Compared to quadratic loss, LAD is more robust to outliers because it does not square errors, preventing a single large deviation from dominating the optimization. This loss is equivalent to the Manhattan distance (or L1 norm) in the space of predictions and observations.

Mathematical Foundations

Deterministic Loss

In deterministic settings, a loss function provides a measure of error for a single prediction by quantifying the discrepancy between a true outcome and its estimate, without reference to probabilistic distributions. Formally, it is defined as a mapping L: \mathcal{Y} \times \hat{\mathcal{Y}} \to [0, \infty), where \mathcal{Y} denotes the space of possible true values and \hat{\mathcal{Y}} the space of predictions, with the property that L(y, y) = 0 for all y \in \mathcal{Y}, ensuring zero penalty when the prediction exactly matches the truth. This pointwise computation applies to fixed pairs (y, \hat{y}), treating the loss as a direct cost incurred for that specific instance. Common forms of deterministic losses include symmetric distance-based measures, such as the Euclidean norm L(y, \hat{y}) = \| y - \hat{y} \|_2, which penalizes deviations proportionally to their magnitude regardless of direction. Asymmetric variants, however, impose different penalties for positive and negative errors to capture domain-specific imbalances; for instance, in estimation problems, a loss like L(y, \hat{y}) = a (\hat{y} - y)_+^2 + b (y - \hat{y})_+^2 with a \neq b weights overestimation more heavily than underestimation, as seen in applications where false positives carry higher costs. These losses play a central role in optimization through empirical risk minimization, where for a dataset of n observations \{(x_i, y_i)\}_{i=1}^n, the objective is to minimize the average pointwise loss \hat{R}(f) = \frac{1}{n} \sum_{i=1}^n L(y_i, f(x_i)) over a class of predictor functions f. This approach assumes a deterministic framework, relying solely on observed true-predicted pairs to guide model selection and training, without integrating expectations over random variables.

Expected Loss

The expected loss provides a probabilistic generalization of the pointwise loss by averaging over the uncertainty inherent in the data-generating process, thereby quantifying the average penalty incurred by a decision rule across possible outcomes. For a decision rule δ that maps observations X to actions δ(X), the expected loss, often termed the risk R(δ), is defined as the expectation of the loss function L(Y, δ(X)) taken with respect to the joint distribution of the random variables X and Y: $$ R(\delta) = \mathbb{E}[L(Y, \delta(X))], where the expectation integrates over all possible realizations of (X, Y). In explicit integral form, assuming a continuous joint distribution P(x, y), this becomes R(\delta) = \int L(y, \delta(x)) , dP(x, y). [](https://www2.stat.duke.edu/~st118/sta732/Decision.pdf)[](https://people.maths.bris.ac.uk/~mazjcr/ToI/2016/ToIDecTheory.pdf) This expression arises from the law of total expectation, which decomposes the overall risk into conditional expectations over the observation space, connecting the deterministic loss—served as the core integrand L(y, δ(x))—to a comprehensive measure of performance under uncertainty in statistical decision theory.[](https://www.stat.purdue.edu/~dasgupta/528-3.pdf) A key property of the expected loss is that minimizing R(δ) yields the Bayes optimal decision rule, which achieves the lowest possible average risk under the given distribution; for the quadratic loss L(y, a) = (y - a)^2, this optimal rule selects the posterior mean \mathbb{E}[Y \mid X] as the action δ(X).[](https://www2.stat.duke.edu/courses/Fall21/sta601.001/slides/02-loss-functions-handout.pdf) Under quadratic loss, the expected loss further decomposes into interpretable components via the bias-variance tradeoff: $ R(\delta) = \mathbb{E}[\mathrm{Var}(Y \mid X)] + \mathbb{E}[(\delta(X) - \mathbb{E}[Y \mid X])^2] $, where the first term is the irreducible noise $ \mathbb{E}[\mathrm{Var}(Y \mid X)] $, and the second term, the excess risk or mean squared error of the predictor relative to the true regression function, can be further decomposed into the squared bias $ \mathbb{E}[(\mathbb{E}[\delta(X) \mid X] - \mathbb{E}[Y \mid X])^2] $ and the variance $ \mathbb{E}[\mathrm{Var}(\delta(X) \mid X)] $ of the predictor over the distribution of X.[](https://homes.cs.washington.edu/~pedrod/bvd.pdf) ### Bayes Risk The Bayes risk quantifies the overall expected loss of a decision rule under a Bayesian framework, integrating both the sampling distribution of the data and the prior beliefs about the unknown parameter. Formally, for a prior distribution $\pi$ over the parameter $\theta$ and a decision rule $\delta$, the Bayes risk is defined as r(\pi, \delta) = \mathbb{E}\theta \left[ \mathbb{E}{X|\theta} L(\theta, \delta(X)) \right], where the inner expectation is the risk function $R(\theta, \delta)$ conditional on $\theta$, and the outer expectation averages this risk over the prior $\pi$. This measure evaluates the performance of $\delta$ by incorporating uncertainty in $\theta$ via the prior, providing a single scalar that allows comparison and selection among decision rules.[](https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/RevisedLectureNotes2.pdf)[](https://www2.stat.duke.edu/~rcs46/lecturesModernBayes/601-module2-decision-theory/lecture3-decision-theory.pdf) The optimal Bayes rule $\delta^*$ minimizes the Bayes risk, given by $\delta^* = \arg\min_\delta r(\pi, \delta)$. This minimization yields the decision that achieves the lowest average loss under the specified prior, guiding actions based on posterior expected loss for each observed data $X$. Unlike frequentist risk, which conditions solely on fixed $\theta$ and emphasizes asymptotic behavior over repeated sampling, the Bayes risk explicitly integrates subjective or empirical priors to reflect belief updating, enabling a coherent framework for decision-making under uncertainty.[](https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/RevisedLectureNotes2.pdf) A key example arises under the 0-1 loss function $L(\theta, a) = 0$ if $a = \theta$ and 1 otherwise, where the Bayes rule $\delta^*(X)$ selects the action that maximizes the posterior probability, corresponding to the maximum a posteriori (MAP) estimator. This choice minimizes the probability of error by favoring the most probable parameter value given the data and prior.[](https://pillowlab.princeton.edu/teaching/mathtools16/slides/lec18_BayesianEstim.pdf)[](https://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/RevisedLectureNotes2.pdf) The concept of Bayes risk was formalized within Abraham Wald's foundational work on statistical decision functions in 1950, establishing the theoretical basis for evaluating rules under integrated risk. This framework was further developed and extended in Bayesian statistics by Morris H. DeGroot in his 1970 treatise on optimal statistical decisions, emphasizing prior incorporation for robust inference.[](https://gwern.net/doc/statistics/decision/1950-wald-statisticaldecisionfunctions.pdf)[](https://onlinelibrary.wiley.com/doi/book/10.1002/0471729000) ## Applications in Statistics ### Frequentist Expected Loss In the frequentist paradigm of statistical decision theory, the expected loss, often termed the risk function, quantifies the average performance of a decision rule over the data-generating distribution parameterized by a fixed but unknown true parameter $\theta$. Formally, for a decision rule $\delta$ that maps observations $X$ to actions, and a loss function $L(\theta, a)$ measuring the discrepancy between the true parameter and action $a$, the risk is defined as R(\theta, \delta) = \mathbb{E}{X \sim P\theta} \left[ L(\theta, \delta(X)) \right], where $P_\theta$ denotes the distribution of the data under $\theta$. This formulation, introduced by Abraham Wald, evaluates the long-run frequency of loss for repeated sampling from $P_\theta$, without invoking probabilistic structure on $\theta$ itself. Since the true $\theta$ is unknown, direct computation of $R(\theta, \delta)$ is infeasible, necessitating estimators of the risk. In contexts where the loss can be directly evaluated using observed outcomes (e.g., prediction tasks with known responses $Y_i$), a primary approach is the empirical risk, which replaces the expectation with the sample average of observable losses over an independent validation dataset: $\hat{R}_n(\delta) = \frac{1}{n} \sum_{i=1}^n l(\delta(X_i), Y_i)$. Under the i.i.d. assumption from $P_\theta$, this empirical risk serves as an unbiased estimator of the true risk, as its expectation equals $R(\theta, \delta)$.[](https://faculty.washington.edu/yenchic/18W_425/Lec16_ERM.pdf) In more general decision problems without directly observable losses, plug-in estimators provide an alternative by substituting a consistent estimator $\hat{\theta}$ for $\theta$ and approximating the risk $R(\hat{\theta}, \delta)$ under the estimated distribution $P_{\hat{\theta}}$, often via Monte Carlo simulation or analytical methods in parametric models. These achieve consistency under regularity conditions like identifiability and compactness of the parameter space. Key properties of these risk estimators include asymptotic unbiasedness and consistency, where the bias of plug-in variants diminishes to zero as sample size grows, provided the underlying estimator $\hat{\theta}$ is consistent; for empirical risk in observable settings, consistency follows from the law of large numbers, converging in probability to the true risk under mild moment conditions on the loss. Furthermore, the risk framework connects to confidence intervals by interpreting loss thresholds: for instance, the set of actions $a$ where $R(\theta, \delta) \leq c$ for some constant $c$ can define asymptotic confidence regions, analogous to inverting tests based on expected loss bounds.[](https://bookdown.org/compfinezbook/introcompfinr/Asymptotic-Properties-of-Estimators.html)[](https://www.stat.cmu.edu/~larry/=sml/stein.pdf) Despite these strengths, the frequentist risk is sensitive to model misspecification, as it assumes the data arise from some $P_\theta$ in a specified family; if the true distribution lies outside this family, the computed risk fails to capture the actual expected loss, potentially leading to poor decision rules. This vulnerability prompted developments in robust statistics, starting with Peter Huber's work on minimizing maximum risk under contamination models, and extending post-1980s to influence functions and M-estimators that bound sensitivity to outliers or distributional deviations. In contrast to the Bayes risk, which averages losses over a prior distribution on $\theta$, the frequentist approach emphasizes performance under the true fixed distribution for long-run reliability.[](https://economics.yale.edu/sites/default/files/minimizing_sensitivity_-_bonhomme_-_2018-10-9.pdf) ### Statistical Decision Rules In statistical decision theory, decision rules provide a framework for selecting actions that minimize the risk associated with a loss function, where risk is defined as the expected loss under the frequentist perspective. A decision rule δ maps observations to actions, and its performance is evaluated via the risk function R(θ, δ), which quantifies the average loss for each parameter θ. Optimal rules aim to balance performance across possible θ without prior assumptions about their distribution.[](https://www2.stat.duke.edu/~pdh10/Teaching/732/Notes/admissibility.pdf) Admissibility serves as a foundational criterion for evaluating decision rules. A rule δ is inadmissible if there exists another rule δ' such that R(θ, δ') ≤ R(θ, δ) for all θ in the parameter space Θ, with strict inequality holding for at least one θ_0. Conversely, δ is admissible if no such dominating rule exists, ensuring it is not uniformly outperformed. This concept eliminates suboptimal rules, focusing analysis on the class of admissible procedures, though in infinite-dimensional spaces, complete characterization remains challenging.[](https://www2.stat.duke.edu/~pdh10/Teaching/732/Notes/admissibility.pdf)[](https://projecteuclid.org/journals/statistical-science/volume-34/issue-4/Larry-Browns-Work-on-Admissibility/10.1214/19-STS744.pdf) Minimax rules address uncertainty about θ by prioritizing robustness against the worst-case scenario. Formally, a rule δ* is minimax if it solves \delta^* = \arg\min_{\delta} \max_{\theta \in \Theta} R(\theta, \delta), minimizing the maximum risk over all possible θ. This approach is particularly valuable when no reliable prior information is available, as it guarantees a bounded worst-case loss, though it may sacrifice average performance compared to tailored rules.[](https://nobel.web.unc.edu/wp-content/uploads/sites/13591/2020/08/Decision_Theory-4.pdf) Bayes rules connect to minimax criteria through limiting cases. A Bayes rule, which minimizes the average risk with respect to a prior π, approaches a minimax rule as π tends to a uniform (or improper uniform) distribution over Θ, often yielding an equalizer rule with constant risk. If the Bayes risk equals the constant risk value, the rule is minimax, providing a Bayesian justification for minimax solutions under diffuse priors.[](https://www.stat.cmu.edu/~siva/teaching/705/lec17.pdf) The development of these rules traces to foundational work in the 1930s and 1950s. The Neyman-Pearson lemma, introduced in 1933, established optimality for hypothesis testing under loss, laying groundwork for broader decision-theoretic frameworks by identifying most powerful tests that minimize error risks. This was extended in the 1950s by Charles Stein, who demonstrated inadmissibility of standard estimators in multivariate settings, prompting deeper exploration of admissibility and minimaxity in high-dimensional problems.[](http://www.stat.yale.edu/~hz68/619/Stein-1956.pdf) ### Examples in Statistical Estimation In statistical estimation, maximum likelihood estimation (MLE) serves as a foundational procedure where the negative log-likelihood functions implicitly as a loss measure. Introduced by Ronald A. Fisher in 1922, MLE seeks parameter values that maximize the likelihood of observing the given data under an assumed probability distribution, which is equivalent to minimizing the negative log-likelihood.[](https://jhanley.biostat.mcgill.ca/bios601/Likelihood/Fisher1922.pdf) This approach treats the negative log-likelihood as a convex loss function for parameter optimization, promoting estimates that assign high probability to the observed data while penalizing deviations akin to a logarithmic penalty on prediction errors.[](https://imai.fas.harvard.edu/teaching/files/mle.pdf) For instance, under a Gaussian assumption, MLE reduces to minimizing the sum of squared residuals, bridging it to other loss-based methods. Least squares estimation exemplifies the use of quadratic loss in linear regression, where the objective is to minimize the sum of squared differences between observed and predicted values. Attributed to Carl Friedrich Gauss in his 1809 work *Theoria Motus Corporum Coelestium*, this method assumes errors follow a normal distribution, making the least squares estimator the maximum likelihood solution.[](https://www.econstor.eu/bitstream/10419/248787/1/21105.pdf) The solution is obtained by solving the normal equations, derived from setting the partial derivatives of the quadratic loss to zero: \mathbf{X}^T \mathbf{X} \boldsymbol{\beta} = \mathbf{X}^T \mathbf{y} This yields the ordinary least squares estimator $\boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$, providing unbiased and minimum-variance estimates under the Gauss-Markov assumptions.[](https://www.cs.princeton.edu/courses/archive/fall18/cos324/files/linear-regression.pdf) Robust estimation addresses sensitivity to outliers by employing absolute loss, as in the median, which minimizes the sum of absolute deviations. Unlike the mean, which uses quadratic loss and has a breakdown point of 0—meaning a single outlier can arbitrarily bias it—the sample median achieves a breakdown point of 1/2, resisting contamination from up to nearly half the data points before failing.[](https://www.ine.pt/revstat/pdf/rs070101.pdf) This property, formalized by Frank R. Hampel in 1968, quantifies robustness as the smallest fraction of corrupted data that causes the estimator to break down, highlighting the median's superiority in non-normal distributions with heavy tails.[](https://bookdown.org/egarpor/inference/point-robust.html) In hypothesis testing, loss functions manifest through the risks of Type I and Type II errors, where Type I error (false rejection of a true null hypothesis) incurs a probability $\alpha$, and Type II error (failure to reject a false null) has probability $\beta$. The Neyman-Pearson framework from 1933 optimizes tests by controlling $\alpha$ while minimizing $\beta$, linking to the power function $1 - \beta(\theta)$, which measures the test's ability to detect true alternatives as a function of the parameter $\theta$. This error-based loss perspective evaluates decision rules for admissibility, ensuring no other rule dominates in expected loss across possible states.[](https://online.stat.psu.edu/stat415/lesson/25/25.2) ## Applications in Machine Learning ### Regression Loss Functions In regression tasks within machine learning, loss functions measure the discrepancy between predicted continuous values and actual outcomes, guiding model optimization toward accurate predictions. Unlike the quadratic loss, which squares errors and is highly sensitive to outliers, regression-specific losses often prioritize robustness or targeted distributional properties.[](https://arxiv.org/pdf/2211.02989) The mean absolute error (MAE), derived from the absolute loss, computes the average magnitude of errors without squaring, making it less affected by extreme values and thus more robust in datasets with outliers.[](https://arxiv.org/pdf/2211.02989) Formally, for a dataset of $n$ samples with true values $y_i$ and predictions $\hat{y}_i$, \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| This L1-based metric promotes the conditional median as the optimal predictor under Laplacian error assumptions, enhancing stability in noisy environments like financial modeling.[](https://arxiv.org/pdf/2211.02989) The Huber loss addresses limitations of both quadratic and absolute losses by combining them in a piecewise manner, applying quadratic penalties to small errors for smoothness and linear penalties to large ones for outlier resistance.[](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full) It features a threshold parameter $\delta > 0$, defined as L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \ \delta |y - \hat{y}| - \frac{1}{2} \delta^2 & \text{if } |y - \hat{y}| > \delta \end{cases} Introduced in robust estimation contexts, this hybrid form balances differentiability and bounded influence, with $\delta$ typically tuned via cross-validation to control robustness levels.[](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full) For scenarios requiring predictions beyond the mean, such as interval forecasts or asymmetric risks, the quantile loss—also known as pinball loss—enables quantile regression by penalizing errors proportionally to their direction relative to a quantile level $\tau \in (0,1)$.[](https://www.econ.uiuc.edu/~roger/NAKE/rqs78.pdf) The loss is L_\tau(y, \hat{y}) = \begin{cases} \tau (y - \hat{y}) & \text{if } y \geq \hat{y} \ (\tau - 1)(y - \hat{y}) & \text{if } y < \hat{y} \end{cases} This asymmetric formulation, originating from regression quantiles, yields the $\tau$-th conditional quantile as the minimizer, allowing models to capture varying uncertainty across the outcome distribution, such as in demand forecasting.[](https://www.econ.uiuc.edu/~roger/NAKE/rqs78.pdf) In modern deep learning applications, particularly for time series forecasting with recurrent networks like LSTMs since the 2010s, these losses enhance performance under non-stationary or outlier-prone data. For instance, Huber loss mitigates extreme market events in LSTM-based stock predictions, outperforming mean squared error by reducing sensitivity to volatility spikes.[](https://www.researchgate.net/publication/373240182_Robust_Stock_Market_Forecasting_Using_Huber_Loss_Mitigating_Extreme_Events_and_Outliers_in_LSTM_Models) Similarly, quantile loss facilitates probabilistic LSTM forecasts, providing uncertainty bands for multi-step predictions in energy and financial time series.[](https://arxiv.org/html/2411.15674v1) ### Classification Loss Functions Classification loss functions are designed for tasks involving categorical predictions, where the goal is to assign inputs to discrete classes. Unlike the ideal 0-1 loss, which incurs a penalty of 1 for any misclassification and 0 otherwise, these functions provide differentiable surrogates that enable gradient-based optimization in machine learning models. They address the non-convexity and non-differentiability of the 0-1 loss by promoting margins or probabilistic confidence in predictions. The cross-entropy loss, also known as log loss, is a cornerstone for probabilistic classifiers and measures the divergence between the true class distribution and the predicted probabilities. For a single example with true label encoded as a one-hot vector $ \mathbf{y} $ and predicted probabilities $ \mathbf{p} $, it is defined as L(\mathbf{y}, \mathbf{p}) = -\sum_{i=1}^C y_i \log(p_i), where $ C $ is the number of classes. This loss encourages the model to assign high probability to the correct class while penalizing overconfident incorrect predictions, making it suitable for training neural networks with softmax outputs. It originates from information theory and has been widely adopted since the early days of logistic regression in machine learning. Hinge loss, primarily used in support vector machines (SVMs), focuses on maximizing the margin between classes without assuming probabilistic outputs. For a binary classification with true label $ y \in \{-1, 1\} $ and model score $ f(\mathbf{x}) $, the loss is L(y, f(\mathbf{x})) = \max(0, 1 - y f(\mathbf{x})). This formulation penalizes predictions where the score for the correct class falls below a margin of 1, promoting sparse solutions and robustness to outliers. Introduced in the context of SVMs, it has influenced large-margin classifiers and remains a benchmark for non-probabilistic approaches. To handle class imbalance, where minority classes are underrepresented, the focal loss modifies the cross-entropy by down-weighting well-classified examples. Defined as L(y, p_t) = -\alpha (1 - p_t)^\gamma \log(p_t), where $ p_t $ is the predicted probability for the true class, $ \alpha $ balances class importance, and $ \gamma \geq 0 $ modulates focus on hard examples (with $ \gamma = 0 $ reducing to cross-entropy), it was proposed for object detection tasks. This loss reduces the relative loss for easy negatives, allowing the model to prioritize difficult cases and improve precision in imbalanced datasets. For multi-class problems, the softmax cross-entropy extends the binary cross-entropy by applying the softmax function to raw scores $ \mathbf{z} $ to obtain probabilities $ p_i = \frac{e^{z_i}}{\sum_j e^{z_j}} $, then computing the loss over all classes as in the cross-entropy formula. This pairwise comparison across classes ensures mutual exclusivity in predictions and is standard in deep learning frameworks for tasks like image classification. ### Custom and Robust Loss Functions In machine learning, robust loss functions are designed to mitigate the influence of outliers by bounding the impact of large residuals, thereby improving model stability in noisy datasets. The log-cosh loss, defined as $\log(\cosh(y - \hat{y}))$ where $y$ is the true value and $\hat{y}$ the prediction, approximates the mean absolute error for small residuals while behaving quadratically for larger ones, providing smoothness and twice differentiability essential for gradient-based optimization.[](https://arxiv.org/abs/2208.04564) This function has been analyzed for its statistical properties, including unbiasedness and consistency in regression tasks under certain conditions.[](https://arxiv.org/abs/2208.04564) Similarly, Tukey's biweight loss, an M-estimator from robust statistics, applies a redescending influence function that downweights outliers beyond a tuning constant $c$, with the loss given by $\rho(r) = \frac{r^2}{2}$ for $|r| \leq c$ and $\rho(r) = \frac{c^2}{6} (1 - (1 - (r/c)^2)^3)$ otherwise, where $r = y - \hat{y}$.[](https://openaccess.thecvf.com/content_iccv_2015/papers/Belagiannis_Robust_Optimization_for_ICCV_2015_paper.pdf) Introduced in deep regression via convolutional networks, it demonstrated superior performance on datasets with synthetic outliers, reducing mean pixel error by up to 12% compared to least squares on human pose estimation datasets.[](https://openaccess.thecvf.com/content_iccv_2015/papers/Belagiannis_Robust_Optimization_for_ICCV_2015_paper.pdf) Domain-specific custom losses adapt to task peculiarities, such as class imbalance or perceptual quality. In image segmentation, the Dice loss addresses foreground-background imbalance by optimizing the Dice similarity coefficient, formulated as $1 - \frac{2 |p \cap g|}{|p| + |g|}$, where $p$ and $g$ are predicted and ground truth segmentations.[](https://arxiv.org/abs/1606.04797) Proposed in the V-Net architecture for volumetric medical imaging, it outperformed cross-entropy on prostate MRI datasets, achieving a mean Dice score of 0.869 for prostate segmentation with fewer parameters.[](https://arxiv.org/abs/1606.04797) For generative adversarial networks (GANs), perceptual losses enhance visual fidelity by comparing high-level features from pre-trained networks like VGG, rather than pixel-wise differences; the loss is $\|\phi(y) - \phi(\hat{y})\|_2^2$, where $\phi$ extracts features from intermediate layers.[](https://arxiv.org/abs/1609.04802) This approach, introduced in SRGAN for single-image super-resolution, improved perceptual quality as measured by mean opinion score (MOS), e.g., from 2.29 to 3.56 on the BSD100 dataset compared to pixel-based losses on datasets such as Set5 and BSD100.[](https://arxiv.org/abs/1609.04802) In learning to rank, listwise losses evaluate entire ranking lists to align with metrics like normalized discounted cumulative gain (NDCG), using approximations for differentiability. Pairwise hinge losses, such as $\max(0, 1 - (s_i - s_j))$ for relevant document $i$ ranked below irrelevant $j$ with scores $s$, extend to listwise by aggregating over permutations, but more advanced methods approximate NDCG directly via softmax over ranked probabilities.[](https://icml.cc/Conferences/2008/papers/167.pdf) The foundational ListNet framework treats ranking as listwise classification, minimizing cross-entropy between predicted and ground-truth permutations, which boosted NDCG@10 by 5-10% on benchmark corpora like TD2003 compared to pointwise baselines.[](https://icml.cc/Conferences/2008/papers/167.pdf) Modern approximations, like those in NeuralNDCG, use Taylor expansions of the gain function to enable end-to-end optimization, yielding up to 8% relative improvements in offline NDCG on MSLR-Web30K. Designing custom losses involves challenges in balancing differentiability for gradient descent, convexity for global optima, and interpretability for debugging model behavior.[](https://www.mdpi.com/2227-7390/13/15/2417) Non-convex losses like those approximating ranking metrics can lead to optimization instability, while ensuring smoothness requires careful tuning to avoid vanishing gradients.[](https://www.mdpi.com/2227-7390/13/15/2417) Interpretability is further complicated by composite forms, such as combining perceptual features with adversarial terms, necessitating tools like automatic differentiation in frameworks such as PyTorch to compute higher-order gradients efficiently.[](https://www.mdpi.com/2227-7390/13/15/2417) These trade-offs are evident in surveys highlighting that while custom losses excel in niche tasks—e.g., robust variants reducing outlier sensitivity by bounding influence—overly complex designs risk overfitting without validation on held-out metrics.[](https://link.springer.com/article/10.1007/s10462-025-11198-7) ## Applications in Economics and Decision Theory ### Regret in Decision Making In decision theory, regret serves as a measure of loss by quantifying the difference between the outcome of a selected action and the optimal action for a given state of the world. Formally, the regret $ R(a, s) $ for choosing action $ a $ in state $ s $ is defined as $ R(a, s) = L(s, a) - L(s, a^*) $, where $ L $ denotes the loss function and $ a^* = \arg\min_{a'} L(s, a') $ is the best action that minimizes loss in state $ s $. This formulation captures the opportunity cost or "what might have been" if the superior action had been known and chosen. The concept underpins the minimax regret criterion, which selects the action minimizing the maximum regret across all possible states, providing a conservative approach to decision-making under uncertainty. Leonard J. Savage introduced regret in his seminal 1951 paper on statistical decision theory, later expanded in his 1954 book *The Foundations of Statistics*, where it emerged as a tool for evaluating decisions without probabilistic priors. In economics and behavioral economics, regret has become central to modeling human behavior, particularly through regret theory, which emphasizes how anticipated feelings of regret or rejoicing shape choices under risk. Pioneered by David Bell in 1982, this theory explains deviations from expected utility maximization, such as risk aversion in gains and risk-seeking in losses, by incorporating comparative evaluations of foregone alternatives. In dynamic settings like online learning, regret is distinguished as instantaneous, reflecting suboptimality at a single timestep, or cumulative, aggregating these differences over a sequence of decisions. No-regret algorithms, which bound cumulative regret to grow sublinearly with time (ensuring average regret vanishes), include the multiplicative weights update method, a foundational approach from the 1990s that adjusts action probabilities based on past performance. In game theory, when agents apply such regret minimization, their long-run strategies converge to Nash equilibria in zero-sum games, as the low-regret property forces play toward mutual best responses. This convergence underpins applications in algorithmic game theory and reinforcement learning for strategic interactions. ### Loss under Uncertainty In economic models of decision-making under risk, loss is conceptualized within the utility framework as the disutility arising from unfavorable wealth outcomes. Specifically, for a given utility function $u(w)$ that maps wealth $w$ to satisfaction, the loss associated with transitioning from a preferred wealth state $w$ to a worse state $w'$ (where $w > w'$) is defined as $u(w) - u(w')$, representing the forgone utility. This setup underpins the von Neumann-Morgenstern expected utility theory, which posits that rational agents evaluate risky prospects by maximizing the expected value of utility, $\mathbb{E}[u(w)]$, over probability distributions of wealth outcomes; deviations from this maximum thus quantify expected loss under uncertainty. Risk aversion in this context arises from the concavity of the utility function, leading agents to weigh potential losses more heavily than equivalent gains. For small risks—such as minor fluctuations in wealth—the expected utility loss can be approximated using a quadratic form, where the certainty equivalent of a zero-mean risk $\tilde{\epsilon}$ with variance $\sigma^2$ is roughly $w - \frac{1}{2} \sigma^2 r_a(w)$, and the associated loss reflects the risk premium $\pi \approx \frac{1}{2} \sigma^2 r_a(w)$. The Arrow-Pratt absolute measure of risk aversion, $r_a(w) = -\frac{u''(w)}{u'(w)}$, captures the curvature of $u(w)$ and directly links it to the magnitude of this loss, with higher values indicating greater aversion to small-scale uncertainty. This approximation facilitates analysis in settings where risks are localized, such as short-term financial perturbations.[](https://www.jstor.org/stable/1913738) Applications of this framework appear prominently in portfolio optimization, where loss aversion distorts traditional expected utility maximization toward conservative allocations. Kahneman and Tversky's prospect theory extends the utility-loss model by introducing reference dependence and a loss aversion coefficient $\lambda > 1$, such that losses relative to a reference point (e.g., current wealth) are felt more acutely than gains; this leads investors to overweight downside risks, often resulting in under-diversification or disposition effects in asset holdings. Empirical assessments confirm that incorporating prospect-theoretic loss aversion improves models of observed portfolio behavior, yielding more realistic predictions of risk-taking under uncertainty.[](https://web.mit.edu/curhan/www/docs/Articles/15341_Readings/Behavioral_Decision_Theory/Kahneman_Tversky_1979_Prospect_theory.pdf)[](https://www.aeaweb.org/articles?id=10.1257/jep.27.1.173) To compare prospects without fully specifying the utility function, stochastic dominance criteria offer a robust method for identifying choices that minimize expected loss for broad classes of risk-averse agents. First-order stochastic dominance occurs when one wealth distribution $F$ satisfies $F(x) \leq G(x)$ for all $x$, implying lower expected loss for all increasing utility functions. More relevant for risk aversion, second-order stochastic dominance holds if $\int_{-\infty}^x [G(t) - F(t)] dt \geq 0$ for all $x$, ensuring preference by all agents with concave utilities, as it guarantees a lower expected loss in the mean-preserving spread sense. These rules, derived from integral conditions on cumulative distributions, guide selections in uncertain environments like investment comparisons.[](https://www.aeaweb.org/articles?id=10.1257/aer.59.1.25) ## Construction and Selection ### Building Loss Functions Loss functions can be constructed by composing simpler base losses to address complex problem structures. A common approach involves taking weighted sums of individual losses, where each component captures a specific aspect of the error, such as accuracy and smoothness, allowing the overall function to balance multiple objectives during optimization.[](https://proceedings.neurips.cc/paper/2008/file/f5deaeeae1538fb6c45901d524ee2f98-Paper.pdf) More generally, Bregman divergences provide a framework for generating families of loss functions from a strictly convex generator function $F$, defined as D_F(p \parallel q) = F(p) - F(q) - \langle \nabla F(q), p - q \rangle, where this divergence measures the difference between points $p$ and $q$ in a way that generalizes common losses like squared error (from $F(x) = \frac{1}{2}\|x\|^2$) or Kullback-Leibler divergence (from $F(x) = \sum x_i \log x_i$).[](https://pages.cs.wisc.edu/~cmzhang/publication-paper/Canadian_Journal_Statistics_2009_119.pdf) This construction unifies many statistical losses and ensures properties like non-negativity and linearity in the first argument, facilitating their use in regression and classification tasks. To align loss functions with specific tasks, they can be derived from foundational principles in information theory or geometry. For instance, the cross-entropy loss, widely used in probabilistic modeling, arises from the Kullback-Leibler (KL) divergence, which quantifies the information loss when approximating one probability distribution $P$ by another $Q$: D_{\text{KL}}(P \parallel Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)}. The cross-entropy $H(P, Q) = -\sum_i P(i) \log Q(i)$ equals the KL divergence plus the entropy of $P$, so minimizing cross-entropy is equivalent to minimizing KL divergence up to a constant, making it suitable for tasks requiring distributional matching, such as classification. Geometrically, losses can be built using divergences that respect the manifold structure of the data space, as in Bregman constructions, which induce distances compatible with the geometry defined by $F$.[](https://pages.cs.wisc.edu/~cmzhang/publication-paper/Canadian_Journal_Statistics_2009_119.pdf) For data with inherent ordering, such as ordinal outcomes, asymmetric loss functions are designed to penalize over- and under-predictions differently, reflecting unequal costs. A quadratic asymmetric form weights errors based on direction, for example, using $w_1 (y - \hat{y})^2$ for underprediction ($y > \hat{y}$) and $w_2 (\hat{y} - y)^2$ for overprediction ($\hat{y} > y$), with $w_1 \neq w_2$ to encode ordinal preferences. This approach shifts the optimal predictor away from the mean toward medians or quantiles, aligning with the ranked nature of ordinal data.[](https://www.nber.org/system/files/working_papers/t0167/t0167.pdf) Ensuring optimization tractability is a key consideration in loss function design, often achieved through convex analysis. Convex losses guarantee that local minima are global, enabling efficient algorithms like gradient descent to converge reliably. For robustness to outliers, the Huber loss combines quadratic behavior near zero with linear tails: \rho(t) = \begin{cases} \frac{1}{2} t^2 & |t| \leq k \ k |t| - \frac{1}{2} k^2 & |t| > k, \end{cases} where $k$ controls the transition, providing a tractable convex alternative to squared loss in contaminated distributions.[](https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full) ### Criteria for Selecting Loss Functions Selecting an appropriate loss function is crucial for aligning the optimization process with the specific objectives of the modeling task. For instance, when the goal involves probabilistic predictions, such as in classification settings where calibrated probability outputs are desired, the log loss (also known as cross-entropy loss) is preferred because it penalizes confident wrong predictions more severely and encourages well-calibrated probabilities.[](https://arxiv.org/abs/1502.06254) In contrast, for tasks requiring robustness to outliers or noisy data, bounded loss functions are advantageous as they limit the influence of extreme errors, preventing them from dominating the training process.[](https://arxiv.org/abs/2309.02250) Computational tractability plays a key role in loss function selection, particularly in gradient-based optimization frameworks prevalent in machine learning. Differentiable loss functions are essential for standard gradient descent algorithms, enabling efficient computation of gradients to update model parameters via backpropagation. For non-smooth losses, such as the hinge loss used in support vector machines, subgradient methods can be employed to approximate gradients, though they may converge more slowly than smooth counterparts. Interpretability of the loss function enhances its integration with domain-specific knowledge, allowing practitioners to incorporate prior expertise into the model. In clinical trials, weighted loss functions are often selected to emphasize certain outcomes, such as prioritizing false negatives in diagnostic models to reflect the higher cost of missing disease cases, thereby aligning the loss with ethical and practical considerations in healthcare.[](https://arxiv.org/abs/1804.05378) Empirical validation ensures the chosen loss function generalizes well beyond the training data. Cross-validation techniques, such as k-fold cross-validation, are commonly used to estimate the empirical risk associated with different loss functions, helping to select one that minimizes overfitting while maintaining performance on held-out data. Additionally, sensitivity analysis, which has gained prominence in machine learning since the 2000s, evaluates how variations in the loss function affect model stability and predictions under different data distributions. The expected loss under the true data distribution provides a theoretical basis for risk-based selection, though it is typically approximated empirically. ### Optimization and Decision Rules In machine learning, gradient-based optimization methods rely on loss functions to iteratively minimize the empirical risk, which approximates the expected loss over training data. Stochastic gradient descent (SGD) is a foundational algorithm that computes gradients of the loss with respect to model parameters using mini-batches of data, enabling efficient updates that converge to minima of non-convex objectives in large-scale settings.[](https://leon.bottou.org/publications/pdf/compstat-2010.pdf) Adaptations like the Adam optimizer enhance SGD by incorporating adaptive learning rates and momentum terms based on first- and second-order moments of gradients, improving convergence speed and stability for deep learning tasks.[](https://arxiv.org/abs/1412.6980) Decision-theoretic rules provide principled ways to select actions or parameters that minimize loss under uncertainty, extending statistical foundations to practical optimization. The plug-in rule estimates the expected loss by substituting empirical distributions or moments into the risk function and then minimizes the resulting estimate, offering a computationally tractable approach for high-dimensional problems. Bayes rules minimize the posterior expected loss given a prior distribution over parameters, averaging risks weighted by beliefs to achieve optimal decisions under probabilistic assumptions. Minimax rules, in contrast, select the decision that minimizes the maximum possible loss over the worst-case scenario, ensuring robustness without relying on priors and often coinciding with Bayes rules under least-favorable priors. When multiple losses conflict, such as balancing predictive accuracy against fairness constraints in classification models, multi-objective optimization identifies Pareto fronts representing non-dominated trade-offs. These fronts consist of solutions where improving one loss (e.g., reducing accuracy disparity across demographic groups) cannot occur without worsening another (e.g., overall error rate), allowing practitioners to select models based on domain priorities via scalarization techniques like weighted sums.[](https://arxiv.org/abs/2008.10797) In online learning settings, algorithms perform loss-driven updates sequentially as data arrives, adapting models without batch recomputation. No-regret guarantees ensure that the average loss per round approaches the best fixed decision in hindsight, with algorithms like online gradient descent achieving sublinear regret bounds for convex losses over time.

References

  1. [1]
    What is Loss Function? | IBM
    In machine learning, loss functions measure model performance by calculating the deviation of a model's predictions from the “ground truth” predictions.Overview · How do loss functions work?
  2. [2]
    [PDF] The Method of Least Squares - The University of Texas at Dallas
    1974, for a history and pre-history of LSM). The modern approach was first exposed in 1805 by the French mathematician Legendre in a now classic memoir, but ...
  3. [3]
    Predictions and Decision Theory (1) - Statistics & Data Science
    Feb 4, 2021 · A loss function ℓ(y,a): how much it hurts to take action a when the ... Abraham Wald: reformulates inference as decision problems ...
  4. [4]
    [PDF] Lecture 1 (Statistical Decision Theory) - People @EECS
    Aug 29, 2019 · In this lecture, we discuss a unified theoretical framework of statistics proposed by Abraham Wald, which ... Loss Function: L(θ, δ). The loss ...
  5. [5]
    [PDF] 5 Decision Theory: Basic Concepts - Purdue Department of Statistics
    It seems only natural that once we have specified a loss function, we should prefer decision procedures δ that have low risk. Lower risk is considered such ...
  6. [6]
    Loss function | Linear regression, statistics, machine learning
    Typically, loss functions are increasing in the absolute value of the estimation error and they have convenient mathematical properties, such as ...Estimation Losses · Empirical Risk Minimization · Quadratic Loss
  7. [7]
    [PDF] Statistical Decision Functions - Gwern
    A brief historical note on the developments leading up to the present stage of the theory is given in Section 1.7 of. Chapter 1. Thefirst chapter is devoted to ...
  8. [8]
    Ridge Regression: Biased Estimation for Nonorthogonal Problems
    [11] HOERL, A. E. and KENNARD, R. W. (1968). On regression analysis and biased estimation. Technometrics 10, 422-423. Abstract.
  9. [9]
    [PDF] Algorithms for Direct 0–1 Loss Optimization in Binary Classification
    0–1 loss is robust to out- liers since it is not affected by a misclassified point's distance from the margin, but this property also makes it non-convex; the ...Missing: discontinuous | Show results with:discontinuous
  10. [10]
    Loss Functions and Metrics in Deep Learning - arXiv
    Aug 8, 2024 · 2.1 Properties of loss functions​​ 1. Convexity: A loss function is convex if any local minimum is also the global minimum.
  11. [11]
    [PDF] Legendre On Least Squares - University of York
    His work on geometry, in which he rearranged the propositions of Euclid, is one of the most successful textbooks ever written. On the Method of Least Squares.
  12. [12]
    Gauss and the Invention of Least Squares - Project Euclid
    Abstract. The most famous priority dispute in the history of statistics is that between Gauss and Legendre, over the discovery of the method of least squares.
  13. [13]
    Robust Estimation of a Location Parameter - Project Euclid
    March, 1964 Robust Estimation of a Location Parameter. Peter J. Huber ... This paper contains a new approach toward a theory of robust estimation; it ...
  14. [14]
    [PDF] Analysis of least absolute deviation - HKUST Math Department
    Unlike the LS method, the LAD method is not sensitive to outliers and produces robust es- timates.
  15. [15]
    Proof: The median minimizes the mean absolute error
    Sep 23, 2024 · The median is the sole critical point, that it must be a global minimum. Therefore, the median must minimize the mean absolute error, completing the proof.
  16. [16]
    [PDF] Least absolute deviation estimation of linear econometric models
    Feb 13, 2007 · The paper by Charnes et al. (1955) is considered to be a seminal paper for giving a new lease of life to L1 regression. Fisher (1961) showed ...
  17. [17]
    [PDF] STA 732: Inference Notes 10. Parameter Estimation from a Decision ...
    The two theorems together say that under some regularity conditions on the model and the loss function, every Bayes rule is admissible and every admissible rule ...
  18. [18]
    [PDF] 3 Statistical Decision Theory - University of Bristol
    This is a large set of loss functions, which should satisfy most clients who do not have a specific loss function already in mind. For point estimators there ...<|control11|><|separator|>
  19. [19]
    [PDF] Loss Functions, Bayes Risk and Posterior Summaries - Stat@Duke
    Bayes risk is defined as the expected loss of using averaging over the posterior distribution. the Bayes optimal estimate is the estimator that has the lowest.
  20. [20]
    Mean squared error of an estimator | Bias-variance decomposition
    The mean squared error (MSE) of an estimator is a measure of the expected losses generated by the estimator. In this page: we briefly review some concepts ...
  21. [21]
    [PDF] Lecture 2. Bayes Decision Theory
    loss function: L(α(x),y) cost of making decision α(x) when true state is y. The risk function combines the loss function, the decision rule, and the ...<|control11|><|separator|>
  22. [22]
    [PDF] Intro to Decision Theory - Stat@Duke
    The Bayes action δ∗(x) for any fixed x is the decision δ(x) that minimizes the posterior risk. If the problem at hand is to estimate some unknown parameter θ,.
  23. [23]
    [PDF] Bayesian Estimation & Information Theory - pillow lab @ princeton
    Typical Loss functions and Bayesian estimators. 2. “zero-one” loss. (1 unless. ) • posterior maximum (or “mode”). • known as maximum a posteriori (MAP) estimate ...Missing: rule | Show results with:rule
  24. [24]
    Optimal Statistical Decisions | Wiley Online Books
    Optimal Statistical Decisions ; Author(s):. Morris H. DeGroot, ; First published:16 April 2004 ; Print ISBN:9780471680291 | ; Online ISBN: ...
  25. [25]
    [PDF] Lecture 16: Learning Theory: Empirical Risk Minimization
    The empirical risk ... (Cc) is an unbiased estimator of R(Cc). Thus, as long as we have a set of new data, the empirical risk on the new dataset is an unbaised ...
  26. [26]
    7.3 Asymptotic Properties of Estimators - Bookdown
    7.3 Asymptotic Properties of Estimators. Estimator bias and precision are finite sample properties. That is, they are properties that hold for a fixed ...
  27. [27]
    [PDF] Stein's Unbiased Risk Estimate - Statistics & Data Science
    We'll walk through Stein's univariate and multivariate lemmas on the normal distribution. Following this, we'll discuss how they apply to unbiased risk ...
  28. [28]
    [PDF] Minimizing Sensitivity to Model Misspecification
    Oct 9, 2018 · This paper relates to several branches of the literature in economet- rics and statistics on robustness and sensitivity analysis. As in the ...
  29. [29]
    [PDF] Admissibility - Stat@Duke
    Jan 17, 2025 · The main objectives of statistical decision theory are to develop, evaluate and com- pare the statistical properties of decision rules. A ...Missing: source | Show results with:source
  30. [30]
    Larry Brown's Work on Admissibility - Project Euclid
    Stein had shown that for squared error loss, the best invariant estimator was admissible for p = 1 (Stein,. 1959) and for p = 2 (James and Stein, 1961), in both.
  31. [31]
    [PDF] Basics of Decision Theory - Andrew B. Nobel
    A rule d ∈ D is said to be minimax if Rm(d) = R∗ m. Definition: The optimal Bayes risk for a family of decision rules D under a prior π is. R∗ π = inf.<|control11|><|separator|>
  32. [32]
    [PDF] Lecture 17: October 7 17.1 Minimax Estimators through Bayes ...
    Minimax estimators have lowest maximum risk, while Bayes estimators have lowest average risk. Bayes estimators can be used to find minimax estimators.
  33. [33]
    [PDF] Stein-1956.pdf - Yale Statistics and Data Science
    If the loss is the sum of squares of the errors, this estimator is admissible for. » 2, but inadmissible for n ≥ 3. Since the usual estimator is best among ...
  34. [34]
    [PDF] On the Mathematical Foundations of Theoretical Statistics Author(s)
    (17) R. A. FISHER (1922). " The Interpretation of X2 from Contingency Tables, and the Calculation of P," 'J.R.S.S.,' lxxxv., pp. 87-94. (18) K. PEARSON ...
  35. [35]
    [PDF] Likelihood Inference - Kosuke Imai
    Maximum likelihood estimation (MLE):. ˆ θn. ≡ argmax θ∈Θ. L(θ | Y1,...,Yn) ... Bayesian Information Criterion: BIC = −2 · loglikelihood + K · log n.
  36. [36]
    [PDF] econstor
    Nov 26, 2021 · Abstract: Gauss' 1809 discussion of least squares, which can be viewed as the beginning of mathematical statistics, is reviewed. The general ...
  37. [37]
    [PDF] Ordinary Least Squares Linear Regression - cs.Princeton
    Aug 27, 2018 · ... the most popular one for linear regression is the squared loss or quadratic loss: ℓ(y, y) = (y − y). 2 . (1). Figure 2a plots the squared loss ...
  38. [38]
    [PDF] THE BREAKDOWN POINT — EXAMPLES AND ...
    The breakdown point is one of the most popular measures of robustness of a statistical procedure. Originally introduced for location functionals (Hampel, 1968, ...
  39. [39]
    3.7 Robust estimators | A First Course on Statistical Inference
    The breakdown point of the sample median is ⌊n/2⌋/n, ⌊ n / 2 ⌋ / n , with ... Finally, a robust alternative for estimating σ σ is the median absolute deviation.
  40. [40]
    25.2 - Power Functions | STAT 415
    The power function of a hypothesis test depends on the parameter being investigated, and increases as the actual mean moves further from the null mean.Missing: loss | Show results with:loss
  41. [41]
    [PDF] A Comprehensive Survey of Regression Based Loss Functions for ...
    Nov 5, 2022 · This paper summarizes 14 regression loss functions used for time series forecasting, which are important for instigating the learning process ...
  42. [42]
  43. [43]
    (PDF) Robust Stock Market Forecasting Using Huber Loss
    Through experiments with sample stock market data, we compare the performance of LSTM models trained with Huber loss against those using traditional MSE loss.
  44. [44]
    Quantile deep learning models for multi-step ahead time series ...
    Nov 24, 2024 · Applying the quantile loss function to time series data allows for a broader range of predicted values and enables an overview of uncertainties.
  45. [45]
    Statistical Properties of the log-cosh Loss Function Used in Machine ...
    Aug 9, 2022 · This paper analyzes a popular loss function used in machine learning called the log-cosh loss function. A number of papers have been published ...Missing: seminal | Show results with:seminal
  46. [46]
    [PDF] Robust Optimization for Deep Regression - CVF Open Access
    In this work, we propose a re- gression model with ConvNets that achieves robustness to such outliers by minimizing Tukey's biweight function, an. M-estimator ...
  47. [47]
    V-Net: Fully Convolutional Neural Networks for Volumetric Medical ...
    Jun 15, 2016 · In this work we propose an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network.
  48. [48]
    [1609.04802] Photo-Realistic Single Image Super-Resolution Using ...
    Sep 15, 2016 · To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes ...
  49. [49]
    [PDF] Listwise Approach to Learning to Rank - Theory and Algorithm
    In this paper, we aim to investigate the listwise ap- proach to learning to rank, particularly from the view- point of loss functions. Actually similar ...Missing: seminal | Show results with:seminal
  50. [50]
    A Survey of Loss Functions in Deep Learning - MDPI
    This paper theoretically improves the knowledge system of the loss function ... log-cosh loss function used in machine learning. arXiv 2022, arXiv ...Missing: seminal | Show results with:seminal
  51. [51]
    A comprehensive survey of loss functions and metrics in deep learning
    Apr 11, 2025 · This paper presents a comprehensive review of loss functions and performance metrics in deep learning, highlighting key developments and practical insights
  52. [52]
    Risk Aversion in the Small and in the Large - jstor
    This paper concerns utility functions for money. A measure of risk aversion in the small, the risk premium or insurance premium for an arbitrary risk, and a ...
  53. [53]
    [PDF] Prospect Theory: An Analysis of Decision under Risk - MIT
    BY DANIEL KAHNEMAN AND AMOS TVERSKY'. This paper presents a critique of expected utility theory as a descriptive model of decision making under risk, ...
  54. [54]
    Thirty Years of Prospect Theory in Economics: A Review and ...
    In 1979, Daniel Kahneman and Amos Tversky, published a paper in Econometrica titled "Prospect Theory: An Analysis of Decision under Risk." The paper ...
  55. [55]
  56. [56]
    [PDF] On the Design of Loss Functions for Classification
    The paper proposes deriving loss functions from the functional form of minimum conditional risk, not just the loss itself, and that convexity of the loss is ...
  57. [57]
    [PDF] New aspects of Bregman divergence in regression and classification ...
    This article aims to study new aspects of Bregman divergence (BD), a notion which unifies nearly all of the commonly used loss functions in regression and ...
  58. [58]
    [PDF] optimal prediction under - asymmetric loss
    Theorem 2 shows that the optimal predictor under conditional normality is the conditional mean plus a function of the conditional prediction-error variance.Missing: seminal | Show results with:seminal
  59. [59]
    [1502.06254] The fundamental nature of the log loss function - arXiv
    Feb 22, 2015 · This note shows that the log loss function is most selective in that any prediction algorithm that is optimal for a given data sequence.
  60. [60]
    RoBoSS: A Robust, Bounded, Sparse, and Smooth Loss Function ...
    Sep 5, 2023 · In this paper, we address the aforementioned constraints by proposing a novel robust, bounded, sparse, and smooth (RoBoSS) loss function for supervised ...
  61. [61]
    Application of machine learning methods in clinical trials for ...
    We aim to implement machine learning (ML) algorithms into the response-adaptive randomization (RAR) design and improve the treatment outcomes.Abstract · MATERIALS AND METHODS · RESULTS · DISCUSSION
  62. [62]
    [PDF] Large-Scale Machine Learning with Stochastic Gradient Descent
    Léon Bottou. Table 1. Stochastic gradient algorithms for various learning systems. Loss. Stochastic gradient algorithm. Adaline (Widrow and Hoff, 1960).
  63. [63]
    [1412.6980] Adam: A Method for Stochastic Optimization - arXiv
    Dec 22, 2014 · We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order ...
  64. [64]
    [2008.10797] The Fairness-Accuracy Pareto Front - arXiv
    Aug 25, 2020 · We put to use the concept of Pareto optimality from multi-objective optimization and seek the fairness-accuracy Pareto front of a neural network classifier.