Fact-checked by Grok 2 weeks ago

Scoring rule

A scoring rule is a statistical measure that evaluates the quality of a probabilistic forecast by assigning a numerical score based on the predicted and the actual observed outcome, with the goal of incentivizing accurate and honest reporting of probabilities. These rules are particularly valuable in fields requiring reliable predictions, such as , , and , where they quantify forecast performance and promote alignment between reported beliefs and expected scores. A key subclass consists of proper scoring rules, which are designed such that the expected score is maximized (or minimized, depending on the convention) precisely when the forecaster reports their true subjective probabilities, thereby eliciting truthful forecasts without strategic manipulation. Strictly proper scoring rules further ensure that this optimum is unique, achieved only by the true , enhancing their robustness for purposes. The originated in the context of , with Glenn W. Brier introducing the first prominent example—the —in 1950 as a measure of the mean squared difference between predicted probabilities and binary outcomes (0 or 1), which remains widely used for its simplicity and decomposability into , refinement, and components. Other notable examples include the logarithmic score, which rewards higher probabilities for correct outcomes via the log-likelihood and relates to concepts like Kullback-Leibler divergence, and the spherical score, a pseudospherical rule suitable for forecasts. Scoring rules have broad applications beyond , including in , in , and in , where they facilitate model comparison, calibration checks, and incentive-compatible mechanisms like prediction markets. Their theoretical foundation draws from , with proper rules often derived from Bregman divergences or entropy measures, ensuring mathematical consistency and interpretability. Ongoing research extends these to multivariate, spatial, and functional forecasts, addressing challenges like in high dimensions and robustness to outliers.

Definitions and Fundamentals

Probabilistic Forecast

A probabilistic forecast provides a predictive over possible future outcomes or events, rather than a single predicted value. This quantifies the forecaster's by assigning probabilities to each potential result, enabling a more complete representation of what is known or believed about the future. In contrast to point forecasts, which deliver a single deterministic value such as an expected mean, probabilistic forecasts emphasize the full spectrum of possibilities to better capture inherent uncertainties in complex systems. This approach is particularly valuable in fields where decisions depend on assessing risks and ranges of outcomes, allowing users to weigh alternatives based on likelihoods. The origins of probabilistic forecasting trace back to early applications in meteorology during the early 20th century, with initial explicit treatments appearing in works like W. E. Cooke's 1906 analysis of weather predictions. These ideas were further developed in decision theory and statistics around the mid-20th century, notably through von Neumann and Morgenstern's 1944 formulation of expected utility theory, which formalized decision-making under probabilistic uncertainty. Early meteorological uses, such as Anders Ångström's 1920s explorations of forecast value, including his 1927 and 1933 papers on probability forecasting, highlighted practical needs for probability-based predictions in weather services. For discrete outcomes, a probabilistic forecast takes the form of a , where probabilities sum to one across all possibilities. In continuous cases, it is represented by a . A classic example is a weather forecaster stating a 70% of tomorrow, implying a over rainy and non-rainy scenarios that informs decisions like event planning. Scoring rules serve to evaluate the accuracy and of such forecasts against observed outcomes.

Scoring Rule

A scoring rule is a function S(y, p) that evaluates the quality of a probabilistic forecast p given an observed outcome y, assigning a numerical score where higher values indicate greater accuracy in the forecast. These rules are applied in contexts such as , , and to quantify how well predicted probability distributions align with realized events. In mathematical terms, a scoring rule generally takes the form S(y, p) = g(p) - h(y, p), where g depends only on the forecast and h incorporates both the outcome and the forecast, allowing for a structured assessment of deviation from truth. This structure facilitates comparisons across different methods by normalizing the evaluation process. Scoring rules serve a critical in settings, where they incentivize forecasters to report their true probabilistic beliefs rather than biased estimates, thereby promoting reliable information gathering in processes. The development of scoring rules traces back to the 1950s, pioneered by Glenn W. Brier in his work on verifying probability forecasts for weather events, building on foundations in statistical from earlier contributions like those of .

Point Forecast

A point forecast provides a single predicted value for a future outcome, representing a deterministic estimate without any associated probabilities. For instance, in , a point forecast might predict a of exactly 25°C for a given location and time, serving as a concise summary of the expected observation. Such forecasts are often derived from models that optimize for a specific functional, like the or , to minimize error under a chosen . In the framework of scoring rules, a point forecast is equivalent to a degenerate probabilistic forecast, where the predictive places 100% probability mass on the single predicted value, akin to a Dirac distribution concentrated at that point. This equivalence allows proper scoring rules designed for probabilistic forecasts—such as the continuous ranked probability score (CRPS)—to be applied directly, reducing to simpler error measures like the for point predictions. Despite their simplicity, point forecasts have notable limitations, as they do not quantify , potentially leading to overconfident predictions and poorer in scenarios with high variability or incomplete information. This omission can mislead decision-makers by implying , particularly in complex domains like climate or where outcomes are inherently . Point forecasts remain prevalent in deterministic modeling approaches, such as early systems or basic models, where computational efficiency prioritizes a single output over distributional details. Scoring rules adapt to these use cases by treating the point estimate as an implicit , enabling consistent evaluation alongside more expressive probabilistic forecasts.

Scoring Function

A scoring function constitutes the foundational mathematical construct for assessing the quality of probabilistic forecasts by comparing a predicted distribution to a realized outcome. Denoted typically as S(o, F), where o is the observed outcome and F is the forecasted , it maps these inputs to a that quantifies their correspondence. This structure underpins the evaluation process in , serving as the atomic unit from which broader assessment mechanisms are built. For outcomes over a finite , the scoring function takes the basic form S(y, p), where y represents the realized outcome and p is the of forecasted probabilities assigned to each possible . In the continuous setting, involving probability densities f, the function S(y, f(y)) evaluates the forecast at the outcome y, with the integral form \int S(y, f(y)) f(y) \, dy capturing its under the forecasted ; this provides the theoretical basis for averaging scores across potential realizations. These algebraic expressions ensure the function's applicability across diverse probabilistic domains without presupposing specific distributional assumptions. Scoring functions possess general properties that enhance their utility in forecast evaluation, including monotonicity with respect to forecast accuracy—scores improve (non-decreasing in positive or non-increasing in negative) as the forecasted aligns more closely with the true underlying probabilities. They are not inherently required to satisfy stricter conditions like propriety, allowing flexibility in design while prioritizing sensitivity to discrepancies between forecast and outcome. Orientation influences whether higher numerical values indicate better performance (positive) or worse (negative), a choice that standardizes interpretation in applications. In contrast to fully specified scoring rules, which involve the operational deployment of these functions—such as aggregation over samples or into decision frameworks—scoring functions remain the elemental components focused solely on pairwise of outcome and forecast. This distinction underscores their role as versatile building blocks rather than complete protocols.

Orientation of Scoring Rules

Scoring rules are classified by their orientation, which determines whether higher or lower numerical values indicate superior forecast performance. Positively oriented scoring rules reward accurate probabilistic forecasts with higher scores, such as the logarithmic scoring rule, where the score is the logarithm of the predicted probability assigned to the observed outcome, and forecasters aim to maximize the . In contrast, negatively oriented scoring rules function as penalties, assigning lower values (often closer to zero or more negative) to better forecasts; the , defined as the mean squared difference between predicted probabilities and binary outcomes, exemplifies this by penalizing deviations and is minimized for optimal performance. The orientation of a scoring rule can be inverted without altering its relative evaluation of forecasts: multiplying the rule by −1 flips it from positive to negative (or ), preserving the ordering of forecast quality since the transformation is monotonic. This equivalence ensures that core properties like propriety remain intact under sign reversal. The choice of orientation carries practical implications for applications. Positively oriented rules align with contexts, where forecasters maximize expected scores to reveal true beliefs, a rooted in . Negatively oriented rules predominate in optimization, treating scores as loss functions to minimize during model training, facilitating integration with gradient-based algorithms. This convention of distinguishing orientations emerged as standard in the literature since the 1970s, though usage varies by discipline—positive in statistics and , negative in computational fields.

Expected Score

The expected score serves as the theoretical performance metric for a scoring rule, quantifying the long-run average score a forecaster would receive if repeatedly issuing the same probabilistic forecast under a true underlying distribution. For a forecast distribution p and true distribution q over a discrete sample space, the expected score is defined as \text{ES}(p, q) = \sum_y q(y) \, S(y, p), where S(y, p) is the score assigned to the forecast p upon observing outcome y. This expectation represents the mean score over outcomes drawn from q. In the continuous case, the expected score takes the form \text{ES}(p, q) = \int q(y) \, S(y, p(y)) \, dy, where the integration is over the , and p(y) denotes the forecasted density or evaluated at y. This formulation extends the discrete case to handle outcomes from continuous distributions, maintaining the focus on average performance relative to the true distribution q. The expected score plays a central in evaluating forecast quality, as it measures the anticipated long-run performance of a scoring rule. For proper scoring rules, the expected score is maximized when the forecast p equals the true distribution q, incentivizing forecasters to report their true beliefs to achieve the highest possible average score. Strict propriety further strengthens this by ensuring that the maximum is unique, attained only when p = q, which promotes unambiguous truthful reporting in .

Sample Average Score

The sample average score, also referred to as the empirical score, quantifies the performance of probabilistic forecasts by averaging the values of a scoring rule over a of forecast-observation pairs. For n pairs (y_i, p_i), where y_i denotes the observed outcome and p_i the forecast at occasion i, it is defined as \bar{S}_n = \frac{1}{n} \sum_{i=1}^n S(y_i, p_i), with S representing the scoring rule. This measure was introduced in early forecast verification studies within , notably in Brier's 1950 analysis of probabilistic weather predictions, where it served as a metric to assess forecast accuracy across multiple events. Assuming the pairs (y_i, p_i) are independent and identically distributed, the sample average score is an unbiased of the expected score, meaning its equals the true value. Furthermore, under standard regularity conditions such as finite variance, it converges to the expected score as the sample size n approaches infinity, by the ; this consistency property ensures that long-term empirical performance reliably reflects theoretical quality. In practice, the sample average score facilitates forecaster evaluation and using finite datasets, where competing methods are ranked by their aggregated scores—typically favoring those with minimal values for negatively oriented rules—to identify superior predictive systems without awaiting infinite data. This data-driven approach bridges theoretical expected scores to operational decision-making, though finite-sample variability necessitates careful sample size considerations for robust comparisons.

Properties and Theoretical Foundations

Proper Scoring Rules

A proper scoring rule is a mechanism for evaluating probabilistic forecasts such that the expected score is maximized when the forecaster reports their true distribution. Formally, for a scoring rule S(\mathbf{q}, x) that assigns a score based on reported probabilities \mathbf{q} and observed outcome x, and true P, the rule is proper if the expected score satisfies \mathbb{E}_{P}[S(P, X)] \geq \mathbb{E}_{P}[S(Q, X)] for all probability distributions Q. This property ensures that, under the true distribution, no other report yields a higher expected score than truthful reporting. Strict propriety strengthens this condition by guaranteeing a unique maximum at the true distribution, meaning the inequality is strict unless Q = P. This unique incentive aligns the forecaster's optimal precisely with , eliminating any ambiguity in the reporting . The theoretical foundations of proper scoring rules trace back to in the mid-20th century, with Leonard 's 1971 work providing a seminal characterization linking them to expected utility maximization and subjective probability . demonstrated that proper scoring rules correspond to functions on probability simplices, enabling their use as devices to infer personal probabilities without strategic distortion, building on earlier ideas from de Finetti's axioms. In practice, proper scoring rules facilitate truthful elicitation of beliefs in uncertain environments, preventing strategic misrepresentation by making dishonesty suboptimal in expectation. This has profound implications for applications like forecast verification and incentive-compatible mechanisms in economics and statistics.

Strictly Proper Scoring Rules

A strictly proper scoring rule is a refinement of a proper scoring rule, where the expected score under the true distribution p is uniquely maximized by reporting p itself. Formally, for a scoring rule S, it is strictly proper if \mathbb{E}_{Y \sim p} [S(p, Y)] > \mathbb{E}_{Y \sim p} [S(q, Y)] for all probability distributions q \neq p. This strict inequality ensures that any deviation from the true forecast p results in a strictly lower expected score, incentivizing precise calibration and sharpness in probabilistic forecasts. In contrast to merely proper scoring rules, which allow multiple forecasts to achieve the maximum expected score, strictly proper rules eliminate such ambiguity, making them particularly valuable for eliciting truthful and unique probabilistic predictions. Examples of proper but not strictly proper scoring rules include the zero-one score, which assigns a score of 1 if the forecast's matches the outcome and 0 otherwise; this is proper because the expected score is maximized by any forecast assigning positive probability only to the true outcome, but it is not strictly proper due to non-uniqueness among such forecasts. Similarly, the energy score with parameter \beta = 2 is proper for distributions with finite second moments but not strictly proper, as multiple forecasts can yield the same expected score. Such non-strict rules are rare in practice, as they often fail to distinguish between equally calibrated but differently sharp forecasts. The strict propriety of a scoring rule follows from the strict convexity of the negative expected score with respect to the forecast distribution. Specifically, the negative expected score -\mathbb{E}_{Y \sim p} [S(q, Y)] is a of q, implying a unique minimum at q = p. This convexity links strictly proper scoring rules to Bregman divergences, where the divergence generated by a \phi defines the negative expected score as D_\phi(q, p) = \mathbb{E}_{Y \sim p} [\phi'(p)(Y) - \phi'(q)(Y)] + \phi(q) - \phi(p), ensuring the strict for q \neq p. A brief proof sketch involves applied to the \phi: for q \neq p, the convexity yields \phi(q) > \mathbb{E}_{Y \sim p} [\phi'(p)(Y)], leading directly to the strict maximization of the expected score at q = p. Most commonly used scoring rules in and statistics are strictly proper, including the logarithmic score and the () score. For instance, the logarithmic score, defined as S(p, y) = \log p(y), is strictly proper for both and continuous distributions, uniquely rewarding the true probabilities. Likewise, the for categorical outcomes, S(p, y) = - \sum_k (p_k - \mathbb{I}\{y = k\})^2, is strictly proper, providing a penalty that penalizes deviations from the true distribution. These properties make strictly proper rules the standard for applications requiring robust evaluation of probabilistic forecasts.

Consistent Scoring Functions

In statistics, a consistent scoring function provides a framework for evaluating point forecasts of specific distributional functionals, ensuring that the scoring rule incentivizes accurate estimation of those functionals. Formally, consider a space \mathcal{X} of possible outcomes, a space \mathcal{Y} of forecasts, and the set \mathcal{P}(\mathcal{X}) of probability distributions over \mathcal{X}. A functional T: \mathcal{P}(\mathcal{X}) \to \mathcal{Y} maps distributions to point estimates, such as the mean or a quantile. A scoring function S: \mathcal{Y} \times \mathcal{X} \to \mathbb{R} is consistent for T if, for any distribution F \in \mathcal{P}(\mathcal{X}) and random outcome Y \sim F, \mathbb{E}[S(T(F), Y)] = \inf_{y \in \mathcal{Y}} \mathbb{E}[S(y, Y)], with the infimum achieved uniquely at y = T(F) for strict consistency. This property implies that the expected score is minimized precisely when the forecast equals the true functional value, promoting truthful reporting. In practice, with an observed sample X_1, \dots, X_n \stackrel{\text{iid}}{\sim} F, the sample average score \bar{S}_n(y) = n^{-1} \sum_{i=1}^n S(y, X_i) serves as an estimator. Under suitable regularity conditions, such as convexity of S and continuity of T, the minimizer \hat{y}_n = \arg\min_y \bar{S}_n(y) converges almost surely to T(F) as n \to \infty, making consistent scoring functions a basis for asymptotically valid point estimation. This convergence underpins their utility in empirical forecast verification, where the sample average approximates the population minimization. Proper scoring rules, which evaluate full probabilistic forecasts, represent a special case of consistent scoring functions where the functional T is the identity mapping, i.e., T(F) = F, directly eliciting the true distribution. Classic examples illustrate this framework. The squared error score S(y, x) = (y - x)^2 is strictly consistent for the mean functional T(F) = \mathbb{E}_F[X], as its expected value \mathbb{E}[(y - X)^2] = (y - \mu)^2 + \mathrm{Var}(X) (with \mu = \mathbb{E}[X]) is uniquely minimized at y = \mu. Similarly, the mean absolute error S(y, x) = |y - x| is strictly consistent for the median functional T(F) = F^{-1}(1/2), since the expected absolute deviation is minimized at the median for any distribution. These examples extend to other functionals like quantiles and expectiles, where consistent scores can be constructed via integral representations. The general theory of consistent scoring functions was formalized in the statistics literature during the 2010s, building on earlier work in to address challenges in verifying point forecasts for complex functionals in fields like and . Seminal contributions, including characterizations via Choquet integrals and forecast rankings, have established their role in ensuring robust evaluation beyond simple location parameters.

Applications in Practice

Weather and Climate Forecasting

Scoring rules have played a pivotal role in the verification of probabilistic forecasts in since the mid-20th century, particularly for assessing the accuracy of predictions in and contexts. The , originally proposed in 1950, was developed specifically for evaluating probabilistic forecasts of binary events, such as the occurrence or non-occurrence of , providing a quadratic measure of forecast accuracy that penalizes deviations from observed outcomes. For continuous predictands like or , the Continuous Ranked Probability Score (CRPS) emerged in the 1970s as a key metric, generalizing the to probabilistic distributions and enabling fair comparisons between deterministic and ensemble . In operational settings, such as those at the European Centre for Medium-Range Weather Forecasts (ECMWF), scoring rules are routinely applied to evaluate ensemble prediction systems for variables including surface temperature and . The Discrete Ranked Probability Score (RPS) is used for categorical forecasts, measuring the cumulative differences between predicted and observed cumulative probabilities across ordered categories, while the CRPS assesses continuous forecasts by integrating the squared differences between predictive and empirical cumulative distribution functions derived from ensembles. These metrics are computed over verification periods using reanalysis datasets like ERA5, allowing forecasters to quantify performance across lead times from short-range to medium-range predictions. The primary benefits of these scoring rules in lie in their ability to simultaneously evaluate —the statistical reliability of predicted probabilities—and —the concentration of predictive distributions around expected outcomes—thus incentivizing forecasts that are both accurate and informative. By rewarding proper probabilistic reporting, they facilitate the construction of skill scores, such as the Ranked Probability Skill Score (RPSS), which normalize raw scores against climatological benchmarks to highlight relative improvements in forecast quality. By the 2020s, scoring rules had become embedded in the verification processes for large-scale intercomparisons, including Phase 6 of the (CMIP6), where the Skill Score is applied to assess the performance of global s in simulating seasonal patterns against observational data. This integration supports model selection and bias correction in projections of future climate variability, enhancing the reliability of assessments for in sectors like and water resource management.

Economic and Decision Theory

In economic and decision theory, proper scoring rules play a crucial role in eliciting truthful subjective probabilities from individuals in settings such as markets and surveys, ensuring that reported beliefs maximize the expected reward under risk neutrality. These rules incentivize honest reporting by assigning scores that peak when the forecaster's matches their true beliefs, thereby facilitating the aggregation of dispersed information for collective . Applications of scoring rules in include prediction markets, where mechanisms like the logarithmic market scoring rule (LMSR) enable continuous trading and probability updates without requiring matched counterparties, promoting efficient information revelation. For instance, the Iowa Electronic Markets, a long-running real-money platform for forecasting elections and economic events, leverages market-based incentives akin to scoring rules to aggregate participant beliefs into accurate consensus probabilities. In cost-benefit analysis, scoring rules aid in quantifying uncertainty by eliciting probabilistic assessments of outcomes, allowing decision-makers to compute expected net benefits under alternative scenarios. A key theoretical link exists between scoring rules and utility maximization: for a risk-neutral , the expected score from a proper rule aligns precisely with the expected derived from acting on the true , making truthful reporting the optimal strategy in Bayesian decision frameworks. This connection underpins their use in for modeling subjective expectations. The prominence of scoring rules in economic applications grew in the and within , particularly through Charles Manski's work on measuring subjective probabilities via incentivized methods that address biases in survey responses. Manski emphasized their potential to reveal full s rather than point estimates, enhancing econometric analysis of under uncertainty.

Machine Learning and Calibration

In machine learning, proper scoring rules serve as loss functions for training probabilistic classifiers and regressors, incentivizing models to output calibrated probability distributions. The logarithmic scoring rule, equivalent to the cross-entropy loss, is widely used in neural networks for multi-class classification, where it minimizes the negative log-likelihood of the true class under the predicted probability distribution. This loss function encourages the model to assign high probabilities to correct classes and low probabilities to incorrect ones, facilitating gradient-based optimization in deep learning frameworks. Similarly, the Brier score, or quadratic scoring rule, functions as a mean squared error loss for multi-class settings by penalizing deviations between predicted probabilities and one-hot encoded true labels across all classes. Both rules are strictly proper, ensuring that the model's expected loss is minimized only when its predictions match the true conditional distribution, which supports reliable probabilistic outputs in supervised learning tasks. Scoring rules are instrumental in evaluating and improving model calibration, a critical aspect of trustworthy artificial intelligence where predicted probabilities should reflect true empirical frequencies. The expected calibration error (ECE) quantifies miscalibration by binning predictions based on confidence and measuring the difference between average predicted probabilities and observed accuracies within each bin, often using sample averages of proper scoring rules like the Brier or logarithmic score to estimate reliability. Poor calibration, detectable through high ECE values, can lead to overconfident or underconfident predictions, undermining applications in high-stakes domains such as medical diagnosis or autonomous systems; thus, post-hoc recalibration techniques, informed by these scores, adjust model outputs to align confidence with accuracy. In trustworthy AI, proper scoring rules provide a unified framework for assessing not just accuracy but also the sharpness and resolution of probabilistic forecasts, ensuring models are both precise and reliable. Recent developments in the have extended scoring rules to in large language models (LLMs), where they evaluate the decomposition of total into aleatoric (data-inherent) and epistemic (model-induced) components. For instance, frameworks using proper scores like the and logarithmic rules have improved in LLMs for tasks such as clinical text , reducing scores by up to 74% and enhancing expected calibration error through Bayesian-inspired approximations of posterior distributions. Methods leveraging proper scoring rules also guide selective and out-of-distribution detection by tailoring estimates to task-specific losses for better performance in real-world deployment. A key theoretical link exists between proper scoring rules and Bayesian methods, where the expected logarithmic score under the true equals the negative Shannon entropy minus the divergence to the predicted , implying that proper rules minimize divergence when predictions align with the true posterior. This connection underpins their use in variational inference and Bayesian neural networks, promoting forecasts that are information-theoretically optimal.

Examples of Proper Scoring Rules

Logarithmic Score for Categorical and Continuous Variables

The logarithmic scoring rule, also known as the log score or ignorance score, evaluates probabilistic forecasts by assigning a score based on the logarithm of the predicted probability assigned to the observed outcome. For categorical variables with a of outcomes, the score is defined as S(y, p) = \log p(y), where y is the realized outcome and p(y) is the forecaster's assigned probability to that outcome, assuming a positive orientation where higher scores indicate better performance. This formulation originates from early work on admissible probability measurement procedures, where it was identified as the unique local strictly proper scoring rule. For continuous variables, the logarithmic score extends naturally to probability densities, given by S(y, f) = \log f(y), where f denotes the predicted and y is the observed value. In both discrete and continuous cases, the score is undefined if the predicted probability or density at the outcome is zero, which underscores its strict requirement for positive support across possible outcomes. The logarithmic score is strictly proper, meaning that the expected score is uniquely maximized when the forecaster reports their true subjective probabilities, incentivizing honesty without strategic misrepresentation. It elicits the in aggregation contexts, such as pooling multiple forecasts, where the optimal combined probability corresponds to the of individual predictions under this rule. Additionally, the score is particularly sensitive to low-probability assignments; assigning a very small probability to the realized outcome results in a large negative score, heavily penalizing underestimation of . The derivation of the logarithmic score's propriety follows from , where the expected score under true probabilities q for a forecast p is \mathbb{E}_q[S(y, p)] = \sum_y q(y) \log p(y) in the case (or the analog for continuous), which simplifies to the negative Shannon minus the Kullback-Leibler : \mathbb{E}_q[S(y, p)] = -H(q) - D_{KL}(q \| p). This expression is maximized uniquely when p = q, as the KL is zero only for matching distributions, thereby linking the score's optimization to maximization.

Brier and Quadratic Scores

The is a quadratic scoring rule originally developed for evaluating probabilistic forecasts in . Introduced by Glenn W. Brier in , it was proposed as a to verify weather predictions expressed in terms of probabilities, such as the likelihood of , by measuring the squared difference between forecasted probabilities and observed outcomes. This score has since become a standard metric in forecast verification across various fields, rewarding forecasts that align closely with observed events while penalizing deviations. For a categorical prediction with K possible outcomes, let \mathbf{p} = (p_1, \dots, p_K) denote the forecasted , where \sum_{i=1}^K p_i = 1 and p_i \geq 0, and let y be the realized outcome. The indicator vector \mathbf{o} has o_i = I(y = i), which equals 1 if outcome i occurs and 0 otherwise. The is then given by BS(\mathbf{p}, y) = -\sum_{i=1}^K (p_i - I(y = i))^2 = -\|\mathbf{p} - \mathbf{o}\|^2, where \|\cdot\|^2 is the squared norm. This negative orientation treats the score as a reward, with higher (less negative) values indicating superior forecast accuracy; equivalently, the positive version \|\mathbf{p} - \mathbf{o}\|^2 functions as a , minimized for accurate predictions. For a sequence of n independent forecasts, the overall score is the average \frac{1}{n} \sum_{t=1}^n BS(\mathbf{p}_t, y_t). The Brier score possesses key theoretical properties that underpin its utility. It is a strictly proper scoring rule: for any true \mathbf{q}, the expected score \mathbb{E}[BS(\mathbf{p}, y) \mid \mathbf{q}] is uniquely maximized when \mathbf{p} = \mathbf{q}, incentivizing forecasters to report their true beliefs rather than hedging or biasing probabilities. Additionally, it is decomposable, permitting the expected score to be partitioned into terms capturing (the reliability of predicted probabilities relative to observed frequencies) and refinement (the sharpness or variability of the forecasts, reflecting their informativeness). Specifically, the expected Brier score can be expressed as \mathbb{E}[BS] = U - R + C, where U is the uncertainty inherent in the observations (a fixed climatological term), R is the or refinement (higher for sharper forecasts), and C is the error (lower for well-calibrated forecasts); this , originally due to , highlights trade-offs in forecast quality. As a , the generalizes naturally to point forecasts. When the \mathbf{p} is degenerate—assigning probability 1 to a single predicted outcome \hat{y} and 0 elsewhere—the score simplifies to BS(\mathbf{p}, y) = -( \hat{y} - y )^2, the negative squared error between the point prediction and the actual outcome. This connection positions the as an extension of the classical quadratic loss function to probabilistic settings, bridging deterministic and uncertain forecasting paradigms.

Spherical Score

The spherical score is a strictly proper scoring rule designed for evaluating probabilistic forecasts over categorical outcomes, where the forecast is represented by a \mathbf{p} = (p_1, \dots, p_m) with \sum_{i=1}^m p_i = 1 and p_i \geq 0, and the outcome is indicated by a vector \mathbf{o} such that o_y = 1 and o_i = 0 for i \neq y. It can be expressed in vector form as the of the forecast and outcome vectors, normalized by the of the forecast : S(\mathbf{p}, \mathbf{o}) = \frac{\mathbf{p} \cdot \mathbf{o}}{\| \mathbf{p} \|_2} = \frac{p_y}{\sqrt{\sum_{i=1}^m p_i^2}}. An alternative formulation, which scales the score differently while preserving its propriety, is S(\mathbf{p}, y) = 1 - \frac{(1 - p_y)^2}{\sum_{i=1}^m p_i^2}. This rule assigns higher scores (closer to 1) for forecasts that align well with the observed outcome, with the maximum score of 1 achieved when \mathbf{p} is a Dirac delta at y. The spherical score is strictly proper, meaning that for any true probability distribution \mathbf{q}, the expected score is uniquely maximized when the forecast \mathbf{p} = \mathbf{q}. It belongs to a family of power-based scoring rules derived from Bregman divergences associated with the \ell_\alpha-norm for \alpha = 2, ensuring that deviations from the true distribution are penalized in a convex manner. Regarding transformations, the spherical score's expected value under the true distribution remains invariant under positive affine transformations of the score function, a property shared by all proper scoring rules, which preserves its incentive compatibility for eliciting honest forecasts. Geometrically, the spherical score interprets the \mathbf{p} as a point in the probability simplex, projected onto the unit sphere in \mathbb{R}^m via by \| \mathbf{p} \|_2. The score then measures the cosine of between this and the outcome direction \mathbf{o}, rewarding forecasts that are both concentrated (higher \| \mathbf{p} \|_2) and correctly aligned with the outcome. This projection emphasizes directional accuracy over magnitude, making it particularly suitable for distinguishing forecasts in scenarios where probability mass is spread across many categories. Although less commonly applied than the logarithmic or Brier scores, the spherical score finds utility in high-dimensional categorical settings, such as multi-class or ensemble forecasting with numerous outcomes, where its normalization helps evaluate relative alignments without excessive sensitivity to the overall of the forecast .

Ranked Probability Score

The Ranked Probability Score (RPS) is a strictly proper scoring rule specifically developed for evaluating probabilistic forecasts of ordered categorical outcomes, such as severity levels or ranked . Introduced by in 1969, it extends the by incorporating the ordinal structure of categories, thereby addressing limitations in scores that ignore . This makes the RPS particularly useful for applications where the relative positions of categories matter, like intensity forecasts or economic indicators on ordinal scales. The RPS is computed based on the cumulative distribution functions of the forecast and observation. For K ordered categories, let F_k denote the forecasted cumulative probability up to category k (i.e., the probability that the outcome is category k or lower), and O_k the corresponding observed cumulative indicator (0 if the outcome exceeds category k, 1 otherwise). The score is given by: \text{RPS} = \frac{1}{K-1} \sum_{k=1}^{K-1} (F_k - O_k)^2 Lower values indicate better forecast accuracy, with the minimum value of 0 achieved when the forecast perfectly matches the . This formulation normalizes the score to the range [0, 1], facilitating comparisons across different numbers of categories. As a strictly proper rule, the RPS incentivizes forecasters to report their true subjective probabilities, minimizing the expected score only when the forecast distribution matches the true one. It penalizes rank errors by quadratically weighting discrepancies in cumulative probabilities, with larger penalties for misplacements farther apart in the ordering—for instance, confusing the lowest and highest categories incurs a higher cost than adjacent ones. Compared to the , which evaluates each category independently and thus underutilizes ordinal information, the RPS provides a more nuanced assessment for ordered data, improving interpretability in contexts like multi-level risk assessments. In practice, the RPS has found application in , particularly for verifying survey-based density forecasts of variables like or GDP , where ordinal categorizations are common. Its sensitivity to cumulative misalignments helps highlight both and in such forecasts, though it assumes equal spacing between categories unless weights are incorporated. Seminal analyses confirm its robustness as an extension of quadratic scoring rules from the 1970s, balancing simplicity with ordinal awareness.

Continuous Ranked Probability Score

The Continuous Ranked Probability Score (CRPS) is a strictly proper scoring rule designed to evaluate univariate probabilistic forecasts expressed as cumulative distribution functions (CDFs). It measures the squared distance between the forecast CDF F and the corresponding to the observed value y, providing a comprehensive assessment of forecast sharpness and . The CRPS is defined as \text{CRPS}(F, y) = \int_{-\infty}^{\infty} \left( F(z) - \mathbb{I}(z \geq y) \right)^2 \, dz, where \mathbb{I}(\cdot) is the . This integral form generalizes the to distributional forecasts, reducing to the absolute | \mu - y | when F is a at a point forecast \mu. As a strictly proper score, the CRPS is minimized in when the forecast CDF matches the true conditional distribution, incentivizing honest probabilistic predictions. For computation, the CRPS lacks a universal closed-form expression but can be evaluated efficiently using numerical quadrature methods, particularly for parametric distributions like the Gaussian, where explicit formulas exist involving the cumulative distribution function \Phi and density \phi: \text{CRPS}(\mathcal{N}(\mu, \sigma^2), y) = \sigma \left[ \frac{1}{\sqrt{\pi}} - 2 \phi\left( \frac{y - \mu}{\sigma} \right) - \frac{y - \mu}{\sigma} \left( 2 \Phi\left( \frac{y - \mu}{\sigma} \right) - 1 \right) \right]. When forecasts are represented by samples, an O(n \log n) algorithm based on order statistics provides an unbiased estimate. Advanced methods link the CRPS to distances in reproducing kernel Hilbert spaces via kernel mean embeddings, enabling scalable computation for complex distributions. Since the early 2000s, the CRPS has become a standard metric in for verifying forecasts, decomposing into components of reliability, , and to diagnose forecast performance. In , it is widely applied to assess probabilistic predictions of variables like river discharge and , supporting model and intercomparison.

Energy and Variogram Scores for Multivariate Cases

The and scores provide essential tools for evaluating multivariate probabilistic forecasts of continuous joint distributions, capturing dependencies across dimensions that univariate scores cannot address. These scores are particularly valuable in fields requiring assessment of spatial or temporal correlations, such as climate modeling, where forecasts often involve vector-valued outcomes like temperatures or wind speeds at multiple locations. The energy score measures the discrepancy between a forecast distribution F and an observation y \in \mathbb{R}^d using Euclidean distances between random draws. It is given by \mathrm{ES}(F, y) = \mathbb{E}_{X \sim F} [ \|X - y\| ] - \frac{1}{2} \mathbb{E}_{X, X' \sim F} [ \|X - X'\| ], where \|\cdot\| denotes the Euclidean norm and the expectations are with respect to independent draws from F. This formulation ensures the score is strictly proper, as the expected score \mathbb{E}_{Y \sim G} [\mathrm{ES}(F, Y)] is minimized when F = G, the true distribution. In practice, the score is estimated via Monte Carlo integration using samples from F, making it suitable for ensemble-based forecasts. The energy score was introduced as a multivariate generalization of the continuous ranked probability score and has kernel representations linking it to negative definite functions. When comparing two distributions via samples X_1, \dots, X_n \sim F and Y_1, \dots, Y_m \sim G, an analogous form of the is D(F, G) = \mathbb{E} [ \|X - Y\| ] - \frac{1}{2} \left( \mathbb{E} [ \|X - X'\| ] + \mathbb{E} [ \|Y - Y'\| ] \right), which quantifies and equals zero F = G (under mild conditions). For a single observation (m=1, Y_1 = y), this reduces to the energy score up to scaling. The score is elicitable for the multivariate in the sense that it penalizes deviations in location parameters. Developed in the mid-2000s, the energy score saw increased adoption in the 2010s for verifying ensemble forecasts in and climate science. The variogram score, inspired by the semivariogram in geostatistics, extends similar ideas by incorporating powers of distances to assess dependence structures. The core component is the variogram of order \alpha, defined as V_\alpha = \mathbb{E} [ \|X - Y\|^\alpha ], where X, Y are independent draws from the distribution and typically \alpha = 1 or $2. The scoring rule compares the forecast's expected variogram to an empirical version from the observation, often in the form \mathrm{VS}_\alpha(F, y) = \mathbb{E}_{X \sim F} [ \|X - y\|^\alpha ] - \frac{1}{2} \mathbb{E}_{X, X' \sim F} [ \|X - X'\|^\alpha ]. This generalizes the energy score (which corresponds to \alpha=1) and is strictly proper under stationarity assumptions on the underlying process, ensuring the expected score is optimized for truthful forecasts. The variogram score is particularly sensitive to misspecifications in variances and correlations, and it elicits the mean vector by being minimized at the correct location parameters. Introduced in the 2010s specifically for multivariate probabilistic forecasts in spatial and temporal contexts, it has been applied to climate variables like wind speeds across sites, aiding diagnosis of ensemble underdispersion or correlation errors. Both scores facilitate of proper scoring rules by decomposing forecast errors into components related to , , and , though they require careful for comparability across dimensions.

Comparison and Interpretation of Proper Scoring Rules

Proper scoring rules, while all incentivizing honest reporting of probabilistic forecasts, differ in their sensitivity to various aspects of forecast quality, such as , , and tail behavior. For instance, the logarithmic score is highly sensitive to extreme probabilities, penalizing overconfident predictions in the tails more severely than the , which emphasizes overall and is more forgiving of errors. In contrast, the continuous ranked probability score (CRPS) is robust to outliers but less sensitive to multivariate structures compared to the score, which better detects errors in distributions. These differences affect interpretability, as scores like the provide decomposable insights into reliability and , aiding diagnostic analysis. All strictly proper scoring rules are equivalent up to positive affine transformations for the purpose of , meaning they yield the same optimal forecast under , but they diverge in robustness and practical utility. For example, while the logarithmic score's sensitivity to low-probability events makes it ideal for emphasizing sharpness, the CRPS offers greater robustness in continuous settings by integrating over the full cumulative . This equivalence ensures consistent incentives, yet non-equivalence in finite samples or under misspecification highlights the need for rule-specific interpretations, such as the energy score's strength in multivariate robustness despite higher computational demands. The choice of a proper scoring rule depends on the forecasting domain and data characteristics. In categorical settings, the or logarithmic scores are preferred for their direct handling of discrete outcomes and ease of decomposition. For continuous variables, the CRPS suits univariate cases by evaluating the entire distribution, while the energy score excels in multivariate scenarios where dependencies matter. Ordered data may favor scores like the ranked probability score for preserving ordinal structure, whereas unordered multivariate data benefits from kernel-based approaches like the energy score.
Scoring RuleSensitivity CharacteristicsComputability NotesTypical Domain Suitability
Logarithmic ScoreHigh to tails and extreme probabilities; penalizes miscalibration in low-probability events severely.Closed-form for densities; efficient for categorical.Categorical forecasts; emphasizes sharpness.
Brier ScoreModerate to and ; robust to extreme values compared to log score.Simple calculation; decomposable.Categorical and ; diagnostic via .
CRPSSensitive to marginal distributions; less to dependencies; robust to outliers.Integral over CDF, approximated via empirical CDF or ; feasible for univariate.Continuous univariate; full distribution evaluation.
Energy ScoreSensitive to joint dependencies and tail structures; robust in multivariate settings.Sample-based via distances; estimation, intensive for high dimensions.Multivariate continuous; dependency-focused.
No single proper scoring rule is universally superior, as performance varies by context; instead, decomposition—such as the score's breakdown into reliability, resolution, uncertainty, and binning effects—enables targeted diagnosis of forecast weaknesses like poor or low discrimination. This approach, extended via the Diebold-Mariano test for significance, guides rule selection by revealing trade-offs in sensitivity and interpretability.

Characteristics and Advanced Properties

Affine Transformations

An affine transformation of a scoring rule S is given by S'(p, x) = a S(p, x) + b, where a > 0 is a positive scaling factor and b \in \mathbb{R} is a constant shift. If S is a strictly proper scoring rule—meaning that the forecaster maximizes their expected score by reporting their true belief q—then S' is also strictly proper. This equivalence holds because the transformation does not alter the distribution p that maximizes the expected score under the true outcome distribution q. The preservation of strict propriety follows from the linearity of the operator. Specifically, the expected score under S' is \mathbb{E}_{q}[S'(p, X)] = a \mathbb{E}_{q}[S(p, X)] + b. Since a > 0, the maximum occurs at the same p = q as for S, and the maximum is unique if S is strictly . Moreover, many proper scoring rules admit a Bregman , where the expected score relates to a d(q, p) derived from a strictly convex function; affine transformations leave this divergence structure invariant, as the scaling and shift apply uniformly. These transformations have key implications for the design and application of scoring rules. They allow , such as rescaling to a bounded like [0, 1], without changing the incentives for truthful reporting, which is particularly useful for practical implementations where score ranges need standardization. Additionally, since expected scores under different forecasts maintain their relative orderings—\mathbb{E}_{q}[S'(p_1, X)] > \mathbb{E}_{q}[S'(p_2, X)] \mathbb{E}_{q}[S(p_1, X)] > \mathbb{E}_{q}[S(p_2, X)], for a > 0—affine invariance ensures consistent comparability across equivalent scoring rules.

Locality and Decomposition

Locality in refers to the property that the score assigned to a probabilistic forecast depends solely on the predictive probability or evaluated at the observed outcome, without requiring information from the entire forecast distribution. This local simplifies and , as seen in the logarithmic scoring rule, where the score is S(p, o) = -\log p(o), depending only on the forecast probability p(o) at the outcome o. In contrast, non-local rules like the for multi-category forecasts incorporate the full vector of probabilities, making them more computationally intensive but potentially more robust to distributional assumptions. Local scoring rules exhibit desirable local behavior, such as differentiability in smooth variants, which measures sensitivity to small perturbations in the forecast or outcome near the observation. For instance, the quadratic scoring rule, a strictly proper local rule in certain formulations, is twice differentiable, allowing gradients to quantify how the score changes with nearby outcomes and aiding optimization in forecast elicitation. This differentiability supports local analysis of forecast accuracy, particularly in continuous settings where small errors in density estimation can be isolated. Decomposition of proper scoring rules breaks the overall score into interpretable components that reveal specific aspects of forecast quality, such as (reliability), , and . For the in binary or multi-category settings, the decomposes as \mathrm{BS} = \overline{U} - \overline{R} + \overline{C}, where \overline{U} is the average (inherent variability in outcomes), \overline{R} is the (forecast's ability to distinguish outcomes), and \overline{C} is the term (match between forecast probabilities and observed frequencies). This framework, introduced by and Winkler, generalizes to other strictly proper scoring rules, where the score separates into a reliability term (measuring ) and a refinement term (capturing or ), both of which are non-negative for proper rules. Such facilitate of forecast weaknesses; for example, high but low indicates over-hedging, while poor signals in probability estimates. In proper scoring rules, the positive decomposition ensures that improvements in individual components directly reduce the overall score, providing a diagnostic tool for refining systems without altering the incentive for truthful reporting.

Elicitability and Recent Developments

A functional T: \mathcal{F} \to A \subseteq \mathbb{R}^k on a class of probability distributions \mathcal{F} is elicitable if there exists a strictly consistent S such that, for any distribution F \in \mathcal{F}, the expected score \mathbb{E}_{Y \sim F} [S(T(F), Y)] is uniquely minimized at the true value T(F). This property ensures that forecasts of T can be incentivized truthfully through the . For instance, the functional is elicitable via the scoring rule S(r, y) = -(r - y)^2, while quantiles are elicitable using loss functions. However, the variance alone is not elicitable, though the functional of and variance is elicitable with a bivariate . Full probability distributions are generally not elicitable as single functionals, though they can be elicited component-wise for finite support cases. Proper scoring rules, which are strictly consistent for full distributions, relate to elicitability by providing expected score minimization at the true ; for elicitable functionals, consistent scoring functions serve a similar role but target subsets like moments or quantiles. This connection underpins the use of scoring rules in evaluating probabilistic forecasts beyond complete distributions, such as in where joint elicitability aids Value-at-Risk assessment. In the 2020s, research has advanced kernel-based scoring rules for high-dimensional settings, leveraging rough path theory to handle spatio-temporal data like forecasts, where traditional scores struggle with path dependencies. For applications, scoring rules have been adapted for calibrating large language models (), as in the Credence Calibration Game, which uses symmetric or exponential scoring strategies to align LLM confidence scores with empirical accuracy in long-form generations. Post-2020 work integrates scoring rules into , particularly through novel score functions that enhance coverage guarantees for by quantifying nonconformity more robustly, with applications extending to tasks.

References

  1. [1]
    [PDF] Strictly Proper Scoring Rules, Prediction, and Estimation
    Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences.Missing: inventor | Show results with:inventor
  2. [2]
    [PDF] An Overview of Applications of Proper Scoring Rules
    To the best of our knowl- edge, the classic paper by Brier (1950) was the first to introduce proper scoring rules in the context of weather forecasting.Missing: original inventor
  3. [3]
    VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF ...
    VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY. GLENN W. BRIER. GLENN W. BRIER U. S. Weather Bureau, Washington, D. C.. Search for other papers by ...
  4. [4]
    Scoring Rules for Forecast Verification
    The proper score function is the logarithm of the probability forecasted for the event that actually occurred and any different choice necessarily infringes at ...
  5. [5]
    An overview of deterministic and probabilistic forecasting methods of ...
    Jan 20, 2023 · This article aims to provide a systematic review of the existing deterministic and probabilistic wind forecasting methods.Missing: seminal | Show results with:seminal
  6. [6]
    [PDF] Another look at Proper Scoring Rules - ECMWF
    Feb 8, 2007 · Some history. • Most scores for probability forecasts of categorical variables proved to be proper in late 1960s and early 70s by Murphy and ...
  7. [7]
    [0912.0902] Making and Evaluating Point Forecasts - arXiv
    Dec 4, 2009 · Point forecasting methods are compared and assessed by means of an error measure or scoring function, such as the absolute error or the squared error.
  8. [8]
    Use and Communication of Probabilistic Forecasts - PMC
    Decision theory is also useful in research on the use of probabilistic forecasts, to analyze different possible decision rules. This article is organized as ...Missing: seminal | Show results with:seminal
  9. [9]
    None
    ### Summary of Scoring Rules from https://cran.r-project.org/package=scoringRules/vignettes/article.pdf
  10. [10]
    Scoring Probabilistic Forecasts: The Importance of Being Proper
    Another property of scores (i.e., locality) is discussed. Several scores are examined in this light. There is, effectively, only one proper, local score for ...
  11. [11]
  12. [12]
    Of quantiles and expectiles: consistent scoring functions, Choquet ...
    May 10, 2016 · Competing point forecasts are then compared by using a non-negative ... scoring rules for probability forecasts of binary events ...
  13. [13]
    Scoring Rules for Continuous Probability Distributions - PubsOnLine
    In this paper, families of scoring rules for the elicitation of continuous probability distributions are developed and discussed.Missing: 1976 | Show results with:1976
  14. [14]
    [PDF] Evaluation of ECMWF forecasts
    The ensemble precipitation forecasts are evaluated with the CRPSS (Figures 22 and 23). Verification is against the same set of SYNOP observations as used ...
  15. [15]
    Probabilistic Forecasts, Calibration and Sharpness - Oxford Academic
    Proper scoring rules address calibration as well as sharpness and allow us to rank competing forecast procedures. Section 4 turns to a case-study on ...
  16. [16]
    Bias correction of CMIP6 simulations of precipitation over Indian ...
    Mar 1, 2023 · Brier Skill Score (BSS) measures the accuracy of predictions. It is evaluated with the help of mean squared error between observed and model ...Missing: verification | Show results with:verification
  17. [17]
    [PDF] Logarithmic Market Scoring Rules for Modular Combinatorial ...
    In practice, scoring rules elicit good probability estimates from individuals, while betting markets elicit good consensus estimates from groups.
  18. [18]
    [PDF] Prediction Markets for Economic Forecasting
    The major variations are play-money markets, pari-mutuel markets, and market scoring rules. Due to concerns about speculation and manipulation, the U.S. ...
  19. [19]
    [PDF] Measuring Expectations
    Hence, with this reward function, a person's optimal forecast is his subjective probability for the event. Proper scoring rules have been applied in ...
  20. [20]
  21. [21]
    [PDF] Better Uncertainty Calibration via Proper Scores for Classification ...
    In this work, we introduce the framework of proper calibration errors, which relates every calibration error to a proper score and provides a respective upper ...
  22. [22]
    None
    **Authors:** Mridul Sharma, Adeetya Patel, Zaneta D’souza, Samira Abbasgholizadeh Rahimi, Siva Reddy, Sreenath Madathil
  23. [23]
    None
    ### Summary: Relation Between Proper Scoring Rules and Uncertainty Quantification
  24. [24]
    [PDF] Proper scoring rules for estimation and forecast evaluation - arXiv
    A scoring rule S : P ×Y → R is called regular if H(P) = S(P,P) is finite and S(P,Q) > −∞ for every P,Q ∈ P with P ̸= Q. Theorem 12 (Gneiting and Raftery, 2007).
  25. [25]
    [PDF] From Proper Scoring Rules to Max-Min Optimal Forecast Aggregation
    Aug 19, 2023 · Logarithmic pooling (sometimes called log-linear or geometric pooling) consists of taking a weighted geometric mean of the probabilities and ...
  26. [26]
    [PDF] Decompositions of Proper Scores - MPG.PuRe
    Oct 10, 2007 · A popular example is the Brier score, which allows for a decomposition into terms related to the sharpness (or information content) and to the ...
  27. [27]
  28. [28]
    A Scoring System for Probability Forecasts of Ranked Categories in
    A Scoring System for Probability Forecasts of Ranked Categories in: Journal of Applied Meteorology and Climatology Volume 8 Issue 6 (1969)
  29. [29]
    [PDF] Scoring rules and survey density forecasts - University of Warwick
    Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8,. 985–987. Galbraith, J. W., & van ...
  30. [30]
    Decomposition of the Continuous Ranked Probability Score for ...
    This leads to the concept of the continuous ranked probability score (CRPS) (Brown 1974; Matheson and Winkler 1976; Unger 1985; Bouttier 1994). This CRPS has ...
  31. [31]
    Estimation of the Continuous Ranked Probability Score with Limited ...
    Nov 21, 2017 · The continuous ranked probability score (CRPS) is a much used measure of performance for probabilistic forecasts of a scalar observation.
  32. [32]
  33. [33]
    Performance Metrics for the Comparative Analysis of Clinical Risk ...
    Compared with the Brier score, the log-loss increases more rapidly for extreme values of predictions around 0 or 1 that are untrue. The choice between the.
  34. [34]
    [PDF] On Sensitive and Strictly Proper Scoring Rules - arXiv
    Oct 16, 2019 · We find that the energy score, which is probably the most widely used multivariate scoring rule, performs comparably well in detecting forecast ...<|control11|><|separator|>
  35. [35]
    Two Extra Components in the Brier Score Decomposition in
    The Brier score can be decomposed into the sum of three components: uncertainty, reliability, and resolution.Abstract · Introduction · The Brier score decomposition · b. Brier score decomposition
  36. [36]
    Beyond Strictly Proper Scoring Rules: The Importance of Being Local
    Dec 23, 2020 · The only local strictly proper scoring rules, the logarithmic score, has direct interpretations in terms of probabilities and bits of ...
  37. [37]
    Strictly Proper Scoring Rules, Prediction, and Estimation
    This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof.
  38. [38]
    Beyond Strictly Proper Scoring Rules: The Importance of Being ...
    The only local strictly proper scoring rule, the logarithmic score, has direct interpretations in terms of probabilities and bits of information.
  39. [39]
    [PDF] Scoring Probabilistic Forecasts: The Importance of Being Proper - LSE
    Aug 3, 2006 · There is, effectively, only one proper, local score for probability forecasts of a continuous variable. It is also noted that operational.
  40. [40]
    Reliability, sufficiency, and the decomposition of proper scores
    Jul 14, 2009 · ... decomposition into terms related to the resolution and the reliability of a forecast. This fact is particularly well known for the Brier Score.
  41. [41]
    [PDF] Elicitability and its Application in Risk Management - arXiv
    Jul 30, 2017 · Definition 1.2 (Elicitability). A functional T : F → A ⊆ Rk is called elicitable if there exists a strictly F-consistent scoring function for T.
  42. [42]
    [PDF] Elicitability - Stanford AI Lab
    We can of course elicit the probability of each Ei separately (e.g., with a quadratic scoring rule) but it is not possible to elicit the probabilities of all ...
  43. [43]
    Distribution‐Based Model Evaluation and Diagnostics: Elicitability ...
    Jun 19, 2024 · Scoring rules have long been used to evaluate the accuracy of forecast probabilities after observing the occurrence, or nonoccurrence, of ...
  44. [44]
    [PDF] Elicitability and Encompassing for Volatility Forecasts by Bregman ...
    Sep 30, 2023 · Gneiting (2011) states that variance alone is not elicitable, but there exists a 2-elicitable functional of joint mean and variance. Brehmer ...
  45. [45]
    [PDF] From Classification Accuracy to Proper Scoring Rules: Elicitability of ...
    Point predictions are evaluated by means of consistent scoring or loss functions. Similar to proper scoring rules, a scoring function is consistent for a ...
  46. [46]
    [PDF] Signature Kernel Scoring Rule as Spatio-Temporal Diagnostic for ...
    Oct 21, 2025 · Since 2022, new data-driven machine learning weather prediction approaches (MLWP) boosted by GPU acceleration have seen tremendous development.
  47. [47]
    Credence Calibration Game? Calibrating Large Language Models ...
    Aug 20, 2025 · Table 1: Scoring rules of the Credence Calibration Game under symmetric and exponential scoring strategies. Report issue for preceding ...Missing: PLMs | Show results with:PLMs
  48. [48]
  49. [49]
    A Quantum Probability Approach to Improving Human–AI Decision ...
    This perspective paper explores how QPT may be applied to human–AI interactions and contributes by integrating these concepts into human-in-the-loop decision ...Missing: scoring | Show results with:scoring<|control11|><|separator|>