Fact-checked by Grok 2 weeks ago

Normal distribution

The normal distribution, also known as the , is a continuous for a real-valued that is symmetric and bell-shaped, with values most concentrated around its and decreasing smoothly away from it. It is defined by two parameters: the μ, which specifies the center of the distribution, and the standard deviation σ, which measures the spread or width. The for a normal X is given by
f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right),
where the total area under the curve equals 1, representing the total probability.
The normal distribution is the most widely used in statistics because it approximates many natural phenomena so well, such as heights, test scores, and measurement errors, due to the influence of numerous small, independent random factors. This ubiquity is largely explained by the , which states that the of the sample mean from any population with finite mean and variance approaches a normal distribution as the sample size increases, regardless of the underlying population distribution. First introduced by in 1733 as an approximation for probabilities and later formalized by in 1809 in the context of astronomical error analysis, the normal distribution underpins much of , including hypothesis testing, confidence intervals, and .

Definitions

Probability Density Function

The (PDF) of the normal distribution, also known as the Gaussian distribution, for a X with \mu \in \mathbb{R} and positive variance \sigma^2 > 0 is defined as f(x \mid \mu, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right), where x \in \mathbb{R}. In this formulation, \mu serves as the , specifying the center or of the distribution, while \sigma acts as the , determining the standard deviation and thus the dispersion around the . This PDF provides the relative likelihood of X taking on a specific value x, with the function being continuous and non-negative, integrating to 1 over the entire real line to satisfy the axioms of probability. The normalizing constant \frac{1}{\sigma \sqrt{2\pi}} ensures the total area under the curve equals 1, derived from the fundamental Gaussian integral \int_{-\infty}^{\infty} \exp\left(-\frac{u^2}{2}\right) \, du = \sqrt{2\pi}. To see this, substitute u = \frac{x - \mu}{\sigma} into the integral of the PDF, yielding \int_{-\infty}^{\infty} f(x \mid \mu, \sigma^2) \, dx = \frac{1}{\sigma \sqrt{2\pi}} \cdot \sigma \int_{-\infty}^{\infty} \exp\left(-\frac{u^2}{2}\right) \, du = 1. This integral, first evaluated using techniques like polar coordinate transformation in the early 19th century, underpins the validity of the normal PDF as a proper probability distribution. Graphically, the PDF produces a characteristic bell-shaped curve that is symmetric about \mu, with the peak occurring at x = \mu where the density is maximized at \frac{1}{\sigma \sqrt{2\pi}}. Shifting \mu translates the curve horizontally along the x-axis without altering its shape or width, while increasing \sigma flattens and broadens the curve, reducing the peak height and spreading the probability mass over a larger range; conversely, decreasing \sigma sharpens and narrows it. The normal distribution is a member of the exponential family of distributions, expressible in canonical form as f(x \mid \eta) = h(x) \exp\left( \eta_1 x + \eta_2 \frac{x^2}{2} - A(\eta) \right), where h(x) = 1, the natural sufficient statistic is T(x) = (x, x^2), and the natural parameters are \eta_1 = \frac{\mu}{\sigma^2} and \eta_2 = -\frac{1}{\sigma^2}./05:_Special_Distributions/5.02:_General_Exponential_Families) This parameterization highlights the distribution's flexibility and facilitates statistical inference, as the exponential family structure simplifies maximum likelihood estimation and Bayesian updates.

Cumulative Distribution Function

The cumulative distribution function (CDF) of a normal random variable X \sim \mathcal{N}(\mu, \sigma^2) is defined as F(x \mid \mu, \sigma^2) = \int_{-\infty}^x f(t \mid \mu, \sigma^2) \, dt, where f(t \mid \mu, \sigma^2) is the probability density function of the normal distribution. This integral represents the probability that X takes a value less than or equal to x. The CDF lacks a closed-form expression in terms of elementary functions, as the antiderivative of the Gaussian density cannot be expressed using basic operations like polynomials, exponentials, or logarithms; this limitation dates to the early development of the normal distribution and necessitates special functions for explicit representation. Specifically, it can be written using the error function \erf(z) = \frac{2}{\sqrt{\pi}} \int_0^z e^{-t^2} \, dt, yielding F(x \mid \mu, \sigma^2) = \frac{1}{2} \left[ 1 + \erf\left( \frac{x - \mu}{\sigma \sqrt{2}} \right) \right]. The error function itself is a special function arising from the integral of the Gaussian kernel, providing a standardized way to tabulate and compute cumulative probabilities. For the standard normal distribution \mathcal{N}(0, 1), the CDF simplifies to \Phi(z) = F(z \mid 0, 1) = \frac{1}{2} \left[ 1 + \erf\left( \frac{z}{\sqrt{2}} \right) \right]. The general CDF relates to this via : F(x \mid \mu, \sigma^2) = \Phi\left( \frac{x - \mu}{\sigma} \right). The function \Phi(z) is continuous and strictly increasing (monotonic), with limits \Phi(-\infty) = 0 and \Phi(\infty) = 1, ensuring it serves as a proper . Additionally, it exhibits symmetry about zero: \Phi(-z) = 1 - \Phi(z), reflecting the symmetric bell of the underlying density.

Standard Normal Distribution

The standard normal distribution, also known as the z-distribution, is a specific case of the normal distribution with 0 and variance 1, serving as a foundational reference for theoretical and computational purposes in . Its (PDF) is given by \phi(z) = \frac{1}{\sqrt{2\pi}} \exp\left( -\frac{z^2}{2} \right), which describes a symmetric bell-shaped curve centered at zero, with the total area under the curve equal to 1. The cumulative distribution function (CDF) of the standard normal distribution, denoted Φ(z), represents the probability that a standard normal is less than or equal to z, and is computed as the integral of the PDF from negative infinity to z. Common values include Φ(1) ≈ 0.8413, Φ(2) ≈ 0.9772, and Φ(3) ≈ 0.9987, which illustrate the concentration of probability near the mean. These values underpin the empirical rule, or 68-95-99.7 rule, stating that approximately 68% of the distribution lies within ±1 standard deviation of the mean, 95% within ±2 standard deviations, and 99.7% within ±3 standard deviations. To transform a X from a general normal distribution with μ and standard deviation σ to the standard normal form, the z-score is used: z = (X - μ) / σ. This standardization allows any normal distribution to be expressed in terms of the standard normal, facilitating the use of precomputed tables and simplifying hypothesis testing and calculations. In practice, z-score standardization enables comparisons across datasets with different scales and units, such as converting test scores or measurements to a common metric for relative performance evaluation. The general normal distribution can be viewed as a linear of the standard normal, shifting and scaling it by μ and σ, respectively.

General Normal Distribution

The general normal distribution arises as a location-scale family derived from the standard normal distribution. If Z follows a standard normal distribution N(0,1), then the random variable X = \mu + \sigma Z follows a normal distribution N(\mu, \sigma^2), where \mu \in \mathbb{R} is the location parameter representing the mean and \sigma > 0 is the scale parameter representing the standard deviation. This transformation enables the normal distribution to flexibly model real-world data by shifting the center via \mu and adjusting the spread via \sigma, while preserving the bell-shaped symmetry of the standard form. An alternative parameterization replaces the variance \sigma^2 with the \tau = 1/\sigma^2, which is particularly useful in due to its compatibility with conjugate priors. The under this parameterization is f(x \mid \mu, \tau) = \sqrt{\frac{\tau}{2\pi}} \exp\left( -\frac{\tau (x - \mu)^2}{2} \right), where \tau > 0 measures the spread, making smaller \tau correspond to wider distributions. This form highlights the relationship between precision and variance, facilitating computations in models involving multiple normals. The univariate normal extends naturally to multivariate settings as a precursor to joint distributions, where vectors of variables are characterized by mean vectors and covariance matrices that generalize the scalar \mu and \sigma^2. A key uniqueness property of the normal distribution is its closure under linear combinations: if independent random variables follow normal distributions, then any linear combination of them also follows a normal distribution, a characterization that distinguishes it from other families and underpins its role in statistical theory.

Properties

Moments and Symmetry

The normal distribution, denoted X \sim \mathcal{N}(\mu, \sigma^2), has a given by the first E[X] = \mu. The higher-order moments can be expressed in terms of the and variance, but the central moments, defined as E[(X - \mu)^k], provide insight into the distribution's shape relative to its center. For k \geq 3, these central moments are zero, reflecting the distribution's . For even k = 2m where m \geq 1, the central moments are E[(X - \mu)^{2m}] = \sigma^{2m} (2m - 1)!!, with (2m - 1)!! denoting the , the product of all positive integers up to $2m - 1. In particular, the second central moment is the variance \sigma^2. The third central moment, normalized by \sigma^3, yields the skewness coefficient of zero, indicating perfect symmetry around the mean. This symmetry arises from the probability density function f(x) = \frac{1}{\sigma \sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right), which satisfies f(\mu + x) = f(\mu - x) for all x, making the distribution an even function centered at \mu. Consequently, the median, mode, and mean all coincide at \mu. The fourth central moment, normalized appropriately, gives a kurtosis of 3, resulting in an excess kurtosis of zero; this mesokurtic property means the normal distribution has tails and peakedness comparable to the baseline for many distributions. These moments can be derived using the (MGF) M(t) = \exp\left(\mu t + \frac{\sigma^2 t^2}{2}\right), obtained by in the M(t) = E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f(x) \, dx. The k-th raw moment is then E[X^k] = M^{(k)}(0), the k-th of the MGF evaluated at t = 0; central moments follow by shifting via the or direct computation. Alternatively, integration yields the central moments: for even powers, the \int_{-\infty}^{\infty} x^{2m} e^{-x^2/2} \, dx = \sqrt{2\pi} (2m - 1)!! (after and by \sigma) confirms the formulas, while odd powers vanish by symmetry.

Generating Functions

The moment-generating function (MGF) of a normal random variable X \sim \mathcal{N}(\mu, \sigma^2) is defined as M_X(t) = \mathbb{E}[e^{tX}] and evaluates to M_X(t) = \exp(\mu t + \sigma^2 t^2 / 2) for all real t \in \mathbb{R}. This closed-form expression is analytic everywhere in the complex plane, reflecting the normal distribution's infinite differentiability and providing a tool for deriving higher-order moments via differentiation at t = 0. The cumulant-generating function is the natural logarithm of the MGF, given by K_X(t) = \log M_X(t) = \mu t + \sigma^2 t^2 / 2. Its Taylor series expansion yields the cumulants, where the first cumulant \kappa_1 = \mu is the , the second \kappa_2 = \sigma^2 is the variance, and all higher-order cumulants \kappa_n = 0 for n > 2, underscoring the normal distribution's lack of and excess beyond the Gaussian form. The \psi_X(t) = \mathbb{E}[e^{itX}], where i = \sqrt{-1}, serves as the of the and is \psi_X(t) = \exp(i \mu t - \sigma^2 t^2 / 2). This function uniquely determines the distribution and facilitates proofs of convergence in distribution, such as in the . The inversion formula allows recovery of the density from the characteristic function via the Fourier transform: the probability density function f_X(x) satisfies f_X(x) = \frac{1}{2\pi} \int_{-\infty}^{\infty} e^{-itx} \psi_X(t) \, dt, assuming integrability conditions hold, which they do for the normal distribution. This bidirectional relationship highlights the analytical utility of generating functions in characterizing and manipulating normal distributions.

Maximum Entropy Property

The differential entropy of a continuous X with f(x) is defined as H(X) = -\int_{-\infty}^{\infty} f(x) \log f(x) \, dx, measuring the uncertainty or in the . Among all continuous on the real line with fixed \mu and variance \sigma^2, the normal maximizes this . The of a normal random variable X \sim \mathcal{N}(\mu, \sigma^2) is H(X) = \frac{1}{2} \left[1 + \log(2\pi \sigma^2)\right], and this value is the upper bound for any satisfying the same constraints, with equality holding uniquely for the normal . To establish this, consider maximizing H(X) subject to the constraints \int f(x) \, dx = 1, \int x f(x) \, dx = \mu, and \int (x - \mu)^2 f(x) \, dx = \sigma^2. Using the method of Lagrange multipliers, introduce multipliers \lambda_0, \lambda_1, and \lambda_2 for these constraints, respectively. The functional to optimize is \mathcal{L} = -\int f(x) \log f(x) \, dx + \lambda_0 \left( \int f(x) \, dx - 1 \right) + \lambda_1 \left( \int x f(x) \, dx - \mu \right) + \lambda_2 \left( \int (x - \mu)^2 f(x) \, dx - \sigma^2 \right). Taking the functional derivative with respect to f(x) and setting it to zero yields \log f(x) + 1 - \lambda_0 - \lambda_1 x - \lambda_2 (x - \mu)^2 = 0, so f(x) = \exp(\lambda_0 - 1 + \lambda_1 x + \lambda_2 (x - \mu)^2). Completing the square in the exponent and applying the constraints determines \lambda_1 = 0 (due to symmetry around \mu) and \lambda_2 = -1/(2\sigma^2), resulting in the normal density f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x - \mu)^2}{2\sigma^2} \right). This solution is unique under the given constraints, confirming the normal distribution as the entropy maximizer. This maximum entropy property underscores the normal distribution's role in as the distribution of maximal uncertainty given constraints on and , making it a natural choice for modeling phenomena where only and variance are specified, such as in and .

Stein Equation and Operator

The Stein operator for the normal distribution N(\mu, \sigma^2) is a first-order linear differential operator defined by \mathcal{A} f(x) = f'(x) - \frac{x - \mu}{\sigma^2} f(x) for differentiable test functions f satisfying appropriate growth conditions, such as absolute continuity and bounded derivatives. A random variable X follows N(\mu, \sigma^2) if and only if \mathbb{E}[\mathcal{A} f(X)] = 0 for all such f, providing a characterizing equation unique to the normal distribution in one dimension. This characterization stems from Stein's lemma, which equates \mathbb{E}[(X - \mu) f(X)] = \sigma^2 \mathbb{E}[f'(X)], and holds due to the symmetry and moment-generating properties of the normal. The Stein equation arises by setting \mathcal{A} f(x) = h(x) - \mathbb{E}[h(Z)], where Z \sim N(\mu, \sigma^2) and h is a measurable function, typically bounded or with bounded variation to ensure solvability. For h with |h(x) - h(y)| \leq |x - y| (Lipschitz) and bounded first derivative, the solution f satisfies |f(x)| \leq \sqrt{\pi/8} \min(1, \sigma/|x - \mu|) and \|f'\|_\infty \leq 1, ensuring boundedness independent of \mu and \sigma. No other univariate distribution satisfies this equation for all suitable h, confirming the normal's uniqueness. Stein's method leverages this framework for distributional approximation, bounding the distance between the law of a random variable W and N(\mu, \sigma^2) via |\mathbb{E}[h(W)] - \mathbb{E}[h(Z)]| = |\mathbb{E}[\mathcal{A} f(W)]| for solutions f to the Stein equation. This approach yields explicit error rates, notably in normal approximations for sums of random variables, including dependent cases. A key application is deriving Berry–Esseen-type bounds on the Kolmogorov distance, such as \sup_x |F_W(x) - \Phi((x - \mu)/\sigma)| \leq C \mathbb{E}[|W - \mu|^3]/(\sigma^3 \sqrt{n}) for sums of n variables with finite third moments, where C \approx 0.56 and \Phi is the standard normal CDF. These results extend classical central limit theorem error estimates to non-i.i.d. settings, with applications in statistical inference and risk analysis.

Parameter Estimation

Point Estimates for Mean and Variance

In the frequentist framework, point estimation of the parameters of a normal distribution from an independent and identically distributed sample X_1, \dots, X_n \sim N(\mu, \sigma^2) relies on classical estimators that the sample moments. The sample , defined as \bar{X} = \frac{1}{n} \sum_{i=1}^n X_i, serves as the primary estimator for the population mean \mu. This estimator is unbiased, meaning E[\bar{X}] = \mu, and its variance is \text{Var}(\bar{X}) = \sigma^2 / n. For samples from a normal distribution, \bar{X} follows an exact normal distribution: \bar{X} \sim N(\mu, \sigma^2 / n), and thus \sqrt{n} (\bar{X} - \mu) \sim N(0, \sigma^2). For the variance parameter \sigma^2, the unbiased sample variance is given by S^2 = \frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2, which corrects for the bias introduced by using the sample mean in place of the unknown population mean—a adjustment known as . This estimator satisfies E[S^2] = \sigma^2, making it unbiased, and under normality, (n-1) S^2 / \sigma^2 \sim \chi^2_{n-1}, where \chi^2_{n-1} denotes the with n-1 . The maximum likelihood estimators (MLEs), introduced by , coincide with the sample mean for \mu but differ for \sigma^2: \hat{\mu}_{\text{ML}} = \bar{X}, \quad \hat{\sigma}^2_{\text{ML}} = \frac{1}{n} \sum_{i=1}^n (X_i - \bar{X})^2. While \hat{\mu}_{\text{ML}} is unbiased, \hat{\sigma}^2_{\text{ML}} is biased downward, with E[\hat{\sigma}^2_{\text{ML}}] = \frac{n-1}{n} \sigma^2, though it remains consistent as n \to \infty. Under regularity conditions, including , the MLEs exhibit asymptotic : \sqrt{n} (\hat{\mu}_{\text{ML}} - \mu) \to_d N(0, \sigma^2) and \sqrt{n} (\hat{\sigma}^2_{\text{ML}} - \sigma^2) \to_d N(0, 2 \sigma^4), where \to_d denotes in . These properties ensure the estimators become increasingly reliable for large samples, facilitating such as confidence intervals.

Confidence Intervals

Confidence intervals for the parameters of a , specifically the μ and variance σ², are constructed using pivotal quantities derived from the sampling distributions of the sample mean and sample variance under the assumption of . These intervals provide probabilistic guarantees about containing the true parameter values, with the sample mean \bar{X} and sample variance S² serving as the foundational estimators. For the μ when the variance σ² is known, the standardized sample mean = √n (\bar{X} - μ) / σ follows a standard . Thus, a (1 - α) × 100% is given by \bar{X} \pm z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, where z_{α/2} is the (1 - α/2) quantile of the standard normal distribution. This interval has an exact coverage probability of 1 - α when the population is normally distributed. When σ² is unknown, the t-statistic T = √n (\bar{X} - μ) / S follows a Student's t-distribution with n - 1 degrees of freedom, leading to the confidence interval \bar{X} \pm t_{n-1, \alpha/2} \frac{S}{\sqrt{n}}, where t_{n-1, α/2} is the (1 - α/2) of the t-distribution with n - 1 . This construction, originally developed for small samples from normal populations, also achieves exact coverage probability of 1 - α under normality. For the variance σ², the (n - 1) S² / σ² follows a with n - 1 . The corresponding (1 - α) × 100% is \left( \frac{(n-1) S^2}{\chi^2_{n-1, \alpha/2}}, \frac{(n-1) S^2}{\chi^2_{n-1, 1 - \alpha/2}} \right), where \chi^2_{n-1, p} denotes the p quantile of the with n - 1 . This interval likewise has exact of 1 - α assuming ./09%3A_Point_Estimation_and_Confidence_Intervals/9.03%3A_Confidence_Intervals) Under the normality assumption, all these intervals possess exact coverage probabilities equal to their nominal levels, meaning that in repeated sampling, the proportion of intervals containing the true equals 1 - α precisely, without approximation.

Bayesian Estimation

In Bayesian estimation of the parameters of a distribution, priors are chosen to reflect prior beliefs about the \mu and variance \sigma^2, and the posterior is obtained by updating these with the likelihood from observed data X_1, \dots, X_n \stackrel{\text{iid}}{\sim} N(\mu, \sigma^2). When \sigma^2 is known, the for \mu is : \mu \sim N(\mu_0, \sigma^2 / \kappa_0), where \mu_0 is the prior and \kappa_0 > 0 controls the prior sample size. The resulting posterior is also : \mu \mid \mathbf{X} \sim N\left( \frac{\kappa_0 \mu_0 + n \bar{X}}{\kappa_0 + n}, \frac{\sigma^2}{\kappa_0 + n} \right), where \bar{X} = n^{-1} \sum_{i=1}^n X_i. This form weights the prior and sample means by their respective precisions, shrinking the posterior toward \mu_0 more strongly when \kappa_0 is large relative to n. When both \mu and \sigma^2 are unknown, the is the , which specifies \mu \mid \sigma^2 \sim N(\mu_0, \sigma^2 / \kappa_0) and \sigma^2 \sim \text{Inv-Gamma}(\nu_0 / 2, \nu_0 \sigma_0^2 / 2), with hyperparameters \mu_0, \kappa_0 > 0, \nu_0 > 0, and \sigma_0^2 > 0. The joint posterior remains normal-inverse-gamma with updated parameters: \kappa_n = \kappa_0 + n, \mu_n = (\kappa_0 \mu_0 + n \bar{X}) / \kappa_n, \nu_n = \nu_0 + n, and \nu_n \sigma_n^2 / 2 = \nu_0 \sigma_0^2 / 2 + (n-1) S^2 / 2 + \kappa_0 n (\bar{X} - \mu_0)^2 / (2 \kappa_n), where S^2 = n^{-1} \sum_{i=1}^n (X_i - \bar{X})^2. The marginal posterior for \mu is then a non-standard with \mu_n, \sigma_n / \sqrt{\kappa_n}, and \nu_n . A noninformative is the , derived from the and given by p(\mu, \sigma^2) \propto 1 / \sigma^2, which is improper but leads to proper posteriors for n \geq 2. Under this , the conditional posterior \mu \mid \sigma^2, \mathbf{X} \sim N(\bar{X}, \sigma^2 / n) and the marginal \sigma^2 \mid \mathbf{X} \sim \text{Inv-}\chi^2(n-1, S^2), where the marginal for \mu is again Student's t with n-1 , location \bar{X}, and scale S / \sqrt{n}. This is invariant under location-scale transformations and often motivates reference analyses. The posterior predictive distribution for a new observation X^* \mid \mathbf{X} integrates the normal likelihood over the posterior: p(X^* \mid \mathbf{X}) = \int p(X^* \mid \mu, \sigma^2) p(\mu, \sigma^2 \mid \mathbf{X}) \, d\mu \, d\sigma^2. Under the normal-inverse-gamma prior, this yields a Student's t-distribution: X^* \mid \mathbf{X} \sim t_{\nu_n} \left( \mu_n, \sigma_n^2 (1 + 1/\kappa_n) \right), accounting for both parameter uncertainty and sampling variability; the heavier tails compared to a normal reflect this uncertainty.

Hypothesis Testing and Normality Assessment

Normality Tests

Normality tests are statistical procedures used to assess whether a sample of comes from a normally distributed , which is crucial for validating assumptions in many statistical methods. These tests typically compare the empirical of the to the theoretical normal , either through graphical methods or quantitative statistics that yield p-values for testing. The is that the are normally distributed, and rejection indicates deviation from . Common tests include those based on empirical distribution functions, order statistics, and higher moments, each with varying sensitivity to different types of departures from . The -Wilk test is a powerful method for testing , particularly effective for small sample sizes (n ≤ 50). It computes a W defined as W = \frac{\left( \sum_{i=1}^n a_i X_{(i)} \right)^2}{\sum_{i=1}^n (X_i - \bar{X})^2}, where X_{(i)} are the ordered sample values, \bar{X} is the sample mean, and the coefficients a_i are specifically chosen constants derived from the expected values of statistics to maximize the test's power. Under the , W is close to 1; smaller values indicate non-normality. Critical values and p-values are tabulated for small n, with the test rejecting normality if W falls below a . The test was introduced by Shapiro and Wilk in their seminal 1965 paper, which demonstrated its superior power compared to earlier methods like the chi-squared goodness-of-fit test. The Kolmogorov-Smirnov (K-S) test, when adapted for , evaluates the maximum deviation between the empirical cumulative distribution function (ECDF) F_n(x) of the sample and the \Phi(x) of the standard normal distribution. The is D = \sup_x |F_n(x) - \Phi\left( \frac{x - \mu}{\sigma} \right)|, where \mu and \sigma are estimated from the sample (often using Lilliefors' modification to account for parameter estimation). Large values of D suggest non-. This test is distribution-free under the null but loses some power when parameters are estimated from the data. Originally proposed by Kolmogorov in 1933 and extended by Smirnov in 1948, the normality version is widely implemented in statistical software for its simplicity and applicability to continuous data. The Anderson-Darling test enhances the K-S approach by placing greater emphasis on the tails of the distribution, where deviations from are often most pronounced. Its statistic is given by A^2 = -n - \sum_{i=1}^n \frac{2i-1}{n} \left[ \ln \Phi(z_i) + \ln \left(1 - \Phi(z_{n+1-i}) \right) \right], where z_i = (X_{(i)} - \bar{X})/s are standardized order statistics, and \phi(z) is the standard (though the integral form weights squared differences by $1/\phi(z)). This weighting makes the test more sensitive to discrepancies in the tails. Critical values are available from tables or simulations. Developed by Anderson and Darling in 1952 and 1954, the test is recommended for its balance of power against various alternatives, including asymmetric and heavy-tailed distributions. The Jarque-Bera test assesses normality by examining deviations in and from their expected normal values of 0 and 3, respectively. The test statistic is JB = \frac{n}{6} \left( S^2 + \frac{(K - 3)^2}{24} \right), where S is the sample and K is the sample , asymptotically distributed as chi-squared with 2 under the . It is particularly useful for larger samples (n > 20) where moment estimates are reliable. Proposed by Jarque and Bera in , this test is computationally simple and commonly used in econometric applications to detect non-normality due to asymmetry or peakedness. Despite their utility, normality tests have limitations, including low power against certain alternatives such as heavy-tailed distributions (e.g., Student's t with low ) or data, where they may fail to detect deviations in moderate sample sizes. Additionally, no single test is universally most powerful across all alternatives, and results can be sensitive to sample size—large samples may reject for minor deviations irrelevant to practical analysis. Users are advised to complement these tests with visual inspections like Q-Q plots.

Power and Sample Size Considerations

In statistical hypothesis testing for normality, the power of a test is defined as the probability of correctly rejecting the of when the data are actually drawn from a non-normal alternative distribution. This power function varies with the significance level, sample size, and the nature of the alternative; for example, normality tests generally exhibit lower power against symmetric non-normal distributions (such as or platykurtic alternatives) compared to asymmetric or heavy-tailed ones (like skewed or leptokurtic distributions). The Shapiro-Wilk test, in particular, demonstrates high sensitivity to departures due to and long-tailedness, with empirical power around 0.5 against skewed alternatives like the for samples of size 20 at a 5% significance level. Determining the appropriate sample size to achieve a desired level in normality testing often relies on adaptations of general power formulas used in parametric inference. A common approximation, originally derived for detecting differences in means under , is n \approx (z_{\alpha/2} + z_{\beta})^2 \frac{\sigma^2}{\delta^2}, where z_{\alpha/2} and z_{\beta} are the z-scores corresponding to the significance level \alpha and desired $1 - \beta, \sigma is the standard deviation, and \delta represents the minimal detectable deviation from (e.g., a specified or excess ). For normality tests, this formula is adapted by defining \delta in terms of distributional deviations, though exact closed-form solutions are rare due to the complexity of non-normal alternatives; software tools like implement such calculations for specific tests. When closed-form approximations are insufficient, simulation-based methods provide robust estimates of power and required sample sizes. Monte Carlo simulations generate large numbers of samples (often thousands) from target non-normal distributions and compute the proportion of rejections under the , allowing evaluation for tests like the Shapiro-Wilk across various alternatives and sample sizes. For instance, such simulations have shown that the Shapiro-Wilk test maintains superior over competitors like the Kolmogorov-Smirnov for sample sizes up to 500, with power approaching 1.0 as sample size increases for most non-normal cases. Practical trade-offs in power and sample size planning include the fact that larger samples enhance detection of subtle deviations from but escalate computational costs, especially in simulation-heavy assessments or large-scale data applications. Outliers further complicate this by inflating variability and reducing test power, as they exacerbate apparent non- even in moderately sized samples; studies indicate that levels as low as 5% can halve the power of common normality tests against symmetric alternatives.

Central Limit Theorem

The central limit theorem (CLT) establishes that, under suitable conditions, the distribution of the standardized sum of a large number of independent random variables approximates the standard normal distribution, providing a foundational justification for the ubiquity of the normal distribution in statistical inference. This theorem explains why many phenomena, even those arising from non-normal variables, tend toward normality as the number of terms increases, enabling the use of normal-based approximations in diverse fields. A classic version, known as the Lindeberg–Lévy CLT, applies to independent and identically distributed (i.i.d.) random variables X_1, X_2, \dots, X_n with finite \mu and positive finite variance \sigma^2. Let S_n = \sum_{i=1}^n X_i denote the . Then, the standardized \frac{S_n - n\mu}{\sigma \sqrt{n}} converges in distribution to a standard normal random variable Z \sim N(0,1) as n \to \infty. This result was established by Jarl Waldemar Lindeberg in 1922 for (not necessarily identical) random variables under the Lindeberg condition—which includes the i.i.d. case with finite variance—with further developments by Lévy and in 1935 using characteristic functions. To quantify the in the CLT, the provides a uniform bound on the supremum difference between the (CDF) of the standardized sum and the standard normal CDF. For i.i.d. variables with finite third absolute moment \rho = E[|X_1 - \mu|^3] < \infty, the bound is \sup_x \left| P\left( \frac{S_n - n\mu}{\sigma \sqrt{n}} \leq x \right) - \Phi(x) \right| \leq C \frac{\rho}{\sigma^3 \sqrt{n}}, where \Phi is the standard normal CDF and C is a universal constant (originally bounded by 7.59, later refined to approximately 0.56). This theorem, independently developed by Andrew C. Berry in 1941 and Carl-Gustav Esseen in 1942, highlights the O(1/\sqrt{n}) convergence rate, depending on the third moment, and is crucial for assessing approximation accuracy in finite samples. Generalizations of the CLT extend beyond i.i.d. cases to independent but non-identically distributed variables satisfying the Lyapunov condition, which requires the existence of moments of order $2 + \delta for some \delta > 0. Specifically, for independent X_i with means \mu_i, variances \sigma_i^2 > 0, and \sum_{i=1}^n E[|X_i - \mu_i|^{2+\delta}] = o\left( \left( \sum_{i=1}^n \sigma_i^2 \right)^{(2+\delta)/2} \right) as n \to \infty, the standardized sum \frac{S_n - \sum \mu_i}{\sqrt{\sum \sigma_i^2}} converges in distribution to N(0,1). This condition, introduced by in 1901, allows for heterogeneous variances and is sufficient for asymptotic in many practical settings, such as residuals or sums. The implications of the CLT are profound: it positions the normal distribution as a universal limiting law for sums of random variables with finite variance, underpinning asymptotic in estimators like sample means and enabling techniques such as confidence intervals and hypothesis tests to rely on normal approximations for large samples. This universality facilitates the application of normal theory across statistics, even when underlying distributions deviate from .

Operations on Normal Variables

The sum of independent normal random variables is itself normally distributed. Specifically, if X_1, X_2, \dots, X_n are independent random variables with X_i \sim \mathcal{N}(\mu_i, \sigma_i^2) for i = 1, \dots, n, then their sum S = \sum_{i=1}^n X_i follows \mathcal{N}\left( \sum_{i=1}^n \mu_i, \sum_{i=1}^n \sigma_i^2 \right). This result follows from the additivity of means and variances under independence, and the closure of the normal family under convolution. A special case is the difference of two independent normals: if X \sim \mathcal{N}(\mu_X, \sigma_X^2) and Y \sim \mathcal{N}(\mu_Y, \sigma_Y^2) are independent, then X - Y \sim \mathcal{N}(\mu_X - \mu_Y, \sigma_X^2 + \sigma_Y^2). This arises directly from the sum property by considering -Y, which is \mathcal{N}(-\mu_Y, \sigma_Y^2). More generally, linear combinations of normal random variables preserve normality, even if the variables are correlated. For jointly normal X \sim \mathcal{N}(\mu_X, \sigma_X^2) and Y \sim \mathcal{N}(\mu_Y, \sigma_Y^2) with correlation \rho, the combination aX + bY (where a, b are constants) is distributed as \mathcal{N}(a\mu_X + b\mu_Y, a^2 \sigma_X^2 + b^2 \sigma_Y^2 + 2ab \rho \sigma_X \sigma_Y). The variance term incorporates the covariance \operatorname{Cov}(X, Y) = \rho \sigma_X \sigma_Y, reflecting the joint dependence structure in the multivariate normal framework. In contrast, the product of two independent normal random variables does not follow a normal distribution; its density involves a modified Bessel function of the second kind and is symmetric around the product of the means if both are zero-mean, but generally skewed otherwise. Within the bivariate normal distribution, conditional distributions remain normal: given X = x, Y \mid X = x \sim \mathcal{N}(\mu_Y + \rho \frac{\sigma_Y}{\sigma_X} (x - \mu_X), \sigma_Y^2 (1 - \rho^2)).

Infinite Divisibility and Extensions

The normal distribution possesses the property of , which means that for every positive n, its can be represented as the of n identical distribution functions. In particular, if X \sim \mathcal{N}(\mu, \sigma^2), then X equals the sum \sum_{i=1}^n Y_i in distribution, where the Y_i are and each Y_i \sim \mathcal{N}(\mu/n, \sigma^2/n). This decomposition arises from the of the normal distribution, \phi_X(t) = \exp(i \mu t - \sigma^2 t^2 / 2), which factors as [\phi_Y(t)]^n with \phi_Y(t) = \exp(i (\mu/n) t - (\sigma^2/n) t^2 / 2), confirming the through the Lévy-Khinchin representation where the Gaussian component dominates without a jump measure. A key characterization related to this property is provided by Cramér's theorem, which states that any infinitely divisible distribution with finite variance must be normal if all its cumulants of order higher than two are zero. This result underscores the uniqueness of the normal distribution among infinitely divisible laws with bounded moments, as the vanishing higher cumulants eliminate contributions from the Lévy measure in the cumulant generating function, leaving only the drift and diffusion terms characteristic of the Gaussian. The normal distribution extends naturally to the multivariate setting, defining a joint distribution over random vectors in \mathbb{R}^k. The probability density function of the multivariate normal distribution \mathcal{N}_k(\boldsymbol{\mu}, \Sigma) is f(\mathbf{x}) = \frac{1}{(2\pi)^{k/2} \det(\Sigma)^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right), where \boldsymbol{\mu} \in \mathbb{R}^k is the mean vector and \Sigma is the k \times k positive definite covariance matrix. This formulation generalizes the univariate case, capturing correlations through \Sigma. Marginal distributions of the multivariate normal are also normal; specifically, the marginal for any single component X_j is univariate normal \mathcal{N}(\mu_j, \Sigma_{jj}). Furthermore, conditional distributions within the multivariate normal framework remain normal. Given a partition of the vector into subvectors \mathbf{X}_1 and \mathbf{X}_2, the conditional distribution of \mathbf{X}_1 given \mathbf{X}_2 = \mathbf{x}_2 is multivariate normal with mean \boldsymbol{\mu}_1 + \Sigma_{12} \Sigma_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2) and covariance \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}, preserving the Gaussian structure under conditioning. This property facilitates applications in and for correlated variables.

Applications

In Statistics and Probability

The normal distribution serves as a cornerstone in statistical theory, particularly through its role in deriving other fundamental distributions. The chi-squared distribution with k degrees of freedom arises as the sum of squares of k independent standard normal random variables, providing the basis for variance estimation and goodness-of-fit tests. The Student's t-distribution emerges from the ratio of a standard normal variable to the square root of an independent chi-squared variable divided by its degrees of freedom, which is essential for inference on means with unknown variance. Similarly, the F-distribution is the ratio of two independent chi-squared variables, each divided by their degrees of freedom, facilitating comparisons of variances across groups. In probabilistic modeling, the normal distribution underpins the asymptotic properties of key estimators. Under standard regularity conditions—such as differentiability of the log-likelihood and of parameters—the maximum likelihood (MLE) converges in distribution to a normal random variable with equal to the true parameter and variance given by the inverse matrix. Ordinary least squares in models also exhibit asymptotic normality, justified by the for sums of independent random variables, enabling reliable large-sample inference. This asymptotic normality extends to many other , supporting confidence intervals and hypothesis tests in . Normal distributions enhance methods through , where they are frequently employed as proposal distributions to reduce estimator variance. By sampling from a normal approximation to the target distribution—such as a shifted or scaled normal for tail probabilities— reweights samples to target the desired , often achieving variance reduction compared to crude . This technique is particularly effective for high-dimensional integrals where the target is roughly Gaussian, minimizing the variance of the estimator. Post-2020 developments in have further integrated normals into hierarchical modeling via languages like . In , normal priors and likelihoods facilitate scalable for multilevel structures, such as varying intercepts across groups, by leveraging for efficient posterior sampling. For instance, bivariate hierarchical models combining summary measures often specify normal distributions for parameters, enabling robust in meta-analyses. These applications underscore the normal's versatility in modern for complex, data-driven probabilistic frameworks.

In Natural and Social Sciences

In , heights within populations are often approximately ly distributed, reflecting the additive effects of multiple genetic and environmental factors on . For instance, analyses of military conscript data from 19th-century illustrate how adult male height distributions, while often approximated as , exhibit distortions due to growth spurt and influences like and . Similarly, measurements in healthy populations, such as systolic and diastolic values, are frequently modeled as distributions for statistical , enabling the establishment of ranges based on age, sex, and height percentiles. This approximation facilitates the identification of thresholds, as seen in large-scale epidemiological studies where population-level data exhibit bell-shaped curves centered around means like 120/80 mmHg. In physics, the normal distribution arises prominently in the modeling of measurement errors, where represents random fluctuations in instruments due to thermal or quantum effects. For example, lock-in amplifiers used in precision experiments quantify as the standard deviation of voltage signals, providing a baseline for signal-to-noise ratios in fields like and . This noise is assumed to follow a normal distribution because it results from the superposition of many independent random processes, as justified by the . Additionally, the limiting behavior of —the random movement of particles in a —yields normally distributed displacements over time, with the variance proportional to elapsed time; this underpins diffusion models in and has been experimentally verified through colloidal particle tracking. The explains many of these empirical occurrences in natural phenomena, as sums of independent random variables tend toward normality regardless of their original distributions. In the social sciences, (IQ) scores are deliberately standardized to follow a normal distribution with a of 100 and standard deviation of 15, allowing for consistent interpretation across populations and tests. This normalization, rooted in early 20th-century , assumes that cognitive abilities aggregate from numerous factors to approximate normality, enabling rankings where about 68% of scores fall between 85 and 115. distributions, while typically right-skewed and better fit by lognormal or Pareto models overall, often have central portions and lower tails that can be approximated by normal distributions for certain analytical purposes, such as modeling middle-class earnings variability in econometric studies. In psychology, reaction times in cognitive tasks are positively skewed but can be transformed via the natural logarithm to approximate a normal distribution, improving the validity of parametric statistical tests. This log-transformation accounts for the multiplicative nature of processing speeds, where slower responses disproportionately affect the tail; empirical validations show that log(RT) yields distributions closer to normality, as demonstrated in analyses of choice reaction time experiments. Regarding climate data, the Intergovernmental Panel on Climate Change (IPCC) notes that normal distributions provide a reasonable approximation for temperature variability in many regions, facilitating assessments of extremes like heatwaves through standard deviation-based thresholds. However, for precipitation and other non-symmetric variables, such approximations are less reliable, particularly in arid areas where variability debates highlight the need for alternative models.

In Engineering and Computing

In , the (AWGN) model assumes that noise in communication channels follows a normal distribution with zero and uniform power across frequencies, enabling the analysis of signal degradation and the design of optimal receivers. This model underpins capacity calculations and error rate predictions in digital communications systems, such as those using modulation schemes like QPSK or OFDM. For instance, the for binary in AWGN channels is derived under this normality assumption to evaluate system performance. In , the relies on the assumption of normally distributed process and measurement errors to provide optimal recursive state estimation in linear dynamic systems. By modeling uncertainties as Gaussian, the filter minimizes the through prediction and update steps, making it essential for applications like in and . This Gaussian framework ensures computational tractability and optimality under linearity and uncorrelated noise conditions. Machine learning leverages the normal distribution in es for non-parametric , where the prior over functions is defined by a with a specifying smoothness and structure. This approach yields probabilistic predictions with , widely used in optimization and spatial . Additionally, Bayesian neural networks often employ normal priors on weights to regularize learning and capture epistemic uncertainty, facilitating scalable via variational methods. As of 2025, diffusion models have advanced image generation by iteratively adding and removing , starting from data and reversing the process to sample from complex distributions, with enhancements in efficiency through flow-matching techniques. In , Shewhart control charts assume normally distributed process variations to monitor stability, using control limits set at three standard deviations from the mean to detect shifts. Under this normality, process capability indices like C_p = \frac{USL - LSL}{6\sigma} and C_{pk} = \min\left( \frac{USL - \mu}{3\sigma}, \frac{\mu - LSL}{3\sigma} \right) quantify how well a meets specification limits, guiding improvements in . These indices highlight centering and relative to tolerances, assuming stable Gaussian output for reliable assessment.

Computational Methods

Random Number Generation

Generating random numbers from the normal distribution is essential for simulations and methods, typically starting from random variables on [0,1] which are readily available from pseudorandom number generators. One classical method is the Box-Muller transform, which produces a pair of independent standard normal random variables Z_0 and Z_1 from two independent random variables U_1, U_2 \sim U(0,1). The transformation is given by: \begin{align*} Z_0 &= \sqrt{-2 \log U_1} \cos(2\pi U_2), \\ Z_1 &= \sqrt{-2 \log U_1} \sin(2\pi U_2). \end{align*} This method relies on the joint distribution of the radius and angle in polar coordinates to match the bivariate normal density. The Marsaglia polar method is a rejection sampling variant that avoids trigonometric functions for efficiency. It generates candidate pairs (V_1, V_2) uniformly in [-1,1] until their squared distance S = V_1^2 + V_2^2 < 1, then computes a standard normal pair as Z_0 = V_1 \sqrt{-2 \ln S / S} and Z_1 = V_2 \sqrt{-2 \ln S / S}. The expected number of rejections is about 1.27, making it faster than the original Box-Muller in practice. For even greater speed in computational applications, the Ziggurat algorithm approximates the normal density with a stack of horizontal rectangles (a "ziggurat") of equal area, accepting samples under the density via rejection sampling with high probability in the base layers. Tail regions are handled separately, often with exponential approximations. This method generates standard normals at rates exceeding 15 million per second on 400 MHz processors and is implemented in libraries such as Python's NumPy for its random.normal function. To obtain normals with arbitrary mean \mu and standard deviation \sigma > 0, scale and shift a standard normal Z \sim N(0,1) as X = \mu + \sigma Z. This linear transformation preserves the normality due to the distribution's stability under affine operations.

Approximations for CDF and Quantiles

The (CDF) of the standard normal distribution, denoted Φ(z), lacks a in elementary functions, necessitating numerical approximations for practical computation. These approximations are essential in statistical software, simulations, and real-time applications where table lookups are inefficient. The CDF relates directly to the via Φ(z) = 1/2 + (1/2) erf(z / √2), where erf(z) = (2/√π) ∫_0^z e^{-t^2} dt, making accurate approximations of erf(z) a foundational approach. Approximations for the error function often employ rational functions, series expansions, or continued fractions, as detailed in the seminal handbook by Abramowitz and Stegun. For moderate z, a continued fraction representation for the complementary error function erfc(z) = 1 - erf(z) provides high accuracy: erfc(z) = (e^{-z^2} / (√π z)) [1 / (1 + a_1 / z^2 + (a_2 / z^2) / (1 + a_3 / z^2 + ⋯))], with coefficients a_i specified up to seven terms for relative errors below 10^{-15} over z > 0. For small z, a power series expansion erf(z) ≈ (2/√π) ∑_{n=0}^∞ (-1)^n z^{2n+1} / (n! (2n+1)) converges rapidly, though it is less efficient for larger arguments. These methods achieve machine-precision accuracy and form the basis for implementations in numerical libraries. Direct approximations for the normal CDF Φ(z) bypass the for simplicity and speed. Pólya's offers reasonable accuracy with maximum absolute error around 0.003 for all z, suitable for quick estimates. For tail probabilities, particularly 1 - Φ(z) with large positive z, ' rational minimizes the maximum error over intervals: for 0 < z < ∞, 1 - Φ(z) ≈ (1 / √(2π)) e^{-z^2 / 2} / z * (1 / (1 + b_1 / z^2 + ⋯ + b_5 / z^{10})), where the b_i are coefficients yielding relative errors under 7.5 × 10^{-8}. These approximations are particularly valuable in one-sided hypothesis testing and risk analysis. The quantile function, or probit, Φ^{-1}(p), inverts the CDF to find z such that Φ(z) = p for p ∈ (0,1). No elementary inverse exists, so iterative methods like are standard: initialize z_0 ≈ √(2) erfinv(2p - 1), then iterate z_{n+1} = z_n - (Φ(z_n) - p) / φ(z_n), where φ is the standard normal PDF; convergence typically occurs in 3-5 steps to double precision, with safeguards for p near 0 or 1 to avoid divergence. This method's efficiency stems from the near-linearity of Φ near its median and is widely implemented due to its robustness. For high-precision needs, such as in scientific computing, Chebyshev polynomial expansions provide uniform approximation over finite intervals. Cody's rational Chebyshev approximations for erf(z) minimize maximum deviation using economized polynomials, achieving errors below 10^{-19} for |z| < 3 with degree-22 numerators and denominators. For large |z|, asymptotic expansions refine tail estimates: 1 - Φ(z) ∼ (φ(z) / z) (1 - 1/z^2 + 3/z^4 - 15/z^6 + ⋯) as z → ∞, with the series truncated at the term minimizing remainder, yielding relative accuracy to 10^{-16} for z > 5. Modern libraries like incorporate these with GPU acceleration via CuPy backends, enabling parallel evaluation of vectorized CDF and quantile computations in 2025 releases for high-throughput applications.

History

Development

The mathematical foundations of the normal distribution emerged in the early through efforts to approximate probability distributions. In 1733, derived an approximation for the using Stirling's formula for factorials, yielding the density function \frac{1}{\sqrt{2\pi npq}} \exp\left( -\frac{(x - np)^2}{2 npq} \right), where n is the number of trials, p the success probability, q = 1 - p, and x the number of successes; this was the first explicit formulation of the normal curve as a limiting case of the . This approximation gained prominence in the context of error analysis during the early 19th century. , in his 1809 work Theoria Motus Corporum Coelestium in Sectionibus Conicis Solem Ambientium, introduced the normal distribution to model errors in astronomical observations, deriving it as the distribution that minimizes the sum of squared errors under the assumption of independent, equally likely errors of varying magnitudes. Pierre-Simon Laplace extended these ideas in 1812 with his Théorie Analytique des Probabilités, generalizing the normal distribution to continuous error laws and providing integral evaluations that demonstrated its applicability to a broader class of probabilistic phenomena, including the superposition of multiple error sources. Subsequent milestones shifted focus toward applications in natural and social phenomena. In 1835, applied the normal distribution to measurements of human physical traits, such as height and weight, in his Sur l'Homme et le Développement de ses Facultés, ou Essai de Physique Sociale, positing the concept of "l'homme moyen" (the average man) as the around which variations cluster normally. Building on this, , in his 1889 book Natural Inheritance, popularized the term "normal" for the distribution while studying hereditary traits and , emphasizing its role in describing biological variability. The theoretical rigor of the normal distribution was solidified in the early through the (CLT). Andrey Lyapunov provided a general proof of the CLT in 1901, showing that the sum of independent s with finite variances converges in distribution to a normal under mild conditions, establishing the normal as a universal limit law in .

Naming and Standardization

The term "Gaussian distribution" derives from the contributions of , who formalized the in 1809 as part of his work on the method of for modeling errors in astronomical data. Quetelet applied the distribution—then known as the Gaussian or error law—to social and biological measurements in the 1830s, where he used it to characterize the "average man" (l'homme moyen) as the most probable type in large populations, implying a typical or ideal state in human statistics. The term "normal distribution" emerged later in the late 19th century; it was first used in this context by Charles S. Peirce in 1873 and Wilhelm Lexis in 1879, and popularized by in the 1870s and 1880s through his anthropometric studies and writings on , such as Natural Inheritance (1889), framing it as a curve of "normal variability" for traits like and physical characteristics. Alternative names have included informal descriptors like "bell curve," which highlights the symmetric, peaked shape but is discouraged in formal mathematical contexts for lacking specificity about parameters or properties. Early historical confusion arose with the (a double-exponential form), as referred to the normal as his "second law of errors" in the early , contrasting it with his "first law" for the ; this ambiguity was largely resolved by mid-century through attribution to Gauss and distinct mathematical characterizations. Standardization of notation for the normal distribution is encapsulated in the conventional form X \sim \mathcal{N}(\mu, \sigma^2), denoting a X with \mu and variance \sigma^2, as recommended in international statistical vocabularies for clarity in scientific communication. This notation, while not uniquely mandated by bodies like IUPAC or IUBMB (which reference it in contexts of ), aligns with ISO 3534-1:2006 for probability terms and has been reinforced in software implementations for consistency. The term "normal" has faced criticism for its implications during the eugenics era of the late 19th and early 20th centuries, when figures like and invoked the distribution to rank human traits hierarchically, portraying deviations from the mean as inferior or pathological, which fueled discriminatory policies. This historical baggage has prompted modern statistical literature to use the term cautiously, often preferring "Gaussian" in neutral or technical discussions to avoid connotations of normative superiority.

References

  1. [1]
    Normal distribution - Student Academic Success - Monash University
    normal distributionA continuous probability distribution that is symmetric and bell-shaped. · probabilityA measure of the likelihood that an event will occur, ...
  2. [2]
    [PDF] Normal distribution
    The normal distribution is the most widely known and used of all distributions. Because the normal distribution approximates many natural phenomena so well, it ...
  3. [3]
    History of the Gaussian Function
    ``The normal distribution was first introduced by de Moivre in the second edition (1718) of his Doctrine of Chances, in the context of approximations of large ...
  4. [4]
    1.3.6.6.1. Normal Distribution - Information Technology Laboratory
    Normal Distribution. Probability Density Function, The general formula for the probability density function of the normal distribution is.
  5. [5]
    [PDF] THE GAUSSIAN INTEGRAL Let I = ∫ ∞ e dx, J ... - Keith Conrad
    ∫ ∞. −∞ e. −1. 2 x2 dx = √. 2π that we have been discussing, another place in mathematics where. √. 2π appears is in Stirling's formula: n! ∼ nn en. √. 2πn as n ...
  6. [6]
    [PDF] Chapter 8 The exponential family: Basics - People @EECS
    Thus it may come as a surprise that the mean and variance of distributions in the exponential family can be obtained by calculating derivatives.Missing: source | Show results with:source
  7. [7]
    [PDF] 10: The Normal (Gaussian) Distribution
    Apr 27, 2020 · For a Normal RV 𝑋~𝒩 𝜇,𝜎2 , its CDF has no closed form. A function that has been solved for numerically To get here, we'll first need to know ...Missing: elementary antiderivative
  8. [8]
    [PDF] Lecture 10: Introducing the Normal Distribution
    A function closely related to Φ that is popular in error analysis in some of the sciences is the error function, erf, denoted erf defined by erf(x) = 2. √ π.
  9. [9]
    [PDF] Chapter 8 Continuous Random Variables
    The error function is related to the standard normal cumulative distribution function by scaling and translation,. Φ(x) =1 + erf x/√2. 2 . If X is a standard ...
  10. [10]
    [PDF] Hand-book on STATISTICAL DISTRIBUTIONS for experimentalists
    As is seen the error function erf is a distribution (or cumulative) function and the corre- sponding probability density function is given by f(z) = 2. √ π.
  11. [11]
    [PDF] Unit 23: PDF and CDF
    Definition: The normal distribution has the density f(x) = 1. /. 2π e−x2/2 . 23.4. It is the distribution which appears most often if data can take both ...
  12. [12]
    [PDF] The Normal Distribution
    Jul 19, 2017 · Using the standard normal means you only need to build a table of one distribution, rather than an indefinite number of tables for all the ...<|control11|><|separator|>
  13. [13]
    Normal Distribution | Gaussian | Normal random variables | PDF
    The CDF of the standard normal distribution is denoted by the Φ function: Φ(x)=P(Z≤x)=1√2π∫x−∞exp{−u22}du. As we will see in a moment, the CDF of any normal ...<|control11|><|separator|>
  14. [14]
    [PDF] Table of the Standard Normal Cumulative Distribution Function Φ(z) 1
    Page 1. Table 1: Table of the Standard Normal Cumulative Distribution Function Φ(z) z. 0.00. 0.01. 0.02. 0.03. 0.04. 0.05. 0.06. 0.07. 0.08. 0.09. -3.4. 0.0003.Missing: authoritative | Show results with:authoritative
  15. [15]
    Empirical Rule: Definition, Formula, and Example - Investopedia
    The empirical rule, also sometimes called the three-sigma or 68-95-99.7 rule, predicts deviations from the mean or average of data.What Is the Empirical Rule? · Understanding the Rule · The Rule in Investing
  16. [16]
    The Standard Normal Distribution | Calculator, Examples & Uses
    Nov 5, 2020 · What is the empirical rule? The empirical rule, or the 68-95-99.7 rule, tells you where most of the values lie in a normal distribution:.Normal Distribution Vs The... · How To Calculate A Z Score · Z Tests And P Values
  17. [17]
    Z-Score: The Complete Guide to Statistical Standardization
    Oct 7, 2025 · In a normal distribution, a z-score of 1.0 corresponds to the 84th percentile, meaning 84% of observations are smaller. Modern online ...
  18. [18]
    [PDF] Chapter 2 Multivariate Distributions and Transformations
    Show that the normal distribution is a location– scale family by showing that W = µ + σY ∼ N(µ, σ) where µ is real and σ > 0. 2.44. Let Y ∼ OSS(1). Show that ...Missing: general | Show results with:general
  19. [19]
    [PDF] The Normal Distribution
    Objectives: ➢ Define and describe density curves. ➢ Measure position using percentiles. ➢ Measure position using z-scores. ➢ Describe Normal distributions.
  20. [20]
    [PDF] Conjugate Bayesian analysis of the Gaussian distribution
    Oct 3, 2007 · The Gaussian or normal distribution is one of the most widely used in statistics. Estimating its parameters using. Bayesian inference and ...
  21. [21]
    [PDF] The normal distribution - MyWeb
    As noted previously, BUGS and JAGS parameterize the normal distribution in terms of the precision, as it is typically easier to work with in Bayesian ...
  22. [22]
    [PDF] Normal Distribution characterizations with applications
    such that for some scale factor κ and some location parameter α the distribution of X1 + X2 is the same as the distribution of κ(X1 + α). Then X1 is normal.
  23. [23]
    The Normal Distribution - Random Services
    The following theorem gives the skewness and kurtosis of the standard normal distribution. ... The ordinary (raw) moments of X can be computed from the central ...
  24. [24]
    [PDF] Moments and Absolute Moments of the Normal Distribution - arXiv
    In this section we give formulas for the raw/central (absolute) moments of a normal RV. If not noted otherwise, these results hold for ν > -1. • Raw moments:.
  25. [25]
    16.4 - Normal Properties | STAT 414 - STAT ONLINE
    We'll start by verifying that the normal p.d.f. is indeed a valid probability distribution. Then, we'll derive the moment-generating function ...
  26. [26]
    [PDF] 5.6: Moment Generating Functions
    The first moment of X is the mean of the distribution µ = E [X]. This describes the center or average value. 2. The second moment of X about µ is the variance ...
  27. [27]
    [PDF] Moments and the moment generating function Math 217 Probability ...
    A fairly flat distribution with long tails has a high kurtosis, while a short tailed distribution has a low kurtosis. A bimodal distri- bution has a very high ...
  28. [28]
    [PDF] Lecture 5: Moment generating functions
    In some cases, moments determine the distributions. The mgf, if it exists, determines a distribution. Theorem 2.3.11. Let X and Y be ...
  29. [29]
    [PDF] 1 Cumulants
    The normal distribution N(µ, σ2) has cumulant generating function ξµ+ ξ2σ2/2, a quadratic polynomial implying that all cumulants of order three and higher ...
  30. [30]
    [PDF] Moments and Generating Functions - Arizona Math
    • The cumulant generating function is defined to be. KX(t) = log MX(t). The k-th terms in the Taylor series expansion at 0, kn(X) = 1 k! dk dtk. KX(0) is ...
  31. [31]
    [PDF] Characteristic function of the Gaussian probability density
    The probability density of a Gaussian (or “normal distribution”) with mean µ and variance σ2 is p(x) = 1. /2πσ2 e− (x−µ)2. 2σ2 . (1). Its characteristic ...
  32. [32]
    [PDF] Lecture 8 Characteristic Functions
    Dec 8, 2013 · Hint: Prove that |ϕ(tn)| = 1 along a suit- ably chosen sequence tn → ∞, where ϕ is the characteristic function of the Can- tor distribution.
  33. [33]
    [PDF] Overview 1 Characteristic Functions
    Characteristic functions are essentially Fourier transformations of distribution functions, which provide a general and powerful tool to analyze probability ...
  34. [34]
    [PDF] Chapter 11: Distributions and the Fourier Transform - UC Davis Math
    The product of such characteristic functions is another function of the same form, in which the means and variances add together.Wژ onseLt uently, the sum of ...
  35. [35]
    [PDF] Probability distributions and maximum entropy - Keith Conrad
    ... Gaussian having the chosen fixed mean and variance. Theorem 7.7 does not tell us Π has a unique maximum entropy distribution, but rather that if it has one ...
  36. [36]
    [PDF] Lecture 8: Information Theory and Maximum Entropy
    The normal distribution is therefore the maximum entropy distribution for a distribution with known mean and variance. Yet another reason that Gaussians are ...
  37. [37]
    Fundamentals of Stein's method - Project Euclid
    Stein's method was initially conceived by Charles Stein in the seminal paper [54] to provide errors in the approximation by the normal distribution of the ...
  38. [38]
    A bound for the error in the normal approximation to the distribution ...
    6.2 | 1972 A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. Chapter Author(s) Charles Stein.
  39. [39]
    On the mathematical foundations of theoretical statistics - Journals
    A recent paper entitled "The Fundamental Problem of Practical Statistics," in which one of the most eminent of modern statisticians presents what purports to ...
  40. [40]
    [PDF] Outline of a Theory of Statistical Estimation Based on the Classical ...
    The theory of statistical estimation, based on classical probability, involves determining numerical values of parameters from experimental data, using a ...
  41. [41]
    [PDF] Student's 1908 paper
    Sep 25, 2006 · The 1908 paper, titled 'The Probable Error of a Mean', discusses the uncertainty in the mean of a sample, especially with small samples, and ...Missing: Gosset | Show results with:Gosset
  42. [42]
    Confidence Intervals for Normal Samples - Probability Course
    Here, we would like to discuss how to find interval estimators for the mean and the variance of a normal distribution.
  43. [43]
    [PDF] Bayesian Data Analysis Third edition (with errors fixed as of 20 ...
    This book is intended to have three roles and to serve three associated audiences: an introductory text on Bayesian inference starting from first principles, a ...
  44. [44]
    [PDF] Conjugate Bayesian analysis of the Gaussian distribution
    Oct 3, 2007 · The use of conjugate priors allows all the results to be derived in closed form.
  45. [45]
    Principles of sample size calculation - PMC - NIH
    (A) Sample size for one mean, normal distribution. n = Z α + Z β 2 × σ 2 d 2 · (B) Sample size for two means, quantitative data. n = Z α + Z β 2 × σ 2 d 2 · (C) ...
  46. [46]
    [PDF] Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors ...
    Results show that Shapiro-Wilk test is the most powerful normality test, followed by Anderson-Darling test,. Lillie/ors test and Kolmogorov-Smirnov test.
  47. [47]
    Penalized power properties of the normality tests in the presence of ...
    It is important, since the outliers may increase the variability in the data set, they cause the decrease in the statistical power. In this study we show the ...
  48. [48]
    The Central Limit Theorem Around 1935 - Project Euclid
    The Central Limit Theorem around 1935 involved finding conditions for approximating sums of random variables by Gaussian distributions, with Feller and Levy's ...Missing: original | Show results with:original
  49. [49]
    Central Limit Theorems | SpringerLink
    J. Lindeberg, “Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrechnung,” Math. Zeit. 15 (1922), 211–225. Article MathSciNet Google ...
  50. [50]
    26.1 - Sums of Independent Normal Random Variables | STAT 414
    follows the normal distribution: ... Therefore, by the uniqueness property of moment-generating functions, Y must ...
  51. [51]
    The Normal Distribution - Utah State University
    The normal distribution is widely used in probability theory and underlies much of statistical inference.
  52. [52]
    Probability Playground: The Normal Distribution
    The sum of n independent normal random variables with parameters (μ₁, σ₁²),..., (μₙ, σₙ²) is itself a normal random variable with parameters (μ₁ + ··· + μₙ, σ₁² ...
  53. [53]
    [PDF] The multivariate normal distribution - MyWeb
    • A very important property of the multivariate normal distribution is that its linear combinations are also normally distributed. • Theorem: Let b be a k ...
  54. [54]
    [PDF] Multivariate Distributions
    Linear combinations of multivariate normal random vectors remain normally distributed with mean vector and ... Generating Correlated Normal Random Variables.
  55. [55]
    Product of Two Gaussian PDFs - Stanford CCRMA
    The product density has mean and variance given by \begin{eqnarray*} \mu &=& \frac{\frac{\mu_1}{2\sigma_1^2} + \frac{\mu_2}{2\sigma_2^2}}{\frac{1}{2\
  56. [56]
    [PDF] Lecture 3 : Probability Theory - MIT OpenCourseWare
    (ii) Show that the product of two independent log-normal distributions is also a log-normal distribution. Some examples of other importatnt distributions ...
  57. [57]
    Lesson 21: Bivariate Normal Distributions | STAT 414
    To calculate such a conditional probability, we clearly first need to find the conditional distribution of Y given X = x . That's what we'll do in this lesson, ...
  58. [58]
    [PDF] of Infinitely Divisible Distributions - Purdue Department of Statistics
    Infinitely divisible (id) distributions, introduced in 1929, can be divided into n independent components. Examples include normal, Poisson, and Gamma  ...
  59. [59]
    Infinite Divisibility and Variance Mixtures of the Normal Distribution
    April, 1971 Infinite Divisibility and Variance Mixtures of the Normal Distribution. Douglas Kelker · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Math. Statist.Missing: reference | Show results with:reference
  60. [60]
    [PDF] Entropic instability of Cramer's characterization of the normal law
    A well-known theorem of Cramer (1936, [Cr]) indicates that, if the sum X+ Y of two independent random variables X and Y has a normal distribution, then neces- ...
  61. [61]
    [PDF] Three remarkable properties of the Normal distribution - arXiv
    The Levy Cramer theorem states that if the sum of two independent non- constant random variables X1 and X2 is normally distributed, then each of the summands ( ...
  62. [62]
    [PDF] Distributions Derived From the Normal Distribution
    Distributions Derived From the Normal Distribution. 1. Page 2. Distributions Derived from Normal Random Variables χ2 , t, and F Distributions. Statistics from ...
  63. [63]
    26.4 - Student's t Distribution | STAT 414 - STAT ONLINE
    The density curve looks like a standard normal curve, but the tails of the t -distribution are "heavier" than the tails of the normal distribution. That is, we ...
  64. [64]
    [PDF] Chi-square (χ 2) distribution. • t distri
    Distributions related to the normal distribution. Three important distributions: • Chi-square (χ2) distribution. • t distribution. • F distribution. Before ...Missing: deriving | Show results with:deriving
  65. [65]
    [PDF] Lecture 3 Properties of MLE: consistency, asymptotic normality ...
    In this section we will try to understand why MLEs are 'good'. Let us recall two facts from probability that we be used often throughout this course.Missing: citation | Show results with:citation
  66. [66]
    [PDF] Maximum Likelihood Estimation
    May 14, 2001 · Under regularity conditions, the MLE is consistent, asymptotically normally distrib-.
  67. [67]
    [PDF] Chapter 6 Importance sampling - Arizona Math
    Importance sampling rewrites the mean using a proposal distribution (q(x)) and a target distribution (p(x)), generating samples from q(x).
  68. [68]
    [PDF] Bivariate Hierarchical Bayesian Model for Combining Summary ...
    Sep 15, 2021 · normal distribution with mean vector (µθ = 10,µσ = 2)T and variance ... Our bivariate hierarchical model for combining summary measures and their uncer-.
  69. [69]
    [PDF] Stan Reference Manual
    This is the official reference manual for Stan's programming language for coding probability models, inference algorithms for fitting models and making ...
  70. [70]
    [PDF] A New Approach to Linear Filtering and Prediction Problems1
    Using a photo copy of R. E. Kalman's 1960 paper from an original of the ASME “Journal of Basic Engineering”, March. 1960 issue, I did my best to make an ...
  71. [71]
    Book webpage - Gaussian Processes for Machine Learning
    This book provides a long-needed systematic and unified treatment of theoretical and practical aspects of GPs in machine learning.Contents · Data · Errata · Order
  72. [72]
    [2006.11239] Denoising Diffusion Probabilistic Models - arXiv
    Access Paper: View a PDF of the paper titled Denoising Diffusion Probabilistic Models, by Jonathan Ho and 2 other authors. View PDF · TeX ...Missing: Song | Show results with:Song
  73. [73]
    A Note on the Generation of Random Normal Deviates - Project Euclid
    June, 1958 A Note on the Generation of Random Normal Deviates. G. E. P. Box, Mervin E. Muller · DOWNLOAD PDF + SAVE TO MY LIBRARY. Ann. Math. Statist.
  74. [74]
    A Convenient Method for Generating Normal Variables | SIAM Review
    Marsaglia, Improving the Polar Method for Generating a Pair of Random Variables, D1-82-0203, Boeing Sci. Res. Lab., 1962. Google Scholar. 4. G. Marsaglia ...
  75. [75]
    The Ziggurat Method for Generating Random Variables
    We provide a new version of our ziggurat method for generating a random variable from a given decreasing density. It is faster and simpler than the original.
  76. [76]
    Erf -- from Wolfram MathWorld
    erf(z) is the "error function" encountered in integrating the normal distribution (which is a normalized form of the Gaussian function).
  77. [77]
    A sharp Pólya-based approximation to the normal cumulative ...
    Apr 1, 2018 · In this paper, we provide a simple approximation for the standard normal CDF with only a few explicit coefficients obtained by precise mathematical calculation.
  78. [78]
    Computer Evaluation of the Normal and Inverse Normal Distribution ...
    FIGURnE 1-The error e in approximating the inverse normal distribution function by Newton-. Raphson iteration. logarithm, etc.) computer library functions.Missing: quantile | Show results with:quantile
  79. [79]
    [PDF] De Moivre on the Law of Normal Probability - University of York
    This paper gave the first statement of the formula for the “normal curve,” the first method of finding the probability of the occurrence of an error of a given ...
  80. [80]
    Theoria motus corporum coelestium in sectionibus conicis solem ...
    Nov 21, 2014 · Theoria motus corporum coelestium in sectionibus conicis solem ambientium. by: C. F. Gauss. Publication date: 1809.Missing: normal distribution
  81. [81]
    [PDF] THE ANALYTIC THEORY OF PROBABILITIES Third Edition Book I ...
    Many of the problems had been treated by Laplace in earlier memoirs. Consequently the TAP may be considered in one sense as a consolidation of his work in ...
  82. [82]
    Sur l'homme et le développement de ses facultés - Internet Archive
    Jan 26, 2009 · Sur l'homme et le développement de ses facultés : ou, Essai de physique sociale. by: Quetelet, Adolphe, 1796-1874.
  83. [83]
    [PDF] Natural Inheritance by Francis Galton (Macmillan, 1889)
    Ordinates to the Normal Curve of Distribution, when its 100. Grades run from 50%, through 0°, to + 50°. Ditto when the Grades run from 0° to 100 ...
  84. [84]
    [PDF] History of the Central Limit Theorem - AMS Tesi di Laurea
    The term itself was the title of a paper published in 1920 by George Pólya, in order to underline its central role in probability theory. Therefore, strictly ...
  85. [85]
    The Evolution of the Normal Distribution - jstor
    the distribution of measurement errors was Adolphe Quetelet (1796-1874). ... Why normal?. A word must be said about the origin of the term normal. Its ...
  86. [86]
    Quetelet and the emergence of the behavioral sciences - SpringerPlus
    Sep 4, 2015 · The other key notion concerned the distribution of observations, which takes a 'Gaussian' form (now called a 'normal distribution'). This he ...
  87. [87]
    NORMAL Distribution: Origin of the name - DePaul University
    The NORMAL distribution has been studied under various names for nearly 300 years. Some names were derived from ERROR, e.g. the law of error, the law of ...
  88. [88]
    [PDF] Quantities, Units and Symbols in Physical Chemistry - IUPAC
    When the normal distribution applies and uc is a reliable estimate of the standard deviation of y, U = 2uc (i.e., k = 2) defines an interval having a level ...
  89. [89]
    Statistics and eugenics: How the past will shape the future | BPS
    Sep 16, 2025 · The collection and analysis of eugenic data. Much of the story of modern mathematical statistics begins with Francis Galton, a relative of ...
  90. [90]
    Teaching the Difficult Past of Statistics to Improve the Future
    Jul 20, 2023 · An inquiry into the history of eugenics at University College London—where Galton, Pearson, and Fisher all worked—noted the importance of not ...