Shapiro–Wilk test
The Shapiro–Wilk test is a statistical hypothesis test designed to determine whether a given sample of data is drawn from a normally distributed population, with the null hypothesis stating that the data follows a normal distribution.[1] Developed by Samuel S. Shapiro and Martin B. Wilk in 1965,[2] it evaluates the goodness-of-fit by comparing the ordered sample values to the expected values from a normal distribution.[2] The test produces a statistic W that ranges between 0 and 1, where values close to 1 indicate strong evidence of normality, and smaller values suggest deviations such as skewness or kurtosis; rejection of the null occurs if W falls below a critical value at a chosen significance level (e.g., 0.05).[1] The test statistic is computed as W = \frac{\left( \sum_{i=1}^{n} a_i x_{(i)} \right)^2}{\sum_{i=1}^{n} (x_i - \bar{x})^2}, where x_{(i)} are the ordered observations from smallest to largest, \bar{x} is the sample mean, and the coefficients a_i are pre-determined constants derived from the means, variances, and covariances of order statistics from standard normal samples of size n.[1] These constants and critical values were originally obtained through Monte Carlo simulations for sample sizes up to 50,[2] with later extensions allowing use for larger n up to 5,000 in modern software implementations.[3] The method is particularly suited for complete, independent samples and assumes the data can be scaled to have mean 0 and variance 1 under the null.[1] One of the key strengths of the Shapiro–Wilk test is its superior statistical power compared to other normality tests, such as the Kolmogorov–Smirnov and Anderson–Darling tests, especially for detecting departures from normality in small to moderate sample sizes (typically n < 50).[4] It excels at identifying non-normality due to asymmetry or heavy tails, making it a preferred choice in fields like biostatistics, engineering, and quality control where parametric assumptions underpin analyses such as t-tests or ANOVA.[3] Empirical studies have confirmed its robustness across various distributions, often outperforming competitors in power while maintaining appropriate type I error rates.[3] Despite its advantages, the Shapiro–Wilk test has limitations, including reduced power for very small samples (n < 3), where it may fail to detect non-normality reliably, and increased sensitivity in large samples (n > 200), where even minor deviations from perfect normality can lead to false rejections of the null hypothesis.[4] It is also computationally intensive for large n due to the need for ordering and coefficient calculations, though this is mitigated by statistical software like R, SAS, and SPSS.[1] For multivariate data or tied observations, extensions or alternative tests may be necessary.[5]Overview
Definition and Purpose
The Shapiro–Wilk test is a statistical hypothesis test designed to assess whether a given random sample is drawn from a normally distributed population. It operates by comparing the ordered sample values to the expected values of order statistics from a standard normal distribution, quantifying the degree of conformity through a test statistic.[6] This method was introduced as an analysis of variance-based approach specifically for complete samples, making it particularly sensitive to deviations in skewness and kurtosis.[1] The primary purpose of the Shapiro–Wilk test is to serve as a diagnostic tool in statistical analysis, verifying the normality assumption that underpins many parametric procedures, such as t-tests, analysis of variance (ANOVA), and linear regression. By identifying non-normal distributions early, it helps researchers decide whether to proceed with parametric methods or opt for robust alternatives or transformations. The test is especially valuable for preliminary data screening in fields like biology, engineering, and social sciences, where normality is often assumed but rarely guaranteed.[1] At its core, the test evaluates two hypotheses: the null hypothesis (H₀) posits that the population from which the sample is drawn follows a normal distribution, while the alternative hypothesis (H₁) asserts that it does not. Rejection of H₀ occurs when the p-value falls below a chosen significance level (typically 0.05), indicating evidence of non-normality. Developed for univariate data, the test performs optimally for small to moderate sample sizes ranging from n = 3 to 50, though extensions such as the Royston formulation allow reliable application up to n = 2000.[7]History and Development
The Shapiro–Wilk test was developed in 1965 by Samuel S. Shapiro, affiliated with General Electric Company, and Martin B. Wilk, affiliated with Bell Telephone Laboratories, Inc., as a response to the limitations of prior normality tests, such as the chi-squared goodness-of-fit method, which often lacked power for small sample sizes.[8] Their approach sought to create a more sensitive test by deriving an optimal linear combination of order statistics, leveraging the expected values and variances from the moments of the normal distribution to better detect deviations from normality in complete samples.[6] The test was formally introduced in their seminal paper, "An Analysis of Variance Test for Normality (Complete Samples)," published in Biometrika, Volume 52, Issues 3-4, pages 591-611.[6] In this work, Shapiro and Wilk presented the test statistic W alongside precomputed coefficients and critical value tables for sample sizes ranging from 3 to 50, a practical necessity given the computational limitations of the time that precluded real-time calculation of the required normal order statistics.[9] Subsequent extensions addressed the original test's scope limitations. In 1982, J. P. Royston proposed an algorithm extending the Shapiro–Wilk W test to larger samples up to n=2000, using approximations to maintain accuracy while enabling broader applicability. Royston further refined these approximations in 1992, providing efficient computational methods that facilitated the test's integration into statistical software packages, leading to its widespread adoption by the 1990s.[10]Theoretical Foundation
Test Statistic
The Shapiro–Wilk test statistic, denoted W, quantifies the degree of normality in a sample by comparing the ordered observations to their expected values under a normal distribution. It is formally defined as W = \frac{\left( \sum_{i=1}^n a_i x_{(i)} \right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}, where x_{(1)} \leq \cdots \leq x_{(n)} are the ordered sample values, \bar{x} is the sample mean, and a_i (for i = 1, \dots, n) are predetermined constants derived from the properties of normal order statistics.[1] The derivation of W relies on the high correlation between the ordered sample from a normal population and the corresponding expected normal order statistics. Specifically, the coefficients a_i are obtained by solving for the linear combination that best approximates the expected order statistics, given by \mathbf{a} = \mathbf{V}^{-1} \mathbf{m} / (\mathbf{m}^T \mathbf{V}^{-1} \mathbf{m})^{1/2}, where \mathbf{m} is the vector of expected values of the order statistics from a standard normal distribution, and \mathbf{V} is the covariance matrix of those order statistics. This construction ensures that the numerator represents a variance estimate aligned with normality, while the denominator is the total sample variance, maximizing the test's power to detect deviations from normality. The a_i are precomputed for sample sizes up to n = 50 (and approximated for larger n) to facilitate practical application. Under the null hypothesis of normality, W approaches 1, as the ordered sample closely matches the expected normal pattern; values substantially less than 1 indicate inconsistencies, such as skewness or kurtosis anomalies, with W bounded between 0 and 1. The optimality of the a_i coefficients for the normal distribution enhances the test's sensitivity compared to moment-based methods, which rely on skewness or kurtosis estimates and are less efficient for small samples.[1]Hypotheses and Assumptions
The Shapiro–Wilk test evaluates the hypothesis that a given sample originates from a normally distributed population. The null hypothesis H_0 posits that the observations are drawn from a normal distribution with unspecified mean \mu and variance \sigma^2.[2] The alternative hypothesis H_1 states that the sample does not come from such a normal distribution, indicating some form of deviation from normality without specifying the nature of the departure (e.g., skewness or kurtosis).[4] The test rejects H_0 in favor of H_1 when the test statistic provides sufficient evidence against normality at a chosen significance level. For the test to be valid, the sample must consist of independent and identically distributed (i.i.d.) random variables from the target population, ensuring that the observations are randomly selected without systematic biases or dependencies.[11] The data are required to be univariate and continuous, as the test relies on order statistics that assume distinct values; ties or discrete data can distort the results, and missing values are not accommodated in the standard formulation.[7] Additionally, the sample size n must be at least 3, with optimal performance and exact critical values available for n between 3 and 50; larger samples may require approximations, and the test becomes highly sensitive to minor deviations; some implementations limit computation to n \leq 5000 (e.g., in R).[7][12] Under H_0, the test is distribution-free with respect to the specific parameters \mu and \sigma^2, as the sampling distribution of the test statistic depends only on normality, not the location or scale.[2] However, the procedure assumes the absence of outliers arising from data collection errors, as such anomalies can mimic non-normality; it is particularly sensitive to violations of the independence assumption, which could arise from clustered or serially correlated data.[4]Implementation
Calculation Steps
The computation of the Shapiro–Wilk test statistic W proceeds through a series of steps that transform the original sample into ordered values and apply specialized coefficients to assess deviation from normality. These steps are designed for algorithmic implementation, though manual execution is practical only for small sample sizes due to the complexity of generating the required coefficients.- Sort the sample data: Arrange the n observations in non-decreasing order to obtain the ordered sample x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}. This step aligns the data with the expected order statistics under normality.
- Compute the sample mean and sum of squared deviations: Calculate the sample mean \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i and the total sum of squared deviations from the mean, \sum_{i=1}^n (x_i - \bar{x})^2, which serves as the denominator in the test statistic. These quantities provide a measure of the sample variance.
- Obtain the coefficients a_i: The coefficients a_i (for i = 1, \dots, n) are determined based on the expected values and covariance structure of order statistics from a standard normal distribution. For sample sizes n \leq 50, these can be directly sourced from precomputed tables provided in the original formulation. For larger n, the coefficients must be calculated algorithmically, typically by solving for the vector a as a = V^{-1} m / \sqrt{m^T V^{-1} m}, where m is the vector of expected normal order statistics and V is their covariance matrix; this process is computationally intensive for n > 50 due to the O(n^3) complexity of matrix inversion, though optimizations like Cholesky decomposition reduce it to O(n^2) by factoring V = LL^T and solving linear systems iteratively.
- Calculate the test statistic W: Form the numerator as \left( \sum_{i=1}^n a_i x_{(i)} \right)^2 and divide by the sum of squared deviations computed in step 2 to yield W = \frac{\left( \sum_{i=1}^n a_i x_{(i)} \right)^2}{\sum_{i=1}^n (x_i - \bar{x})^2}. The resulting W ranges from 0 to 1, with values near 1 indicating closer conformity to normality. Note that the coefficients satisfy \sum a_i = 0, ensuring the numerator equivalently measures \sum a_i (x_{(i)} - \bar{x}).
Critical Values and Software
The Shapiro–Wilk test relies on precomputed critical values for the test statistic W to determine significance at common levels such as \alpha = 0.05 and $0.01, with tables available for sample sizes up to n = 50 as originally provided in the foundational work by Shapiro and Wilk.[13][14] These tables include coefficients for computing W and corresponding critical thresholds, allowing manual comparison where W below the critical value indicates rejection of normality. For larger samples, exact critical values become impractical due to computational demands, so approximations or simulation-based methods are employed to derive p-values instead.[1] P-value computation involves comparing the observed W to its distribution under the null hypothesis of normality, often using exact methods for small n or asymptotic approximations for larger samples. A key advancement is Royston's algorithm, which extends the test to n up to 5000 by approximating the distribution of W through transformations to chi-squared variables, enabling efficient p-value calculation without exhaustive tabulation.[10][15] Post-1995 refinements to this algorithm, implemented in various packages, support reliable p-values for samples as large as 5000 via these approximations.[12] The test is rejecting the null hypothesis if the p-value is less than the chosen significance level \alpha, such as 0.05, indicating evidence of non-normality. Implementations are widely available in statistical software, facilitating practical use. In R, theshapiro.test() function takes a numeric vector as input and returns the W statistic along with the p-value, with support for n up to 4999 using Royston's method.[12][16] In Python, scipy.stats.shapiro() from the SciPy library accepts an array-like input and outputs the test statistic and p-value, applicable for n up to approximately 5000, though it issues warnings for very large samples due to precision limits.[17] SAS implements the test via PROC UNIVARIATE with the NORMAL option, providing W, p-value, and plots for datasets of varying sizes.[18] In SPSS, the Shapiro–Wilk test is accessible through the Explore procedure or syntax for small to moderate samples (typically n < 50), outputting W and the p-value, but switches to alternative normality tests like Kolmogorov–Smirnov for larger n.[19][20]