Friedman test
The Friedman test is a non-parametric statistical test developed by economist and statistician Milton Friedman in 1937 to assess differences in treatments across multiple matched blocks or repeated measures without relying on the assumption of normally distributed data.[1] It functions as a rank-based analog to the parametric repeated measures analysis of variance (ANOVA), particularly suitable for randomized block designs where observations are dependent within blocks, such as in longitudinal studies or matched subject experiments.[2] By converting raw data into ranks within each block, the test computes a statistic based on the variance of rank sums, which under the null hypothesis of no treatment effects approximates a chi-squared distribution with k-1 degrees of freedom, where k is the number of treatments.[3] The procedure begins by ranking the observations for each block from 1 (lowest) to k (highest), assigning average ranks in case of ties, and then summing these ranks across all n blocks for each treatment to obtain rank totals R_j. The test statistic F_r (or Q) is calculated asF_r = \frac{12}{n k (k+1)} \sum_{j=1}^k R_j^2 - 3 n (k+1),
which simplifies to assess whether the rank sums deviate significantly from their expected value under randomness.[3] Assumptions include ordinal or higher-level data that can be ranked, independent blocks, and no systematic interactions beyond the treatments of interest; it performs well with small sample sizes (n ≥ 5 recommended) but may require exact distribution tables for very small k.[2] Unlike parametric ANOVA, it is robust to outliers and non-normal distributions but has lower power when normality holds.[4] In practice, the Friedman test is widely applied in fields like psychology, medicine, and biology to analyze repeated measures data, such as comparing pain relief across multiple drugs in the same patients or evaluating performance under varying conditions in matched subjects.[5] If the test indicates significant differences (p < 0.05), post-hoc pairwise comparisons using procedures like the Wilcoxon signed-rank test with Bonferroni correction can identify specific treatment pairs driving the effect.[6] Its enduring relevance stems from Friedman's original emphasis on avoiding normality assumptions in variance analysis, making it a foundational tool in non-parametric statistics.[1]
Introduction
Definition and purpose
The Friedman test is a rank-based, non-parametric statistical procedure designed to detect differences in treatments across multiple test attempts or blocks in experimental designs.[7] Introduced as an alternative to parametric methods that assume normality, it applies ranks to the data within each block to avoid reliance on distributional assumptions, making it suitable for analyzing correlated or matched observations.[8] Its primary purpose is to compare three or more related samples, particularly in scenarios where the data violate the normality requirements of parametric tests such as repeated measures ANOVA.[4] The test evaluates whether there are significant overall differences among the groups or conditions, testing the null hypothesis that all population distributions are identical against the alternative that at least one tends to produce larger (or smaller) observations.[9] The Friedman test is appropriately applied in situations involving ordinal data, small sample sizes, or non-normal distributions within within-subjects designs, such as ranking preferences across multiple options or assessing treatment effects in matched blocks.[8] In practice, it ranks the observations within each block and derives a test statistic to quantify the consistency of these ranks across blocks, providing evidence of treatment effects without assuming equal intervals or Gaussian distributions.[10]Historical development
The Friedman test originated from the work of economist and statistician Milton Friedman during his early career in the 1930s, as he explored alternatives to parametric methods that relied on normality assumptions. Developed in 1937 while Friedman was a research assistant at the National Bureau of Economic Research, the test addressed the need for robust analysis in experimental designs involving multiple related samples.[11] Friedman first described the procedure in his paper "The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance," published in the Journal of the American Statistical Association. In this work, he proposed replacing raw observations with ranks within blocks to perform a two-way analysis of variance, thereby extending rank-based techniques to handle repeated measures or matched designs without assuming underlying distributions. The method was directly inspired by Harold Hotelling and Margaret Pabst's 1936 paper on rank correlation coefficients, which Friedman encountered during his studies under Hotelling at Columbia University; it built on R. A. Fisher's foundational two-way ANOVA framework by adapting it for non-parametric use.[7][11][12] The test gained prominence in the mid-20th century alongside the broader expansion of non-parametric statistics, particularly in the 1940s and 1950s, as researchers in fields like psychology, biology, and social sciences increasingly favored distribution-free methods for handling ordinal or non-normal data in experimental settings. This period saw a surge in rank-based procedures, with Friedman's approach becoming a standard tool for analyzing blocked designs, as evidenced by its inclusion in seminal texts on non-parametric methods.[11]Assumptions and data requirements
Non-parametric assumptions
The Friedman test, as a non-parametric alternative to repeated-measures ANOVA, does not require the assumption of normality in the underlying data distributions, making it suitable for ordinal or non-normal continuous data where parametric assumptions fail. Instead, its core assumption is that the observations within each block are identically distributed except for possible location shifts attributable to treatment effects, allowing the test to focus on differences in central tendency across treatments while controlling for block variability.[1][13] Regarding independence, the test is designed for related samples, where observations within each block are dependent due to matching or repeated measures on the same subjects, but the blocks themselves must be independent to ensure the validity of the overall analysis. This structure accounts for intra-block correlations without assuming independence within blocks, distinguishing it from tests for independent samples. The data are typically ordinal or continuous, transformed into ranks within each block for analysis; ties are accommodated by assigning average ranks to tied values, preserving the ordinal nature of the measurements.[14][15][16] Despite its robustness to outliers and non-normal distributions, the Friedman test has limitations, including the assumption of similar distributional shapes across treatments within blocks apart from location differences. If these assumptions are violated—such as through differing variances or shapes in the distributions—the test may fail to adequately control the Type I error rate, potentially leading to inflated false positive rates, though it generally maintains nominal levels when assumptions hold. In contrast to parametric tests, which impose stricter normality and homoscedasticity requirements, the Friedman test's relaxed assumptions enhance its applicability in diverse empirical settings.[17][18]Data structure and prerequisites
The Friedman test requires data arranged in a blocked design, featuring n blocks (such as subjects or matched groups) and k treatments (such as conditions or time points), with exactly one observation per treatment within each block.[19] This configuration yields a rectangular k \times n data matrix, where rows represent blocks and columns represent treatments, ensuring a complete, unreplicated block structure.[20] The test mandates at least three treatments (k \geq 3) to detect differences among multiple groups, while the number of blocks should be at least five for adequate power, though typically n \geq 10 is recommended to ensure reliable p-values via the chi-square approximation; smaller n can be analyzed using exact permutation methods.[2][21][17] When ties occur within a block (identical observations across treatments), they are handled by assigning average ranks to the tied values, and statistical software often adjusts the degrees of freedom accordingly to maintain test validity.[22][23] Key prerequisites include that observations within blocks must be related, such as repeated measures on the same subjects over time or carefully matched pairs/groups to control for inter-block variability.[14] Incomplete data poses challenges, as the test assumes fully observed blocks; missing values necessitate either listwise deletion (reducing n) or cautious imputation, though the latter risks biasing ranks and should be avoided when possible, with extensions like the Skillings-Mack test considered for substantial missingness.[24][25] For illustration, consider a dataset evaluating three treatments (A, B, C) on ten subjects, structured as follows:| Subject | Treatment A | Treatment B | Treatment C |
|---|---|---|---|
| 1 | 5.2 | 6.1 | 4.8 |
| 2 | 4.9 | 5.5 | 5.0 |
| ... | ... | ... | ... |
| 10 | 6.0 | 7.2 | 5.9 |
Test procedure
Step-by-step method
The Friedman test begins with organizing the data into a structured format suitable for analysis. The dataset consists of n blocks (also called subjects or rows), each containing k observations corresponding to k treatments or conditions (columns). This arrangement ensures that observations within each block are related, such as repeated measures on the same subjects.[27] To perform the test manually, follow these sequential steps:- Organize the data: Arrange the observations into an n \times k table, where rows represent blocks and columns represent treatments. Ensure that the data meet the basic prerequisites, such as ordinal or continuous measurements without requiring normality.[22]
- Rank observations within each block: For each row independently, assign ranks to the k observations from 1 (lowest value) to k (highest value). If ties occur within a block, assign the average of the tied ranks to each tied observation; for example, two tied values for ranks 3 and 4 both receive rank 3.5. This ranking process is performed separately for every block to account for subject-specific variability.[28][27]
- Sum the ranks for each treatment: Calculate the total rank sum R_j for each treatment j (where j = 1 to k) by adding the ranks assigned to that treatment across all n blocks. These sums, R_1, R_2, \dots, R_k, represent the aggregated ranking for each treatment.[27]
- Verify block totals (optional but recommended): For each block, confirm that the sum of ranks equals \frac{k(k+1)}{2}. This sum is exact even with ties due to the use of average ranks. This step ensures the ranking process is accurate and complete.[22]
- Prepare for test statistic calculation: Use the rank sums R_j as inputs for computing the overall test statistic, with the detailed formula provided in the subsequent mathematical formulation.[27]
Mathematical formulation
The Friedman test statistic is based on the sums of ranks assigned to each of the k treatments across n blocks (or subjects). The rank sum for treatment j is defined as R_j = \sum_{i=1}^n r_{ij}, where r_{ij} is the rank of the i-th block's observation for treatment j, with ranks typically ranging from 1 to k within each block (using average ranks for any ties).[27] Under the null hypothesis that there are no differences among the treatments, the test statistic Q measures the variability in these rank sums and is given by Q = \frac{12}{n k (k+1)} \sum_{j=1}^k \left( R_j - \frac{n(k+1)}{2} \right)^2, where \frac{n(k+1)}{2} is the expected rank sum for each treatment. This is mathematically equivalent to Q = \frac{12}{n k (k+1)} \sum_{j=1}^k R_j^2 - 3 n (k+1). [27][20] The derivation of Q stems from the variance of the rank sums under the null hypothesis. Under this hypothesis, each R_j has an expected value of \frac{n(k+1)}{2} and variance \frac{n (k^2 - 1)}{12}, leading to a standardized measure of dispersion that scales to the given form; this normalization ensures the statistic approximates a chi-squared distribution when treatment effects are absent.[27] For large n, Q follows approximately a \chi^2 distribution with k-1 degrees of freedom, allowing critical values to be obtained from standard chi-squared tables for significance testing.[27][20] When ties occur within blocks, average ranks are assigned to tied values, and the test statistic is adjusted to account for the reduced variability. The adjusted Q is Q = \frac{12}{n k (k+1) C} \sum_{j=1}^k R_j^2 - 3 n (k+1), where the correction factor C is C = 1 - \frac{\sum_i (t_i^3 - t_i)}{n k (k+1)(k-1)}, with the sum taken over all m sets of ties and t_i denoting the number of observations tied in the i-th set. This adjustment deflates Q to reflect the ties' impact on rank dispersion.[28][29] For small samples, the chi-squared approximation may be inaccurate, so the exact distribution of Q is used instead, typically via precomputed critical value tables or permutation-based methods that enumerate all possible rank assignments under the null hypothesis.[29][10]Interpretation and results
Test statistic and significance
The null hypothesis of the Friedman test posits that the probability distributions of the treatments are identical across blocks, which is commonly interpreted as the treatments having no differential effect, or equivalently, equal medians assuming identical distribution shapes.[1][30] To determine statistical significance, the test statistic Q (detailed in the Mathematical formulation section) is compared to the critical value from the chi-square distribution with k-1 degrees of freedom, where k is the number of treatments, at a chosen significance level such as \alpha = 0.05. Alternatively, statistical software computes the exact p-value using Monte Carlo simulation or enumeration for small samples, or the asymptotic chi-square approximation for larger ones. If Q exceeds the critical value, or if the p-value is less than \alpha, the null hypothesis is rejected, providing evidence that at least one treatment differs from the others in its distribution.[31] Standard reporting conventions include stating the value of Q, the degrees of freedom k-1, and the associated p-value, along with the sample size n (number of blocks) to contextualize the approximation's reliability. The chi-square approximation is generally valid when n > 10, though some guidelines recommend n > 15 or k > 4 for better accuracy; for smaller samples, exact methods are preferred to avoid inflated Type I error rates.[17] Regarding power, the Friedman test exhibits good statistical power for detecting location shifts (differences in medians or central tendencies) under the null, but generally has lower power than parametric alternatives like repeated-measures ANOVA when normality holds, particularly for detecting location shifts; it is not designed to detect differences in variance or distribution shape.[32][33]Effect size measures
The Friedman test detects differences in treatments across multiple matched blocks, but assessing the magnitude of these differences requires effect size measures to evaluate practical significance. The primary effect size metric for the Friedman test is Kendall's coefficient of concordance, denoted as W, which quantifies the degree of agreement in rankings across treatments.[34] Introduced by Kendall and Babington Smith, W normalizes the test statistic to range from 0, indicating no agreement or effect, to 1, representing perfect concordance in ranks. W is calculated as W = \frac{Q}{n(k-1)}, where Q is the Friedman test statistic derived from the sum of squared rank totals, n is the number of blocks (subjects), and k is the number of treatments. This measure is computed directly from the rank sums assigned to each treatment, providing a straightforward way to report the proportion of variance in ranks attributable to treatment differences, which aids in interpreting the practical importance of results beyond statistical significance.[35] Guidelines for interpreting W, adapted from Cohen's conventions for related statistics, classify values of approximately 0.1 as small effects, 0.3 as medium effects, and 0.5 as large effects, though these thresholds should be contextualized by the study's domain. Alternative effect size measures include rank-based analogs to eta-squared, which estimate the percentage of total rank variance explained by the treatment factor, and rank biserial correlations adapted for multi-group comparisons, though these are less commonly applied to the overall test.[36] Despite its utility, Kendall's W assumes the absence of tied ranks; when ties occur, adjustments such as those proposed by Gwet are recommended to correct for bias. Additionally, for small samples (e.g., fewer than 20 treatments), the underlying chi-square approximation for significance testing may inflate Type I errors, necessitating adjustments or exact methods that also affect W's reliability.[37]Related tests
Parametric alternatives
The primary parametric alternative to the Friedman test is the repeated measures analysis of variance (ANOVA), which is suitable for comparing means across multiple related samples or conditions when the data meet parametric assumptions.[14] Repeated measures ANOVA evaluates differences in means directly, assuming the data are continuous and follow a normal distribution within each group, along with homogeneity of variances (sphericity for within-subjects factors).[38] In contrast, the Friedman test employs ranks rather than raw values, providing robustness against violations of normality and unequal variances, making it preferable for ordinal data or non-normal distributions.[39] The Friedman test was specifically developed by statistician Milton Friedman in 1937 as a non-parametric method to circumvent the normality assumption inherent in parametric ANOVA procedures.[39] Key differences lie in their statistical foundations: repeated measures ANOVA relies on the F-statistic to test for mean differences, while the Friedman test uses a chi-square distributed test statistic (Q) based on rank sums. Under conditions of normality, the ANOVA F-statistic relates asymptotically to the Friedman Q statistic, with the latter approximating (k-1) times the F-value for k treatments, reflecting their near-equivalence in large samples.[39] However, the asymptotic relative efficiency of the Friedman test relative to ANOVA under normality is approximately 0.955k/(k+1), indicating slightly lower power for the non-parametric approach when parametric assumptions hold.[39] Researchers should prefer repeated measures ANOVA when dealing with large sample sizes, continuous normally distributed data, and verified sphericity, as it offers higher statistical power to detect true differences in such scenarios.[40] Conversely, the Friedman test is more appropriate for smaller samples, skewed distributions, or ranked data, where its robustness prevents inflated Type I error rates associated with parametric violations.[41] This choice balances power gains from parametric methods against the reliability of non-parametric alternatives in real-world data often deviating from ideal assumptions.[42]Other non-parametric alternatives
The Wilcoxon signed-rank test serves as a foundational non-parametric procedure for comparing two related samples, functioning as a direct precursor to the Friedman test when the number of treatments or conditions is limited to k=2.[43] Developed by Frank Wilcoxon in 1945, it assesses differences in paired observations by ranking the absolute differences and accounting for their signs, providing a robust alternative to the paired t-test under non-normality.[44] In scenarios with only two repeated measures per block, the Wilcoxon signed-rank test is preferred over the Friedman test due to its higher power and simpler computation, as the Friedman test reduces to a less efficient sign test in this case.[45] For designs involving independent samples rather than repeated measures, the Kruskal-Wallis test acts as the between-subjects analog to the Friedman test, extending the Mann-Whitney U test to k>2 groups.[46] Introduced by William Kruskal and W. Allen Wallis in 1952, it ranks all observations across groups and tests for differences in distribution medians without assuming block-wise dependencies.[47] Researchers should opt for the Kruskal-Wallis test when blocks (subjects) are unrelated, as it avoids the within-block ranking central to the Friedman test and better suits completely randomized designs.[48] Extensions of the Friedman framework address specialized data types. For binary or dichotomous repeated measures, Cochran's Q test provides a direct adaptation, evaluating consistency in proportions across k conditions while maintaining the block structure. Proposed by William G. Cochran in 1950, it simplifies the Friedman ranks to 0-1 assignments per cell, offering a non-parametric chi-square-like test for matched binary data.[49] When ordered alternatives are hypothesized—such as a monotonic trend in treatment effects—the Page test enhances the Friedman approach by weighting ranks according to their expected order.[50] Edward B. Page formalized this in 1963, computing a linear combination of rank sums to detect ordered differences with greater sensitivity than the omnibus Friedman test.[51] The Friedman test's specificity to repeated measures designs imposes limitations relative to these alternatives; for instance, it requires paired blocks and may underperform without them, whereas the Kruskal-Wallis test accommodates independence but loses power from unexploited pairings.[52] Similarly, while Cochran's Q and the Page test inherit the block-wise ranking, they are constrained to binary or ordered contexts, respectively, and cannot handle general continuous outcomes as flexibly as the Friedman test.[53]Post-hoc analysis
Multiple comparisons overview
Following a significant result from the Friedman test, which indicates overall differences among the treatments or related samples, post-hoc multiple comparisons are employed to pinpoint which specific pairs of treatments differ from one another.[54] This step is essential because the Friedman test only detects the presence of at least one difference but does not identify the location or nature of those differences.[54] Such analyses are performed solely when the Friedman test's p-value is less than the chosen significance level (e.g., α = 0.05); if the overall test is not significant, no further pairwise investigations are warranted to avoid unnecessary error inflation.[54] A primary challenge in multiple comparisons arises from the increased risk of Type I errors—the false identification of differences—due to conducting numerous pairwise tests simultaneously.[54] To mitigate this, adjustments to the significance level are required, such as the Bonferroni correction, which divides the overall α by the number of comparisons to control the familywise error rate.[54] These safeguards ensure that the probability of at least one false positive across all tests remains at the desired level, preserving the integrity of the analysis in the context of repeated measures or blocked designs typical of the Friedman test.[54] General strategies for post-hoc analysis after the Friedman test rely on rank-based procedures that extend the nonparametric framework of the original test. Common approaches include pairwise comparisons using adapted versions of tests like Dunn's procedure, which operates on the ranks to compare treatment means while incorporating multiplicity adjustments.[55] For comprehensive all-pairs evaluations, methods such as the Nemenyi test or Conover's test are frequently applied, providing a structured way to assess differences based on rank sums or means across all treatment combinations.[56][55] These techniques maintain the robustness of nonparametric inference, making them suitable for ordinal or non-normal data in repeated measures settings. However, some statistical literature has criticized rank-based post-hoc tests like Nemenyi and Conover for relying on assumptions such as exchangeability of ranks that may not hold in all applications, potentially leading to invalid p-values.[57]Specific post-hoc procedures
When the Friedman test indicates significant differences among the treatments, post-hoc procedures are employed to identify which specific pairs differ. One common approach is the pairwise Wilcoxon signed-rank test adjusted for multiple comparisons using the Bonferroni correction. For each pair of treatments, the signed-rank test is applied to the differences in observations across blocks, ranking the absolute differences and assigning signs based on direction. The significance level α is then divided by the number of pairwise comparisons, C(k, 2), where k is the number of treatments, to control the family-wise error rate. This method is suitable for targeted pairwise investigations but can be conservative with many comparisons.[6] The Nemenyi test provides a distribution-free multiple comparison procedure based on rank sums from the Friedman analysis. It compares all pairs simultaneously by computing the test statistic q = \frac{|R_i - R_j|}{\sqrt{\frac{[k](/page/K)([k](/page/K)+1)}{6N}}}, where R_i and R_j are the rank sums for treatments i and j, k is the number of treatments, and N is the number of blocks. The null hypothesis of no difference between the pair is rejected if q > q_{\alpha}, the critical value from the studentized range distribution with k degrees of freedom and infinite replicates at significance level α. Ties in the original data are handled by assigning average ranks during the Friedman ranking step, which propagates to the rank sums without further adjustment. Critical values for q_{\alpha} are tabulated in standard statistical references for various α and k.[58] This test is ideal for all-pairs comparisons without assuming an a priori order among treatments. Conover's test offers another pairwise procedure, adapting a t-like statistic to the mean ranks from the Friedman test. For each pair, the statistic is t = \frac{\bar{R}_i - \bar{R}_j}{s \sqrt{\frac{2}{N}}}, where \bar{R}_i and \bar{R}_j are the mean ranks, s is the pooled standard deviation of the ranks across treatments, and N is the number of blocks; this t follows a Student's t-distribution with (k-1)(N-1) degrees of freedom under the null. P-values are adjusted for multiplicity, often via Bonferroni or other methods, and the pair is deemed significant if the adjusted p < α. Ties are accommodated through average ranking, with degrees of freedom unchanged.[55] The choice among these procedures depends on the analysis goals: the Nemenyi test is preferred for comprehensive all-pairs evaluations in unordered treatments due to its simultaneous control, while the Wilcoxon signed-rank with Bonferroni suits focused pairwise tests where computational simplicity is valued. Conover's test provides a parametric-like flavor on ranks, useful when t-distributions offer familiarity, but requires careful multiplicity adjustment.Applications and implementation
Practical examples
One practical example of the Friedman test involves a sensory evaluation study where 10 tasters rated three different wines (A, B, and C) on a scale from 1 to 10, with higher scores indicating greater preference. The raw ratings are presented in the following table:| Taster | Wine A | Wine B | Wine C |
|---|---|---|---|
| 1 | 8 | 6 | 4 |
| 2 | 8 | 6 | 4 |
| 3 | 8 | 6 | 4 |
| 4 | 8 | 6 | 4 |
| 5 | 8 | 6 | 4 |
| 6 | 8 | 6 | 4 |
| 7 | 8 | 6 | 4 |
| 8 | 8 | 6 | 4 |
| 9 | 6 | 8 | 4 |
| 10 | 6 | 4 | 8 |
Software and computational tools
The Friedman test is implemented in various statistical software packages, facilitating its application to repeated measures data. In R, the basestats package provides the friedman.test() function, which performs the rank sum test on unreplicated blocked data.[61] The function accepts a formula interface for specifying the data, treatments, and blocks, such as friedman.test(response ~ [treatment](/page/Treatment) | [block](/page/Block), [data](/page/Data) = df), where response is the measurement variable, treatment defines the groups, and block identifies the subjects or blocks.[62] The output includes the Friedman chi-squared statistic (Q) and the associated p-value, assuming asymptotic approximation; for small samples, users can verify results manually due to potential discrepancies in tie handling.[61]
In Python, the scipy.stats module offers friedmanchisquare(*samples), which computes the test statistic and p-value for multiple related samples passed as separate arrays.[31] This function automatically averages ranks for ties, providing a straightforward interface for data structured in long format or as grouped arrays, and returns the chi-squared value along with the p-value under the null hypothesis of identical distributions.[31]
SPSS supports the Friedman test through the Nonparametric Tests menu under Related Samples, where users select the Friedman option and specify the repeated measures variables for each block.[63] The procedure outputs the chi-squared statistic, degrees of freedom, and asymptotic significance, with options for exact tests via Monte Carlo simulation for smaller datasets. In SAS, the test is available in PROC FREQ using the CMH (Cochran-Mantel-Haenszel) option on stratified tables, computing Friedman's chi-squared for randomized complete block designs; this yields the statistic, p-value, and supports both asymptotic and exact computations.
For Microsoft Excel, no native function exists, but add-ins like XLSTAT or Real Statistics Resource Pack enable the test by inputting repeated measures data and generating ranks, the chi-squared statistic, and p-value.[64] [65] These tools often provide options for approximate or exact p-values, suitable for smaller samples. For post-hoc analyses following a significant Friedman test, the R package PMCMRplus extends functionality with procedures like posthoc.friedman.nemenyi.test(), which performs pairwise comparisons adjusted for multiple testing.[55] Users should ensure data is in the required blocked format, as outlined in prerequisites, and cross-validate outputs with manual calculations for datasets under 20 blocks to confirm accuracy.[61]