Spearman's rank correlation coefficient
Spearman's rank correlation coefficient, denoted as \rho_s for the population parameter or r_s for the sample statistic, is a nonparametric statistical measure that assesses the strength and direction of the monotonic relationship between two variables based on their ranks rather than raw values.[1] Introduced by British psychologist Charles Spearman in his 1904 paper "The Proof and Measurement of Association between Two Things," it provides a robust alternative to Pearson's product-moment correlation when data do not meet assumptions of normality or linearity, particularly for ordinal data or when outliers are present.[1] The coefficient is calculated by first assigning ranks to the values of each variable, then applying a formula analogous to Pearson's correlation but to these ranks: r_s = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}, where d_i is the difference between the ranks of corresponding observations, and n is the number of observations; adjustments are made for tied ranks to ensure accuracy.[1] This method yields values ranging from -1 (perfect negative monotonic association) to +1 (perfect positive monotonic association), with 0 indicating no monotonic relationship.[1] Unlike Pearson's correlation, which assumes linearity and is sensitive to outliers, Spearman's rho focuses on rank order preservation and is widely used in fields such as psychology, medicine, and environmental science for exploratory data analysis and hypothesis testing.[1][2] Its distribution-free nature makes it suitable for small sample sizes or non-parametric settings, though significance testing often relies on approximations to the t-distribution or exact permutation methods.[3]Fundamentals
Definition
Spearman's rank correlation coefficient, denoted as \rho, is a nonparametric measure of the strength and direction of the monotonic association between two variables. It evaluates how well the relationship between the variables can be described by a monotonic function, where an increase in one variable is associated with either an increase or a decrease in the other, without requiring the relationship to be linear.[4][5] The method relies on ranking the data points of each variable, which involves assigning ordinal values (ranks) to the observations based on their order from lowest to highest, handling ties by averaging ranks where necessary. This ranking process transforms the original data into a form suitable for assessing order-preserving relationships, making \rho particularly appropriate for ordinal data or continuous data that may not follow a normal distribution.[6][7] Spearman's \rho is mathematically equivalent to the Pearson product-moment correlation coefficient applied to these ranked data, providing a value between -1 and +1, where +1 indicates a perfect positive monotonic relationship, -1 a perfect negative one, and 0 no monotonic association. The formula for computing \rho is: \rho = 1 - \frac{6 \sum_{i=1}^{n} d_i^2}{n(n^2 - 1)} where d_i represents the difference in ranks for the i-th paired observation, and n is the number of observations.[8][9] In contrast to Pearson's r, which assumes linearity and normality to measure linear relationships, Spearman's \rho is robust to outliers and nonlinear but monotonic patterns, rendering it ideal for non-parametric analyses.[10][11]Calculation
To compute Spearman's rank correlation coefficient, denoted as \rho, begin by ranking the values of each variable separately, assigning the lowest value a rank of 1, the next lowest a rank of 2, and so on, up to the highest value receiving rank n, where n is the number of observations.[12] For each paired observation i, calculate the difference in ranks d_i = rank of x_i minus rank of y_i. Square these differences to obtain d_i^2, and sum them across all pairs to get \sum d_i^2. The coefficient is then given by the formula: \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} This formula provides an estimate of the population parameter based on the sample ranks.[13] Consider a sample dataset with n=5 paired observations: x = (10, 20, 30, 40, 50) and y = (15, 25, 35, 45, 55). The ranks for x are (1, 2, 3, 4, 5), and for y are also (1, 2, 3, 4, 5), yielding d_i = (0, 0, 0, 0, 0) and \sum d_i^2 = 0. Substituting into the formula gives \rho = 1 - \frac{6 \times 0}{5(25 - 1)} = 1. Now, alter y to (55, 45, 35, 25, 15) for a perfect negative association; ranks for y become (5, 4, 3, 2, 1), so d_i = (-4, -2, 0, 2, 4) and \sum d_i^2 = 40. Then \rho = 1 - \frac{6 \times 40}{5(25 - 1)} = 1 - 2 = -1. For a case with moderate positive association, take y = (15, 25, 55, 35, 45); ranks for y are (1, 2, 5, 3, 4), d_i = (0, 0, -2, 1, 1), \sum d_i^2 = 6, and \rho = 1 - \frac{6 \times 6}{5(25 - 1)} = 1 - 0.3 = 0.7.[14] The calculation assumes paired observations are independent, the data are at least ordinal (allowing meaningful ranking), and n \geq 2. It serves as a sample estimator of the underlying population rank correlation.[15] In edge cases, perfect positive correlation occurs when ranks match exactly (\sum d_i^2 = 0), yielding \rho = 1; perfect negative correlation arises when one variable's ranks are the inversion of the other's (e.g., ascending versus descending order), yielding \rho = -1; and no association typically results in \rho \approx 0, where rank differences are randomly distributed.[13]Properties and Interpretation
Interpretation
Spearman's rank correlation coefficient, denoted as ρ, quantifies the strength and direction of the monotonic relationship between two ranked variables, ranging from -1 to +1. A value of +1 indicates a perfect positive monotonic association, where higher ranks in one variable correspond exactly to higher ranks in the other; -1 signifies a perfect negative monotonic association, with higher ranks in one corresponding to lower ranks in the other; and 0 suggests no monotonic association.[9] The absolute value of ρ provides a guideline for the strength of the association, though these thresholds are not absolute and depend on the context of the data and field of study. According to one common classification, |ρ| = 0.00–0.19 is very weak, 0.20–0.39 weak, 0.40–0.59 moderate, 0.60–0.79 strong, and 0.80–1.00 very strong, reflecting the degree to which the rankings align monotonically.[16][9] Unlike Pearson's product-moment correlation coefficient (r), which measures linear relationships and assumes normally distributed continuous data, Spearman's ρ assesses monotonic relationships, including non-linear trends, by using ranks rather than raw values, making it more robust to outliers and non-normality. However, ρ is less sensitive to the precise magnitude of linear trends compared to Pearson's r, as it focuses solely on rank order preservation.[16][9] A key limitation of ρ is its insensitivity to the actual differences in magnitude between data points, capturing only the ordinal structure and potentially overlooking important scale variations. Additionally, in small samples, ρ can exhibit a negative bias, underestimating the true association, which may lead to conservative interpretations.[17][9]Related Quantities
Spearman's rank correlation coefficient, denoted as ρ, was developed by Charles Spearman in 1904 as a method to measure the association between two variables based on their ranks, improving upon earlier approaches to rank-based correlations by providing a standardized measure akin to Pearson's product-moment correlation but applicable to non-normal data. In comparison to Pearson's product-moment correlation coefficient (r), which assumes a linear relationship and normality of data to assess the strength of linear associations between continuous variables, Spearman's ρ is a non-parametric alternative that evaluates monotonic relationships by applying Pearson's formula to ranked data, making it robust to outliers and non-linear but strictly increasing or decreasing patterns.[18][19] Another prominent rank correlation measure is Kendall's tau (τ), introduced by Maurice Kendall in 1938, which quantifies the ordinal association by counting the number of concordant and discordant pairs in the rankings, differing from Spearman's ρ in that it treats all pairwise disagreements equally regardless of the magnitude of rank differences, whereas ρ gives greater weight to larger discrepancies through its squared rank differences.[20][21] Other related measures include Goodman and Kruskal's gamma (γ), proposed in 1954 for ordinal data with ties, which normalizes the difference between concordant and discordant pairs by their total, offering a symmetric alternative to τ that is particularly useful in contingency tables; and Somers' D, developed by Randall Somers in 1962 as a directional, asymmetric measure of rank association where one variable predicts the other, adjusting for ties in a manner similar to τ but emphasizing predictive strength. Spearman's ρ is preferred over Pearson's r when data violate normality assumptions or exhibit monotonic but non-linear relationships, and over Kendall's τ or gamma when emphasizing the extent of rank differences is important, such as in psychological or educational rankings where larger deviations indicate stronger deviations from independence.[18][19]Applications
General Applications
Spearman's rank correlation coefficient, denoted as ρ, is primarily used to assess the strength and direction of monotonic relationships between two variables, particularly when data are ordinal or ranked rather than interval-scaled. This non-parametric measure is especially valuable for analyzing rankings without assuming a linear relationship or normal distribution of the data. In fields dealing with subjective assessments or ordered categories, such as psychology, it evaluates associations between ranked variables like performance scores on intelligence tests, where Spearman's original formulation in 1904 demonstrated its utility for measuring intellectual associations.[10] A key advantage of Spearman's ρ lies in its robustness to outliers and non-normal distributions, as it transforms raw data into ranks, mitigating the influence of extreme values that could distort parametric correlations like Pearson's. This property makes it suitable for hypothesis testing of associations in real-world datasets where distributional assumptions are violated, allowing researchers to detect monotonic trends without requiring linearity. For instance, in economics, it is applied to preference orders, such as ranking consumer choices or investment options, to quantify the consistency of ordinal preferences across individuals or groups. Similarly, in biology, Spearman's ρ analyzes species abundance ranks, correlating factors like habitat size with biodiversity metrics to identify ecological patterns.[22][23][24] In practical contexts, Spearman's correlation facilitates interdisciplinary applications, including market research where it ranks consumer preferences against product attributes to uncover monotonic trends in survey data. In environmental science, it examines pollutant rank correlations, such as associations between air quality indices and emission sources, aiding in the identification of environmental risk factors. Within social sciences, it is commonly employed for attitude scales, measuring the monotonic alignment between ordinal responses on Likert-type items and behavioral indicators. More recently, in machine learning, Spearman's ρ supports feature ranking by evaluating monotonic dependencies between input variables and outcomes, enhancing model interpretability in non-linear settings.[25][26][27][28]Specialized Uses
In genomics, Spearman's rank correlation coefficient is employed to rank gene expression levels and identify coexpressed genes within biological pathways, particularly when data exhibit non-normal distributions or nonlinear relationships. For instance, it has been shown to effectively detect associations in coexpression networks for pathway analysis, outperforming some parametric methods in small datasets by focusing on monotonic trends rather than assuming linearity.[29][30] In finance, the coefficient assesses correlations between ranked asset returns to uncover non-linear dependencies that Pearson's correlation might miss, aiding in portfolio risk management and dependency modeling under non-normal market conditions. In ecology, Spearman's rank correlation facilitates spatial analyses of ranked environmental variables and biodiversity metrics, including correlations between species richness gradients and habitat factors across scales. It is particularly valuable for assessing relationships in non-parametric settings, such as biodiversity-disease dynamics moderated by spatial extent, where it reveals monotonic associations without assuming normality.[31] Post-2020 applications include its use in AI ethics for detecting bias in ranked model outputs, such as evaluating alignment between large language model ratings and human judgments on sensitive topics like news source credibility or fairness in decision rankings. In this context, Spearman correlations quantify monotonic biases in ordinal predictions, helping identify disparities in model performance across demographic groups.[32][33] In climate science, recent implementations leverage Spearman's rank correlation for trend ranking in time series data, such as analyzing committed economic damages from emissions or groundwater level changes relative to rainfall patterns. It proves robust for non-normal climate variables, enabling detection of monotonic trends in high-variability datasets like salinity-discharge relationships under changing scenarios.[34][35] Despite these advantages, Spearman's rank correlation exhibits reduced statistical power in high-dimensional data, where multiple testing and sparsity can inflate false positives or dilute signal detection compared to dimension-reduced alternatives. In such contexts, partial rank correlations are often preferred to control for confounding variables, though they lack strong theoretical backing for the Spearman variant and may require adjustments for censored or ultrahigh-dimensional cases.[36]Statistical Analysis
Determining Significance
To determine the statistical significance of Spearman's rank correlation coefficient ρ, a hypothesis test is typically conducted. The null hypothesis states that there is no monotonic association between the two variables in the population, formally H₀: ρ = 0.[37] The alternative hypothesis can be two-sided (H₁: ρ ≠ 0, indicating any monotonic association) or one-sided (H₁: ρ > 0 or ρ < 0, specifying the direction).[38] For large sample sizes (typically n ≥ 10), the test statistic is given by t = \rho \sqrt{\frac{n-2}{1 - \rho^2}}, which approximately follows a t-distribution with n-2 degrees of freedom under the null hypothesis.[39] This approximation allows for the computation of a p-value by comparing the observed t to the critical values of the t-distribution or using statistical software.[40] For example, at a significance level of α = 0.05 (two-tailed), the exact critical value for |ρ| when n = 10 is 0.648, meaning correlations exceeding this threshold in absolute value are considered significant.[41] For small sample sizes (n < 10), the t-approximation may be unreliable, so exact permutation tests are preferred. These involve generating the full distribution of possible ρ values by permuting one variable's ranks while holding the other fixed, then computing the proportion of permutations yielding a |ρ| at least as extreme as the observed value to obtain the p-value.[42] Such exact tests are computationally feasible for small n and do not rely on distributional assumptions.[43] In scenarios involving multiple pairwise Spearman's correlations, such as high-dimensional data analysis, adjustments for multiple testing are essential to control the family-wise error rate. The Bonferroni correction divides the desired α level by the number of tests performed; for instance, with m = 20 correlations and α = 0.05, each test uses α' = 0.0025.[44] This conservative approach reduces the risk of false positives across the set of comparisons.[45] For data that violate independence assumptions (non-i.i.d. cases, such as time series), modern approaches incorporate bootstrapping to assess significance. The bootstrap method resamples the data with replacement—often using block bootstrapping to preserve dependence structure— to estimate the distribution of ρ under the null, yielding empirical p-values that are robust to non-normality and serial correlation.[46] This technique has gained prominence in recent analyses of dependent data, providing reliable inference where parametric tests fail.[47]Confidence Intervals
Confidence intervals for Spearman's rank correlation coefficient \rho provide a range of plausible values for the population correlation, quantifying the uncertainty in the sample estimate. One common parametric method to construct these intervals applies the Fisher z-transformation to the observed \hat{\rho}, defined as z = \frac{1}{2} \ln \left( \frac{1 + \hat{\rho}}{1 - \hat{\rho}} \right), which approximately follows a normal distribution with variance \frac{1}{n-3} for sample size n > 3.[6][48] The 95% confidence interval for z is then z \pm 1.96 / \sqrt{n-3}, and the interval for \hat{\rho} is obtained by back-transforming the bounds using the hyperbolic tangent function: \tanh(z_{\text{lower}}) to \tanh(z_{\text{upper}}).[49] This approach assumes large samples and continuous data without ties, providing symmetric intervals on the z-scale but asymmetric ones on the \rho-scale.[50] For smaller samples, ordinal data, or when ties are present, non-parametric bootstrap resampling offers robust alternatives to the Fisher method, as it does not rely on normality assumptions.[51] In the percentile bootstrap, resamples are drawn with replacement from the original data, \hat{\rho} is computed for each (typically 1,000–10,000 iterations), and the 2.5th and 97.5th percentiles of the bootstrap distribution form the 95% interval.[52] The bias-corrected accelerated (BCa) bootstrap improves upon the basic percentile method by adjusting for bias and skewness in the sampling distribution, yielding more accurate coverage especially with non-normal data or small n.[46] Simulation studies show that BCa intervals often outperform analytic methods for ordinal variables, achieving nominal coverage probabilities closer to 95%.[51] For example, with an observed \hat{\rho} = 0.6 and n = 20, the Fisher z-transformation yields an approximate 95% confidence interval of [0.21, 0.82].[49] Bootstrap methods, such as BCa, may produce slightly narrower or adjusted intervals depending on the data's distribution, but both approaches highlight the estimate's precision.[46] Wider confidence intervals indicate greater uncertainty in the estimate of \rho, often due to small sample sizes or high variability, while narrower intervals suggest more precise estimation.[53] These intervals are useful in power analysis for determining required sample sizes to achieve desired precision, such as a specific interval width at 95% confidence.[53] With modern computational resources, BCa bootstrap has become a preferred method for its robustness in contemporary statistical practice.[51]Examples and Illustrations
Basic Example
To illustrate the computation of Spearman's rank correlation coefficient, consider a hypothetical dataset from an educational study involving six students. The data consist of paired observations: weekly study hours (in hours) and corresponding exam scores (out of 100). The raw data are as follows:| Student | Study Hours | Exam Score |
|---|---|---|
| A | 1 | 10 |
| B | 2 | 30 |
| C | 3 | 20 |
| D | 4 | 50 |
| E | 5 | 60 |
| F | 6 | 40 |
Handling Ties in Calculation
In the presence of tied values within the dataset, Spearman's rank correlation coefficient requires adjustments to ensure accurate ranking and computation. Tied observations are assigned the average of the ranks they would otherwise occupy. For instance, if two values are tied and would receive ranks 5 and 6 in an untied scenario, both are given the average rank of 5.5. This approach preserves the overall sum of ranks, which equals n(n+1)/2 regardless of ties, and reflects the reduced variability introduced by the ties.[54] The standard formula for \rho must be modified to account for this reduced variability in both variables. With average ranks used to compute the differences d_i, the adjusted formula is: \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1) - \sum t_x - \sum t_y} where \sum t_x is the sum over all tied groups in the first variable of (m_g^3 - m_g)/12 (with m_g denoting the size of the g-th tied group), and \sum t_y is defined analogously for the second variable. This correction subtracts the tie-induced reduction in the denominator, preventing overestimation of the correlation strength. The method originates from early nonparametric developments and is detailed in standard references on the topic.[55][54] Consider a dataset with n=7 observations where ties occur in two pairs for each variable, illustrating the computation:| Observation | X values | Rank X | Y values | Rank Y | d_i | d_i^2 |
|---|---|---|---|---|---|---|
| 1 | 10 | 1 | 12 | 1 | 0 | 0 |
| 2 | 20 | 2.5 | 22 | 3.5 | -1 | 1 |
| 3 | 20 | 2.5 | 22 | 3.5 | -1 | 1 |
| 4 | 30 | 4 | 32 | 4 | 0 | 0 |
| 5 | 40 | 5 | 42 | 5 | 0 | 0 |
| 6 | 50 | 6.5 | 52 | 6.5 | 0 | 0 |
| 7 | 50 | 6.5 | 52 | 6.5 | 0 | 0 |
Extensions
Correspondence Analysis
Correspondence analysis (CA) is a multivariate technique that explores associations between rows and columns of a contingency table using chi-square distances to represent categorical data in a low-dimensional space. In the context of Spearman's rank correlation coefficient, grade correspondence analysis (GCA) extends CA by incorporating Spearman's ρ to measure and maximize rank-based associations, particularly for ordinal or ranked data where monotonic relationships are of interest. This integration allows for the detection of trends in residuals or directly within ranked contingency tables, providing a nonparametric alternative to classical CA when data exhibit ordinal structure.[57] The procedure for applying Spearman's ρ in GCA begins with ranking the entries of the contingency table to transform the data into ordinal form. Row and column scores are then derived iteratively to maximize the value of ρ between these ranked scores, often using multi-start optimization to identify principal trends and avoid local maxima. This ranking step preserves the monotonic order while applying ρ to quantify the strength of association, enabling visualization of overrepresentation patterns in a joint plot similar to standard CA but optimized for rank correlations.[57] Applications of this integration appear in sociology, such as examining category rankings in questionnaire data on employment barriers for disabled individuals to uncover underlying social trends.[57] This approach clarifies connections to modern multidimensional scaling techniques, which embed rank-based distances for non-Euclidean visualizations, with the GradeStat R package providing implementation for GCA.[58]Stream Approximation
The traditional computation of Spearman's rank correlation coefficient requires storing the entire dataset to assign ranks and calculate the sum of squared rank differences, which is infeasible for massive or unbounded data streams where memory and processing time are constrained.[59] To address this, streaming approximations maintain compact summaries of the data distribution, enabling incremental updates with constant time and space complexity per observation while providing probabilistic guarantees on accuracy. One prominent method employs a count matrix to track the joint frequency distribution of approximate ranks for paired observations in the stream. As each new pair (x_t, y_t) arrives, the algorithm discretizes the values into rank bins (e.g., via quantiles or uniform partitioning) and increments the corresponding entry in a low-dimensional matrix, allowing estimation of \sum d_i^2 (where d_i is the rank difference) by aggregating over the matrix without full rank recomputation. This approach achieves O(1) update time and space proportional to the number of bins, with approximation error bounded by O(1/\sqrt{n}) under mild assumptions on data distribution, where n is the stream length.[59] Another technique uses Hermite series expansions to sequentially estimate the bivariate probability density function underlying the ranks, from which Spearman's \rho is derived via integration. The algorithm maintains coefficients of the Hermite polynomials updated incrementally for each observation, supporting both stationary streams (with mean absolute error O(n^{-1/2})) and non-stationary ones via exponential weighting to handle concept drift. For the latter, a forgetting factor \lambda \in (0,1) controls recency, yielding standard error O(\lambda^{1/2}). In outline, a generic streaming algorithm for \rho involves: (1) maintaining summaries of marginal rank distributions (e.g., order statistics or density estimates); (2) updating the cross-term \sum (rank_x - rank_y)^2 via incremental rank approximations or pairwise counts; and (3) querying the current \rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)} from the summaries at any time. These methods are particularly suited to use cases like sensor networks, where real-time correlation detection in IoT streams aids anomaly monitoring, and online analytics in finance or machine learning for feature selection without batch recomputation.[59]Implementation
Software Implementations
Spearman's rank correlation coefficient is implemented in numerous statistical software packages and programming libraries, facilitating its computation in research, data analysis, and applied settings. These implementations typically handle ranking of data, tie adjustments, and associated statistical tests such as p-values and confidence intervals, making the coefficient accessible for both small-scale and large-scale analyses.[60][61] In the R programming language, thecor.test() function from the base stats package computes Spearman's ρ between two paired samples when specified with the method = "spearman" argument. This function not only returns the correlation coefficient but also provides a p-value for testing the null hypothesis of no monotonic association and, optionally, a confidence interval for the coefficient via the conf.level parameter. For example, the command cor.test(x, y, method = "spearman") ranks the input vectors x and y, applies the Spearman formula with tie corrections, and outputs the statistic alongside inferential details suitable for hypothesis testing.[60]
Python's SciPy library offers the scipy.stats.spearmanr() function in its stats module, which calculates the Spearman rank correlation coefficient and p-value for two arrays or sequences. This implementation automatically handles ties by assigning average ranks and supports handling of missing data through the nan_policy parameter (e.g., 'omit' to ignore NaNs) and specification of one- or two-sided tests via the alternative parameter. A typical usage is from scipy.stats import spearmanr; rho, p_value = spearmanr(x, y), yielding the coefficient rho and its significance, with the function designed for monotonic relationship assessment in datasets ranging from small samples to larger arrays.[61]
MATLAB's Statistics and Machine Learning Toolbox includes the corr() function, which computes Spearman's rank correlation by specifying the 'Type','Spearman' option on input matrices or vectors. This method ranks the data internally and applies the correlation formula, supporting multiple variables for pairwise computations. P-values can be obtained by requesting additional outputs, such as [rho, pval] = corr(X, 'Type', 'Spearman'), and the 'Rows','pairwise' option manages incomplete observations by using available pairs. For instance, rho = corr(X, 'Type', 'Spearman') produces a correlation matrix based on ranks, enabling efficient analysis in engineering and scientific workflows.[62]
Microsoft Excel lacks a built-in function for Spearman's ρ, but it can be computed using add-ins such as the Real Statistics Resource Pack, which provides the RSPEARMAN() function for direct calculation on ranked data ranges. This add-in handles ties via average ranking and returns the coefficient, with manual p-value computation possible through integration with Excel's distribution functions; alternatively, users rank data with RANK.AVG() and apply CORREL() to the ranks for the core statistic. Such extensions make Spearman's test viable for spreadsheet-based analyses in business and education.[63]
In SAS, the PROC CORR procedure calculates Spearman's rank-order correlation using the SPEARMAN option, which ranks non-missing values and substitutes them into the Pearson formula while adjusting for ties. This yields the coefficient, along with p-values and confidence limits when requested via the ALPHA statement, as in PROC CORR DATA=dataset SPEARMAN; VAR x y; RUN;, supporting large datasets in enterprise environments.[64]
Julia's StatsBase.jl package implements Spearman's correlation through the corspearman(x, y) function, which performs ranking (including dense or ordinal options) and computes the coefficient with tie handling. This open-source tool integrates with Julia's ecosystem for high-performance computing, returning the ρ value suitable for scripting and integration with other statistical functions.[65]
For distributed computing environments, Apache Spark's MLlib library provides Spearman's correlation via the Statistics.corr() method with the "spearman" correlation type on RDDs or DataFrames, enabling scalable computation across clusters for big data applications. This implementation ranks distributed data partitions and aggregates results, as in Statistics.corr(rddX, rddY, "spearman"), making it relevant for processing massive datasets in 2025-era analytics pipelines.