Fact-checked by Grok 2 weeks ago

Intraclass correlation

The intraclass correlation coefficient (ICC) is a statistical index that quantifies the degree of similarity or agreement among observations within the same group or class, relative to the variability between groups, typically ranging from 0 (no similarity) to 1 (perfect similarity). Introduced by Ronald A. Fisher in 1921 as an extension of the to handle , such as familial resemblances in physical measurements, the ICC addresses scenarios where traditional correlations fail to account for within-group dependencies. In practice, the is widely applied in fields like , , and to evaluate reliability in repeated measures, inter-rater assessments, and cluster-randomized trials, where it helps determine how much variation in outcomes is attributable to true differences between s versus random error within them. For instance, in , it measures the consistency of diagnostic ratings across multiple observers, while in trial design, it informs sample size adjustments by estimating clustering effects, as captured by the formula: DEFF = 1 + (n-1)ICC, where n is the cluster size. Several forms of the ICC exist, depending on the study design and assumptions, including one-way random effects models for absolute agreement (ICC(A,1)), two-way mixed effects for consistency (ICC(C,1)), and others that adjust for fixed or random raters. These are typically estimated using analysis of variance (ANOVA), with the general form for a one-way model given by ICC(1) = (MS_B - MS_W) / (MS_B + (k-1)MS_W), where MS_B is the mean square between subjects, MS_W is the mean square within subjects, and k is the number of measurements per subject. Interpretations vary by context, but common guidelines classify values below 0.5 as poor reliability, 0.5–0.75 as moderate, 0.75–0.9 as good, and above 0.9 as excellent, though confidence intervals are essential for assessing precision.

Historical Development

Early Definition

The intraclass correlation coefficient (ICC) was first introduced by in , in his paper "On the 'Probable Error' of a Coefficient of Correlation Deduced from a Small Sample," published in Metron, in the context of agricultural experiments at Rothamsted Experimental Station and studies of familial resemblances, such as correlations among siblings for physical traits. developed the concept as a way to quantify the similarity of observations within the same class or group, extending Pearson's product-moment correlation to clustered data through the framework of analysis of variance (ANOVA). This approach addressed the need to partition total variance into components attributable to between-group and within-group sources, particularly relevant for randomized block designs in agriculture where plots or litters represent classes. In his seminal 1925 textbook, formalized the unbiased for the under a balanced one-way , derived directly from the ANOVA table: \hat{\rho} = \frac{MS_B - MS_W}{MS_B + (k-1) MS_W}, where MS_B denotes the between groups, MS_W the within groups, and k the number of observations per group. This arises from the expected squares in the model: E(MS_B) = \sigma^2_W + k \sigma^2_B and E(MS_W) = \sigma^2_W, where \sigma^2_B is the between-group variance component and \sigma^2_W the within-group variance. Solving for \sigma^2_B yields (MS_B - MS_W)/k, and substituting into the ICC \rho = \sigma^2_B / (\sigma^2_B + \sigma^2_W) produces the , ensuring unbiasedness for the in balanced designs. The primary purpose of this early formulation was to provide an unbiased estimate of the proportion of total variance explained by between-group differences, serving as a key metric for assessing group homogeneity in experimental data. emphasized its utility in hypothesis testing via the F-statistic (MS_B / MS_W) and in power calculations for experimental design, making it foundational for variance component analysis. To illustrate, consider hypothetical data from Fisher's era on wheat yields (in grams per plant) across four experimental plots, each with five replicate plants: Plot 1: 20, 22, 19, 21, 20; Plot 2: 25, 27, 24, 26, 25; Plot 3: 18, 20, 17, 19, 18; Plot 4: 23, 25, 22, 24, 23. The ANOVA yields MS_B \approx 48.3 and MS_W = 1.3 (with k=5). Thus, \hat{\rho} = (48.3 - 1.3) / (48.3 + 4 \times 1.3) \approx 47 / 53.5 \approx 0.88, suggesting that approximately 88% of the total variance in yields arises from differences between plots, indicative of substantial plot-to-plot variability in or effects.

Modern Definitions

Following , the intraclass correlation coefficient (ICC) gained widespread adoption in and for assessing reliability in measurement and rating studies, extending Ronald Fisher's early variance-based framework to practical applications in inter-rater agreement and test-retest scenarios. A key simplification emerged in the early 1950s with Robert L. Ebel's work, which proposed estimating ICC as the ratio of between-subject variance to total variance using analysis of variance (ANOVA) components, making computation more accessible for researchers. This approach was refined in 1979 by Patrick E. Shrout and Joseph L. Fleiss, who outlined six distinct forms of ICC to accommodate various study designs, such as single versus multiple raters and fixed versus random effects, thereby standardizing its use in reliability assessments. Their framework introduced nomenclature like ICC(1,1) for single-rater absolute agreement in a one-way random effects model, facilitating precise selection based on research objectives. The modern estimator, commonly expressed as \text{ICC} = \frac{\text{MS}_\text{between} - \text{MS}_\text{within}}{\text{MS}_\text{between} + (k-1) \text{MS}_\text{within}} where \text{MS}_\text{between} is the between groups, \text{MS}_\text{within} is the within groups, and k is the number of raters, prioritizes ease of calculation via ANOVA. Contemporary guidelines, such as those by Tae Kyu Koo and Hun Li in 2016, build on these developments by recommending specific ICC forms for clinical reliability and emphasizing reporting practices to ensure interpretability and reproducibility.

Mathematical Foundations

Relation to Pearson's Correlation

The intraclass correlation coefficient (ICC) generalizes Pearson's product-moment correlation coefficient (r) by extending its application to exchangeable observations within defined classes, such as repeated measurements on the same subjects or ratings by multiple raters, rather than treating variables as distinct. This conceptual link positions the ICC as a measure of similarity or agreement within groups, where Pearson's r quantifies linear association between two separate variables. In essence, the ICC captures the proportion of total variance attributable to differences between classes, providing a framework for reliability assessment in clustered data. A direct mathematical equivalence exists in the special case of two observations per class, such as two raters evaluating multiple subjects. Here, the ICC for a two-way assessing absolute agreement with a single rater, denoted ICC(2,1), equals Pearson's r computed between the two sets of ratings, assuming equal variances and random rater effects. This equivalence arises because both coefficients reduce to the formula \rho = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B + \text{MS}_W}, where \text{MS}_B is the mean square between classes and \text{MS}_W is the mean square within classes from a one-way ANOVA, which matches the covariance-based structure of Pearson's r for paired data. However, the ICC extends this to multiple observations per class (k > 2), generalizing the denominator to \text{MS}_B + (k-1)\text{MS}_W to account for increased within-class variability, and it can incorporate adjustments for systematic rater biases absent in Pearson's r. The core differences stem from their foundational assumptions: Pearson's r is an interclass correlation suited to bivariate data with ordered variables (e.g., predictor and outcome), emphasizing covariance relative to individual variances. In contrast, the ICC is strictly intraclass, assuming observations within a class are interchangeable and focusing on variance partitioning to evaluate consistency or agreement. One interpretive bridge is that the ICC equals the expected Pearson's r between any two randomly drawn observations from the same class, reflecting within-class dependence. Equivalently, it can be derived as Pearson's r applied to the deviations of observations from their group means, which isolates within-class covariation after removing between-class effects: if Y_{ij} is the j-th observation in class i, then the correlation among (Y_{ij} - \bar{Y}_i) across paired selections yields the within-component structure underlying the ICC. Ronald A. Fisher originally developed the ICC in 1921, motivated by the limitations of Pearson's r for clustered or paired data where no natural distinction exists between independent and dependent variables, such as measurements on siblings or repeated assessments of the same entity. Fisher's innovation addressed the probable error estimation for such "intraclass" correlations, laying the groundwork for its use in biological and experimental contexts with grouped observations.

Variance Components and Models

The intraclass correlation coefficient () is derived from an analysis of variance (ANOVA) framework that partitions the total observed variance into between-group and within-group components. This quantifies the extent to which variability among observations is attributable to systematic differences between groups, such as subjects, clusters, or raters, relative to random variation within those groups. Formally, the ICC is expressed as the ratio \text{ICC} = \frac{\sigma_b^2}{\sigma_b^2 + \sigma_w^2}, where \sigma_b^2 represents the between-group variance and \sigma_w^2 the within-group variance; thus, the ICC measures the proportion of total variance explained by group membership. This formulation, rooted in early statistical work on variance partitioning, provides a measure of homogeneity or clustering in data, with values closer to 1 indicating strong group-level effects and values near 0 suggesting near-independence of observations within groups. The underlying model assumes a mixed-effects , typically represented as Y_{ij} = \mu + a_i + e_{ij}, where Y_{ij} is the j-th observation within the i-th group, \mu is the grand , a_i \sim N(0, \sigma_b^2) captures the random group , and e_{ij} \sim N(0, \sigma_w^2) denotes error term, with a_i and e_{ij} uncorrelated across and within groups. Key assumptions include multivariate of the observations, independence of errors within groups to ensure no residual clustering beyond the modeled group , and of groups as random samples from a larger . These assumptions facilitate the use of ANOVA mean squares to estimate variance components in balanced designs, where each group has an equal number (k) of observations; in unbalanced designs, where group sizes vary, alternative estimation approaches such as (REML) are required to obtain unbiased variance component estimates and avoid bias in the ANOVA-based F-statistic. The ICC exhibits several important properties within this framework. It theoretically ranges from -\frac{1}{k-1} to 1, where k is the number of replicates or raters per group; the lower bound reflects scenarios of greater within-group than under complete , though negative estimates often arise from sampling variability when the true ICC is near zero. In large samples, the ANOVA-based of the ICC is unbiased, converging to the true value under the stated assumptions, which supports its reliability in assessing group-level consistency across diverse applications like reliability studies and clustered sampling.

Types of ICC

One-Way Random Effects Model

The one-way random effects model represents the foundational approach in intraclass correlation analysis for designs involving a single random factor, such as subjects or groups drawn randomly from a larger . In this framework, all sources of variation beyond the random groups are treated as error, making it suitable for scenarios where measurements within groups are exchangeable and there is no fixed effect to consider. The model posits that the observed scores Y_{ij} for the j-th on the i-th group follow Y_{ij} = \mu + \alpha_i + e_{ij}, where \mu is the overall mean, \alpha_i \sim N(0, \sigma_b^2) captures between-group variance, and e_{ij} \sim N(0, \sigma_w^2) represents within-group error variance, with groups and measurements assumed independent. This setup yields the intraclass correlation as the ratio \rho = \sigma_b^2 / (\sigma_b^2 + \sigma_w^2), emphasizing the proportion of total variance due to the random grouping factor. For estimation, the model typically relies on analysis of variance (ANOVA) mean squares under balanced designs, where each group has the same number k of measurements. The ICC for a single measure, ICC(1,1), is calculated as \text{ICC}(1,1) = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B + (k-1) \text{MS}_W}, where \text{MS}_B is the between-groups and \text{MS}_W is the within-groups . This estimator reflects the reliability of individual ratings when raters or measurements are random and not fixed across groups. For the average of k measures per group, denoted ICC(1,k), the formula adjusts to \text{ICC}(1,k) = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B}, providing a higher reliability estimate by averaging out within-group error. These forms are particularly relevant in single-facet designs, such as evaluating across random samples of items or subjects without additional structured factors. Confidence intervals for these ICC estimates are commonly derived using F-test statistics from the ANOVA table, leveraging the ratio of mean squares F = \text{MS}_B / \text{MS}_W, which follows an under the of no between-group variance. An approximate lower confidence limit can be obtained by substituting the observed F or critical F-values into a transformed form, such as L = \frac{F' - 1}{F' + (k-1)}, where F' is adjusted based on the desired level and (e.g., df_B = n-1, df_W = n(k-1), with n groups); upper bounds follow similarly by inverting the . This method provides inferential bounds on the population ICC, allowing assessment of precision in reliability estimates. The model assumes of random effects and errors, within and between groups, and equal variances, which support the validity of ANOVA-based in balanced settings. In cases of unbalanced , where k varies across groups, traditional estimators can become biased; instead, (REML) is employed to yield unbiased estimates of the variance components \sigma_b^2 and \sigma_w^2, from which the is derived as their ratio. REML accounts for the loss of in likelihood , making it robust for irregular designs while maintaining computational feasibility through iterative methods. This adjustment ensures reliable computation without requiring balanced replication. Use cases for the one-way random effects model are confined to situations with a solitary random factor, such as test-retest reliability assessments involving repeated measures on the same subjects over time (treating time points as random without fixed raters) or inter-item consistency in psychological scales where items form random groups. It is ideal when the goal is to partition variance solely between these random units and residual error, avoiding the introduction of fixed effects that would necessitate more complex models.

Two-Way and Mixed Effects Models

In two-way models for intraclass correlation, raters are incorporated as a second factor alongside , enabling the evaluation of rater variability in designs where every subject is rated by multiple raters, typically in a fully crossed setup. These models distinguish between random and fixed effects for raters to address different inferential goals in reliability assessments. The two-way treats both subjects and raters as random factors, suitable when the selected raters represent a sample from a broader of potential raters, allowing of reliability estimates beyond the specific raters used. In this framework, ICC(2,1) quantifies absolute agreement for a single , capturing the proportion of total variance attributable to true subject differences while for both rater and variance. The is given by \text{ICC}(2,1) = \frac{\text{MS}_b - \text{MS}_w}{\text{MS}_b + (k-1) \text{MS}_w + \frac{k (\text{MS}_r - \text{MS}_w)}{n}}, where \text{MS}_b is the between-subjects mean square, \text{MS}_r is the raters mean square, \text{MS}_w is the residual mean square, k is the number of raters, and n is the number of subjects; this adjustment for rater variance (\text{MS}_r) ensures the estimate reflects systematic differences among raters as part of the error structure. The two-way mixed effects model, in contrast, treats raters as fixed effects and subjects as random, appropriate for studies where the particular raters are of specific interest and results are not intended to generalize to other raters, such as evaluating agreement among a fixed panel of experts. Here, ICC(3,1) measures by focusing on the similarity in relative rankings of subjects across raters, deliberately excluding rater main effects (e.g., overall ) from the variance components. The formula simplifies to \text{ICC}(3,1) = \frac{\text{MS}_b - \text{MS}_w}{\text{MS}_b + (k-1) \text{MS}_w}, omitting \text{MS}_r since fixed raters do not contribute random variance to the denominator, thus emphasizing agreement after adjusting for rater-specific offsets. These estimators are derived from two-way ANOVA mean squares under balanced designs, where the interaction term serves as the residual. For unbalanced data or more flexible specifications, such as varying numbers of ratings per subject-rater pair, two-way and mixed effects ICC models are generalized through linear mixed models (LMMs), which partition variance into fixed rater effects and random subject effects using iterative estimation. Variance components in LMMs are typically obtained via restricted maximum likelihood (REML) for unbiased estimates in random effects settings or maximum likelihood (ML) for model comparisons, accommodating missing observations and enabling hypothesis tests on rater effects. Model selection hinges on study design: employ the two-way random model for crossed raters with generalizability needs, as in broad inter-rater studies; choose the mixed model for fixed, study-specific raters, such as in clinical assessments with designated evaluators.

Applications

Reliability Assessment

The intraclass correlation coefficient (ICC) serves as a fundamental for evaluating the reliability of measurement instruments in observational settings, quantifying the proportion of total variance attributable to between-subject differences relative to within-subject variability. This approach is particularly valuable for assessing consistency in ratings or measurements where systematic errors from raters or time can influence outcomes. Derived from variance component models, ICC estimates help determine whether a tool produces and reproducible results across repeated applications. In inter-rater reliability assessments, ICC measures the agreement among multiple observers evaluating the same subjects, such as physicians diagnosing medical conditions based on symptom presentations. The ICC(2,1), under a two-way random effects model, evaluates absolute agreement by treating raters as a random sample and accounting for both rater and residual variance, making it suitable when the goal is to ensure ratings are interchangeable. In contrast, the ICC(3,1), using a two-way mixed effects model, focuses on relative consistency by treating raters as fixed effects, which is appropriate when specific raters are of interest and systematic rater differences are not penalized. These forms enable precise evaluation of observer concordance in fields like clinical diagnostics. For test-retest reliability, ICC assesses the stability of measurements taken on the same subjects at different time points under similar conditions, while examines consistency within a single rater across multiple trials on the same subjects. Both scenarios commonly employ the from a for single administrations, which partitions variance into between-subject and residual components (including time or trial effects as random). This model is ideal for tools like psychological scales or clinical assessments where repeated measures simulate rater variability. Interpretation guidelines for ICC values, popularized in psychology during the 1980s, provide benchmarks for reliability strength; for instance, values below 0.50 indicate poor reliability, 0.50–0.75 moderate, 0.75–0.90 good, and above 0.90 excellent, with confidence intervals recommended to account for estimation . In clinical trials, such as those validating scales for self-reports or nurse assessments, ICC is routinely applied to confirm tool dependability; representative studies report inter-rater ICC values around 0.80 for intensity ratings, supporting their use in outcome . Compared to alternatives like pairwise Pearson correlations, which evaluate agreement between only two raters at a time and necessitate cumbersome averaging for groups, ICC offers superior efficiency by simultaneously incorporating all raters into a single estimate, better capturing overall while adjusting for both and systematic .

Clustered and Genetic Studies

The has gained prominence in for analyzing clustered data in randomized trials, marking a shift from earlier uses in variance partitioning toward practical applications in trial design and modeling. This development paralleled growing recognition of clustering effects in observational and experimental studies, evolving into a cornerstone of modern by the late with advances in computational methods for multilevel analysis. Cluster randomized trials, common in interventions, rely on the ICC to account for within-cluster dependence, which reduces statistical efficiency compared to individual . The , calculated as $1 + (m-1)\rho, where m is the average cluster size and \rho is the ICC, inflates the required sample size to maintain power; for instance, an ICC of 0.05 with m=50 yields a of approximately 3, necessitating three times more participants. In examples, such as community-level interventions for or uptake, ICC estimates from prior studies (often 0.01–0.05) guide sample size planning to detect modest effects while adjusting for clustering in households, , or neighborhoods. In genetic studies, the quantifies phenotypic resemblance among relatives, enabling estimation as the proportion of trait variance due to . In twin studies, narrow-sense h^2 is approximated using Falconer's formula: h^2 = 2(r_{[MZ](/page/MZ)} - r_{[DZ](/page/DZ)}), where r_{[MZ](/page/MZ)} and r_{[DZ](/page/DZ)} are the correlations for monozygotic and dizygotic pairs, respectively, assuming equal shared environmental influences. This method has been applied in to traits like or susceptibility, with h^2 values often ranging from 0.4 to 0.8 for highly heritable phenotypes. The also underpins hierarchical linear modeling in diverse fields, partitioning variance across levels to reveal clustering effects. In education research, it measures school-level influences on student achievement, with ICCs around 0.10–0.20 highlighting between-school variation in outcomes like test scores after controlling for individual factors. Similarly, in , multilevel models use ICC to evaluate site-specific clustering in environmental data, such as tree biomass partitioning across forest stands, where site effects (ICC ≈ 0.05–0.15) account for in growth responses. Post-2000 advancements in software and theory have accelerated ICC's integration into these multilevel frameworks, facilitating robust analysis of nested data structures.

Computation

Estimation Formulas

Estimation of the intraclass correlation coefficient (ICC) typically relies on analysis of variance (ANOVA) frameworks for balanced designs, where each cluster or target has the same number of observations, denoted as k. The procedure begins by performing a one-way ANOVA treating clusters as the factor, yielding the mean square between clusters (MS_B) and mean square within clusters (MS_W). These are computed from the sums of squares: the total sum of squares (SS_total) is partitioned into SS_between = \sum n_j (\bar{y}_j - \bar{y})^2 and SS_within = \sum \sum (y_{ij} - \bar{y}_j)^2, where n_j = k for balanced data with n clusters; then MS_B = SS_between / (n-1) and MS_W = SS_within / [n(k-1)]. The point estimate for the single-rater ICC, ICC(1,1), in a one-way random effects model is then given by \text{ICC}(1,1) = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B + (k-1)\text{MS}_W}, which corresponds to the ratio of between-cluster variance to total variance and is an unbiased estimator of the population ICC under the model assumptions. For the average-rater ICC, ICC(1,k), the formula simplifies to \text{ICC}(1,k) = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B}, providing an estimate of reliability when averaging across k raters. In two-way random effects models, analogous formulas incorporate the mean square for the second factor (e.g., raters, MS_R), such as ICC(2,1) = (\text{MS}_B - \text{MS}_E) / [\text{MS}_B + (k-1)\text{MS}_E + k(\text{MS}_R - \text{MS}_E)/n], where MS_E is the error mean square, but the core procedure remains ANOVA decomposition of sums of squares into between, interaction, and error components. Confidence intervals for the in balanced designs are commonly constructed using the , leveraging the fact that F = \text{MS}_B / \text{MS}_W \sim F(n-1, n(k-1)) under the null. A standard approach inverts critical values to bound the population ; specifically, let F_L and F_U be the lower (\alpha/2) and upper ($1 - \alpha/2) quantiles of the with (n-1, n(k-1)). The (1-\alpha) is then \left( \frac{F_L - 1}{F_L + (k-1)}, \frac{F_U - 1}{F_U + (k-1)} \right). This method provides approximate coverage, particularly effective for moderate sample sizes, as it accounts for the of the variance ratio. For unbalanced designs, where cluster sizes vary, ANOVA mean squares are not directly applicable; instead, non-parametric is recommended: resample clusters with replacement B times (e.g., B=1000), compute the for each bootstrap sample using moment estimators or maximum likelihood, and take the \alpha/2 and $1-\alpha/2 percentiles of the resulting as the interval bounds. Alternative estimation in linear mixed models (LMMs) frames the ICC directly as the proportion of variance attributable to the random effect: \text{ICC} = \sigma^2_u / (\sigma^2_u + \sigma^2_e), where \sigma^2_u is the variance of the random intercept (between-cluster) and \sigma^2_e is the residual (within-cluster) variance. These components are estimated via restricted maximum likelihood (REML), which adjusts for fixed effects and provides unbiased variance estimates even in unbalanced data; for implementation, functions like lmer in R fit the model y_{ij} = \beta_0 + u_j + e_{ij} with u_j \sim N(0, \sigma^2_u) and e_{ij} \sim N(0, \sigma^2_e), yielding the ICC from the variance outputs. For small samples, where are (e.g., n < 10 or k < 5), the F-distribution-based intervals may have poor coverage; the Satterthwaite approximation addresses this by estimating effective for the variance components as \tilde{\nu} = 2 \hat{\sigma}^4 / \text{Var}(\hat{\sigma}^2), using method-of-moments for the denominator, and substituting into t- or F-distributions for interval construction in LMM or two-way ANOVA contexts. This yields more accurate confidence intervals by accounting for the approximate chi-squared distribution of variance estimates.

Software Packages

In the R programming language, the irr package provides the icc() function for computing basic single-score or average-score intraclass correlation coefficients (ICCs) in one-way and two-way models, including F-tests and confidence intervals based on ANOVA. For more comprehensive analysis supporting multiple ICC types (e.g., ICC(1,1), ICC(2,1), ICC(3,1)) as defined by , along with confidence intervals and p-values, the psych package's ICC() function is widely used, leveraging either ANOVA or linear mixed-effects models via lme4. An example syntax for a two-way mixed-effects model assessing agreement among raters is ICC(data_matrix, model = "twoway", type = "agreement", unit = "average", alpha = 0.05), where data_matrix is a data frame with rows as subjects and columns as raters; this outputs ICC estimates, F-statistics, and 95% confidence intervals. In SPSS, ICCs are computed through the Reliability Analysis procedure under Analyze > Scale > Reliability Analysis, where users select the variables, choose "Intraclass correlation coefficient" in the Statistics dialog, and specify the model (e.g., one-way random, two-way mixed) and type ( or ). The output includes tables with ICC values, mean squares (MS) for rows, columns, and , , and significance tests, facilitating interpretation of . SAS supports ICC estimation primarily through PROC MIXED for linear mixed models, where random effects model variance components (e.g., proc mixed data=dataset; class subject rater; model outcome = / solution; random intercept / subject=subject; run;), allowing manual computation of ICC as the ratio of between-subject variance to total variance from the covariance parameter estimates table. PROC CORR can compute Pearson correlations but requires additional steps or macros (e.g., %ICC9) for full ICC functionality in clustered designs. In , the pingouin library offers the intraclass_corr() function for ICC computation across six types (e.g., ICC(1), ICC(2), ICC(3)) with s and p-values, using a long-format DataFrame input specifying subjects, raters, and measurements; for instance, pg.intraclass_corr(data=df, targets='subject', raters='rater', ratings='score') yields ICC estimates and inference statistics. Key considerations for implementation include handling , where psych::ICC() in accommodates unbalanced designs and missing values via lmer = TRUE (requiring lme4 package, available since R 3.0+), while irr::icc() defaults to complete cases but can be adjusted. Version-specific enhancements, such as improved precision in R 4.0+ through updated lme4 integration, enhance reliability for complex models.

Interpretation

Guidelines for Values

The interpretation of intraclass correlation coefficient (ICC) values depends on the context of the study, but general benchmarks provide a framework for assessing reliability and agreement. Commonly used guidelines proposed by Koo and Li (2016) classify values less than 0.50 as poor, indicating low consistency among raters or measurements; 0.50 to 0.75 as moderate reliability; 0.75 to 0.90 as good reliability; and values above 0.90 as excellent. These thresholds may require stricter criteria in clinical or high-stakes applications, where ICC values exceeding 0.90 are often necessary to ensure sufficient precision for . Interpretations can vary by field; for example, stricter thresholds may apply in clinical settings. An value approaching 1 reflects high similarity within groups or between raters, meaning that the variance attributable to systematic differences is minimal compared to random . In contrast, negative ICC values, which can occur in one-way random effects models, signify greater disagreement among observations than would be expected by chance alone, often arising when between-group variance is smaller than within-group variance. Standard reporting of ICC results should specify the model (e.g., one-way random, two-way mixed), the type (e.g., absolute agreement or consistency), and the unit of analysis, such as ICC(3,1) = 0.82 with a 95% confidence interval [0.75, 0.88]. Precision of the ICC estimate is influenced by sample size, with smaller samples leading to wider confidence intervals and less reliable inferences; thus, reporting both the point estimate and its interval is essential for transparency. For visualizing agreement beyond ICC values, Bland-Altman plots are recommended, as they graphically display the differences between paired measurements against their means, highlighting systematic bias and limits of agreement to complement quantitative reliability assessments.

Limitations and Considerations

The intraclass correlation coefficient (ICC) relies on several statistical assumptions, including of the data and homogeneity of variance across groups or raters. Violations of can lead to unstable variance estimates and biased ICC values, particularly in small samples or when data exhibit or heavy tails. For instance, heteroscedasticity—non-constant variance—inflates ICC estimates, as demonstrated in simulations where unadjusted heteroscedastic data increased ICC from 0.609 to 0.640. To address these issues, robust methods such as Bayesian hierarchical regression with variance-function modeling can be employed, which relax assumptions through (MCMC) techniques and provide more accurate estimates under heterogeneous variances. Selecting the appropriate ICC model and type is prone to pitfalls that can systematically inflate or deflate reliability estimates. For example, opting for a consistency ICC (which ignores absolute differences between raters) instead of an absolute agreement ICC (which accounts for systematic biases) can overestimate reliability when rater biases are present, as a larger consistency value relative to agreement indicates non-negligible bias. Updated guidelines emphasize careful consideration of design completeness and rater effects to avoid such errors, recommending maximum likelihood estimation for incomplete observational designs and highlighting limitations in prior rules of thumb that overlook these factors. Adequate sample size is crucial for ICC estimation, as small numbers of subjects (n) or raters (k) result in low statistical power and wide confidence intervals, reducing the precision of reliability assessments. Recommendations, such as those by Koo and Li (2016), suggest a minimum of 30 subjects and 3 raters to achieve reasonable precision for moderate ICC values (e.g., around 0.5–0.7), though exact requirements vary by expected ICC and desired confidence interval width; smaller setups risk unreliable inferences. ICC is not always suitable, particularly for ordinal or categorical data, where alternatives like weighted kappa better account for the ordered nature of responses by penalizing disagreements proportionally to their magnitude. In high-dimensional data contexts prevalent in post-2020 applications, such as feature extraction, ICC faces critiques for sensitivity to noise amplification and of dimensionality, where sparse, high-feature datasets violate variance stability assumptions and yield unstable reliability estimates across numerous variables.

References

  1. [1]
  2. [2]
    [PDF] The Intraclass Correlation Coefficient (ICC) - Duke University
    The ICC, together with the degrees of freedom (df) based on the number of groups or clusters, is commonly used to calculate how much the sample size of a CRT.
  3. [3]
    [PDF] Intraclass Correlations : Uses in Assessing Rater Reliability
    Fleiss, J. L., & Shrout, P. E. Approximate interval estimation for a certain intraclass correlation coefficient. Psychonzetrika, 1978, 43, 259-262 ...
  4. [4]
    Estimation of an inter-rater intra-class correlation coefficient ... - NIH
    Sep 12, 2018 · R. A. Fisher first introduced the concept of an intraclass correlation coefficient (ICC) in his 1921 paper examining the familial ...Bayesian Framework · Case Study Design · Fig. 3<|separator|>
  5. [5]
    Human biomarker interpretation: the importance of intra-class ...
    Aug 1, 2018 · Intra-class correlation coefficients (ICCs): definition. The underlying concept of the ICC is to compare the variability within a group of ...
  6. [6]
    A hundred years of Metron
    Mar 12, 2020 · Fisher, R.A.: On the “probable” error of a coefficient of correlation deduced from a small sample. Metron I(4), 3–32 (1921). Google Scholar.<|control11|><|separator|>
  7. [7]
    The arrangement of field experiments - Rothamsted Repository
    Fisher, R. A. 1921. Studies in crop variation. 1. An examination of the yield of dressed grain from Broadbalk. The Journal of Agricultural Science. 11 (2) ...
  8. [8]
    Bias-corrected estimator for intraclass correlation coefficient in the ...
    Aug 20, 2012 · The intraclass correlation coefficient (ICC), often denoted by ρ, was first introduced by Fisher [1] to study the familial resemblance between ...
  9. [9]
    Introduction to Fisher (1925) Statistical Methods for Research Workers
    Fisher (1925) Statistical Methods for. Research Workers. s.c. Pearce. R.A. Fisher had an abiding interest in inference, which came out in many of his writings ...
  10. [10]
    Intraclass correlations: Uses in assessing rater reliability.
    In this article, guidelines are given for choosing among 6 different forms of the intraclass correlation for reliability studies in which n targets are rated ...Missing: multiple | Show results with:multiple
  11. [11]
    Intraclass Correlation
    Aug 19, 2015 · It can be obtained directly from the intraclass correlation coefficient by using the Spearman-Brown Prophecy formula rSB = [(2*ricc)/(1+ricc)].
  12. [12]
    Forming inferences about some intraclass correlation coefficients.
    McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some ... Intraclass correlation and the analysis of variance. Holt. Harris, J. A. (1913) ...
  13. [13]
    [PDF] icc — Intraclass correlation coefficients - Description Quick start Menu
    The individual CA-ICC can also be equivalent to the. Pearson's correlation coefficient between raters when k = 2; see McGraw and Wong (1996a) for details.
  14. [14]
    Intraclass Correlations (ICC1, ICC2, ICC3 from Shrout and Fleiss)
    The intraclass correlation is used as a measure of association when studying the reliability of raters. Shrout and Fleiss (1979) outline 6 different estimates.
  15. [15]
    [PDF] R. A. Fisher Source
    the standard method of calculation, and in 1921 (6) he published the corresponding series of curves for intraclass correlations. The brevity of this list is ...
  16. [16]
    [PDF] A Unified Approach to Estimating the Intraclass Correlation ...
    This study explores using hierarchical linear modeling to estimate interrater reliability using the intraclass correlation coefficient (ICC). It also ...Missing: seminal | Show results with:seminal
  17. [17]
    Estimation of an inter-rater intra-class correlation coefficient that ...
    Sep 12, 2018 · R. A. Fisher first introduced the concept of an intraclass correlation coefficient (ICC) in his 1921 paper examining the familial ...Bayesian Framework · Case Study Design · Discussion
  18. [18]
    A Guideline of Selecting and Reporting Intraclass Correlation ... - NIH
    Intraclass correlation coefficient was first introduced by Fisher in 1954 as a modification of Pearson correlation coefficient. However, modern ICC is ...Fig 1 · Icc Characteristics · How To Interpret Icc In...
  19. [19]
    Intraclass Correlations: Uses in Assessing Rater Reliability
    Fleiss, J. L., & Shrout, P. E. Approximate interval estimation for a certain intraclass correlation coefficient. Psychometrika, 1978, 43, 259-262. Haggard ...Missing: PDF | Show results with:PDF
  20. [20]
    Validation of the Standardized Universal Pain Evaluations for ... - jospt
    Sep 30, 2017 · Test-retest reliability for each subscale was evaluated with intraclass correlation coefficients (ICCs) from participants assumed to have stable ...Measures · Statistical Analysis · Discussion<|separator|>
  21. [21]
    A review of statistical methods in the analysis of data arising from ...
    Summary This paper reviews research situations in medicine, epidemiology and psychiatry, in psychological measurement and testing, and in sample surveys in ...
  22. [22]
    Methodology for inferences concerning familial correlations: A review
    20. A. Donner, J.J. Koval. The estimation of intraclass correlation in the analysis of family data ... Relative efficiencies of heritability. Estimates ...
  23. [23]
    Methods for sample size determination in cluster randomized trials
    The simplest approach for their sample size calculation is to calculate the sample size assuming individual randomization and inflate this by a design effect to ...
  24. [24]
    Intraclass Correlation Coefficients Typical of Cluster-Randomized ...
    Recruiting a sample of this increased size ensures that, under the planned cluster randomization, the study affords 80% power to detect the prespecified effect.
  25. [25]
    Assessing the Heritability of Complex Traits in Humans - NIH
    Falconer's formula is applied to twin studies which provides an estimate of broad sense heritability (H2) by comparing the correlation of the phenotype in ...
  26. [26]
    Estimating Heritability and Shared Environmental Effects for ... - IOVS
    Heritability can be estimated approximately as twice the difference of intraclass correlation coefficients between MZ and DZ twin pairs. The greater the ...
  27. [27]
    [PDF] An Introductory Primer on Multilevel and Hierarchical Linear Modeling
    Intraclass correlation (ICC) is the proportion of total vari- ance that is between the groups of the regression equation. Put more succinctly, it “is the degree ...
  28. [28]
    Effects of stand age on tree biomass partitioning and allometric ...
    May 27, 2025 · A multilevel modelling approach was adopted and intraclass correlation was used to evaluate site effects. Results indicated that biomass ...
  29. [29]
    [PDF] An introduction to hierarchical linear modeling
    This allows for the calculation of the ratio of the between group variance to the total variance, termed the intra-class correlation (ICC). In other words ...Missing: ecology | Show results with:ecology
  30. [30]
    Confidence Intervals and Sample Size for the ICC in Two‐Way ... - NIH
    May 22, 2025 · This work advances the understanding of confidence interval methods for the ICC for agreement, provides practical tools, and offers recommendationsMissing: seminal | Show results with:seminal
  31. [31]
    Intraclass correlation coefficient (ICC) for oneway and twoway models
    Computes single score or average score ICCs as an index of interrater reliability of quantitative data. Additionally, F-test and confidence interval are ...
  32. [32]
    ICC Intraclass Correlations (ICC1, ICC2, ICC3 from Shrout and Fleiss)
    The Intraclass correlation is used as a measure of association when studying the reliability of raters. Shrout and Fleiss (1979) outline 6 different estimates, ...
  33. [33]
    Use and Interpret The Intraclass Correlation Coefficient (ICC) in SPSS
    Intraclass correlation coefficient (ICC) is an inter-rater reliability measure of agreement between raters using a continuous outcome. Use ICC in SPSS.
  34. [34]
    SPSS Library: Choosing an intraclass correlation coefficient
    The first decision that must be made in order to select an appropriate ICC is whether the data are to be treated via a one way or a two way ANOVA model. In ...
  35. [35]
    [PDF] The MIXED Procedure - SAS Support
    The MIXED procedure fits a variety of mixed linear models to data and enables you to use these fitted models to make statistical inferences about the data.
  36. [36]
    [PDF] The SAS ICC9 Macro
    Oct 4, 2010 · The %ICC9 macro uses PROC MIXED to compute intraclass correlation coef- ficients (ICCs, or reliability coefficients) and their 95% confidence ...
  37. [37]
    pingouin.intraclass_corr — pingouin 0.5.5 documentation
    Intraclass correlation (ICC) assesses rating reliability by comparing variability of ratings of the same subject to total variation across all ratings and ...<|control11|><|separator|>
  38. [38]
    Intraclass Correlation Coefficient in R : Best Reference - Datanovia
    There are three models: ICC1: One-way random-effects model. In this model, each subject is rated by a different set of randomly chosen raters. Here ...
  39. [39]
    What to do with negative ICC values? Adjust the test or interpret it ...
    May 23, 2016 · So what can you do about these negative values? If you are calculating many ICCs and averaging them, negative values can really drop the mean.Can ICC values be negative? - Cross Validated - Stack ExchangeWhich ICC test to use and how to Interpret negative valuesMore results from stats.stackexchange.com