Intraclass correlation
The intraclass correlation coefficient (ICC) is a statistical index that quantifies the degree of similarity or agreement among observations within the same group or class, relative to the variability between groups, typically ranging from 0 (no similarity) to 1 (perfect similarity).[1] Introduced by Ronald A. Fisher in 1921 as an extension of the Pearson correlation coefficient to handle grouped data, such as familial resemblances in physical measurements, the ICC addresses scenarios where traditional correlations fail to account for within-group dependencies. In practice, the ICC is widely applied in fields like psychology, medicine, and biology to evaluate reliability in repeated measures, inter-rater assessments, and cluster-randomized trials, where it helps determine how much variation in outcomes is attributable to true differences between clusters versus random error within them. For instance, in clinical research, it measures the consistency of diagnostic ratings across multiple observers, while in trial design, it informs sample size adjustments by estimating clustering effects, as captured by the design effect formula: DEFF = 1 + (n-1)ICC, where n is the cluster size.[2] Several forms of the ICC exist, depending on the study design and assumptions, including one-way random effects models for absolute agreement (ICC(A,1)), two-way mixed effects for consistency (ICC(C,1)), and others that adjust for fixed or random raters.[3] These are typically estimated using analysis of variance (ANOVA), with the general form for a one-way model given by ICC(1) = (MS_B - MS_W) / (MS_B + (k-1)MS_W), where MS_B is the mean square between subjects, MS_W is the mean square within subjects, and k is the number of measurements per subject.[1] Interpretations vary by context, but common guidelines classify values below 0.5 as poor reliability, 0.5–0.75 as moderate, 0.75–0.9 as good, and above 0.9 as excellent, though confidence intervals are essential for assessing precision.Historical Development
Early Definition
The intraclass correlation coefficient (ICC) was first introduced by Ronald A. Fisher in 1921, in his paper "On the 'Probable Error' of a Coefficient of Correlation Deduced from a Small Sample," published in Metron, in the context of agricultural experiments at Rothamsted Experimental Station and studies of familial resemblances, such as correlations among siblings for physical traits.[4][5] Fisher developed the concept as a way to quantify the similarity of observations within the same class or group, extending Pearson's product-moment correlation to clustered data through the framework of analysis of variance (ANOVA).[6] This approach addressed the need to partition total variance into components attributable to between-group and within-group sources, particularly relevant for randomized block designs in agriculture where plots or litters represent classes.[7] In his seminal 1925 textbook, Fisher formalized the unbiased estimator for the ICC under a balanced one-way random effects model, derived directly from the ANOVA table: \hat{\rho} = \frac{MS_B - MS_W}{MS_B + (k-1) MS_W}, where MS_B denotes the mean square between groups, MS_W the mean square within groups, and k the number of observations per group. This estimator arises from the expected mean squares in the model: E(MS_B) = \sigma^2_W + k \sigma^2_B and E(MS_W) = \sigma^2_W, where \sigma^2_B is the between-group variance component and \sigma^2_W the within-group variance. Solving for \sigma^2_B yields (MS_B - MS_W)/k, and substituting into the ICC definition \rho = \sigma^2_B / (\sigma^2_B + \sigma^2_W) produces the formula, ensuring unbiasedness for the population ICC in balanced designs.[8] The primary purpose of this early formulation was to provide an unbiased estimate of the proportion of total variance explained by between-group differences, serving as a key metric for assessing group homogeneity in experimental data.[5] Fisher emphasized its utility in hypothesis testing via the F-statistic (MS_B / MS_W) and in power calculations for experimental design, making it foundational for variance component analysis.[9] To illustrate, consider hypothetical data from Fisher's era on wheat yields (in grams per plant) across four experimental plots, each with five replicate plants: Plot 1: 20, 22, 19, 21, 20; Plot 2: 25, 27, 24, 26, 25; Plot 3: 18, 20, 17, 19, 18; Plot 4: 23, 25, 22, 24, 23. The ANOVA yields MS_B \approx 48.3 and MS_W = 1.3 (with k=5). Thus, \hat{\rho} = (48.3 - 1.3) / (48.3 + 4 \times 1.3) \approx 47 / 53.5 \approx 0.88, suggesting that approximately 88% of the total variance in yields arises from differences between plots, indicative of substantial plot-to-plot variability in soil or treatment effects.Modern Definitions
Following World War II, the intraclass correlation coefficient (ICC) gained widespread adoption in psychology and medicine for assessing reliability in measurement and rating studies, extending Ronald Fisher's early variance-based framework to practical applications in inter-rater agreement and test-retest scenarios. A key simplification emerged in the early 1950s with Robert L. Ebel's work, which proposed estimating ICC as the ratio of between-subject variance to total variance using analysis of variance (ANOVA) components, making computation more accessible for researchers. This approach was refined in 1979 by Patrick E. Shrout and Joseph L. Fleiss, who outlined six distinct forms of ICC to accommodate various study designs, such as single versus multiple raters and fixed versus random effects, thereby standardizing its use in reliability assessments. Their framework introduced nomenclature like ICC(1,1) for single-rater absolute agreement in a one-way random effects model, facilitating precise selection based on research objectives.[10] The modern estimator, commonly expressed as \text{ICC} = \frac{\text{MS}_\text{between} - \text{MS}_\text{within}}{\text{MS}_\text{between} + (k-1) \text{MS}_\text{within}} where \text{MS}_\text{between} is the mean square between groups, \text{MS}_\text{within} is the mean square within groups, and k is the number of raters, prioritizes ease of calculation via ANOVA. Contemporary guidelines, such as those by Tae Kyu Koo and Myung Hun Li in 2016, build on these developments by recommending specific ICC forms for clinical reliability research and emphasizing reporting practices to ensure interpretability and reproducibility.Mathematical Foundations
Relation to Pearson's Correlation
The intraclass correlation coefficient (ICC) generalizes Pearson's product-moment correlation coefficient (r) by extending its application to exchangeable observations within defined classes, such as repeated measurements on the same subjects or ratings by multiple raters, rather than treating variables as distinct.[10] This conceptual link positions the ICC as a measure of similarity or agreement within groups, where Pearson's r quantifies linear association between two separate variables.[11] In essence, the ICC captures the proportion of total variance attributable to differences between classes, providing a framework for reliability assessment in clustered data.[12] A direct mathematical equivalence exists in the special case of two observations per class, such as two raters evaluating multiple subjects. Here, the ICC for a two-way random effects model assessing absolute agreement with a single rater, denoted ICC(2,1), equals Pearson's r computed between the two sets of ratings, assuming equal variances and random rater effects.[13] This equivalence arises because both coefficients reduce to the formula \rho = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B + \text{MS}_W}, where \text{MS}_B is the mean square between classes and \text{MS}_W is the mean square within classes from a one-way ANOVA, which matches the covariance-based structure of Pearson's r for paired data.[12] However, the ICC extends this to multiple observations per class (k > 2), generalizing the denominator to \text{MS}_B + (k-1)\text{MS}_W to account for increased within-class variability, and it can incorporate adjustments for systematic rater biases absent in Pearson's r.[10] The core differences stem from their foundational assumptions: Pearson's r is an interclass correlation suited to bivariate data with ordered variables (e.g., predictor and outcome), emphasizing covariance relative to individual variances.[11] In contrast, the ICC is strictly intraclass, assuming observations within a class are interchangeable and focusing on variance partitioning to evaluate consistency or agreement.[12] One interpretive bridge is that the ICC equals the expected Pearson's r between any two randomly drawn observations from the same class, reflecting within-class dependence.[14] Equivalently, it can be derived as Pearson's r applied to the deviations of observations from their group means, which isolates within-class covariation after removing between-class effects: if Y_{ij} is the j-th observation in class i, then the correlation among (Y_{ij} - \bar{Y}_i) across paired selections yields the within-component structure underlying the ICC.[11] Ronald A. Fisher originally developed the ICC in 1921, motivated by the limitations of Pearson's r for clustered or paired data where no natural distinction exists between independent and dependent variables, such as measurements on siblings or repeated assessments of the same entity.[15] Fisher's innovation addressed the probable error estimation for such "intraclass" correlations, laying the groundwork for its use in biological and experimental contexts with grouped observations.[15]Variance Components and Models
The intraclass correlation coefficient (ICC) is derived from an analysis of variance (ANOVA) framework that partitions the total observed variance into between-group and within-group components. This decomposition quantifies the extent to which variability among observations is attributable to systematic differences between groups, such as subjects, clusters, or raters, relative to random variation within those groups. Formally, the ICC is expressed as the ratio \text{ICC} = \frac{\sigma_b^2}{\sigma_b^2 + \sigma_w^2}, where \sigma_b^2 represents the between-group variance and \sigma_w^2 the within-group variance; thus, the ICC measures the proportion of total variance explained by group membership. This formulation, rooted in early statistical work on variance partitioning, provides a measure of homogeneity or clustering in data, with values closer to 1 indicating strong group-level effects and values near 0 suggesting near-independence of observations within groups.[1] The underlying model assumes a mixed-effects structure, typically represented as Y_{ij} = \mu + a_i + e_{ij}, where Y_{ij} is the j-th observation within the i-th group, \mu is the grand mean, a_i \sim N(0, \sigma_b^2) captures the random group effect, and e_{ij} \sim N(0, \sigma_w^2) denotes the independent error term, with a_i and e_{ij} uncorrelated across and within groups. Key assumptions include multivariate normality of the observations, independence of errors within groups to ensure no residual clustering beyond the modeled group effect, and treatment of groups as random samples from a larger population. These assumptions facilitate the use of ANOVA mean squares to estimate variance components in balanced designs, where each group has an equal number (k) of observations; in unbalanced designs, where group sizes vary, alternative estimation approaches such as restricted maximum likelihood (REML) are required to obtain unbiased variance component estimates and avoid bias in the ANOVA-based F-statistic.[1][16][17] The ICC exhibits several important properties within this framework. It theoretically ranges from -\frac{1}{k-1} to 1, where k is the number of replicates or raters per group; the lower bound reflects scenarios of greater within-group dispersion than under complete independence, though negative estimates often arise from sampling variability when the true ICC is near zero. In large samples, the ANOVA-based estimator of the ICC is unbiased, converging to the true population value under the stated assumptions, which supports its reliability in assessing group-level consistency across diverse applications like reliability studies and clustered sampling.[1]Types of ICC
One-Way Random Effects Model
The one-way random effects model represents the foundational approach in intraclass correlation analysis for designs involving a single random factor, such as subjects or groups drawn randomly from a larger population. In this framework, all sources of variation beyond the random groups are treated as error, making it suitable for scenarios where measurements within groups are exchangeable and there is no fixed effect to consider. The model posits that the observed scores Y_{ij} for the j-th measurement on the i-th group follow Y_{ij} = \mu + \alpha_i + e_{ij}, where \mu is the overall mean, \alpha_i \sim N(0, \sigma_b^2) captures between-group variance, and e_{ij} \sim N(0, \sigma_w^2) represents within-group error variance, with groups and measurements assumed independent. This setup yields the intraclass correlation as the ratio \rho = \sigma_b^2 / (\sigma_b^2 + \sigma_w^2), emphasizing the proportion of total variance due to the random grouping factor.[18] For estimation, the model typically relies on analysis of variance (ANOVA) mean squares under balanced designs, where each group has the same number k of measurements. The ICC for a single measure, ICC(1,1), is calculated as \text{ICC}(1,1) = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B + (k-1) \text{MS}_W}, where \text{MS}_B is the between-groups mean square and \text{MS}_W is the within-groups mean square. This estimator reflects the reliability of individual ratings when raters or measurements are random and not fixed across groups. For the average of k measures per group, denoted ICC(1,k), the formula adjusts to \text{ICC}(1,k) = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B}, providing a higher reliability estimate by averaging out within-group error.[18] These forms are particularly relevant in single-facet designs, such as evaluating consistency across random samples of items or subjects without additional structured factors. Confidence intervals for these ICC estimates are commonly derived using F-test statistics from the ANOVA table, leveraging the ratio of mean squares F = \text{MS}_B / \text{MS}_W, which follows an F-distribution under the null hypothesis of no between-group variance. An approximate lower confidence limit can be obtained by substituting the observed F or critical F-values into a transformed form, such as L = \frac{F' - 1}{F' + (k-1)}, where F' is adjusted based on the desired confidence level and degrees of freedom (e.g., df_B = n-1, df_W = n(k-1), with n groups); upper bounds follow similarly by inverting the F-distribution. This method provides inferential bounds on the population ICC, allowing assessment of precision in reliability estimates. The model assumes normality of random effects and errors, independence within and between groups, and equal variances, which support the validity of ANOVA-based estimation in balanced settings. In cases of unbalanced data, where k varies across groups, traditional mean square estimators can become biased; instead, restricted maximum likelihood (REML) is employed to yield unbiased estimates of the variance components \sigma_b^2 and \sigma_w^2, from which the ICC is derived as their ratio. REML accounts for the loss of degrees of freedom in likelihood estimation, making it robust for irregular designs while maintaining computational feasibility through iterative methods. This adjustment ensures reliable ICC computation without requiring balanced replication. Use cases for the one-way random effects model are confined to situations with a solitary random factor, such as test-retest reliability assessments involving repeated measures on the same subjects over time (treating time points as random without fixed raters) or inter-item consistency in psychological scales where items form random groups. It is ideal when the goal is to partition variance solely between these random units and residual error, avoiding the introduction of fixed effects that would necessitate more complex models.Two-Way and Mixed Effects Models
In two-way models for intraclass correlation, raters are incorporated as a second factor alongside subjects, enabling the evaluation of rater variability in designs where every subject is rated by multiple raters, typically in a fully crossed setup. These models distinguish between random and fixed effects for raters to address different inferential goals in reliability assessments. The two-way random effects model treats both subjects and raters as random factors, suitable when the selected raters represent a sample from a broader population of potential raters, allowing generalization of reliability estimates beyond the specific raters used. In this framework, ICC(2,1) quantifies absolute agreement for a single rating, capturing the proportion of total variance attributable to true subject differences while accounting for both rater and residual variance. The estimator is given by \text{ICC}(2,1) = \frac{\text{MS}_b - \text{MS}_w}{\text{MS}_b + (k-1) \text{MS}_w + \frac{k (\text{MS}_r - \text{MS}_w)}{n}}, where \text{MS}_b is the between-subjects mean square, \text{MS}_r is the raters mean square, \text{MS}_w is the residual mean square, k is the number of raters, and n is the number of subjects; this adjustment for rater variance (\text{MS}_r) ensures the estimate reflects systematic differences among raters as part of the error structure.[18] The two-way mixed effects model, in contrast, treats raters as fixed effects and subjects as random, appropriate for studies where the particular raters are of specific interest and results are not intended to generalize to other raters, such as evaluating agreement among a fixed panel of experts. Here, ICC(3,1) measures consistency by focusing on the similarity in relative rankings of subjects across raters, deliberately excluding rater main effects (e.g., overall bias) from the variance components. The formula simplifies to \text{ICC}(3,1) = \frac{\text{MS}_b - \text{MS}_w}{\text{MS}_b + (k-1) \text{MS}_w}, omitting \text{MS}_r since fixed raters do not contribute random variance to the denominator, thus emphasizing agreement after adjusting for rater-specific offsets. These estimators are derived from two-way ANOVA mean squares under balanced designs, where the interaction term serves as the residual.[18] For unbalanced data or more flexible specifications, such as varying numbers of ratings per subject-rater pair, two-way and mixed effects ICC models are generalized through linear mixed models (LMMs), which partition variance into fixed rater effects and random subject effects using iterative estimation. Variance components in LMMs are typically obtained via restricted maximum likelihood (REML) for unbiased estimates in random effects settings or maximum likelihood (ML) for model comparisons, accommodating missing observations and enabling hypothesis tests on rater effects. Model selection hinges on study design: employ the two-way random model for crossed raters with generalizability needs, as in broad inter-rater studies; choose the mixed model for fixed, study-specific raters, such as in clinical assessments with designated evaluators.[18]Applications
Reliability Assessment
The intraclass correlation coefficient (ICC) serves as a fundamental tool for evaluating the reliability of measurement instruments in observational settings, quantifying the proportion of total variance attributable to between-subject differences relative to within-subject variability. This approach is particularly valuable for assessing consistency in ratings or measurements where systematic errors from raters or time can influence outcomes. Derived from variance component models, ICC estimates help determine whether a tool produces stable and reproducible results across repeated applications.[18] In inter-rater reliability assessments, ICC measures the agreement among multiple observers evaluating the same subjects, such as physicians diagnosing medical conditions based on symptom presentations. The ICC(2,1), under a two-way random effects model, evaluates absolute agreement by treating raters as a random sample and accounting for both rater and residual variance, making it suitable when the goal is to ensure ratings are interchangeable. In contrast, the ICC(3,1), using a two-way mixed effects model, focuses on relative consistency by treating raters as fixed effects, which is appropriate when specific raters are of interest and systematic rater differences are not penalized. These forms enable precise evaluation of observer concordance in fields like clinical diagnostics.[19][18] For test-retest reliability, ICC assesses the stability of measurements taken on the same subjects at different time points under similar conditions, while intra-rater reliability examines consistency within a single rater across multiple trials on the same subjects. Both scenarios commonly employ the ICC(1,1) from a one-way random effects model for single administrations, which partitions variance into between-subject and residual components (including time or trial effects as random). This model is ideal for tools like psychological scales or clinical assessments where repeated measures simulate rater variability.[19][18] Interpretation guidelines for ICC values, popularized in psychology during the 1980s, provide benchmarks for reliability strength; for instance, values below 0.50 indicate poor reliability, 0.50–0.75 moderate, 0.75–0.90 good, and above 0.90 excellent, with confidence intervals recommended to account for estimation uncertainty. In clinical trials, such as those validating pain scales for patient self-reports or nurse assessments, ICC is routinely applied to confirm tool dependability; representative studies report inter-rater ICC values around 0.80 for pain intensity ratings, supporting their use in outcome evaluation.[18][20] Compared to alternatives like pairwise Pearson correlations, which evaluate agreement between only two raters at a time and necessitate cumbersome averaging for groups, ICC offers superior efficiency by simultaneously incorporating all raters into a single estimate, better capturing overall measurement consistency while adjusting for both correlation and systematic bias.[19][18]Clustered and Genetic Studies
The intraclass correlation coefficient (ICC) has gained prominence in epidemiology for analyzing clustered data in randomized trials, marking a shift from earlier uses in variance partitioning toward practical applications in trial design and biostatistical modeling.[21] This development paralleled growing recognition of clustering effects in observational and experimental studies, evolving into a cornerstone of modern biostatistics by the late 20th century with advances in computational methods for multilevel analysis.[21] Cluster randomized trials, common in public health interventions, rely on the ICC to account for within-cluster dependence, which reduces statistical efficiency compared to individual randomization. The design effect, calculated as $1 + (m-1)\rho, where m is the average cluster size and \rho is the ICC, inflates the required sample size to maintain power; for instance, an ICC of 0.05 with m=50 yields a design effect of approximately 3, necessitating three times more participants.[22] In public health examples, such as community-level interventions for smoking cessation or vaccination uptake, ICC estimates from prior studies (often 0.01–0.05) guide sample size planning to detect modest intervention effects while adjusting for clustering in households, schools, or neighborhoods.[23] In genetic studies, the ICC quantifies phenotypic resemblance among relatives, enabling heritability estimation as the proportion of trait variance due to additive genetic effects. In twin studies, narrow-sense heritability h^2 is approximated using Falconer's formula: h^2 = 2(r_{[MZ](/page/MZ)} - r_{[DZ](/page/DZ)}), where r_{[MZ](/page/MZ)} and r_{[DZ](/page/DZ)} are the correlations for monozygotic and dizygotic pairs, respectively, assuming equal shared environmental influences.[24] This method has been applied in quantitative genetics to traits like height or disease susceptibility, with h^2 values often ranging from 0.4 to 0.8 for highly heritable phenotypes. The ICC also underpins hierarchical linear modeling in diverse fields, partitioning variance across levels to reveal clustering effects. In education research, it measures school-level influences on student achievement, with ICCs around 0.10–0.20 highlighting between-school variation in outcomes like test scores after controlling for individual factors.[25] Similarly, in ecology, multilevel models use ICC to evaluate site-specific clustering in environmental data, such as tree biomass partitioning across forest stands, where site effects (ICC ≈ 0.05–0.15) account for spatial heterogeneity in growth responses.[26] Post-2000 advancements in software and theory have accelerated ICC's integration into these multilevel frameworks, facilitating robust analysis of nested data structures.[27]Computation
Estimation Formulas
Estimation of the intraclass correlation coefficient (ICC) typically relies on analysis of variance (ANOVA) frameworks for balanced designs, where each cluster or target has the same number of observations, denoted as k. The procedure begins by performing a one-way ANOVA treating clusters as the factor, yielding the mean square between clusters (MS_B) and mean square within clusters (MS_W). These are computed from the sums of squares: the total sum of squares (SS_total) is partitioned into SS_between = \sum n_j (\bar{y}_j - \bar{y})^2 and SS_within = \sum \sum (y_{ij} - \bar{y}_j)^2, where n_j = k for balanced data with n clusters; then MS_B = SS_between / (n-1) and MS_W = SS_within / [n(k-1)]. The point estimate for the single-rater ICC, ICC(1,1), in a one-way random effects model is then given by \text{ICC}(1,1) = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B + (k-1)\text{MS}_W}, which corresponds to the ratio of between-cluster variance to total variance and is an unbiased estimator of the population ICC under the model assumptions. For the average-rater ICC, ICC(1,k), the formula simplifies to \text{ICC}(1,k) = \frac{\text{MS}_B - \text{MS}_W}{\text{MS}_B}, providing an estimate of reliability when averaging across k raters. In two-way random effects models, analogous formulas incorporate the mean square for the second factor (e.g., raters, MS_R), such as ICC(2,1) = (\text{MS}_B - \text{MS}_E) / [\text{MS}_B + (k-1)\text{MS}_E + k(\text{MS}_R - \text{MS}_E)/n], where MS_E is the error mean square, but the core procedure remains ANOVA decomposition of sums of squares into between, interaction, and error components. Confidence intervals for the ICC in balanced designs are commonly constructed using the F-distribution, leveraging the fact that F = \text{MS}_B / \text{MS}_W \sim F(n-1, n(k-1)) under the null. A standard approach inverts critical values to bound the population ICC; specifically, let F_L and F_U be the lower (\alpha/2) and upper ($1 - \alpha/2) quantiles of the F-distribution with degrees of freedom (n-1, n(k-1)). The (1-\alpha) confidence interval is then \left( \frac{F_L - 1}{F_L + (k-1)}, \frac{F_U - 1}{F_U + (k-1)} \right). This method provides approximate coverage, particularly effective for moderate sample sizes, as it accounts for the sampling distribution of the variance ratio. For unbalanced designs, where cluster sizes vary, ANOVA mean squares are not directly applicable; instead, non-parametric bootstrapping is recommended: resample clusters with replacement B times (e.g., B=1000), compute the ICC for each bootstrap sample using moment estimators or maximum likelihood, and take the \alpha/2 and $1-\alpha/2 percentiles of the resulting distribution as the interval bounds. Alternative estimation in linear mixed models (LMMs) frames the ICC directly as the proportion of variance attributable to the random effect: \text{ICC} = \sigma^2_u / (\sigma^2_u + \sigma^2_e), where \sigma^2_u is the variance of the random intercept (between-cluster) and \sigma^2_e is the residual (within-cluster) variance. These components are estimated via restricted maximum likelihood (REML), which adjusts for fixed effects and provides unbiased variance estimates even in unbalanced data; for implementation, functions likelmer in R fit the model y_{ij} = \beta_0 + u_j + e_{ij} with u_j \sim N(0, \sigma^2_u) and e_{ij} \sim N(0, \sigma^2_e), yielding the ICC from the variance outputs.
For small samples, where degrees of freedom are limited (e.g., n < 10 or k < 5), the F-distribution-based intervals may have poor coverage; the Satterthwaite approximation addresses this by estimating effective degrees of freedom for the variance components as \tilde{\nu} = 2 \hat{\sigma}^4 / \text{Var}(\hat{\sigma}^2), using method-of-moments for the denominator, and substituting into t- or F-distributions for interval construction in LMM or two-way ANOVA contexts. This yields more accurate confidence intervals by accounting for the approximate chi-squared distribution of variance estimates.[28]
Software Packages
In the R programming language, theirr package provides the icc() function for computing basic single-score or average-score intraclass correlation coefficients (ICCs) in one-way and two-way models, including F-tests and confidence intervals based on ANOVA.[29] For more comprehensive analysis supporting multiple ICC types (e.g., ICC(1,1), ICC(2,1), ICC(3,1)) as defined by Shrout and Fleiss (1979), along with confidence intervals and p-values, the psych package's ICC() function is widely used, leveraging either ANOVA or linear mixed-effects models via lme4.[30] An example syntax for a two-way mixed-effects model assessing agreement among raters is ICC(data_matrix, model = "twoway", type = "agreement", unit = "average", alpha = 0.05), where data_matrix is a data frame with rows as subjects and columns as raters; this outputs ICC estimates, F-statistics, and 95% confidence intervals.[30]
In SPSS, ICCs are computed through the Reliability Analysis procedure under Analyze > Scale > Reliability Analysis, where users select the variables, choose "Intraclass correlation coefficient" in the Statistics dialog, and specify the model (e.g., one-way random, two-way mixed) and type (consistency or absolute agreement).[31] The output includes tables with ICC values, mean squares (MS) for rows, columns, and error, F-statistics, and significance tests, facilitating interpretation of interrater reliability.[32]
SAS supports ICC estimation primarily through PROC MIXED for linear mixed models, where random effects model variance components (e.g., proc mixed data=dataset; class subject rater; model outcome = / solution; random intercept / subject=subject; run;), allowing manual computation of ICC as the ratio of between-subject variance to total variance from the covariance parameter estimates table.[33] PROC CORR can compute Pearson correlations but requires additional steps or macros (e.g., %ICC9) for full ICC functionality in clustered designs.[34]
In Python, the pingouin library offers the intraclass_corr() function for ICC computation across six types (e.g., ICC(1), ICC(2), ICC(3)) with confidence intervals and p-values, using a long-format DataFrame input specifying subjects, raters, and measurements; for instance, pg.intraclass_corr(data=df, targets='subject', raters='rater', ratings='score') yields ICC estimates and inference statistics.[35]
Key considerations for implementation include handling missing data, where psych::ICC() in R accommodates unbalanced designs and missing values via lmer = TRUE (requiring lme4 package, available since R 3.0+), while irr::icc() defaults to complete cases but can be adjusted.[36] Version-specific enhancements, such as improved confidence interval precision in R 4.0+ through updated lme4 integration, enhance reliability for complex models.