Point-biserial correlation coefficient
The point-biserial correlation coefficient (r_{pb}) is a statistical measure that assesses the strength and direction of the linear association between a dichotomous (binary) variable and a continuous variable, serving as a special case of the Pearson product-moment correlation coefficient when one variable takes only two distinct values, such as 0 and 1.[1][2] It is commonly applied in fields like psychology, education, and social sciences to evaluate relationships, for example, between gender (binary) and test performance (continuous), assuming the binary variable represents true categories rather than an artificial split of continuous data.[3][4] The coefficient is calculated using the formular_{pb} = \frac{\bar{Y}_1 - \bar{Y}_0}{s_Y} \sqrt{p(1 - p)},
where \bar{Y}_1 is the mean of the continuous variable for the group coded as 1, \bar{Y}_0 is the mean for the group coded as 0, s_Y is the standard deviation of the continuous variable across all observations, and p is the proportion of observations in the group coded as 1 (with $1 - p for the other group).[3][2] This formula yields values ranging from -1 to 1, where positive values indicate that higher values of the continuous variable are associated with the category coded as 1, negative values suggest the opposite, and a value near 0 implies no linear relationship.[1] Interpretation mirrors that of Pearson's r, with magnitudes indicating weak (< 0.3), moderate (0.3–0.5), or strong (> 0.5) associations, though context-specific benchmarks apply.[5][3] Key assumptions include the continuous variable being approximately normally distributed within each binary group, equal variances across groups (homoscedasticity), and linearity in the relationship; violations, such as non-normal data or artificial dichotomization, can lead to biased estimates, in which case alternatives like logistic regression may be preferable.[2][4] The point-biserial correlation is robust for large samples but requires random sampling and independence of observations for valid inference, with significance testing often performed via t-tests analogous to those for Pearson's r.[2] In practice, software like SPSS or R computes it directly by treating the binary variable as numeric (0/1), ensuring it reflects true categorical distinctions to avoid misleading results.[4]
Definition and Background
Definition
The point-biserial correlation coefficient is a statistical measure that assesses the strength and direction of the association between a binary (dichotomous) variable and a continuous variable, serving as a special case of the Pearson product-moment correlation coefficient.[6][7] It is particularly useful in scenarios where one variable represents two mutually exclusive categories, such as success/failure or presence/absence, and the other is measured on a numerical scale without discrete categories.[8] In this context, the binary variable is conventionally coded as 0 for one category and 1 for the other, ensuring it functions as a numerical indicator in the correlation calculation, while the continuous variable is typically interval or ratio scaled, such as income levels or exam scores.[7][9] The coefficient captures how variations in the continuous variable differ across the two groups defined by the binary variable, with positive values indicating that the continuous variable tends to be higher in the group coded as 1, and negative values showing the reverse pattern.[10] The term "point-biserial" reflects the discrete, two-point structure of the binary variable on a measurement scale, distinguishing it from other correlation types like the biserial correlation, which assumes the binary variable arises from an underlying continuous distribution.[11] This naming emphasizes the point-like nature of the dichotomy in contrast to fully continuous associations.[12]Historical Development
The point-biserial correlation coefficient originated as an extension of Karl Pearson's pioneering work on correlation measures in the early 1900s. Pearson, who introduced the product-moment correlation coefficient in 1895, extended his methods to handle variables of mixed types, including dichotomous ones, in biometric and statistical analyses. The specific term "point-biserial" was introduced by M. W. Richardson and J. M. Stalnaker in 1933, in their paper distinguishing it from the biserial correlation as a measure for truly dichotomous variables without assuming an underlying continuous distribution.[13] A key publication contributing to its foundations is Pearson's 1909 paper in Biometrika, where he derived the biserial correlation for estimating relationships assuming an underlying continuous distribution for the dichotomous variable; the point-biserial can be viewed as a simplified variant without that assumption.[14] The coefficient's development advanced through further examinations of its properties, such as Joseph Lev's 1949 note in the Annals of Mathematical Statistics, which explored its sampling distribution.[14] It gained traction in psychometrics and social sciences in the mid-20th century, with statisticians like Chester W. Harris emphasizing its practical utility in educational measurement and test construction, including for analyzing item discrimination in binary-scored assessments.[15] During the mid-1900s, refinements focused on robust estimation for dichotomous variables in empirical studies. By the 1980s, it was routinely incorporated into major statistical software packages, such as SAS and SPSS, enabling efficient computation in large-scale analyses. In modern statistical practice, it continues to inform developments in item response theory, where it serves as a benchmark for validating latent trait models against observed binary responses.[2]Mathematical Formulation
Formula Derivation
The point-biserial correlation coefficient arises as a special case of the Pearson product-moment correlation coefficient when one variable is dichotomous (coded as 0 or 1) and the other is continuous.[2] To derive its formula, begin with the general Pearson correlation: r = \frac{\mathrm{Cov}(X, Y)}{s_X s_Y}, where X is the dichotomous variable, Y is the continuous variable, \mathrm{Cov}(X, Y) is the covariance between X and Y, and s_X and s_Y are the standard deviations of X and Y, respectively.[16] Assume X takes the value 1 with proportion P (and 0 with proportion $1 - P). The mean of X is \mu_X = P, and its variance is \sigma_X^2 = P(1 - P), so the standard deviation is s_X = \sqrt{P(1 - P)}. The covariance \mathrm{Cov}(X, Y) can be expressed as P \cdot M_1 - \mu_Y \cdot P, where M_1 is the mean of Y conditional on X = 1 and \mu_Y = P \cdot M_1 + (1 - P) \cdot M_0 is the overall mean of Y, with M_0 the mean of Y conditional on X = 0. Substituting yields \mathrm{Cov}(X, Y) = P(1 - P)(M_1 - M_0).[16][2] Plugging these into the Pearson formula gives: r_{pb} = \frac{P(1 - P)(M_1 - M_0)}{\sqrt{P(1 - P)} \cdot s_Y} = \frac{(M_1 - M_0) \sqrt{P(1 - P)}}{s_Y}, where s_Y is the standard deviation of Y. This simplification shows that the point-biserial correlation r_{pb} directly measures the standardized difference in group means scaled by the binary variance factor.[16][2] In the derived formula, M_1 - M_0 captures the mean difference between the two groups defined by the dichotomous variable, reflecting the strength of the association. The term \sqrt{P(1 - P)} normalizes for the variance inherent in the binary distribution, which reaches its maximum at P = 0.5 and approaches zero as P nears 0 or 1, ensuring the coefficient accounts for the proportion of each category. Finally, division by s_Y standardizes the difference relative to the variability in the continuous variable.[2] The derivation implicitly assumes linearity in the relationship between X and Y, as inherited from the Pearson correlation.Relation to Pearson Correlation
The point-biserial correlation coefficient r_{pb} is mathematically equivalent to the Pearson product-moment correlation coefficient r when the dichotomous variable is coded numerically as 0 and 1.[17] This equivalence arises because the point-biserial formula is a direct substitution of the Pearson formula for the case where one variable is binary, specifically incorporating the variance of the dichotomous variable, which is p(1-p) where p is the proportion of cases in the "1" category.[17] To demonstrate this, recall the Pearson correlation formula: r = \frac{\sum_{i=1}^N (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^N (X_i - \bar{X})^2} \cdot \sqrt{\sum_{i=1}^N (Y_i - \bar{Y})^2}}, where X is the continuous variable, Y is the dichotomous variable coded as 0 or 1, \bar{X} and \bar{Y} are their means, and N is the sample size. With Y binary, \bar{Y} = p and the denominator's second square root simplifies to \sqrt{N p (1-p)}. The numerator, representing the covariance, becomes N p (1-p) ( \bar{X}_1 - \bar{X}_0 ), where \bar{X}_1 and \bar{X}_0 are the means of X for Y=1 and Y=0, respectively. Substituting these terms yields: r = \frac{ ( \bar{X}_1 - \bar{X}_0 ) \sqrt{p (1-p)} }{ s_X }, which is identical to the standard point-biserial formula, confirming the direct substitution and equivalence.[17] Both the point-biserial and Pearson coefficients range from -1 to +1, with the bounds interpreted similarly: r_{pb} = 1 occurs when the binary grouping perfectly predicts the continuous variable (e.g., all higher X values align with one category and lower with the other), r_{pb} = -1 for the opposite perfect separation, and r_{pb} = 0 for no linear association.[17] Although computationally identical under the 0/1 coding, the point-biserial coefficient is preferred over the general Pearson correlation for interpretability when one variable is explicitly dichotomous, as its formula highlights the role of group proportions and mean differences in a manner tailored to binary-continuous associations.[17]Computation and Estimation
Step-by-Step Calculation
The point-biserial correlation coefficient is computed through a series of straightforward steps using the provided formula, which relates the difference in group means to the variability in the continuous variable, adjusted by the binary variable's distribution.- Code the binary variable (X) as 0 for one category and 1 for the other, ensuring consistent assignment across the dataset.[3]
- Calculate the proportion P of observations where X = 1, given by P = (number of 1s) / n, where n is the total number of observations.[3]
- Compute the mean M_1 of the continuous variable (Y) for all observations where X = 1, and the mean M_0 for all observations where X = 0.[3]
- Determine the sample standard deviation S_y of the continuous variable Y across all n observations, using the formula S_y = \sqrt{\frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n-1}}, where \bar{Y} is the overall mean of Y.[3]
- Substitute the values into the point-biserial correlation formula: r_{pb} = \frac{M_1 - M_0}{S_y} \sqrt{P(1 - P)}.
Software Implementation
The point-biserial correlation coefficient can be computed in R using the base functioncor.test() with the method="pearson" argument after coding the binary variable as numeric values 0 and 1, as the point-biserial is equivalent to the Pearson correlation under this coding.[18] A sample code snippet is as follows:
In Python, ther# Example data: binary_var (0/1) and continuous_var binary_var <- c(0, 1, 0, 1, 1) # Coded as numeric continuous_var <- c(2.1, 3.5, 1.8, 4.2, 3.9) # Using base R result_base <- cor.test(binary_var, continuous_var, method = "pearson") print(result_base$estimate) # Point-biserial coefficient# Example data: binary_var (0/1) and continuous_var binary_var <- c(0, 1, 0, 1, 1) # Coded as numeric continuous_var <- c(2.1, 3.5, 1.8, 4.2, 3.9) # Using base R result_base <- cor.test(binary_var, continuous_var, method = "pearson") print(result_base$estimate) # Point-biserial coefficient
scipy.stats module includes the pointbiserialr() function to directly compute the point-biserial correlation between a binary variable (coded as 0 and 1) and a continuous variable, returning the coefficient and p-value.[19] The pingouin library offers pointbiserial_corr() for a similar computation, often with additional statistical outputs like confidence intervals. An executable code example is:
For SPSS, the point-biserial correlation is obtained via the Bivariate Correlations procedure (Analyze > Correlate > Bivariate), selecting the Pearson option; the binary variable must be coded as 0 and 1.[4] The syntax example is:pythonimport numpy as np from scipy.stats import pointbiserialr import pingouin as pg import pandas as pd # Example data binary_var = np.array([0, 1, 0, 1, 1]) # Coded as 0 and 1 continuous_var = np.array([2.1, 3.5, 1.8, 4.2, 3.9]) # Using scipy corr_scipy, p_value = pointbiserialr(binary_var, continuous_var) print(f"Point-biserial coefficient: {corr_scipy}") # Using pingouin df = pd.DataFrame({'binary': binary_var, 'continuous': continuous_var}) result_pg = pg.pointbiserial_corr(df, x='binary', y='continuous') print(result_pg['r'].iloc[0]) # Point-biserial coefficientimport numpy as np from scipy.stats import pointbiserialr import pingouin as pg import pandas as pd # Example data binary_var = np.array([0, 1, 0, 1, 1]) # Coded as 0 and 1 continuous_var = np.array([2.1, 3.5, 1.8, 4.2, 3.9]) # Using scipy corr_scipy, p_value = pointbiserialr(binary_var, continuous_var) print(f"Point-biserial coefficient: {corr_scipy}") # Using pingouin df = pd.DataFrame({'binary': binary_var, 'continuous': continuous_var}) result_pg = pg.pointbiserial_corr(df, x='binary', y='continuous') print(result_pg['r'].iloc[0]) # Point-biserial coefficient
In SAS,CORRELATIONS /VARIABLES=binary_var continuous_var /MISSING=PAIRWISE.CORRELATIONS /VARIABLES=binary_var continuous_var /MISSING=PAIRWISE.
PROC CORR computes the point-biserial correlation as a Pearson correlation by specifying the binary variable (coded as 0 and 1) and the continuous variable in the VAR statement, with options for handling missing data such as NOMISS for listwise deletion. A syntax example is:
In Microsoft Excel, the point-biserial correlation can be calculated manually using thePROC CORR DATA=dataset; VAR binary_var continuous_var; RUN;PROC CORR DATA=dataset; VAR binary_var continuous_var; RUN;
=CORREL() function on ranges containing the continuous variable and the binary variable coded as 0 and 1, or via the Analysis ToolPak add-in by selecting the Correlation tool under Data Analysis to generate a correlation matrix including the point-biserial as a Pearson value. For example, if binary data is in A2:A6 and continuous in B2:B6, enter =CORREL(A2:A6, B2:B6).
When implementing the point-biserial correlation in software, ensure the binary variable is properly coded as numeric 0 and 1 to treat it as dichotomous, as incorrect coding (e.g., as factors or text) may lead to errors or inappropriate computations.[18] Most tools, including cor.test() in R, pointbiserialr() in SciPy, and PROC CORR in SAS, handle missing data via pairwise deletion by default, excluding only pairs with missing values for each correlation; specify listwise deletion if needed to use complete cases only, though this may reduce sample size.[19]