Fact-checked by Grok 2 weeks ago

Bivariate analysis

Bivariate analysis is a fundamental statistical approach used to examine and describe the relationship between two variables, determining whether they are associated, dependent, or correlated, and assessing the strength, direction, and significance of that relationship.^[1] This method serves as a foundational step in data analysis, bridging univariate descriptions of single variables to more complex multivariate explorations, and is essential in fields like social sciences, medicine, and economics for identifying patterns in real-world phenomena.^[2] It typically involves techniques tailored to the measurement levels of the variables—nominal, ordinal, or interval/ratio—such as contingency tables for categorical data or scatterplots for continuous data, without implying causation but rather covariation or interdependence.^[3] The primary goal of bivariate analysis is to test hypotheses about variable relationships, often using inferential statistics to evaluate if observed associations are due to chance, with results informing subsequent model building or policy decisions.^[1] For categorical variables, common techniques include chi-square tests to assess independence and odds ratios to quantify association strength.^[1] In contrast, for interval or ratio variables, Pearson's correlation coefficient measures linear relationships, with values ranging from -1 to +1 indicating direction and magnitude, while simple linear regression models predict one variable from the other.^[1] Additional methods like t-tests compare means between two groups (e.g., independent samples for nominal predictors and continuous outcomes) and analysis of variance (ANOVA) extends this to multiple categories, ensuring assumptions such as data normality are met for valid inferences.^[2] Bivariate analysis is particularly valuable in exploratory research, where it helps detect spurious correlations or confounders before advancing to controlled multivariate models, and its results are interpreted through p-values (typically ≤0.05 for significance) and effect sizes.^[3] Overall, this approach provides concise insights into pairwise interactions, underpinning evidence-based conclusions across disciplines.^[1]

Fundamentals

Definition and Scope

Bivariate analysis encompasses statistical methods designed to examine and describe the relationships between exactly two variables, assessing aspects such as the strength, direction, and form of their association.^[4] This approach focuses on bivariate data, where one variable is often treated as independent (explanatory) and the other as dependent (outcome), enabling researchers to explore potential patterns without assuming causality.^[5] The scope of bivariate analysis extends to various data types, including continuous, discrete, and categorical variables, making it versatile for applications across fields like social sciences, medicine, and economics.^[3] It stands in contrast to univariate analysis, which involves a single variable to describe its distribution or central tendencies, and multivariate analysis, which handles interactions among three or more variables for more complex modeling.^[6] Historically, bivariate analysis originated in 19th-century statistics, with Francis Galton introducing key concepts like regression to the mean through studies on heredity in the 1880s, and Karl Pearson formalizing correlation measures around 1896 to quantify variable relationships.^[7] The primary purpose of bivariate analysis is to identify underlying patterns in data, test hypotheses regarding variable associations, and provide foundational insights that can inform subsequent predictive modeling, such as simple regression techniques.^[3] By evaluating whether observed relationships are statistically significant or attributable to chance, it supports evidence-based conclusions while emphasizing that correlation does not imply causation.^[4] Graphical tools, like scatterplots, often complement these methods to visualize associations visually.^[6]

Types of Variables Involved

In bivariate analysis, variables are classified based on their measurement scales, which determine the appropriate analytical approaches. Quantitative variables include continuous types, which can take any value within a range, such as height in meters or temperature in Celsius (interval scale, where differences are meaningful but ratios are not due to the arbitrary zero point), and ratio scales like weight in kilograms, which have a true zero and allow for meaningful ratios.^[8]^[9] Discrete variables, a subset of quantitative data, consist of countable integers, such as the number of children in a family or daily phone calls received.^[10]^[8] Qualitative variables are categorical, divided into nominal, which lack inherent order (e.g., eye color or gender), and ordinal, which have a ranked order but unequal intervals (e.g., education levels from elementary to postgraduate or Likert scale responses from "strongly disagree" to "strongly agree").^[8]^[10]^[9] The pairings of these variable types shape bivariate analysis strategies. Continuous-continuous pairings, like temperature and ice cream sales, enable examination of linear relationships using methods such as correlation.^[8]^[11] Continuous-categorical pairings, such as income (continuous) and gender (nominal), often involve group comparisons like t-tests for two categories or ANOVA for multiple.^[11]^[10] Categorical-categorical pairings, for instance, smoking status (nominal) and disease presence (nominal) or voting preference (ordinal) and age group (ordinal), rely on contingency tables to assess associations.^[8]^[11] These classifications carry key implications for method selection: continuous variable pairs generally suit parametric techniques assuming normality and equal variances, while categorical pairs necessitate non-parametric approaches or contingency table methods to handle unordered or ranked data without assuming underlying distributions.^[8]^[12] For example, Pearson correlation fits continuous pairs like height and weight, whereas chi-square tests apply to categorical pairs like gender and voting preference.^[11]^[12]

Measures of Linear Association

Covariance

Covariance is a statistical measure that quantifies the extent to which two random variables vary together, capturing the direction and degree of their linear relationship. A positive covariance indicates that the variables tend to increase or decrease in tandem, a negative value signifies that one tends to increase as the other decreases, and a value of zero suggests no linear dependence between them.^[13] This measure serves as a foundational building block for understanding bivariate associations, though it does not imply causation.^[14] The sample covariance between two variables X and Y, based on n observations, is given by the formula

\operatorname{Cov}(X, Y) = \frac{1}{n-1} \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}),

where \bar{X} and \bar{Y} denote the sample means of X and Y, respectively.^[15] This estimator is unbiased for the population covariance and uses the divisor n-1 to account for degrees of freedom in the sample.^[16] The sign of the covariance reflects the direction of co-variation, but its magnitude is sensitive to the units and scales of the variables involved.^[14] In terms of interpretation, the units of covariance are the product of the units of the two variables—for example, if one variable is measured in inches and the other in pounds, the covariance would be in inch-pounds—making direct comparisons across different datasets challenging without normalization.^[17] Consider a sample of adult heights (in inches) and weights (in pounds): taller individuals often weigh more, yielding a positive covariance value, illustrating how greater-than-average height deviations align with greater-than-average weight deviations.^[18] Despite its utility, covariance has notable limitations: it lacks a standardized range (unlike measures bounded between -1 and 1), so values cannot be directly interpreted in terms of strength without considering variable scales, and it is not comparable across studies with differing units or variances.^[14] Additionally, while the sign indicates direction, the absolute value does not provide a scale-invariant assessment of association strength.^[13]

Pearson Correlation Coefficient

The Pearson correlation coefficient, also known as the Pearson product-moment correlation coefficient, is a standardized measure of the strength and direction of the linear relationship between two continuous variables, ranging from -1 to +1, where -1 indicates a perfect negative linear association, +1 a perfect positive linear association, and 0 no linear association.^[19]^[20] It was developed by Karl Pearson as an extension of earlier work on regression and inheritance, providing a scale-invariant alternative to covariance by normalizing the latter with the standard deviations of the variables.^[19] The formula for the sample Pearson correlation coefficient r is given by:

r = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum_{i=1}^{n} (X_i - \bar{X})^2} \sqrt{\sum_{i=1}^{n} (Y_i - \bar{Y})^2}} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y},

where \bar{X} and \bar{Y} are the sample means of variables X and Y, \text{Cov}(X, Y) is the sample covariance, and \sigma_X and \sigma_Y are the sample standard deviations.^[20] To calculate r, first compute the means \bar{X} and \bar{Y}; then determine the deviations (X_i - \bar{X}) and (Y_i - \bar{Y}) for each paired observation; next, sum the products of these deviations to obtain the numerator (covariance term) and sum the squared deviations separately for the denominator components; finally, divide the covariance by the product of the standard deviations.^[20] Interpretation focuses on the value of r: the absolute value |r| indicates the strength of the linear association, with values near 0 suggesting weak or no linear relationship and values near 1 suggesting strong linear relationship, while the sign denotes direction (positive for direct, negative for inverse). For example, a strong positive correlation (r close to 1) between variables like study time and exam performance would indicate that higher values of one tend to associate with higher values of the other. To assess statistical significance, a t-test is used under the null hypothesis of no population correlation (\rho = 0):

t = r \sqrt{\frac{n-2}{1 - r^2}},

with degrees of freedom df = n - 2, where n is the sample size; the resulting t-value is compared to a t-distribution to obtain a p-value.^[21] The method assumes linearity in the relationship between variables, interval or ratio level data, and bivariate normality (i.e., each variable is normally distributed and their joint distribution is normal), with brief consideration for homoscedasticity in related inference, though violations may affect significance testing more than the coefficient itself.^[22]^[20]

Non-Parametric and Categorical Measures

Spearman Rank Correlation

The Spearman rank correlation coefficient, denoted as \rho, is a nonparametric measure of the strength and direction of the monotonic association between two variables, assessing how well the relationship can be described by a monotonically increasing or decreasing function rather than assuming linearity.^[23] Introduced by Charles Spearman in 1904, it operates by converting the original data into ranks, making it suitable for detecting associations where the raw data may not meet parametric assumptions.^[23] The coefficient ranges from -1, indicating a perfect negative monotonic relationship where higher ranks in one variable correspond to lower ranks in the other, to +1, indicating a perfect positive monotonic relationship, with 0 signifying no monotonic association.^[24] The formula for the Spearman rank correlation coefficient is given by

\rho = 1 - \frac{6 \sum_{i=1}^n d_i^2}{n(n^2 - 1)},

where d_i represents the difference between the ranks of the i-th paired observations from the two variables, and n is the number of observations.^[23] To calculate \rho, the values of each variable are first converted to ranks, typically assigning rank 1 to the smallest value and rank n to the largest, with the process performed separately for each variable.^[25] The rank differences d_i are then computed for each pair, squared, and summed before substitution into the formula.^[26] In cases of tied values within a variable, the average of the tied ranks is assigned to each tied observation to maintain consistency—for example, if two values tie for second and third place, both receive a rank of 2.5.^[25] The interpretation of \rho is analogous to that of the Pearson correlation coefficient in terms of strength and direction but focuses on monotonic rather than strictly linear relationships, offering greater robustness to outliers and departures from normality since it relies on ranks rather than raw scores.^[24] Statistical significance of \rho can be assessed through permutation tests, which reshuffle the paired ranks to generate an empirical null distribution, or by comparing the observed value to critical values from standard statistical tables.^[27] Spearman rank correlation is recommended for analyzing non-normally distributed continuous data, ordinal variables, or situations where a nonlinear but monotonic relationship is anticipated, as these conditions violate the assumptions of parametric alternatives like Pearson's method.^[24] For example, in socioeconomic research, a \rho = 0.72 between ranked levels of education (e.g., high school, bachelor's, graduate) and income brackets might indicate a strong positive monotonic trend, where higher education consistently associates with higher income without assuming a straight-line relationship.^[24]

Chi-Square Test of Independence

The chi-square test of independence is a non-parametric statistical test used to assess whether there is a significant association between two categorical variables in a bivariate analysis. It evaluates the null hypothesis that the variables are independent, implying no relationship between their distributions, against the alternative hypothesis that they are dependent. This test is particularly suited for nominal data organized in contingency tables, where it compares observed frequencies to those expected under independence.^[28] The test statistic is computed using the formula

\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}},

where O_{ij} represents the observed frequency in the i-th row and j-th column of the contingency table, and E_{ij} is the expected frequency for that cell, calculated as E_{ij} = \frac{( \sum_j O_{ij} ) ( \sum_i O_{ij} ) }{ \sum_i \sum_j O_{ij} }, with the sums denoting row and column marginal totals and the grand total, respectively. Under the null hypothesis, this statistic approximately follows a chi-square distribution with degrees of freedom (r-1)(c-1), where r and c are the number of rows and columns.^[29] To perform the test, the following steps are followed:

Construct a contingency table displaying the observed frequencies for the cross-classification of the two categorical variables, ensuring the data represent a random sample.
Calculate the expected frequencies for each cell using the marginal totals and grand total as specified in the formula.
Compute the chi-square statistic by summing the squared differences between observed and expected frequencies, each divided by the expected frequency.
Determine the degrees of freedom and obtain the p-value from the chi-square distribution (typically using statistical software or tables for the right-tailed test).
Compare the p-value to a significance level (e.g., \alpha = 0.05); if the p-value is less than \alpha, reject the null hypothesis of independence.^[30]

Interpretation focuses on the p-value, which represents the probability of obtaining the observed data (or more extreme) assuming independence; a low p-value (e.g., < 0.05) indicates sufficient evidence to conclude that the variables are associated. The magnitude of the chi-square statistic reflects the extent of deviation from independence, but for assessing the strength of the association, an effect size measure such as Cramér's V is recommended, given by

V = \sqrt{ \frac{\chi^2}{N (k - 1)} },

where N is the total sample size and k = \min(r, c); V ranges from 0 (no association) to 1 (perfect association), with values around 0.1 interpreted as small, 0.3 as medium, and 0.5 as large.^[31]^[28] Key assumptions include that the variables are categorical (nominal or ordinal treated as nominal), observations are independent (e.g., no paired or repeated measures), and the sample size is sufficiently large such that expected frequencies are at least 5 in most cells (ideally all, or no more than 20% below 5 with none below 1) to ensure the chi-square approximation holds. Violations, such as small expected frequencies, may require alternatives like Fisher's exact test.^[29] A representative example involves testing for an association between gender (male, female) and voting preference (Republican, Democrat, Independent) in a survey of 1,000 voters. The observed contingency table is:

	Republican	Democrat	Independent	Total
Male	200	150	50	400
Female	250	300	50	600
Total	450	450	100	1,000

The expected frequencies are 180, 180, 40 for males and 270, 270, 60 for females across the preferences. This yields \chi^2 = 16.2 with 2 degrees of freedom and p = 0.0003 < 0.05, leading to rejection of independence and evidence of an association between gender and voting preference; Cramér's V ≈ 0.13 indicates a small effect size.^[32]

Regression Modeling

Simple Linear Regression Model

The simple linear regression model is a statistical technique used to describe the relationship between two continuous variables, where one variable, denoted as Y (the dependent or response variable), is predicted from another, denoted as X (the independent or predictor variable).^[33] The model assumes a linear relationship and is expressed mathematically as

Y = \beta_0 + \beta_1 X + \epsilon,

where \beta_0 represents the y-intercept (the expected value of Y when X = 0), \beta_1 is the slope coefficient (indicating the average change in Y for a one-unit increase in X), and \epsilon is the random error term capturing unexplained variation in Y.^[34] This formulation posits a directional dependency, with X influencing Y, rather than a symmetric association.^[35] The primary purpose of the simple linear regression model is to enable prediction of the dependent variable based on the independent variable or to quantify the extent to which changes in the independent variable affect the dependent variable.^[33] For instance, the slope \beta_1 measures the predicted increase or decrease in Y per unit change in X, providing insight into the direction and magnitude of the influence, while the intercept \beta_0 establishes a baseline value for Y. The model's parameters, \beta_0 and \beta_1, are typically estimated using the method of ordinary least squares, which minimizes the sum of squared residuals between observed and predicted values of Y; a key goodness-of-fit measure is the coefficient of determination, R^2, defined as

R^2 = 1 - \frac{SS_{res}}{SS_{tot}},

where SS_{res} is the residual sum of squares (unexplained variance) and SS_{tot} is the total sum of squares (total variance in Y); R^2 thus represents the proportion of variance in Y explained by X, ranging from 0 to 1.^[36] A practical example illustrates the model's application: in predicting body weight (Y, in kilograms) from height (X, in centimeters) among adults, a fitted model might take the form Y = -68 + 0.87X, implying that for every additional centimeter in height, weight is expected to increase by 0.87 kilograms, with the intercept indicating a hypothetical (and unrealistic) negative weight at zero height.^[37] This example highlights the model's utility in forecasting outcomes based on observed predictors, such as in health or anthropometric studies.^[38] Unlike the Pearson correlation coefficient, which measures the symmetric strength and direction of linear association between two variables without implying causality or prediction direction, simple linear regression specifies a directional predictive relationship where X is used to model Y.^[39] The absolute value of the Pearson correlation coefficient equals the square root of R^2 in simple linear regression, linking the two but underscoring regression's emphasis on modeling and forecasting.^[40]

Ordinary Least Squares Estimation

Ordinary least squares (OLS) estimation determines the parameters \beta_0 and \beta_1 in the simple linear regression model by minimizing the sum of the squared residuals, defined as \sum \epsilon_i^2 = \sum (Y_i - \hat{Y}_i)^2, where \hat{Y}_i = \beta_0 + \beta_1 X_i.^[41] This approach treats all observations equally, seeking the line that best fits the data in the least-squares sense.^[42] The explicit formulas for the estimators are \hat{\beta}_1 = \frac{\text{Cov}(X, Y)}{\text{Var}(X)} and \hat{\beta}_0 = \bar{Y} - \hat{\beta}_1 \bar{X}. Equivalently, the slope can be written as \hat{\beta}_1 = r \cdot \frac{\sigma_Y}{\sigma_X}, where r is the Pearson correlation coefficient, \sigma_Y is the standard deviation of Y, and \sigma_X is the standard deviation of X. To derive these, start with the objective function S(\beta_0, \beta_1) = \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2. Taking partial derivatives and setting them to zero gives the normal equations:

\frac{\partial S}{\partial \beta_0} = -2 \sum (Y_i - \beta_0 - \beta_1 X_i) = 0 \implies \sum Y_i = n \beta_0 + \beta_1 \sum X_i

\frac{\partial S}{\partial \beta_1} = -2 \sum X_i (Y_i - \beta_0 - \beta_1 X_i) = 0 \implies \sum X_i Y_i = \beta_0 \sum X_i + \beta_1 \sum X_i^2

Solving this system yields the OLS estimators.^[43] Under the Gauss-Markov assumptions—including linearity, strict exogeneity, homoscedasticity, and no perfect multicollinearity—the OLS estimators are unbiased and possess the minimum variance among all linear unbiased estimators, making them the best linear unbiased estimators (BLUE).^[44] For statistical inference, the standard error of the slope is \text{SE}(\hat{\beta}_1) = \sqrt{\frac{\text{MSE}}{\sum (X_i - \bar{X})^2}}, where MSE is the mean squared error from the regression.^[45] The significance of \hat{\beta}_1 is assessed via a t-test statistic t = \frac{\hat{\beta}_1}{\text{SE}(\hat{\beta}_1)}, which follows a t-distribution with n-2 degrees of freedom under the null hypothesis \beta_1 = 0.^[46] For example, in an analysis of body measurements, OLS applied to height (in cm) and weight (in kg) data produces a slope \hat{\beta}_1 \approx 0.75 kg/cm, suggesting an average weight increase of 0.75 kg per cm of height; residuals from this fit can be examined via plots to evaluate model adequacy.^[47] In cases of heteroscedasticity, alternatives like weighted least squares adjust for varying residual variances.^[41]

Graphical Methods

Scatter Plots

A scatter plot is a graphical representation used to display the relationship between two continuous variables, with data points plotted on a two-dimensional coordinate system where one variable is assigned to the x-axis and the other to the y-axis.^[48] Each point corresponds to a paired observation from the dataset, positioned at the intersection of the respective variable values.^[49] To construct a scatter plot, the axes are scaled to encompass the full range of values in the data, ensuring that the plot captures the variability without distortion, and axes are labeled with variable names and units for interpretability.^[49] Interpretation of a scatter plot focuses on the distribution of points, which can reveal directional trends such as a positive slope for an increasing relationship or a negative slope for a decreasing one; the strength of the association through the tightness or dispersion of the point cloud; the overall form, whether linear, curved, or clustered; and the presence of outliers as isolated points distant from the main pattern.^[48]^[49]^[50] Scatter plots provide an intuitive visualization of bivariate patterns, enabling the detection of non-linear relationships and structural features like clusters that summary statistics alone may overlook.^[48]^[50] Enhancements to basic scatter plots include the addition of a trend line to highlight the dominant pattern or color-coding points to differentiate subgroups, thereby incorporating additional categorical information without altering the core bivariate display.^[49]^[51] For example, a scatter plot of median annual earnings against approximate years of education for U.S. workers aged 25 and older illustrates a positive upward trend, with earnings rising from about $25,000 at 10 years to over $75,000 at 18 years, accompanied by moderate scatter around the general direction.^[52] The visual trends observed in scatter plots, such as linearity or clustering, can be quantified using measures like the Pearson correlation coefficient.^[48]

Line of Best Fit and Residual Plots

In bivariate analysis, the line of best fit represents the straight line that minimizes the sum of squared vertical distances from the observed data points to the line, providing a visual summary of the linear relationship between the predictor variable X and the response variable Y. This line is derived from ordinary least squares (OLS) estimation, a method originally introduced by Adrien-Marie Legendre in 1805 for fitting orbits in astronomy. The equation of the line is given by

\hat{Y} = \hat{\beta}_0 + \hat{\beta}_1 X,

where \hat{\beta}_0 is the estimated y-intercept and \hat{\beta}_1 is the estimated slope, both calculated to minimize the residuals. When overlaid on a scatter plot of the data, the line of best fit illustrates the predicted trend, allowing researchers to assess how well the linear model captures the overall pattern in the bivariate data.^[53] Residuals quantify the discrepancy between observed and predicted values in the model, defined for each observation i as e_i = Y_i - \hat{Y}_i. These residuals represent the unexplained variation after accounting for the linear relationship. To evaluate model adequacy, residual plots are constructed by graphing the residuals e_i on the vertical axis against either the predictor variable X on the horizontal axis or the fitted values \hat{Y}_i. Such plots help diagnose potential issues in the regression model by revealing deviations from ideal behavior.^[54] In an effective residual plot, the points should appear randomly distributed around the horizontal line at zero, with constant spread across the range of X or \hat{Y}, indicating that the linear model appropriately captures the relationship without systematic errors. A lack of patterns supports the assumptions of linearity and homoscedasticity, while curved patterns may suggest nonlinearity, and a funnel-shaped spread (widening or narrowing) signals heteroscedasticity, where residual variance changes with X.^[55] For further diagnostics, a quantile-quantile (Q-Q) plot compares the ordered residuals to the quantiles of a standard normal distribution; residuals following a straight line through the plot confirm approximate normality, a key assumption for inference in linear regression. Additionally, the Durbin-Watson test statistic, introduced in 1950, can detect first-order autocorrelation in residuals by yielding values near 2 for independence, with deviations indicating positive or negative serial correlation.^[56]^[57] Consider a simple linear regression of body weight (in kg) on height (in cm) using data from a sample of adults; the fitted line is \hat{Y} = -133.18 + 1.16X, and the corresponding residual plot would show points scattered randomly around zero if the model fits well, affirming linearity in the height-weight relationship.^[58] These visualizations, including the line of best fit and residual plots, are routinely generated in statistical software such as R's base graphics or Python libraries like matplotlib for scatter overlays and seaborn or statsmodels for residual diagnostics.^[54]

Assumptions and Limitations

Key Assumptions

Bivariate analysis, particularly through parametric methods such as simple linear regression and the Pearson correlation coefficient, relies on several core statistical assumptions to ensure the validity of inferences and estimates. These assumptions underpin the reliability of ordinary least squares (OLS) estimation and hypothesis testing, as formalized in the Gauss-Markov theorem, which guarantees that OLS estimators are unbiased and have minimum variance under specified conditions.^[59] A fundamental assumption is linearity, which posits that the relationship between the independent variable X and the dependent variable Y is linear, meaning Y can be expressed as a linear function of X plus an error term. This can be preliminarily assessed through scatter plots of the data points. Violation of linearity may lead to biased estimates and invalid predictions.^[60] Independence of observations is another critical assumption, requiring that the errors or residuals for different observations are uncorrelated, with no autocorrelation present in the data. This ensures that the influence of one observation does not affect another, which is essential for the unbiasedness of OLS estimators.^[59] Homoscedasticity assumes constant variance of the residuals across all levels of the independent variable X, meaning the spread of residuals remains uniform rather than fanning out or contracting. This condition is necessary for the efficiency of OLS estimators and for valid standard errors in inference. Residual plots can visually inspect this assumption, while the Breusch-Pagan test provides a statistical evaluation by regressing squared residuals on the fitted values and testing for significance.^[61] Normality is required for the residuals in regression models or for the variables themselves in correlation analysis when performing inference, such as t-tests for significance. Under this assumption, the residuals follow a normal distribution with mean zero, enabling the use of standard parametric tests. The Shapiro-Wilk test assesses normality by comparing the sample data to expected normal quantiles, rejecting the null hypothesis of normality if the p-value is below 0.05.^[62] In simple bivariate contexts, perfect multicollinearity is not applicable since only one predictor is involved, but it becomes relevant in extensions to multiple regression where predictors must not be perfectly linearly related.^[60] Method-specific assumptions vary: the Pearson correlation coefficient requires both variables to be continuous (interval or ratio scale), linearly related, free of extreme outliers, and approximately normally distributed for valid significance testing. In contrast, the Spearman rank correlation does not assume normality or linearity but instead requires at least ordinal data and a monotonic relationship between the ranked variables, making it suitable for non-parametric analysis.^[63]^[64]

Common Violations and Diagnostics

In bivariate analysis, particularly within linear regression models, violations of key assumptions can compromise the validity of inferences and predictions. These violations are typically detected through diagnostic tools that examine model residuals and other statistics, building on the foundational assumptions of linearity, homoscedasticity, normality, and independence.^[65] Residual plots serve as the primary visual diagnostic, revealing patterns that indicate deviations from these assumptions.^[66] Non-linearity occurs when the relationship between the predictor and response variables is not linear, often manifesting as curved patterns in residuals plotted against fitted values. This violation biases coefficient estimates and reduces model predictive accuracy. Heteroscedasticity, where residual variance is not constant (e.g., increasing or "fanning" with fitted values), leads to underestimated standard errors, inflating Type I error rates in hypothesis tests and producing overly narrow confidence intervals.^[67] Outliers and influential points, which disproportionately affect model parameters, can be identified using leverage plots and Cook's distance, a measure that quantifies the change in fitted values if an observation is removed; values exceeding 4/n (where n is the sample size) warrant investigation as potential influencers. Non-normality of residuals, detected via skewed or non-linear quantile-quantile (QQ) plots, violates the assumption required for valid t-tests and F-tests, potentially leading to incorrect p-values.^[65] In bivariate contexts, multicollinearity is minimal, as variance inflation factors are inherently low with a single predictor.^[68] To mitigate these issues, several remedies are available. For non-linearity, logarithmic transformations of variables can linearize relationships, such as applying log(x) to the predictor when curvature is evident.^[69] Robust regression methods, like Huber's M-estimation, downweight outliers to provide more stable estimates under heteroscedasticity or non-normality.^[70] Non-parametric alternatives, such as Spearman's rank correlation, bypass normality and linearity assumptions entirely for association measures. Bootstrap resampling offers robust inference by generating empirical distributions of statistics like coefficients, accommodating violations without parametric reliance. Outliers may be removed if justified by domain knowledge, but this requires caution to avoid bias. The consequences of unaddressed violations include biased parameter estimates and unreliable statistical inference; for instance, heteroscedasticity specifically underestimates standard errors, leading to false positives in significance testing. In practice, minor violations are often tolerable with large sample sizes (n > 30), as the central limit theorem approximates normality of estimators. For example, in analyzing height-weight data exhibiting non-linearity, adding a quadratic term extends the model to capture curvature, improving fit without full transformation.^[68]

References

[1]
How to describe bivariate data - PMC - NIH
Bivariate statistics are used in research in order to analyze two variables simultaneously;. Real world phenomena such as many topics of scientific research are ...
[2]
15. Bivariate analysis – Graduate research methods in social work
Bivariate analysis consists of a group of statistical techniques that examine the relationship between two variables.
[3]
Bivariate analysis – Research Design and Methods for the Doctor of ...
Bivariate analysis is a group of statistical techniques that examine the relationship between two variables. You need to conduct bivariate analyses before you ...
[4]
Bivariate Analysis - an overview | ScienceDirect Topics
Bivariate analysis is defined as a statistical technique used to analyze the relationship between two variables, enabling researchers to identify the existence ...
[5]
Bivariate Analysis Definition & Example - Statistics How To
Bivariate analysis is the analysis of exactly two variables. Multivariate analysis is the analysis of more than two variables. The results from bivariate ...
[6]
Research Software Tutorials | UWG
Bivariate statistical analyses are data analysis procedures using two variables (e.g. self-efficacy and academic performance). Bivariate analyses can be ...Missing: definition | Show results with:definition
[7]
Galton, Pearson, and the Peas: A Brief History of Linear Regression ...
This paper presents a brief history of how Galton originally derived and applied linear regression to problems of heredity.<|separator|>
[8]
1.2 Data Basics – Significant Statistics – beta (extended) version
Data can be classified into four levels of measurement. They are (from lowest to highest level): Nominal scale level. Ordinal scale level.
[9]
Data Types of Values
Jun 23, 2021 · Data types include nominal (non-numeric), ordinal (numeric), interval (fixed differences), ratio (ratios), binary, and statistical (from ...
[10]
The Basics - Sociology 3112 - The University of Utah
Apr 12, 2021 · Continuous variables can take on any number of values, whereas discrete variables are limited in the number of values they can take on. Height ...
[11]
Chapter 6: Steps for Bivariate Analysis and Results - OEN Manifold
On the other hand, bi-variate (two variables) analysis means examining two variables at a time. Bivariate analysis is a part of inferential statistics, which ...
[12]
What statistical analysis should I use? Statistical analyses using SPSS
Consider the type of variables (categorical, ordinal, interval) and if they are normally distributed when choosing a statistical test. See 'Choosing the ...
[13]
18.1 - Covariance of X and Y | STAT 414
Covariance quantifies dependence between two random variables. It is defined as the covariance of and, denoted as or.
[14]
[PDF] Covariance and Correlation Class 7, 18.05 Jeremy Orloff and ...
Covariance is a measure of how much two random variables vary together. For example, height and weight of giraffes have positive covariance because when one ...Missing: statistics | Show results with:statistics
[15]
[PDF] STAT 234 Lecture 23A Sample Covariance and Correlation Section ...
Shortcut Formula for the Sample Covariance. There are various formula for computing the sample covariance: sxy = 1 n − 1. Xn i=1. (xi − ¯x)(yi − ¯y). = Pn i ...
[16]
Measures of Association: Covariance, Correlation - STAT ONLINE
The difference between the first and second terms is then divided by n -1 to obtain the covariance value. Again, sample covariance is a function of the random ...
[17]
Covariance and Correlation - Data Analysis in the Geosciences
Covariance has units equal to the product of the units of the two measurements, similar to variance, which has units that are the square of the measurement's ...
[18]
[PDF] Covariance and correlation
Jul 26, 2017 · Example: Weight and height data. E[W ]=62.75. Positive covariance: Knowing high W makes high H more likely! E[H]=52.75. E[W⋅H]=3355.83. Weight.Missing: statistics | Show results with:statistics
[19]
VII. Note on regression and inheritance in the case of two parents
Note on regression and inheritance in the case of two parents. Karl Pearson ... Published:01 January 1895https://doi.org/10.1098/rspl.1895.0041. Abstract.
[20]
1.6 - (Pearson) Correlation Coefficient, $r$ | STAT 501
If r = 1, then there is a perfect positive linear relationship between x and y. If r = 0, then there is no linear relationship between x and y. All other values ...Missing: significance | Show results with:significance<|control11|><|separator|>
[21]
1.9 - Hypothesis Test for the Population Correlation Coefficient
Step 2: Test Statistic. Second, we calculate the value of the test statistic using the following formula: Test statistic: t ∗ = r n − 2 1 − R 2. Step 3: P-Value.
[22]
[PDF] Pearson's correlation - Statstutor
Assumptions. The calculation of Pearson's correlation coefficient and subsequent significance testing of it requires the following data assumptions to hold:.Missing: original | Show results with:original
[23]
Pearson Correlation Coefficient (r) | Guide & Examples - Scribbr
May 13, 2022 · The Pearson correlation coefficient (r) is the most common way of measuring a linear correlation between two variables.
[24]
https://pubmed.ncbi.nlm.nih.gov/29481436/
[25]
Correlation Coefficients: Appropriate Use and Interpretation - PubMed
Both correlation coefficients are scaled such that they range from -1 to +1, where 0 indicates that there is no linear or monotonic association, and the ...Missing: original | Show results with:original
[26]
[PDF] Spearman's correlation - Statstutor
The calculation of Spearman's correlation coefficient and subsequent significance testing of it requires the following data assumptions to hold: • interval ...Missing: seminal | Show results with:seminal
[27]
A robust Spearman correlation coefficient permutation test - PMC - NIH
We developed a robust permutation test for testing the hypothesis H 0 : ρ s = 0 based on an appropriately studentized statistic.Missing: source | Show results with:source
[28]
LibGuides: SPSS Tutorials: Chi-Square Test of Independence
### Summary of Chi-Square Test of Independence
[29]
Tutorial: Pearson's Chi-square Test for Independence
The Chi-square test is intended to test how likely it is that an observed distribution is due to chance. It is also called a "goodness of fit" statistic, ...
[30]
11.3 - Chi-Square Test of Independence | STAT 200
11.3 - Chi-Square Test of Independence · 1. Check assumptions and write hypotheses. · 2. Calculate an appropriate test statistic · 3. Determine the p-value · 4.Missing: definition | Show results with:definition
[31]
[PDF] Chapter 12: Chi-Square Tests of Independence and Goodness-of-Fit
The chi-square test of independence checks if variables are independent by comparing observed frequencies to expected frequencies in a contingency table.
[32]
Chi-Square Test of Independence - Stat Trek
For example, in an election survey, voters might be classified by gender (male or female) and voting preference (Democrat, Republican, or Independent). We ...
[33]
Simple Linear Regression | Introduction to Statistics - JMP
Simple linear regression is used to model the relationship between two continuous variables. Often, the objective is to predict the value of an output variable ...Regression Model Assumptions · Interpreting Regression Output · Curve Fitting
[34]
Application and interpretation of linear-regression analysis - PMC
Simple linear regression includes two variables, i.e., a dependent variable (Y) and an independent variable (X), which are linearly related to each other [6].
[35]
The difference between correlation and regression - GraphPad
Correlation and linear regression are not the same. What is the goal? Correlation quantifies the degree to which two variables are related.
[36]
Coefficient of Determination (R²) | Calculation & Interpretation - Scribbr
Apr 22, 2022 · The coefficient of determination (R²) is a number between 0 and 1 that measures how well a statistical model predicts an outcome.What is the coefficient of... · Calculating the coefficient of...
[37]
12.3 - Simple Linear Regression - STAT ONLINE
Example: Interpreting the Regression Line Predicting Weight with Height. Here, the -intercept is -150.950. This means that an individual who is 0 inches tall ...
[38]
Simple Linear Regression - An example - Freie Universität Berlin
We examine the relationship between two variables, the height of students as the predictor variable and the weight of students as the response variable.
[39]
Statistics review 7: Correlation and regression - PubMed Central
Correlation quantifies the strength of the linear relationship between a pair of variables, whereas regression expresses the relationship in the form of an ...
[40]
2.5 - The Coefficient of Determination, r-squared | STAT 462
The coefficient of determination or r-squared value, denoted r 2 , is the regression sum of squares divided by the total sum of squares.
[41]
Explaining Ordinary Least Squares (OLS) regression with R
OLS minimizes the sum of squared residuals to find coefficients, where all observations are equally weighted. The model form is y=β0+β1x+ε.
[42]
The basic mathematics of OLS
Ordinary Least Squares (OLS) is a technique for turning that correspondence into the best-fitting linear equation ˆy=ˆa+ˆbx. achieved by minimizing ∑i(yi−^yi)2 ...1 Estimating The Linear... · 2 Deriving The Ols... · 2.2 δsδˆa=0Missing: definition | Show results with:definition
[43]
[PDF] The Mathematical Derivation of Least Squares Back ... - UGA SPIA
OLS estimates these parameters by finding the values for the constant and coefficients that minimize the sum of the squared errors of prediction, i.e., the ...
[44]
Gauss Markov theorem - StatLect
The Gauss Markov theorem: under what conditions the OLS estimator of the coefficients of a linear regression is BLUE (best linear unbiased estimator).Assumptions · OLS is linear and unbiased · What it means to be best
[45]
Mathematics of simple regression - Duke People
Review of the mean model, formulas for the slope and intercept of a simple regression model, formulas for R-squared and standard error of the regression.
[46]
12.3.4 - Hypothesis Testing for Slope | STAT 200
Below you will see how to test the statistical significance of the slope and how to construct a confidence interval for the slope.
[47]
New View of Statistics: Linear Regression - Sportsci.org
Dec 10, 2000 · In the example, the slope is about 0.75 kg per cm (an increase in weight of 0.75 kg for each cm increase in height). We can also calculate the ...
[48]
9.1 Introduction to Bivariate Data and Scatterplots
“bi” for two variables. In this chapter, you will be studying the “simple linear regression”.Bivariate Data · Scatterplots · Strength
[49]
Bivariate Plots - Patrick J. Bartlein
A scatter plot displays the values of two variables at a time using symbols, where the value of one variable determines the relative position of the symbol ...
[50]
D-plots: Visualizations for Analysis of Bivariate Dependence ... - MDPI
Moreover, scatter plots enable users to assess similarity between observations through distance, aiding in the detection of clusters, outliers, and class ...
[51]
Plot Two Continuous Variables: Scatter Graph and Alternatives
Nov 17, 2017 · Color by a continuous variable; Add marginal density plots; Continuous bivariate distribution; Zoom in a scatter plot; Add trend lines and ...<|control11|><|separator|>
[52]
[PDF] fitting a line to data – earnings and educational attainment
The scatter plot shows a significant increase in median annual income between 14 and 16 years of education and between 16 and 18 years. These sharp increases ...
[53]
[PDF] Legendre On Least Squares - University of York
Gauss says in his work on the Theory of Mo- tions of the Heavenly Bodies ... but that it was first published by Legendre. The first statement of the ...
[54]
[PDF] Applied Linear Regression - Purdue Department of Statistics
... the Fourth Edition. This is a textbook to help you learn about applied linear regression. The book has been in print for more than 30 years, in a period of ...
[55]
4.4 - Identifying Specific Problems Using Residual Plots | STAT 462
Residual plots identify non-linearity with systematic patterns, non-constant error variance with fanning/funneling, and outliers with observations deviating ...
[56]
4.6 - Normal Probability Plot of Residuals | STAT 462
The normal probability plot of the residuals is approximately linear supporting the condition that the error terms are normally distributed. qq plot. Normal ...Missing: source | Show results with:source<|separator|>
[57]
TESTING FOR SERIAL CORRELATION IN LEAST SQUARES ...
J. DURBIN, G. S. WATSON; TESTING FOR SERIAL CORRELATION IN LEAST SQUARES REGRESSION. I, Biometrika, Volume 37, Issue 3-4, 1 December 1950, Pages 409–428, h.
[58]
Linear Regression Analysis: Part 14 of a Series on Evaluation ... - NIH
For example, if one studies the relationship between sex and weight, one obtains the regression line Y = 47.64 + 14.93 × X, where X = sex (1 = female, 2 = male ...Missing: credible | Show results with:credible
[59]
The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates
When your model satisfies the assumptions, the Gauss-Markov theorem states that the OLS procedure produces unbiased estimates that have the minimum variance.
[60]
Gauss Markov Theorem & Assumptions - Statistics How To
The Gauss Markov theorem tells us that if a certain set of assumptions are met, the ordinary least squares estimate for regression coefficients gives you the ...
[61]
The Breusch-Pagan Test: Definition & Example - Statology
The Breusch-Pagan test is used to determine whether or not heteroscedasticity is present in a regression model.
[62]
An Introduction to the Shapiro-Wilk Test for Normality - Built In
The Shapiro-Wilk test checks whether a sample comes from a normal distribution. Its null hypothesis assumes normality; a low p-value indicates a deviation ...
[63]
The Five Assumptions for Pearson Correlation - Statology
Nov 17, 2021 · 1. Level of Measurement: The two variables should be measured at the interval or ratio level. 2. Linear Relationship: There should exist a linear relationship ...
[64]
Spearman's Rank Order Correlation using SPSS Statistics
Assumption #1: Your two variables should be measured on an ordinal, interval or ratio scale. · Assumption #2: Your two variables represent paired observations.
[65]
Testing the assumptions of linear regression - Duke People
There are four principal assumptions which justify the use of linear regression models for purposes of inference or prediction: (i) linearity and additivity ...
[66]
Understanding Diagnostic Plots for Linear Regression Analysis
Sep 21, 2015 · Residuals could show how poorly a model represents data. Residuals are leftover of the outcome variable after fitting a model (predictors) to ...
[67]
Heteroscedasticity in Regression Analysis - Statistics By Jim
Heteroscedasticity refers to residuals for a regression model that do not have a constant variance. Learn how to identify and fix this problem.
[68]
4.4 - Identifying Specific Problems Using Residual Plots | STAT 501
The residuals vs. fits plot tells you, though, that your prediction would be better if you formulated a non-linear model rather than a linear one. How does non- ...
[69]
7.1 - Log-transforming Only the Predictor for SLR | STAT 462
That is, transforming the x values is appropriate when non-linearity is the only problem (i.e., the independence, normality, and equal variance conditions are ...
[70]
Robust Regression: Asymptotics, Conjectures and Monte Carlo
Huber. "Robust Regression: Asymptotics, Conjectures and Monte Carlo." Ann. Statist. 1 (5) 799 - 821, September, 1973. https://doi.org/10.1214/aos/1176342503 ...