Scree plot
A scree plot is a graphical tool in multivariate statistics, consisting of a line plot that displays the eigenvalues of principal components or factors in decreasing order against their corresponding component numbers, aiding in the determination of the optimal dimensionality of data by identifying how many components capture significant variance.[1] Introduced by psychologist Raymond B. Cattell in 1966 as the "scree test" for factor analysis, it visualizes the point at which additional components contribute diminishing returns in explained variance, often resembling a slope of loose rocks—hence the name derived from geological scree.[2] In principal component analysis (PCA), the plot orders eigenvalues from largest to smallest on the y-axis, with the x-axis representing component indices, allowing analysts to retain components that account for the bulk of the data's variability while discarding noise.[3] The primary interpretation method is the "elbow rule," where one identifies a sharp bend or inflection point in the curve beyond which eigenvalues level off, suggesting minimal additional information from further components; for instance, if the plot flattens after the third component, retaining the first three may explain over 80% of variance.[1] This subjective yet widely adopted criterion helps balance model parsimony and explanatory power, though alternatives like parallel analysis or the Kaiser criterion (retaining eigenvalues ≥1) are sometimes used alongside it to reduce ambiguity. Scree plots are implemented in statistical software such as Minitab and R, and their use extends beyond PCA to exploratory factor analysis, emphasizing their role in dimensionality reduction for high-dimensional datasets in fields like psychometrics, genetics, and environmental science.[3]Definition and Purpose
Definition
A scree plot is a graphical representation in multivariate statistics, typically depicted as a line plot with the component numbers on the x-axis and the eigenvalues (or singular values) derived from the covariance or correlation matrix of a dataset on the y-axis, arranged in descending order of magnitude.[1][4] The term "scree" originates from geology, where it refers to loose rock debris or rubble that accumulates at the base of a steep slope or cliff, serving here as a metaphor for the diminishing eigenvalues that represent residual variance after the extraction of major principal components.[5][6] In exploratory data analysis, the scree plot visualizes the proportion of total variance explained by each successive component, aiding in the assessment of data structure without implying specific interpretive rules.[7] This tool plays a fundamental role in principal component analysis (PCA) for dimensionality reduction by highlighting the relative importance of components based on their associated eigenvalues.[8]Purpose in Dimensionality Reduction
The scree plot serves as a visual diagnostic tool in dimensionality reduction techniques, such as principal component analysis (PCA) and exploratory factor analysis (EFA), to identify the optimal number of principal components or factors by plotting the eigenvalues in descending order and observing the point where the explained variance begins to level off sharply.[2] This approach enables analysts to retain only those components that capture the majority of the data's variability, thereby simplifying complex datasets while minimizing information loss.[2] By facilitating the selection of a subset of components that account for substantial variance, the scree plot helps prevent overfitting in subsequent modeling tasks, as retaining excessive minor components—often representing random noise rather than meaningful structure—can lead to models that perform poorly on unseen data.[9] Dimensionality reduction via scree-guided component selection thus promotes more robust generalizations, particularly in high-dimensional settings where noise dominates lower-variance directions.[10] A key objective of the scree plot is to strike a balance between preserving a high proportion of total variance, typically aiming for 80-90% cumulative explained variance, and maintaining model parsimony to enhance interpretability and computational efficiency.[11] This trade-off ensures that the reduced representation captures essential patterns without unnecessary complexity, supporting reliable inference in downstream analyses.[12] In practice, the scree plot is widely applied in psychometrics for refining latent variable models from questionnaire data with numerous items, and in bioinformatics for distilling gene expression profiles involving thousands of variables into manageable dimensions that highlight biological signals.[2][13]Construction
Steps to Create a Scree Plot
To create a scree plot from a multivariate dataset, begin by preparing the data and following a structured computational process rooted in principal component analysis.- Compute the covariance or correlation matrix of the dataset. Start with a data matrix of p variables and n observations, centering the variables by subtracting their means to place the data cloud at the origin. If the variables have differing scales or units, standardize them (divide by standard deviations) and compute the p \times p correlation matrix; otherwise, use the covariance matrix to capture raw variances. This matrix summarizes the pairwise relationships among variables and serves as the input for decomposition.[14][15]
- Perform eigenvalue decomposition on the matrix. Apply spectral decomposition to the covariance or correlation matrix to extract the eigenvalues \lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p and corresponding eigenvectors. These eigenvalues quantify the variance captured by each successive principal component, with the sum of all eigenvalues equaling the total variance in the data (trace of the matrix).[14][16]
- Plot the eigenvalues against component indices. Arrange the eigenvalues in descending order and create a graph with the component numbers (1 to p) on the x-axis and the eigenvalue magnitudes on the y-axis. Use a line plot connecting the points or a bar graph for visualization, which reveals the distribution of variance across components.[16][14]
- Optionally, incorporate cumulative variance or scree ratios. To enhance interpretability, overlay a line showing the cumulative proportion of total variance explained by the first k components (computed as \sum_{i=1}^k \lambda_i / \sum_{i=1}^p \lambda_i) or plot ratios of consecutive eigenvalues (\lambda_k / \lambda_{k+1}) to highlight drops in importance. These additions are not part of the core scree plot but aid in assessing dimensionality.[1]