Biplot
A biplot is a graphical method in multivariate statistics that simultaneously displays both the observations (rows) and variables (columns) of a data matrix as points and vectors, respectively, in a low-dimensional plane, typically two dimensions, to approximate the original matrix via their outer product in a least-squares sense. This visualization enables the assessment of relationships such as distances between observations, correlations among variables, and contributions of variables to observations, making it a powerful tool for exploratory data analysis.[1] Biplots are particularly effective for matrices of rank two or higher, where the approximation captures the dominant structure of the data.
Introduced by K. R. Gabriel in 1971,[2] the biplot was originally developed as a graphical extension of principal component analysis (PCA), where observations are projected onto principal components and variables are represented by their loadings. Gabriel's seminal work emphasized the biplot's utility in visually appraising the structure of large matrices, such as those in meteorological or biological data, by allowing direct interpretation of how variables influence observations through vector projections.[3] Over time, the method has been generalized to other techniques, including correspondence analysis for categorical data and additive main effects and multiplicative interaction (AMMI) models for multi-environment trials in agriculture.[4]
Biplots have found wide applications across disciplines, from genomics and ecology to economics and sensory analysis, due to their ability to reveal patterns like clustering of observations or variable redundancy in high-dimensional datasets.[5] Variants such as row-metric preserving (RM) and column-metric preserving (CM) biplots adjust the scaling to emphasize either observation distances or variable correlations, enhancing interpretability for specific analytical goals.[2] Modern implementations in statistical software, such as R and Stata, facilitate interactive exploration, though care must be taken to account for the approximation's limitations in higher ranks where only the first few components are visualized.[6]
Introduction
Definition
A biplot is an exploratory graphical display in statistics that generalizes the scatterplot by simultaneously representing both the rows (typically observations or samples) and columns (typically variables) of a multivariate data matrix on a single two-dimensional plot.[7] Introduced in the context of principal component analysis, it plots row markers as points to depict observations and column markers as vectors or arrows to depict variables, all overlaid on shared axes derived from the principal components of the data.[8] This dual representation allows for a compact visualization of high-dimensional data structures.[9]
The primary purpose of a biplot is to approximate the original data matrix using a low-rank (typically rank-two) decomposition, where each element of the matrix is reconstructed via the inner product (scalar product) of the corresponding row and column markers.[7] This approximation facilitates pattern recognition by enabling visual assessment of relationships, such as similarities among observations or correlations between variables, in complex multivariate datasets without needing to examine the full matrix.[8] Biplots are particularly useful in principal component analysis contexts for exploring data configurations, though they extend to other matrix decompositions.[10]
Historical Development
The biplot was introduced by K. Ruben Gabriel in 1971 through his seminal paper titled "The biplot graphic display of matrices with applications to principal component analysis," published in Biometrika. This innovation provided a graphical method to simultaneously approximate the rows (observations) and columns (variables) of a data matrix in a low-dimensional space, initially applied to principal component analysis (PCA) for visualizing variances, covariances, and distances.[11]
In the 1970s, biplots gained traction in multivariate statistical analysis, particularly for PCA applications in fields such as meteorology and biomedicine, where they facilitated the inspection of data structures and diagnosis of patterns in matrices.[12] During the 1980s and 1990s, the technique saw significant extensions to handle categorical data via integration with correspondence analysis, as detailed by Michael Greenacre, enabling biplots for contingency tables and profile data.[13] Concurrently, efforts to improve robustness against outliers emerged, with adaptations incorporating robust estimation methods to mitigate the influence of anomalous observations in multivariate displays.[14]
Modern advancements have further diversified biplot methodologies. The HJ-biplot, proposed by M.P. Galindo in 1986, optimized simultaneous Euclidean representations of rows and columns by balancing their goodness-of-fit.[5] In agronomy, the genotype plus genotype-by-environment (GGE) biplot, developed by Weikai Yan and Manjit S. Kang in 2002, became a standard tool for analyzing genotype-environment interactions and identifying stable cultivars across trials.[15] Recent developments from 2018 to 2025 include sparse variants, such as the sparse HJ-biplot using elastic net regularization to enhance interpretability in high-dimensional data, and iterative algorithms for correlation matrix biplots that improve approximations through column adjustments.[16][17]
Biplots have profoundly influenced diverse disciplines, including chemometrics for process monitoring and variable selection in analytical chemistry, and ecology for ordinating species distributions and environmental gradients.[18][19]
Theoretical Foundations
Relation to Principal Component Analysis
Principal component analysis (PCA) is a statistical technique for dimensionality reduction that transforms a set of possibly correlated variables into a smaller set of uncorrelated principal components, derived as the eigenvectors of the covariance matrix of the centered data. These eigenvectors are ordered by their corresponding eigenvalues, which quantify the amount of variance captured by each component, with the first principal component (PC1) explaining the maximum variance and subsequent components capturing progressively less while being orthogonal to the previous ones. The scores represent the coordinates of observations in the new principal component space, obtained by projecting the original data onto these eigenvectors, while the loadings describe how much each original variable contributes to a given component.
Biplots are intrinsically linked to PCA as a visualization method for its outputs, particularly for displaying the first two principal components (PC1 and PC2) to illustrate the structure underlying the data's variance. In this context, the biplot plots the scores of observations as points and the loadings of variables as vectors (arrows) in the plane spanned by PC1 and PC2, enabling a joint view of how observations relate to each other and to the variables. This approach, introduced by Gabriel, reveals patterns such as clustering among observations and the directions of maximum variance along the principal axes.[11]
A key distinction of biplots from conventional PCA visualizations lies in their integration: standard PCA plots typically separate score plots (showing observation positions) from loading plots (showing variable contributions), whereas biplots superimpose both elements on the same graph to facilitate the assessment of correlations between observations and variables in a unified framework.[11] Biplots presuppose that the input data has been centered by subtracting the mean from each variable to form the covariance matrix; optional standardization scales variables to unit variance, which is advisable when variables differ markedly in measurement units to prevent dominance by those with larger scales.
For example, applying PCA to Fisher's iris dataset—which comprises measurements of sepal and petal dimensions for 150 flowers across three species—yields a biplot where observations cluster distinctly by species along the PC1 and PC2 axes, with PC1 often aligning with petal length differences that separate the setosa species from the others.
Singular Value Decomposition in Biplots
The singular value decomposition (SVD) provides the foundational mathematical framework for constructing biplots by decomposing a centered data matrix X of dimensions n \times p, where n represents the number of observations (rows) and p the number of variables (columns). The SVD expresses X as X = U D V^T, where U is an n \times k orthogonal matrix containing the left singular vectors (also known as scores, representing the principal components for rows), D is a k \times k diagonal matrix with non-negative singular values d_1 \geq d_2 \geq \cdots \geq d_k \geq 0 on the diagonal (indicating the amount of variance explained by each component), V is a p \times k orthogonal matrix containing the right singular vectors (loadings, representing variable contributions), and k = \min(n, p). This decomposition captures the optimal low-rank structure of X in terms of least-squares approximation.[11][20]
For biplot visualization in two dimensions, a rank-2 low-rank approximation is used to project the data onto the plane spanned by the first two principal axes, yielding X \approx \sum_{i=1}^r d_i u_i v_i^T, where r = 2, u_i is the i-th column of U, and v_i is the i-th column of V. This approximation, Y = U_r D_r V_r^T with U_r and V_r being the first two columns of U and V, and D_r the corresponding 2x2 diagonal submatrix of D, minimizes the Frobenius norm \|X - Y\|_F among all rank-2 matrices and preserves the key structural properties of the data for graphical display. The singular values in D_r quantify the relative importance of these dimensions, with the proportion of total variance explained given by \frac{\sum_{i=1}^2 d_i^2}{\sum_{i=1}^k d_i^2}.[11][20]
To represent observations and variables simultaneously in the biplot, a scaling parameter \alpha (where $0 \leq \alpha \leq [1](/page/1)) adjusts the relative emphasis between row and column markers. The coordinates for row markers (observations) are given by G_\alpha = U D^\alpha, an n \times 2 matrix where each row i provides the position (g_{\alpha,i1}, g_{\alpha,i2}) in the plane. The column markers (variables) are H_\alpha = V D^{1-\alpha}, a p \times 2 matrix where each column j defines a vector from the origin to (h_{\alpha,1j}, h_{\alpha,2j}). Common choices include \alpha = [1](/page/1) for the row-metric biplot (emphasizing distances between observations), \alpha = 0 for the column-metric biplot (emphasizing correlations between variables), and \alpha = 0.5 for the symmetric biplot (balancing both). These marker coordinates ensure that the biplot axes align with the principal directions from the SVD.[11][20]
The quality of the biplot approximation is evident in the scalar product between row and column markers, which reconstructs the original data entries: the inner product G_\alpha(i, :) \cdot H_\alpha(:, j) \approx x_{ij} for the i-th observation and j-th variable, with the approximation improving as the retained singular values capture more of the total variance. This property allows the biplot to visually approximate the data matrix while highlighting relationships through projections and angles in the reduced space.[11][20]
Constructing a Biplot
Data Preparation and Standardization
Before constructing a biplot, the input data matrix X, typically consisting of n observations (rows) and p variables (columns), must undergo preprocessing to ensure meaningful graphical representations that accurately reflect the underlying structure without distortion from scale or location differences.[20]
Centering is a fundamental step, involving the subtraction of the mean of each variable from its corresponding column to remove the effects of differing central tendencies and focus on variability around the origin. This transforms X into a centered matrix Z = X - \mathbf{1} \mu^T, where \mathbf{1} is a column vector of ones and \mu is the vector of column means; as a result, the row and column means of Z become zero, facilitating the approximation of inter-observation Euclidean distances and variable relationships in the biplot.[20] Centering is essential for classical biplots derived from principal component analysis, as it aligns the data with the assumptions of singular value decomposition by eliminating translation effects.[21]
Standardization, while optional, is often applied when variables exhibit disparate scales, such as measurements in different units, to prevent dominant variables from overshadowing others. This involves dividing each centered column by its standard deviation, yielding Z_{ij}^* = (X_{ij} - \bar{X}_j)/s_j, where s_j is the standard deviation of column j; the resulting matrix has unit variance per column. In biplots of standardized data—commonly termed correlation biplots—the lengths of variable arrows approximate the standard deviations (often equalized to unity), emphasizing correlations over absolute variances and improving interpretability of angular relationships between variables.[20] Without standardization, as in covariance biplots, arrow lengths reflect the original variances, which may distort visualizations if scales vary widely.[20] The choice between centered-only and standardized data influences the biplot's focus: the former preserves magnitude information, while the latter prioritizes relational patterns.[21]
Handling missing data is critical to avoid biased representations, with common approaches including exclusion of incomplete observations or imputation methods such as mean substitution or regression-based estimation to fill gaps before centering and standardization. For datasets with substantial missingness, advanced techniques like iterative imputation can preserve multivariate structure, ensuring the biplot's approximations remain robust.[22] When dealing with categorical variables, conversion to dummy variables (binary indicators for each category) is standard, allowing their inclusion in the quantitative framework of the biplot after appropriate centering; this enables visualization alongside continuous variables but requires omitting one category per set to avoid linear dependence in the matrix.[23]
The choice of distance metric further tailors data preparation to the data type: Euclidean distances are suitable for continuous variables post-centering and optional standardization, approximating squared distances in the biplot. For contingency tables involving categorical data, alternatives like the chi-squared metric are preferred, which inherently standardizes row and column profiles by their totals, better capturing associations in frequency data without explicit dummy coding.[24] Standardization's impact extends to biplot interpretability, as it equalizes variable contributions, making arrow lengths comparable and projections more reflective of correlations rather than scale differences, though it may obscure variance hierarchies if not desired.[20]
Algorithm for Generation
The algorithm for generating a biplot begins with the singular value decomposition (SVD) of the prepared data matrix Z, which has been centered and optionally standardized, to obtain a low-rank approximation that captures the principal structure of the data.[20] This decomposition forms the basis for positioning both observations and variables in a shared graphical space, as originally proposed by Gabriel.
The procedural steps are as follows:
-
Compute the SVD of the matrix Z, expressed as Z = U D V^T, where U is the matrix of left singular vectors, V is the matrix of right singular vectors, and D is the diagonal matrix containing the singular values in decreasing order.[20]
-
Select the first r = 2 components for a two-dimensional representation by extracting the leading 2 columns of U (denoted U_2) and V (denoted V_2), along with the 2×2 top-left submatrix of D (denoted D_2). The observation scores are then given by U_2 D_2^\alpha, and the variable loadings by V_2 D_2^{1-\alpha}, where \alpha (with $0 \leq \alpha \leq 1) is a scaling parameter that balances the emphasis between row-metric preservation (\alpha = 1) and column-metric preservation (\alpha = 0); a common choice is \alpha = 0.5 for symmetric scaling.[20][25]
-
Plot the observations as points in the two-dimensional plane, positioning each using its coordinates from the scores matrix along the first principal component (PC1) axis (horizontal) and second principal component (PC2) axis (vertical).[20][25]
-
Plot the variables as arrows extending from the origin to the endpoints defined by the loadings matrix, where the direction indicates the variable's association with the principal components and the length reflects its contribution; for \alpha = 0 (covariance biplot), these lengths are proportional to the standard deviations of the variables.[20]
-
Add labels to the observation points and variable arrows, include labeled axes for PC1 and PC2, and optionally incorporate contours or ellipses to visualize the proportion of total variance explained by the two components (typically computed as the sum of the squared singular values in D_2 divided by the total sum of squared singular values).[26]
The output is a single 2D graph in which observations and variables occupy the same coordinate system, facilitating the approximation of data entries via inner products between points and arrow endpoints. Although the algorithm can extend to three dimensions by selecting r = 3, such representations are rare in practice due to increased complexity in visual interpretation.[20]
Interpreting Biplots
Representing Observations
In a biplot derived from principal component analysis (PCA), observations are depicted as points positioned according to their scores on the first two principal components, providing a low-dimensional projection of the multivariate data. These points capture the relative positions of the samples in the reduced space, where the coordinates reflect how each observation contributes to the principal components.[11]
The proximity of points to one another indicates similarity among the corresponding observations; clustered points suggest samples that share comparable profiles across the original variables, facilitating the visual identification of groups or patterns in the data.[20] For instance, in the analysis of the Iris dataset comprising measurements of sepal and petal traits for three species (Setosa, Versicolor, and Virginica), the biplot displays distinct clusters for each species, highlighting their separation based on morphological differences.[27]
The Euclidean distance between observation points in the biplot serves as an approximation of the Mahalanobis distance between those observations in the original standardized data space, particularly in row-metric preserving (RMP) configurations such as the JK-biplot.[28] This property allows for the assessment of dissimilarities while accounting for variable correlations. Outliers manifest as points isolated far from the origin or main clusters, signaling observations that exhibit unusual combinations of variable values relative to the dataset.[20]
To estimate an observation's value for a specific variable, a perpendicular projection is drawn from the point onto the corresponding variable axis in the biplot; the distance along this axis provides an approximate standardized value, with accuracy depending on the variance explained by the selected components.[20]
The quality of these approximations for distances, similarities, and projections depends on the percentage of total variance explained by the plotted dimensions; typically, the first two principal components are used, but the explained variance should be checked to assess reliability.[27]
Representing Variables and Relationships
In a biplot, variables from the original data matrix are represented as arrows originating from the plot's origin, with each arrow's direction indicating the variable's correlation with the principal components used to construct the biplot. The length of each arrow approximates the standard deviation of the corresponding variable, such that the squared length reflects the variance explained by the principal components.[25] This scaling ensures that longer arrows denote variables that contribute more substantially to the overall data variance captured in the plot.[20]
The angle between two variable arrows provides an approximation of the correlation between those variables, where the cosine of the angle equals the correlation coefficient. An acute angle (less than 90 degrees) signifies a positive correlation, while an obtuse angle (greater than 90 degrees) indicates a negative one.[26] Arrows that are perpendicular to each other, forming a 90-degree angle, represent uncorrelated variables, highlighting their independence in the data structure.[29]
To approximate an individual observation's value for a specific variable, one computes the scalar product between the observation's point (as represented in the biplot) and the variable's arrow vector; this projection length yields the estimated original data value, with longer projections indicating higher values for that variable.[25] For instance, if an observation point projects far along a variable arrow, it suggests an elevated value for that variable in the original dataset.[20] In some biplot representations, optional contour lines perpendicular to each variable arrow delineate isocontours of exact predicted values, facilitating precise visual interpolation of data values across the plot.[20]
Types and Variations
Classical Biplot
The classical biplot, introduced by K. Ruben Gabriel in 1971, represents a symmetric graphical display of a multivariate data matrix derived from principal component analysis (PCA), where both row points (observations) and column markers (variables) are plotted on the same coordinate system using a scaling parameter α = 0.5.[7] This configuration employs square root scaling of the singular values from the singular value decomposition (SVD) of the data matrix, assigning the square roots equally to the row and column coordinates to achieve balanced representation.[30] The result is an approximation of the original data matrix X ≈ GH^T, where G contains the row coordinates and H the column coordinates, such that each data entry x_{ij} is approximated by the inner product of the corresponding row and column vectors.[30]
A key property of the classical biplot is its preservation of inner products, enabling the scalar product between a row point and a column marker to directly approximate the original data value, while distances between row points approximate their Euclidean distances in the row metric space and angles between column markers reflect variable correlations.[30] This symmetric approach is optimal when the number of observations (n) equals the number of variables (p), or when the metrics for rows and columns are interchangeable, as it treats both sets equivalently without favoring one over the other.[30] In visualization, observations appear as points and variables as arrows, both scaled equally to facilitate interpretation of projections (how well variables predict observations) and angles (correlations between variables).[31] offering a concise yet informative summary of the data structure.[32]
However, the classical biplot has limitations, particularly when n ≠ p; for instance, in high-dimensional cases where p > n, the symmetric scaling can distort visual interpretations by stretching the representation of variables relative to observations, potentially misleading assessments of distances and relationships.[31] It assumes continuous data in a Euclidean space, relying on the linear assumptions of PCA, and may not perform well with non-Euclidean or discrete data without preprocessing.[32] For balanced datasets in exploratory analysis, the classical biplot remains the standard choice for PCA-based visualizations, though extensions like the HJ-biplot can address imbalances by adjusting scalings for unequal n and p.[30]
HJ-Biplot and Other Extensions
The HJ-biplot, introduced by Galindo-Villardón in 1986, extends the classical biplot by employing separate scalings for rows and columns to optimize the representation quality of both observations and variables independently.[33] Specifically, it aligns row markers with the JK-biplot configuration (equivalent to α=1, preserving Euclidean distances among observations) and column markers with the GH-biplot configuration (equivalent to α=0, preserving inner products among variables), allowing for a balanced low-dimensional projection even when the number of observations (n) and variables (p) differ substantially.[33] This approach minimizes distortions in row and column configurations separately, yielding superior fit metrics compared to symmetric scalings in the classical biplot, particularly for datasets with unequal n and p.[33]
The GGE biplot, developed by Yan and colleagues in 2000, adapts biplot methodology for analyzing genotype-by-environment (GGE) interactions in multi-environment trials, a common setup in agronomy and plant breeding.[34] It focuses on partitioning variance into genotype, genotype-by-environment, and residual components via singular value decomposition, emphasizing crossover interactions that reveal how genotypes perform across environments. Unlike general biplots, the GGE variant uses polygon views and ideal genotype targeting to identify stable high-performing genotypes and mega-environments, making it suited for experimental designs where environmental variability drives selection decisions.
Other extensions address challenges in specialized data contexts. Sparse biplots, introduced in 2021 through elastic net regularization, incorporate variable selection for high-dimensional datasets, reducing noise by shrinking irrelevant variable loadings to zero while maintaining biplot interpretability.[33] Robust biplots, developed in 1992, employ robust principal component analysis to mitigate the influence of outliers, ensuring stable representations in contaminated datasets by downweighting anomalous observations during decomposition.[35] These variants are selected based on data characteristics: HJ-biplots for imbalanced dimensions, GGE for structured trials, sparse for dimensionality reduction, and robust for outlier-prone scenarios, each offering tailored improvements in representational accuracy over the classical form.[33]
Applications
In Exploratory Data Analysis
Biplots play a central role in exploratory data analysis (EDA) of multivariate datasets, facilitating the visual detection of underlying patterns such as clusters of similar observations, outliers that deviate from the main data cloud, and correlations among variables through the angles and proximities in the plot.[2] Unlike hypothesis-driven methods, biplots enable an initial, non-parametric appraisal of data structure, where observations are represented as points and variables as directional vectors scaled to their contributions.[36] This approach stems from principal component analysis (PCA), projecting the data onto the principal components that capture the maximum variance.[2]
In EDA workflows, biplots emphasize the variance explained by the leading principal components, with axes typically labeled to show percentages (e.g., the first two components often accounting for 70-90% of total variance in well-structured datasets). They complement other unsupervised visualization tools, such as dendrograms for hierarchical clustering or heatmaps for similarity matrices, by jointly displaying both observations and variables to highlight relationships that might inform subsequent analyses like clustering. For instance, variable vectors pointing in similar directions indicate positive correlations, aiding quick insights into data dependencies without computational overhead.[36]
A key advantage of biplots in EDA is their intuitiveness, making complex multivariate relationships accessible to non-experts and allowing rapid revelation of data structure for guiding further investigation.[36] However, a common pitfall arises from over-reliance on the two-dimensional projection, which may obscure nuances in higher-dimensional spaces if the plotted components explain insufficient variance (e.g., less than 70%). Analysts must therefore verify the plot's adequacy through variance metrics to avoid misinterpretation.
Specific Fields and Examples
In ecology, biplots are widely applied to visualize relationships between species traits and environmental variables, facilitating the ordination of community structures. For instance, double-constrained correspondence analysis biplots have been used to analyze species abundances across sites in dune meadows, characterized by traits such as specific leaf area and seed mass, revealing how environmental variables like moisture content and manure quantity influence species distributions.[37] These representations highlight clusters of species adapted to similar environmental gradients, aiding in the identification of ecological niches.[38]
In agronomy, genotype plus genotype-by-environment (GGE) biplots are employed to assess crop yield stability across multiple trials, enabling breeders to select high-performing varieties. A study on rain-fed durum wheat genotypes tested in diverse Iranian environments utilized GGE biplots to partition variance into genotype, environment, and interaction effects, identifying stable high-yield genotypes like experimental line G8 and variety Saji that performed consistently across cold and warm conditions.[39] This approach visualizes mega-environments, where long arrows indicate discriminating sites and central positions denote ideal, stable performers, ultimately guiding recommendations for specific agroecological zones.[40]
Biplots in marketing support brand positioning by mapping consumer perceptions of attributes like quality and value, often through perceptual maps derived from principal component analysis. For example, biplots of customer satisfaction surveys have positioned brands relative to variables including service reliability and pricing, revealing competitive gaps such as opportunities for differentiation on affordability, informing strategic repositioning.[41]
In genetics, biplots extend to analyzing gene expression data by representing samples and genes in reduced dimensions, uncovering patterns of co-expression. Applications in genomics use biplots to group profiles from cancer studies, where points denote tissue samples and vectors indicate expressed genes, facilitating the identification of clusters associated with biological pathways.[42] This method enhances interpretability by quantifying the contribution of key genes to sample separation, supporting biomarker discovery.[5]
A classic illustrative example is the principal component analysis biplot of the Iris dataset, which plots the 150 flower measurements across three species using sepal length, sepal width, petal length, and petal width as variables. The biplot reveals clear separation of Setosa from Versicolor and Virginica along the first two principal components, with petal variables driving the primary axis of discrimination and demonstrating species-specific trait clustering.[43]
For financial applications, correlation biplots analyze stock return matrices to explore inter-asset relationships and portfolio risks. In a study of investment returns from major indices, biplots visualized covariance structures among equities, showing high positive correlations between technology stocks during volatile periods and aiding in diversification strategies by highlighting orthogonal assets.[44] Recent analyses as of 2025 have employed correlation biplots to account for market noise, identifying low-correlation pairs like utilities versus growth stocks for balanced portfolios.[1]
Advantages and Limitations
Benefits
Biplots offer a streamlined approach to visualizing multivariate data by integrating the representation of both observations and variables into a single graphical display, thereby reducing the cognitive load associated with interpreting multiple separate plots such as scatterplots or loading plots in principal component analysis (PCA). This simplicity arises from the use of singular value decomposition to project high-dimensional data into a low-dimensional space, typically two dimensions, while preserving key structural information with minimal loss, as demonstrated in applications like the inspection of large matrices where direct numerical examination would be cumbersome.[27][45]
The informativeness of biplots stems from their ability to simultaneously reveal correlations between variables—approximated by angles between arrows—clusters among observations, and the quality of the dimensional approximation through vector lengths and point proximities. For instance, in PCA contexts, biplots enable the identification of dominant patterns, such as collinearities or groupings, that would otherwise require cross-referencing multiple visualizations, providing a comprehensive overview that enhances exploratory data analysis efficiency.[27]
Biplots exhibit flexibility through various extensions, such as the HJ-biplot for balanced row-column representations or adaptations for categorical and count data in correspondence analysis, allowing scalability across diverse data types without substantial reconfiguration. This adaptability supports their application in fields like ecology and bibliometrics, where they handle weighted or incomplete datasets effectively.[27][46]
A core strength lies in the interpretability afforded by calibrated axes and projections, which permit approximate recovery of original data values—such as element-wise scalar products—directly from the plot, obviating the need for tabular consultations and facilitating intuitive geometric inferences like Euclidean distances between points. Since their introduction by Gabriel in 1971, biplots have been empirically supported as a staple for rapid insights in PCA and related techniques, with widespread adoption evidenced in over five decades of multivariate research.[27][45]
Drawbacks and Considerations
Biplots, as low-dimensional projections of multivariate data, inherently involve a loss of information due to dimensionality reduction, typically to two or three principal components for visualization. This projection captures only the variance explained by the selected components; if the retained components explain less than 70% of the total variance, significant aspects of the data structure may be overlooked, leading to incomplete interpretations.[47]
Classical biplots are particularly prone to distortions when the number of variables (p) greatly exceeds the number of observations (n), resulting in stretched representations of variables that poorly reflect true relationships and low explained variance, often below 30% in high-dimensional cases like genomic data with thousands of genes. Such distortions necessitate variants like the HJ-biplot, which better balance row and column representations by optimizing singular value distribution to mitigate these issues.[27]
The construction of biplots involves subjective choices in scaling parameters, such as the exponent α in the singular value decomposition, which determines the relative emphasis on observations versus variables and can substantially alter the visual appearance and interpretability of relationships. Standardization decisions, whether using covariance or correlation matrices, further introduce variability, as they yield different variance explanations (e.g., 95% versus 93.7% in ecological datasets), affecting the reliability of projections.[47][27]
Overinterpretation poses a risk in biplots, as the displayed projections are approximations rather than exact depictions of distances or correlations, with angular representations particularly sensitive to noise and outliers that can exaggerate or obscure variable associations. In noisy datasets, small perturbations may lead to misleading angles between variable vectors, undermining the validity of inferred relationships.[47]
Key considerations for using biplots include always reporting the percentage of variance explained by the plotted components to contextualize the reliability of the visualization, as low percentages (e.g., under 70%) indicate substantial information loss. Biplots should be avoided or adapted with sparsity-inducing methods in very high-dimensional settings without inherent data sparsity, where dense representations become uninterpretable due to clutter and noise dominance. A 2018 critique in agronomy highlighted the dangers of axis stretching in genotype-by-environment (GGE) biplots, which invalidates angular and distance interpretations unless axes are equally scaled.[27][31]
Software and Implementation
Several software packages and tools support the creation and analysis of biplots across programming languages and statistical platforms, enabling visualization of multivariate data in reduced dimensions. In the R programming language, the BiplotGUI package, released in 2009, offers an interactive graphical user interface for constructing and exploring biplots, with features for real-time parameter adjustments and multiple biplot types. For principal component analysis (PCA)-based biplots, the factoextra package leverages ggplot2 to generate customizable, publication-ready plots, including enhancements for observation and variable labeling. Specialized R packages extend biplot functionality further: GGEBiplotGUI focuses on genotype-by-environment interactions in agricultural data, and SparseBiplots implements sparse approximations for high-dimensional datasets.[48]
In Python, while users commonly implement biplots by combining scikit-learn's PCA module for dimensionality reduction with matplotlib or seaborn for visualization, the PyBiplots package provides dedicated functions for various biplot methods, including GH-Biplot, JK-Biplot, and HJ-Biplot.[49]
In SAS, biplots can be generated using the PROC PRINCOMP procedure with ODS Graphics to produce score plots and loading plots that can be overlaid for biplot visualization; this support has been available since SAS 9.2 (2008), with enhancements in later versions. Custom biplots can also be created using PROC IML.[50]
Among other tools, GraphPad Prism includes biplot options within its PCA module, optimized for biomedical and life sciences applications with intuitive drag-and-drop interfaces. Displayr, a web-based analytics platform, supports interactive biplot creation for exploratory data visualization without coding. The NIST Dataplot toolkit, a free engineering statistics software, features dedicated biplot commands for batch processing and scripting. MATLAB's Statistics and Machine Learning Toolbox offers a native biplot function that plots observations and variables on the same axes, with options for scaling and confidence ellipses.
Key features across these tools include interactivity in R's GUI-based packages for user-driven exploration and automation in SAS for handling enterprise-level datasets efficiently. Open-source R packages dominate for broad accessibility and community contributions, while commercial options like SAS, MATLAB, and GraphPad Prism provide robust support for professional and institutional use.
Practical Example in R
A practical example of constructing a biplot in R utilizes the built-in Fisher's iris dataset, which contains measurements of sepal and petal dimensions for 150 samples across three iris species: setosa, versicolor, and virginica. This dataset serves as a standard benchmark for illustrating multivariate visualization techniques.
The following R code demonstrates the process using the factoextra package, which simplifies PCA-based biplot generation. It loads the dataset, performs principal component analysis (PCA) on the four numeric variables after centering and scaling, and produces a biplot of the first two principal components with points colored by species and variable arrows overlaid.
r
# Load required library for enhanced [PCA](/page/PCA) visualization
library(factoextra)
# Load the built-in iris dataset
data(iris)
# Perform [PCA](/page/PCA): center and scale the four measurement variables
# prcomp uses [SVD](/page/SVD) internally for decomposition
res.pca <- prcomp(iris[, 1:4], scale = TRUE)
# Generate the biplot: points for observations (colored by species),
# arrows for variables; repel=TRUE avoids label overlap
fviz_pca_biplot(res.pca,
col.ind = as.factor(iris$Species), # Color points by species
repel = TRUE, # Repel overlapping text
legend.title = "Species") # Legend title
# Load required library for enhanced [PCA](/page/PCA) visualization
library(factoextra)
# Load the built-in iris dataset
data(iris)
# Perform [PCA](/page/PCA): center and scale the four measurement variables
# prcomp uses [SVD](/page/SVD) internally for decomposition
res.pca <- prcomp(iris[, 1:4], scale = TRUE)
# Generate the biplot: points for observations (colored by species),
# arrows for variables; repel=TRUE avoids label overlap
fviz_pca_biplot(res.pca,
col.ind = as.factor(iris$Species), # Color points by species
repel = TRUE, # Repel overlapping text
legend.title = "Species") # Legend title
This code centers and scales the data to ensure variables contribute equally, computes the PCA via singular value decomposition (SVD), and plots observations as points (colored by species for group distinction) alongside arrows representing variable loadings.[51] The repel = TRUE option, powered by the ggrepel package, positions labels to minimize overlap.[52]
In the resulting biplot, observations form three distinct clusters aligned with the species: setosa samples separate clearly along the first principal component (PC1), while versicolor and virginica overlap somewhat along PC2. Arrow directions and lengths indicate variable contributions; for instance, the acute angle between petal length and petal width arrows reflects their strong positive correlation (approximately 0.96), as these traits load heavily on PC1 and drive species differentiation.[51] Longer arrows denote greater explanatory power for the displayed components.
For customization, such as adjusting the biplot's scaling via the α parameter (where row markers are scaled by singular values raised to α and column markers to 1-α), manual SVD computation allows fine control beyond package defaults (typically α = 0.5 for symmetric scaling).[27] The code below computes a symmetric biplot (α = 0.5) using base R functions and plots it with ggplot2 for flexibility; adapt α (e.g., 0 for distance biplot emphasizing row distances, 1 for correlation biplot emphasizing variable correlations) as needed.
r
# Load libraries for plotting
library([ggplot2](/page/Ggplot2))
library(ggrepel) # For label repulsion
# Prepare scaled data matrix (same as prcomp input)
X <- scale([iris](/page/Iris)[, 1:4])
# Compute SVD manually
s <- svd(X)
# Define alpha for scaling (0.5 for symmetric biplot)
alpha <- 0.5
# Compute row scores (observations) and column loadings (variables)
# Scaled for the first two dimensions
n <- nrow(X)
p <- ncol(X)
row_scores <- s$u[, 1:2] %*% diag(s$d[1:2] ^ alpha) * sqrt(n / s$d[1:2] ^ alpha)
col_loadings <- s$v[, 1:2] %*% diag(s$d[1:2] ^ (1 - alpha)) * sqrt(p / s$d[1:2] ^ (1 - alpha))
# Create data frames for plotting
obs_df <- data.frame(PC1 = row_scores[, 1], PC2 = row_scores[, 2], Species = iris$Species)
var_df <- data.frame(PC1 = col_loadings[, 1], PC2 = col_loadings[, 2], Variable = colnames(X))
# Plot biplot with points, arrows, and labels
ggplot() +
geom_point(data = obs_df, aes(x = PC1, y = PC2, color = Species), size = 2) +
geom_segment(data = var_df, aes(x = 0, y = 0, xend = PC1, yend = PC2),
arrow = arrow(length = unit(0.3, "cm")), color = "black", size = 0.8) +
geom_text_repel(data = var_df, aes(x = PC1, y = PC2, label = Variable),
color = "black", size = 4) +
labs(x = "PC1", y = "PC2", color = "Species") +
theme_minimal() +
coord_fixed() # Equal aspect ratio for accurate angles
# Load libraries for plotting
library([ggplot2](/page/Ggplot2))
library(ggrepel) # For label repulsion
# Prepare scaled data matrix (same as prcomp input)
X <- scale([iris](/page/Iris)[, 1:4])
# Compute SVD manually
s <- svd(X)
# Define alpha for scaling (0.5 for symmetric biplot)
alpha <- 0.5
# Compute row scores (observations) and column loadings (variables)
# Scaled for the first two dimensions
n <- nrow(X)
p <- ncol(X)
row_scores <- s$u[, 1:2] %*% diag(s$d[1:2] ^ alpha) * sqrt(n / s$d[1:2] ^ alpha)
col_loadings <- s$v[, 1:2] %*% diag(s$d[1:2] ^ (1 - alpha)) * sqrt(p / s$d[1:2] ^ (1 - alpha))
# Create data frames for plotting
obs_df <- data.frame(PC1 = row_scores[, 1], PC2 = row_scores[, 2], Species = iris$Species)
var_df <- data.frame(PC1 = col_loadings[, 1], PC2 = col_loadings[, 2], Variable = colnames(X))
# Plot biplot with points, arrows, and labels
ggplot() +
geom_point(data = obs_df, aes(x = PC1, y = PC2, color = Species), size = 2) +
geom_segment(data = var_df, aes(x = 0, y = 0, xend = PC1, yend = PC2),
arrow = arrow(length = unit(0.3, "cm")), color = "black", size = 0.8) +
geom_text_repel(data = var_df, aes(x = PC1, y = PC2, label = Variable),
color = "black", size = 4) +
labs(x = "PC1", y = "PC2", color = "Species") +
theme_minimal() +
coord_fixed() # Equal aspect ratio for accurate angles
This manual approach reproduces the factoextra biplot while enabling α adjustments; for α = 0, multiply row_scores by sqrt(n) and set col_loadings scaling to full singular values, emphasizing Euclidean distances among observations.[27] To export the plot as a PDF for publication or sharing, wrap the visualization function in pdf() and dev.off():
r
# Export biplot as PDF (adjust filename and dimensions as needed)
pdf("iris_biplot.pdf", width = 8, height = 6)
fviz_pca_biplot(res.[pca](/page/PCA), col.ind = as.factor(iris$Species), repel = TRUE)
dev.off()
# Export biplot as PDF (adjust filename and dimensions as needed)
pdf("iris_biplot.pdf", width = 8, height = 6)
fviz_pca_biplot(res.[pca](/page/PCA), col.ind = as.factor(iris$Species), repel = TRUE)
dev.off()
This saves a high-resolution vector graphic, preserving scalability.