Compositional data refer to quantitative observations that describe the relative proportions of parts making up a whole, typically represented as vectors of non-negative components summing to a fixed constant, such as 1 for proportions or 100 for percentages, thereby conveying only relative rather than absolute information about the components.[1] These data arise naturally in fields where measurements are inherently constrained, such as the chemical compositions of rocks, market shares in economics, or microbial abundances in biology.[2]The analysis of compositional data poses unique challenges for traditional statistical methods, as the constant-sum constraint induces dependencies among components, leading to spurious correlations if ignored; for instance, an increase in one proportion necessarily decreases others, distorting interpretations of relationships.[3] This issue was formally addressed by statistician John Aitchison in his seminal 1982 paper, which introduced a coherent framework based on the simplex—the geometric space of all possible compositions—and proposed log-ratio transformations to map compositional data into an unconstrained Euclidean space suitable for standard multivariate analysis.[3] Key transformations include the additive log-ratio (ALR), centered log-ratio (CLR), and isometric log-ratio (ILR), each preserving essential properties like scale invariance (where multiplying all components by a constant yields the same ratios) and subcompositional coherence (ensuring results for subsets align with the full composition).Subsequent developments have expanded the theory to include probabilistic models, such as the Dirichlet distribution for generating compositions, and advanced techniques like regression, discriminant analysis, and clustering adapted for the simplex geometry.[4] Applications span diverse disciplines: in geochemistry, for analyzing mineral proportions in sediments; in economics, for studying budget allocations or trade balances; in environmental science, for pollutant mixtures; and in high-throughput biology, such as metagenomics for species relative abundances or immunology for immune cell profiles.[2] These methods ensure interpretable, bias-free inferences, with ongoing research focusing on high-dimensional data and robust estimation under zero-inflated scenarios common in real-world compositions.
Definition and History
Definition
Compositional data are vectors of positive real numbers that represent the relative proportions of parts composing a whole, constrained to sum to a fixed constant, typically 1 (for proportions) or 100 (for percentages). This closure under a constant sum encodes only relative information, where the ratios between components convey the essential structure, rendering absolute scales irrelevant and introducing scale invariance.[5] The components are inherently non-negative and interdependent, as an increase in one part necessarily decreases others to maintain the total sum.In mathematical terms, a D-part composition \mathbf{x} = (x_1, x_2, \dots, x_D) satisfies x_i > 0 for all i and \sum_{i=1}^D x_i = C, where C is the constant (often normalized to 1 without loss of generality). This structure positions compositional data within the interior of a (D-1)-dimensional simplex, a bounded sample space that reflects their relative nature. Standard multivariate statistical techniques, such as those assuming independence, are unsuitable for direct application, as the constraint can artifactually correlate variables that are otherwise unrelated.[6][7]Examples of compositional data abound across disciplines: in geochemistry, the percentages of major oxides in rock samples; in economics, the shares of expenditures across categories in a household budget; and in microbiology, the relative abundances of bacterial taxa in a microbiome sample. These cases highlight how compositional data capture proportional relationships rather than absolute quantities, necessitating specialized analytical frameworks like log-ratio transformations to extract meaningful inferences.[5]
Historical Development
The concept of spurious correlation in data constrained by a constant sum, such as proportions or percentages, was first formally identified by Karl Pearson in 1897, who demonstrated how such constraints could induce artificial negative correlations between components, misleading standard statistical analyses.[8] This issue, termed the "closure effect," highlighted the inadequacy of applying unconstrained multivariate methods to relative data, though it remained largely unaddressed in practice for decades.[9]In the mid-20th century, geologists and biologists began critiquing these applications more systematically. Felix Chayes, in 1960, explicitly linked Pearson's observations to geochemical compositions, showing that product-moment correlations on percentage data from rock analyses produced biased results due to the unit-sum constraint, and he advocated for alternative approaches like ratios of ratios. Concurrently, Sarmanov and Vistelius (1959) examined correlations in geological percentages, while Mosimann (1962) extended similar concerns to biological proportions, emphasizing the need for methods that respect the relative nature of such data.[9] Despite these warnings, standard techniques persisted in fields like geochemistry and ecology until the late 1970s.The modern field of compositional data analysis (CoDA) was established by John Aitchison in the early 1980s, who proposed log-ratio transformations to map compositional data into an unconstrained Euclidean space while preserving relative information. His seminal 1982 paper, "The Statistical Analysis of Compositional Data," presented the additive log-ratio (alr) and centered log-ratio (clr) transformations, applied to datasets in geology and economics, and stressed subcompositional coherence to ensure consistent inference across subsets. Building on earlier work with Shen (1980) on the logistic-normal distribution, Aitchison's 1986 monograph formalized CoDA principles, including perturbation and powering operations on the simplex, establishing a coherent framework that addressed longstanding analytical pitfalls.Subsequent advancements in the 1990s and 2000s refined Aitchison's geometry. Vera Pawlowsky-Glahn and Juan José Egozcue introduced the isometric log-ratio (ilr) transformation in 2003, providing an orthonormal basis for exact isometry to the simplex, which facilitated advanced techniques like principal component analysis and kriging.[10] This led to the formalization of the Aitchison geometry on the simplex as a vector space of equivalence classes, enabling robust multivariate methods. The field gained momentum through international CoDaWork workshops starting in 2003, fostering applications in geosciences, environmetrics, and chemometrics, with software implementations like the R package compositions emerging by 2007.
Properties and Challenges
Key Properties
Compositional data consist of vectors of positive components that sum to a constant total, such as 1 or 100, representing proportions of a whole without conveying absolute magnitudes. This unit-sum constraint, known as closure, implies that the data carry only relative information, where the value of any component is meaningful only in relation to the others.[11] For instance, in geochemical analysis, the percentages of oxides in a rock sample sum to 100%, but scaling the entire vector by a positive constant yields an equivalent composition, emphasizing scale invariance as a core property.The sample space for compositional data is the simplex S^{D-1} = \{ \mathbf{x} = (x_1, \dots, x_D) \mid x_i > 0, \sum_{i=1}^D x_i = 1 \}, a (D-1)-dimensional structure embedded in \mathbb{R}^D, where D is the number of parts.[12] This geometry restricts standard Euclidean operations, as distances and variances must respect the constraint; for example, the Aitchison distance measures dissimilarity via log-ratios:d_a(\mathbf{x}, \mathbf{y}) = \sqrt{ \frac{1}{2D} \sum_{i=1}^D \sum_{j=1}^D \left[ \ln \left( \frac{x_i}{x_j} \right) - \ln \left( \frac{y_i}{y_j} \right) \right]^2 },capturing relative differences coherently.[11] Perturbation (\mathbf{x} \oplus \mathbf{y} = C(x_1 y_1, \dots, x_D y_D), where C denotes closure) and powering (\alpha \odot \mathbf{x} = C(x_1^\alpha, \dots, x_D^\alpha)) serve as the vector addition and scalar multiplication analogs, forming an Aitchison inner product space.A pivotal property is subcompositional coherence, ensuring that statistical inferences about a full composition align with those from any subcomposition obtained by merging or omitting parts. This requires methods invariant to such reductions, as ratios between components remain unchanged (e.g., x_i / x_j = s_i / s_j for subcomposition \mathbf{s}).[12] Violations arise in unconstrained analyses, leading to spurious correlations; thus, log-ratio transformations like the centered log-ratio (clr), defined as \text{clr}(\mathbf{x})_i = \ln(x_i / g(\mathbf{x})) with geometric mean g(\mathbf{x}) = (\prod x_i)^{1/D}, are essential to map the simplex isometrically to Euclidean space.[11]These properties collectively imply that conventional statistical tools, such as Pearson correlation, distort interpretations by ignoring the relative scale and constraint, often inflating or deflating associations artificially.[12] Instead, compositional analysis prioritizes ratio-based metrics to preserve the data's intrinsic structure, as formalized in Aitchison's framework.
Analytical Challenges
Compositional data, consisting of proportions or parts that sum to a constant (typically 1 or 100%), present fundamental analytical challenges due to their inherent constraints, which violate key assumptions of classical statistical methods. The constant-sum constraint, or closure effect, ensures that an increase in one component necessarily decreases others, rendering the variables interdependent and precluding direct application of techniques like analysis of variance or regression that assume independence. This closure leads to spurious correlations, where observed associations between components are artifacts of the constraint rather than true relationships; for instance, in geochemical analyses, two minerals may appear negatively correlated simply because their proportions must balance to 100%, misleading interpretations of underlying processes.[13][14]A core issue is the absence of a suitable framework for independence in the simplexsample space, the constrained space where compositions reside. Traditional notions of independence fail because no parametric families of distributions adequately capture the geometry of the simplex, complicating hypothesis testing and model fitting. John Aitchison highlighted this in his foundational work, noting that "the statistical analysis of such data has proved difficult because of a lack both of concepts of independence and of rich enough parametric classes of distributions in the simplex." Without transformations to an unconstrained space, such as log-ratios, standard multivariate methods yield biased estimates and invalid inferences.[15][15]Zero values exacerbate these problems, as they are common in real-world datasets (e.g., absent taxa in microbiome studies or undetectable elements in spectrometry) and render log-ratio transformations undefined due to logarithms of zero. Zeros introduce sparsity and high dimensionality, particularly in modern high-throughput data where the number of components often exceeds samples, amplifying multicollinearity and distorting distance metrics. Aitchison proposed imputation strategies, such as multiplicative replacement, to handle essential zeros (true absences) while preserving the compositional structure, but these remain imperfect for datasets with extensive sparsity.[14][16]In fields like microbiome research, these challenges compound with unobserved absolute abundances, as sequencing yields only relative proportions, preventing recovery of true counts and necessitating specialized tools like centered log-ratio transformations to mitigate biases in diversity metrics or differential abundance tests. Failure to address compositionality can lead to paradoxical results, such as inflated type I errors in correlation analyses. Seminal advancements, including Aitchison's log-ratio geometry, have enabled rigorous analysis, but ongoing issues with zeros and high-dimensionality continue to drive methodological innovations.[14][13]
The Simplicial Sample Space
The Composition Simplex
In compositional data analysis, the composition simplex serves as the fundamental sample space for representing D-part compositions, which are vectors of positive proportions summing to a constant total, typically unity. Formally, it is defined as the set
S^D = \{ \mathbf{x} = (x_1, \dots, x_D)^\top \in \mathbb{R}^D_{++} : \sum_{i=1}^D x_i = 1 \},
where \mathbb{R}^D_{++} denotes the D-dimensional positive orthant, ensuring all components are strictly positive to avoid zero values that complicate log-ratio transformations.[3] This structure was introduced by John Aitchison as the natural constrained space for analyzing relative information in proportions, distinguishing compositional data from unconstrained multivariate observations.[3]Geometrically, the simplex S^D is a (D-1)-dimensional Euclidean simplex embedded within \mathbb{R}^D, forming the convex hull of the standard basis vectors scaled to sum to 1. For example, in the ternary case (D=3), it appears as an equilateral triangle in the plane x_1 + x_2 + x_3 = 1 with vertices at (1,0,0), (0,1,0), and (0,0,1), though interior points exclude the boundaries to maintain positivity.[17] The simplex is closed under the closure operation C(\mathbf{y}) = \left( \frac{y_1}{\sum y_i}, \dots, \frac{y_D}{\sum y_i} \right), which normalizes positive vectors to compositions, ensuring coherence when subsets of parts (subcompositions) are considered.[3]The simplex's structure endows compositional data with specific invariance properties, such as scale invariance, where multiplying all parts by a positive constant yields the same composition after closure. This relative nature implies that absolute magnitudes are irrelevant, only ratios between parts convey information—a principle central to Aitchison's framework.[3] To enable standard statistical inference, the simplex is equipped with an algebraic-geometric structure known as Aitchison geometry, featuring operations like perturbation \mathbf{x} \oplus \mathbf{y} = C(\mathbf{x}_1 y_1, \dots, \mathbf{x}_D y_D) (analogous to vector addition) and powering \alpha \otimes \mathbf{x} = C(\mathbf{x}_1^\alpha, \dots, \mathbf{x}_D^\alpha) (analogous to scalar multiplication), transforming it into a vector space isomorphic to \mathbb{R}^{D-1}.[17] This geometry, formalized by Pawlowsky-Glahn and Egozcue, includes a metric induced by log-ratio coordinates, with the Aitchison distance
d_a(\mathbf{x}, \mathbf{y}) = \sqrt{ \frac{1}{2D} \sum_{i=1}^D \sum_{j=1}^D \left( \ln \frac{x_i}{x_j} - \ln \frac{y_i}{y_j} \right)^2 },
which measures relative dissimilarities and satisfies subcompositional coherence, meaning distances between subcompositions are upper bounds for the full composition distances.[17]These properties address the inherent constraints of compositional data, preventing spurious correlations from the constant sum and enabling perturbation-invariant statistical models, such as the logistic-normal distribution on the simplex.[3] Seminal developments in this area stem from Aitchison's 1982 paper and 1986 monograph, with extensions by Pawlowsky-Glahn and Egozcue providing the rigorous Euclidean embedding that underpins modern CoDA methods.[3][17]
Structure of the Simplex
The composition simplex, denoted as S^D, is mathematically defined as the set of all D-part compositions \mathbf{x} = (x_1, \dots, x_D) where each x_i > 0 and \sum_{i=1}^D x_i = 1. This forms a bounded, convex subset of the positive orthant in \mathbb{R}^D, specifically a (D-1)-dimensional simplex embedded in D-dimensional space. Geometrically, it resembles an equilateral triangle for D=3 (the ternary case) or a tetrahedron for D=4, with vertices corresponding to the standard basis vectors \mathbf{e}_i (where one component is 1 and others are 0), representing pure compositions of a single part. The interior points represent mixtures, and the faces (subsimplices) correspond to subcompositions obtained by setting one or more parts to zero. This structure inherently constrains the data, as the components are not free variables but interdependent proportions summing to unity.[9][18]Algebraically, the simplex is endowed with a Euclidean vector space structure known as Aitchison geometry, which allows standard statistical operations while respecting the constant-sum constraint. The two fundamental operations are perturbation (\oplus) and powering (\otimes). Perturbation, analogous to vector addition, is defined as \mathbf{x} \oplus \mathbf{y} = C(x_1 y_1, \dots, x_D y_D), where C is the closure operator normalizing the product vector to sum to 1; it forms an abelian group with the identity element \mathbf{e} = (1/D, \dots, 1/D). Powering, serving as scalar multiplication, is \alpha \otimes \mathbf{x} = C(x_1^\alpha, \dots, x_D^\alpha) for \alpha \in \mathbb{R}. These operations satisfy vector space axioms, including associativity, commutativity, and distributivity, enabling linear algebra on the simplex without leaving its boundaries. This framework ensures subcompositional coherence, meaning results for subcompositions are consistent with the full composition.[19][18]The geometry further includes an inner product and a distance metric based on log-ratios to quantify similarities between compositions. The Aitchison inner product is given by\langle \mathbf{x}, \mathbf{y} \rangle_a = \frac{1}{2D} \sum_{i=1}^D \sum_{j=1}^D \ln\left(\frac{x_i}{x_j}\right) \ln\left(\frac{y_i}{y_j}\right),which induces the Aitchison norm \|\mathbf{x}\|_a = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle_a} and distanced_a(\mathbf{x}, \mathbf{y}) = \sqrt{\frac{1}{2D} \sum_{i=1}^D \sum_{j=1}^D \left[ \ln\left(\frac{x_i}{x_j}\right) - \ln\left(\frac{y_i}{y_j}\right) \right]^2}.These satisfy metric properties such as non-negativity, symmetry, and the triangle inequality, providing a way to measure geodesic distances on the simplex while invariant to scaling. The structure supports orthonormal bases via isometric log-ratio coordinates, facilitating multivariate analysis. This algebraic-geometric foundation, originally proposed by Aitchison, resolves the inherent dependencies in compositional data by treating relative information (log-ratios) as primary.[9][18]
Visualization Techniques
Ternary Plots
Ternary plots, also known as ternary diagrams, are a fundamental visualization tool for representing three-part compositional data, where the components sum to a constant total, typically 1 or 100%. These plots project the two-dimensional simplex onto an equilateral triangle, with each vertex corresponding to a pure composition of one component (100% of that part and 0% of the others). A point's position within the triangle encodes the relative proportions of the three components, determined by its perpendicular distances to the opposite sides, as formalized by Viciani's theorem, which states that these distances are proportional to the component values.[20]The construction of a ternary plot involves normalizing the compositional data to ensure the parts sum to unity. For a composition \mathbf{x} = (x_1, x_2, x_3) with x_1 + x_2 + x_3 = 1, the coordinates are calculated such that the distance from the point to the side opposite vertex 1 is x_1, to the side opposite vertex 2 is x_2, and to the side opposite vertex 3 is x_3. Grid lines parallel to the sides facilitate reading values, allowing users to interpolate proportions directly from the plot. Software packages, such as those in R's compositions library, automate this process for accurate rendering.[21][22]Historically, ternary diagrams emerged in the 18th century from studies in color theory, with James Clerk Maxwell advancing their use in 1857 for mixing primary colors. By the late 19th century, they were applied in physical chemistry by H.W.B. Roozeboom (1893) for phase equilibria and in geology by G.A. Rankin (1915) for silicate systems. In the context of compositional data analysis (CoDA), John Aitchison integrated ternary plots into his framework in 1986, emphasizing their role in visualizing the simplex structure and log-ratio transformations, though he cautioned against direct statistical analysis without accounting for the compositional constraint.[23][19]In CoDA applications, ternary plots are widely used in geosciences to classify rock types (e.g., plotting quartz, feldspar, and mica proportions in sediments) and in environmental science for soil textures (sand, silt, clay). They enable the identification of atypical compositions, such as outliers near vertices indicating dominance of one part, and support exploratory analysis of subcompositions or perturbation effects. For instance, in health policy studies, they visualize authority allocations across local, regional, and national levels.[22][20][24]Despite their utility, ternary plots have limitations inherent to three-part data. They cannot directly handle higher-dimensional compositions without reduction to subcompositions, potentially losing information, and zeros in data require special treatment, such as imputation, to avoid boundary issues. Interpretation demands familiarity, as the non-Euclidean geometry of the simplex can distort Euclidean intuitions unless log-ratio coordinates are overlaid. Advanced variants, like centered ternary balance schemes, enhance readability by encoding balances between components. Overall, ternary plots remain a cornerstone for intuitive visualization in CoDA, bridging classical applications in petrology and metallurgy with modern statistical methods.[24][25]
Log-Ratio Based Visualizations
Log-ratio based visualizations transform compositional data into an unconstrained Euclidean space, enabling the application of standard multivariate plotting techniques while preserving the relative information inherent in compositions. These methods rely on log-ratio transformations such as the centered log-ratio (clr), additive log-ratio (alr), or isometric log-ratio (ilr), which map the simplex to \mathbb{R}^{D-1} (where D is the number of parts), allowing for interpretable displays of variation in log-ratios between components. Unlike ternary diagrams, which are limited to three-part compositions, log-ratio visualizations scale to high-dimensional data and facilitate the exploration of subcompositional coherence and principal modes of variation.A primary approach is the clr-biplot, which applies principal component analysis (PCA) to clr-transformed data to generate a two-dimensional representation of samples and variables. The clr transformation is defined as \mathbf{y} = \clr(\mathbf{x}) = \left( \ln \frac{x_1}{g(\mathbf{x})}, \dots, \ln \frac{x_D}{g(\mathbf{x})} \right), where g(\mathbf{x}) is the geometric mean of the composition \mathbf{x} = (x_1, \dots, x_D). The biplot is constructed via singular value decomposition (SVD) of the clr matrix \mathbf{Y}, yielding sample coordinates as principal component scores and variable arrows as loadings that represent clr coefficients, scaled to reflect the relative contributions of log-ratios. This setup allows interpretation of proximities: samples near a variable arrow indicate higher relative abundance of that component, while angles between arrows quantify log-ratio correlations. For example, in analyzing color compositions from abstract paintings, clr-biplots reveal dominant log-ratio variations, such as contrasts between warm and cool tones, outperforming direct simplex plots by capturing multi-dimensional structure.Isometric log-ratio (ilr) transformations provide an orthonormal basis for biplots and balance visualizations, ensuring Euclidean distances correspond directly to Aitchison distances in the simplex. The ilr coordinates are obtained via \mathbf{z} = \ilr(\mathbf{x}) = \mathbf{V} \cdot \clr(\mathbf{x}), where \mathbf{V} is a (D-1) \times D contrast matrix derived from a sequential binarypartition (SBP) that hierarchically groups parts into numerator and denominator subcompositions. In ilr-biplots, the first two coordinates are plotted similarly to clr-biplots, but the orthonormal basis minimizes distortion and supports precise variance decomposition. Balance plots, a specialized ilr visualization, display the first k balances b_i = \sqrt{\frac{r s}{r + s}} \ln \frac{g(\mathbf{x}^+_i)}{g(\mathbf{x}^-_i)} (where r and s are the sizes of positive and negative subcompositions, and g is the geometric mean) as a two-panel figure: one panel shows the SBP tree with group contrasts, and the other plots sample balance values on a common scale to compare distributions across balances ordered by variance. This method excels in high-dimensional settings, such as metagenomic data with hundreds of taxa, where it visualizes key contrasts (e.g., microbial phyla) without overlap, offering clearer insights than dendrograms by enabling direct subset analysis.[10][10]Additive log-ratio (alr) visualizations, while less common due to their dependence on a fixed reference part, involve plotting alr coordinates \mathbf{y} = \alr(\mathbf{x}) = \left( \ln \frac{x_1}{x_D}, \dots, \ln \frac{x_{D-1}}{x_D} \right) in scatterplots or biplots, often after PCA. These are useful for targeted analyses where one component (e.g., a baseline element in geochemistry) is held constant, but they lack the symmetry of clr or ilr, potentially biasing interpretations if the reference varies. Overall, log-ratio based methods prioritize the geometry of compositions, with clr-biplots favored for exploratory symmetry and ilr for rigorous distance-preserving displays in complex datasets.
Aitchison Geometry
Foundations
The Aitchison geometry provides a Euclidean vector space structure to the simplex of compositional data, enabling standard statistical operations while respecting the relative nature of compositions. Introduced by John Aitchison, this framework addresses the inherent constraints of compositional data by defining operations and a metric that are invariant to scaling and preserve subcompositional coherence. The geometry is built upon log-ratio transformations, which map the simplex into a real vector space, allowing for meaningful algebraic manipulations.Central to the Aitchison geometry are two binary operations: perturbation and powering. Perturbation, denoted \oplus, combines two compositions \mathbf{x} = (x_1, \dots, x_D) and \mathbf{y} = (y_1, \dots, y_D) in the simplex S_D as\mathbf{x} \oplus \mathbf{y} = C(x_1 y_1, x_2 y_2, \dots, x_D y_D),where C is the closure operator that normalizes the product vector to sum to 1. This operation is analogous to vector addition and forms an abelian group structure on the simplex, with the neutral element being the uniform composition (1/D, \dots, 1/D). Powering, denoted \oslash, scales a composition by a real number \alpha as\alpha \oslash \mathbf{x} = C(x_1^\alpha, x_2^\alpha, \dots, x_D^\alpha),which is distributive over perturbation and mimics scalar multiplication. These operations ensure that the simplex behaves like a vector space under the induced geometry.The metric structure arises from an inner product tailored to log-ratios. The Aitchison inner product between compositions \mathbf{x} and \mathbf{y} is defined as\langle \mathbf{x}, \mathbf{y} \rangle_A = \frac{1}{2D} \sum_{i=1}^D \sum_{j=1}^D \ln \left( \frac{x_i}{x_j} \right) \ln \left( \frac{y_i}{y_j} \right),which can be equivalently expressed as \sum_{i=1}^D \mathrm{clr}(\mathbf{x})_i \mathrm{clr}(\mathbf{y})_i, where \mathrm{clr} is the centered log-ratio transformation, or as \frac{1}{D} \sum_{i<j} \ln \left( \frac{x_i}{x_j} \right) \ln \left( \frac{y_i}{y_j} \right). The associated norm \|\mathbf{x}\|_A = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle_A} induces the Aitchison distanced_A(\mathbf{x}, \mathbf{y}) = \| \mathbf{x} \ominus \mathbf{y} \|_A = \sqrt{ \frac{1}{2D} \sum_{i=1}^D \sum_{j=1}^D \left[ \ln \left( \frac{x_i}{y_i} \right) - \ln \left( \frac{x_j}{y_j} \right) \right]^2 },where \ominus is the inverse perturbation. This distance is Euclidean, positive definite, and symmetric, turning the simplex into a metric space suitable for multivariate analysis. Key properties include orthogonality with respect to orthonormal bases derived from log-ratio transformations and coherence under subcompositions, ensuring the geometry remains consistent for subsets of parts.[10]
Orthonormal Bases
In Aitchison geometry, the simplex of compositional data is endowed with a Euclidean vector space structure, allowing the definition of orthonormal bases that facilitate statistical analysis by mapping compositions to real coordinates while preserving distances and inner products. These bases consist of D-1 linearly independent compositions \{e_1, \dots, e_{D-1}\} in the D-part simplex S^D, satisfying the orthonormality conditions \langle e_i, e_j \rangle_a = \delta_{ij} for i, j = 1, \dots, D-1, where \delta_{ij} is the Kronecker delta and \langle \cdot, \cdot \rangle_a denotes the Aitchison inner product defined as\langle x, y \rangle_a = \frac{1}{2D} \sum_{i=1}^D \sum_{j=1}^D \ln\left(\frac{x_i}{x_j}\right) \ln\left(\frac{y_i}{y_j}\right).This inner product induces a norm \|x\|_a = \sqrt{\langle x, x \rangle_a} and a distance d_a(x, y) = \|x \ominus y\|_a, where \ominus is the perturbation inverse operation.[10]Orthonormal bases are essential for isometric embeddings of the simplex into \mathbb{R}^{D-1}, enabling the application of standard multivariate techniques such as principal component analysis directly on transformed coordinates without distorting the inherent geometry of compositions. The coordinates of a composition x \in S^D with respect to such a basis are given by the isometric log-ratio (ilr) transformation, \text{ilr}(x) = (z_1, \dots, z_{D-1}), where z_k = \langle x, e_k \rangle_a for k = 1, \dots, D-1. This transformation is an isometry, satisfying d_a(x, y) = \|\text{ilr}(x) - \text{ilr}(y)\|_2, where \|\cdot\|_2 is the Euclidean norm, thus preserving all metric properties of the simplex. The inverse mapping is x = C \left( \exp( \text{ilr}(x) \Psi ) \right), with C(\cdot) the closure operation and \Psi the (D-1, D)-contrast matrix whose rows are the centered log-ratio (clr) transforms of the basis vectors e_i.[10]A practical construction of an orthonormal basis relies on a sequential binary partition (SBP) of the D parts into two non-overlapping subcompositions, iteratively defining D-1 balances that correspond to the basis vectors. Each balance b_k = \sqrt{ \frac{r_k s_k}{r_k + s_k} } \log \frac{ ( \prod_{i \in I_k} x_i )^{1/r_k} }{ ( \prod_{j \in J_k} x_j )^{1/s_k} }, where I_k and J_k are the groups with sizes r_k and s_k, forms the k-th ilr coordinate and ensures orthogonality under the Aitchison metric. This SBP approach allows tailoring the basis to domain-specific partitions, enhancing interpretability; for instance, grouping geochemically similar elements. The resulting \Psi matrix satisfies \Psi \Psi^T = I_{D-1}, confirming orthonormality.[10]For a ternary composition (D=3) with parts x = (x_1, x_2, x_3), the Helmert orthonormal basis corresponding to the sequential binary partition first splitting into \{1\} | \{2\} and then \{1,2\} | \{3\} yieldsz_1 = \frac{1}{\sqrt{2}} \log \frac{x_1}{x_2}, \quad z_2 = \frac{1}{\sqrt{6}} \log \frac{x_1 x_2}{x_3^2},with the basis vectors in clr space being the rows of\Psi = \begin{pmatrix}
\sqrt{1/2} & -\sqrt{1/2} & 0 \\
\sqrt{1/6} & \sqrt{1/6} & -\sqrt{2/3}
\end{pmatrix}.The inverse ilr recovers the original composition via \ln x_i = H_{i \cdot} \mathbf{z} + \ln g(x), where H is the Helmert matrix and g(x) the geometric mean. This example illustrates how ilr coordinates simplify visualization and modeling, such as plotting z_1 vs. z_2 in a standard Euclidean plane equivalent to a ternary diagram under Aitchison distance.[10][19]Orthonormal bases via ilr outperform other log-ratio transformations like additive log-ratio (alr) by avoiding coordinate dependency and ensuring full rank in \mathbb{R}^{D-1}, which is crucial for high-dimensional analyses in fields like geochemistry and microbiome studies. They support decompositions such as simplicial singular value decomposition for dimension reduction, where the basis vectors represent principal directions of variability in compositional datasets.[10]
Log-Ratio Transformations
Additive Log-Ratio Transformation
The additive log-ratio (ALR) transformation is a fundamental method in compositional data analysis for mapping proportions from the simplex to unconstrained real Euclidean space, enabling the application of standard statistical techniques. Introduced as part of the log-ratio approach to address the constraints of compositional data, it expresses each component relative to a chosen reference component via logarithms of ratios. This transformation was originally proposed by Aitchison in his foundational work on the additive logistic normal distribution for compositions.[26]Mathematically, for a D-part composition \mathbf{x} = (x_1, x_2, \dots, x_D) with x_i > 0 and \sum_{i=1}^D x_i = 1, the ALR transformation selects one part, typically the last (x_D), as the denominator and yields a (D-1)-dimensional vector:\text{alr}(\mathbf{x}) = \left( \ln \frac{x_1}{x_D}, \ln \frac{x_2}{x_D}, \dots, \ln \frac{x_{D-1}}{x_D} \right).[9]This formulation ensures subcompositional coherence, meaning the transformation applied to a subcomposition yields the corresponding subvector of the full ALR coordinates. The resulting coordinates operate on an additive scale, preserving the relative information in the original data while removing the sum constraint. However, the ALR is not an isometry in the Aitchison geometry, so Euclidean distances in the transformed space do not directly correspond to geodesic distances on the simplex.[27]A key advantage of the ALR transformation lies in its interpretability: each coordinate directly represents the log-ratio of a part to the reference, facilitating straightforward analysis of relative abundances, such as in regression models or principal component analysis. It serves as a basis for generating all pairwise log-ratios among components, making it computationally simple and suitable for exploratory data analysis. For instance, in analyzing fatty acid compositions in marine copepods, ALR coordinates revealed seasonal variability in lipid profiles, with total variation quantified at 0.2462, highlighting shifts in ratios to a reference fatty acid.[27]Despite these strengths, the ALR has notable limitations. The choice of reference component influences the coordinates and subsequent interpretations, potentially introducing bias if the reference varies systematically across samples; this arbitrariness can complicate comparisons unless standardized. Additionally, it requires strictly positive values, rendering it sensitive to zero components common in real-world data like microbiome surveys, often necessitating imputation or substitution methods. In contrast to the centered log-ratio (CLR), which uses the geometric mean as a symmetric reference, or the isometric log-ratio (ILR), which provides an orthonormal basis preserving distances, the ALR is less flexible for advanced multivariate techniques but remains a practical starting point due to its simplicity.[27][9]
Centered Log-Ratio Transformation
The centered log-ratio (clr) transformation is a key method in compositional data analysis that maps compositions from the simplex to unconstrained real space, enabling the application of standard multivariate statistical techniques while preserving the relative nature of the data. Introduced by John Aitchison, it addresses the constraints inherent in compositional data by expressing each component relative to the geometric mean of all components.Mathematically, for a D-part composition \mathbf{x} = (x_1, \dots, x_D) with x_i > 0 and \sum_{i=1}^D x_i = 1, the clr transformation is defined as\text{clr}(\mathbf{x}) = \left( \ln \frac{x_1}{g(\mathbf{x})}, \dots, \ln \frac{x_D}{g(\mathbf{x})} \right),where g(\mathbf{x}) = \left( \prod_{i=1}^D x_i \right)^{1/D} denotes the geometric mean of the components. This yields a vector in \mathbb{R}^D that lies in the hyperplane \{\boldsymbol{\xi} \in \mathbb{R}^D : \sum_{i=1}^D \xi_i = 0 \}, reflecting the zero-sum constraint due to the logarithmic ratios.The transformation is an isomorphism between the Aitchison simplex and the clr coordinates, preserving perturbation (the group operation on compositions) and ensuring subcompositional coherence in a limited sense. More importantly, it acts as an isometry with respect to the Aitchison geometry, maintaining distances and inner products defined on the simplex: the Aitchison inner product \langle \mathbf{x}, \mathbf{y} \rangle_A = \frac{1}{D} \sum_{i=1}^D \ln(x_i / g(\mathbf{x})) \ln(y_i / g(\mathbf{y})) corresponds directly to the Euclidean inner product in clr space. This geometric fidelity allows clr-transformed data to support classical methods like principal component analysis, though the resulting covariance matrix is singular (rank D-1) due to the linear dependence of coordinates.[17]The inverse clr transformation recovers the original composition viax_i = \frac{\exp(\xi_i)}{\sum_{j=1}^D \exp(\xi_j)}, \quad i = 1, \dots, D,where \boldsymbol{\xi} = \text{clr}(\mathbf{x}), effectively applying a softmax function to the coordinates. This bidirectional mapping facilitates iterative analyses, such as regression models on clr coordinates with back-transformation for interpretation. However, clr is not fully subcompositionally coherent for subsets of components, as the geometric mean changes, potentially complicating analyses of subcompositions compared to alternatives like the isometric log-ratio transformation.[28]In practice, clr is widely used in fields like geochemistry and microbiome studies to handle closure effects and skewness, promoting multivariate normality in transformed space for robust inference. Its simplicity—requiring only the geometric mean—makes it computationally efficient, though care must be taken with zero values, often addressed via multiplicative replacement. Seminal developments building on Aitchison's framework, such as orthonormal bases in the simplex, further integrate clr into broader geometric approaches for compositional modeling.[17][28]
Isometric Log-Ratio Transformation
The isometric log-ratio (ILR) transformation is a method for mapping compositional data from the simplex to an unconstrained real vector space \mathbb{R}^{D-1}, where D is the number of components, while preserving the Aitchison geometry of the simplex. Introduced by Egozcue et al. (2003), it constructs an orthonormal basis that ensures the transformation is an isometry, meaning Euclidean distances and inner products in the transformed space correspond exactly to the Aitchison distances and inner products in the original simplex. This allows standard multivariate statistical techniques, such as principal component analysis and regression, to be applied without distortion due to the compositional constraints.[10]The ILR coordinates are typically generated using a sequential binary partition (SBP) of the D components into disjoint groups, resulting in D-1 balances that form the transformed vector. For a composition \mathbf{x} = (x_1, \dots, x_D) with \sum x_i = 1 and x_i > 0, the k-th balance is defined asz_k(\mathbf{x}) = \sqrt{\frac{r_k s_k}{r_k + s_k}} \ln \left( \frac{\left( \prod_{i \in G_k} x_i \right)^{1/r_k}}{\left( \prod_{j \in H_k} x_j \right)^{1/s_k}} \right),where G_k and H_k are the numerator and denominator groups in the k-th partition step, with sizes r_k = |G_k| and s_k = |H_k|, respectively. The scaling factor ensures orthonormality. The full ILR transformation can also be expressed as \mathbf{z} = \text{clr}(\mathbf{x}) \mathbf{V}, where \text{clr}(\mathbf{x}) = \left( \ln \frac{x_1}{g(\mathbf{x})}, \dots, \ln \frac{x_D}{g(\mathbf{x})} \right) is the centered log-ratio transformation, g(\mathbf{x}) = \left( \prod x_i \right)^{1/D} is the geometric mean, and \mathbf{V} is a (D-1) \times D contrast matrix with orthonormal rows satisfying \mathbf{V} \mathbf{V}^\top = \mathbf{I}_{D-1}. The inverse transformation recovers the composition via \mathbf{x} = C \left( \exp(\mathbf{z} \mathbf{V}^\top) \right), where C denotes closure to the simplex.[10][9]Compared to the additive log-ratio (ALR) and centered log-ratio (CLR) transformations, ILR offers distinct advantages in preserving metric structure and facilitating analysis. ALR, which uses one component as a fixed denominator (e.g., \text{alr}(\mathbf{x}) = \left( \ln \frac{x_1}{x_D}, \dots, \ln \frac{x_{D-1}}{x_D} \right)), is not isometric and depends on the arbitrary choice of denominator, leading to oblique coordinates that distort distances. CLR, while symmetric and isometric to the simplex, produces a singular covariance matrix because the coordinates sum to zero, complicating procedures like principal component analysis that assume full rank. In contrast, ILR yields orthogonal coordinates with a nonsingular covariance matrix, enabling direct application of Euclidean-based methods while maintaining subcompositional coherence—meaning subcompositions transform consistently with the full composition.[10][29][9]These properties make ILR particularly suitable for high-dimensional compositional data, as the orthonormal basis decomposes variance into independent components interpretable as contrasts between groups of parts. For instance, in geochemical analyses, balances can represent contrasts like major vs. trace elements, aiding in pattern recognition without redundancy. However, the choice of SBP affects interpretability, so partitions are often designed domain-specifically to align with substantive contrasts. Handling zeros requires imputation or replacement strategies, as with other log-ratio methods, to avoid undefined logarithms. Overall, ILR's isometric nature underpins its widespread adoption in fields like geosciences, environmental science, and metagenomics for robust statistical inference on relative abundances.[10][30]
Applications
Classical Applications
Compositional data analysis emerged in the late 19th and early 20th centuries primarily to address challenges in handling proportional data that sum to a constant, such as percentages or ratios, where standard statistical methods often led to misleading results like spurious correlations. Karl Pearson first highlighted this issue in 1897, demonstrating how correlations between derived indices, such as proportions of body parts in biological specimens or economic allocations, could arise artifactually due to the constant sum constraint rather than true relationships.[8] This foundational recognition spurred early applications across disciplines, emphasizing the need for transformations to mitigate closure effects before the formal log-ratio framework was established in the 1980s.In geology, one of the earliest and most prominent classical applications involved analyzing rock and sediment compositions, where data on mineral abundances or grain size distributions (e.g., percentages of sand, silt, and clay) sum to 100%. Felix Chayes in 1960 formalized the problem of correlations among constant-sum variables, showing that the unit-sum constraint induces negative biases in covariance matrices, which invalidated direct application of multivariate techniques to geochemical data like major oxide concentrations in igneous rocks.[31] For instance, early studies examined petrologic variations in major oxide concentrations in igneous rocks such as basalts and granites, using log transforms to explore ratios and discriminate between rock types, addressing the closure effects that obscured patterns in relative abundances. These applications were crucial for understanding geological processes like sedimentation and igneous differentiation, with representative analyses revealing patterns in relative abundances that standard statistics obscured.[32]Economic applications classically focused on household budget surveys, treating expenditure shares across commodity groups (e.g., food, housing, services) as compositions summing to total income. John Aitchison's 1986 monograph detailed log-contrast modeling for such data, applying it to datasets from single-person households to estimate demand elasticities and test compositional invariance, finding, for example, that foodstuffs exhibited lower elasticities than services.[33] This approach resolved issues like spurious negative correlations between spending categories, enabling inferences about consumer behavior and income effects without violating the sum constraint. Similar techniques were extended to market share analyses in early econometric studies, prioritizing relative proportions over absolute values to model competitive dynamics.Other classical fields included soil science, where particle size distributions (e.g., clay, silt, sand fractions) were analyzed for fertility assessments, and biology, for species relative abundances in ecological samples. In soil studies, early 20th-century work applied ratio-based adjustments to avoid closure-induced biases in texture classifications, influencing agricultural mapping. Biologically, Mosimann's 1960s explorations of multinomial models for proportions in animal diets or populations built on Pearson's insights, using log-ratios to quantify diversity without spurious effects. These applications underscored the broad utility of preliminary transform methods in pre-Aitchison era, establishing CoDA as essential for relative-scale data across natural and social sciences.[34]
Modern Applications
In recent years, compositional data analysis (CoDA) has found prominent applications in health and epidemiology, particularly for analyzing 24-hour time-use behaviors such as sleep, sedentary time, light physical activity, and moderate-to-vigorous physical activity, which inherently sum to a fixed total of one day. This approach addresses the interdependence of these behaviors by employing log-ratio transformations to model reallocations of time across components, revealing associations with health outcomes like adiposity, cardiovascular risk, and mental health. For instance, studies have shown that replacing sedentary time with moderate-to-vigorous physical activity is linked to improved metabolic profiles, with compositional models providing more robust estimates than traditional methods that ignore the relative nature of the data.[35][36][37]In microbiome research, CoDA has become essential for interpreting high-throughput sequencing data, where microbial abundances are reported as relative proportions summing to a constant total due to sequencing depth constraints. Methods like centered log-ratio transformations enable differential abundance testing and clustering while accounting for compositional biases, such as spurious correlations from rare taxa. This has facilitated insights into gut microbiota diversity in relation to diet, disease, and environmental factors; for example, analyses of fecal samples from healthy cohorts have identified log-ratio-based biomarkers for conditions like inflammatory bowel disease, outperforming count-based approaches in sensitivity and interpretability.[38][14][39]CoDA is increasingly integrated with machine learning for handling compositional predictors in supervised tasks, such as classification of glycomic profiles[40] or environmental samples, where kernel methods adapted via isometric log-ratio coordinates preserve the Aitchison geometry.[41] In environmental science, recent applications include geochemical water quality assessment, where CoDA coupled with principal component analysis derives pollution indices from ion proportions, as demonstrated in groundwater studies revealing spatial contamination patterns. These advancements, often leveraging Bayesian multilevel models for clustered data, underscore CoDA's role in high-dimensional, relative datasets across disciplines.[30][42]