Multiple factor analysis
Multiple factor analysis (MFA) is a multivariate statistical technique that extends principal component analysis (PCA) to handle datasets consisting of multiple blocks or groups of variables—typically a mix of quantitative and qualitative types—measured on the same set of observations. By normalizing each variable block individually (often by scaling to unit first eigenvalue) and then concatenating them for a global PCA, MFA balances the influence of disparate groups, enabling the identification of common structures across blocks while assessing their individual contributions and relationships. This method provides factor scores, loadings, and visualizations that summarize complex, multitable data in a unified framework, making it particularly suited for exploratory analysis where traditional PCA might be biased by imbalanced variable sets.[1][2] Developed by statisticians Brigitte Escofier and Jérôme Pagès in the late 1980s and early 1990s, MFA emerged as a synthesis of earlier multivariate approaches, including canonical analysis, Procrustes rotation, and individual differences scaling (INDSCAL), to address the challenges of integrating heterogeneous data tables. Their seminal work, detailed in a 1994 publication, formalized MFA as a weighted factor analysis tool capable of processing both numerical and categorical variables on shared individuals, with implementations available in software packages like AFMULT. Subsequent refinements, such as extensions for incomplete data and hierarchical structures, have built on this foundation, enhancing its applicability in modern computational environments.[2][3] At its core, MFA operates in two main stages: first, performing separate PCAs (or multiple correspondence analyses for categorical blocks) on each normalized data table to derive partial factor scores; second, aggregating these scores into a composite table for an unweighted global PCA, which yields overall dimensions representing consensus across blocks. This process not only reveals how observations cluster but also quantifies the similarity between blocks through metrics like RV coefficients, allowing researchers to evaluate whether certain variable groups align or diverge in explaining variance. Dual formulations of MFA exist for cases where the same variables are observed across different samples, further broadening its utility.[1][2] MFA finds extensive use in fields requiring the integration of diverse data sources, such as sensory science—where it analyzes panels of descriptors alongside instrumental measurements for products like wines or foods—and in marketing for combining consumer surveys with demographic profiles. It has also been applied in taxonomy to diagnose species relationships from morphological and genetic blocks, in environmental studies for multiblock ecological data, and in social sciences for exploring multifaceted survey responses. These applications highlight MFA's strength in providing interpretable, balanced insights into complex systems without requiring prior assumptions about variable importance.[1][4][5]Introduction
Definition and Objectives
Multiple factor analysis (MFA) is a principal component method designed for the simultaneous analysis of multiple groups of variables, which can be numerical and/or categorical, measured on the same set of observations.90135-X) It aims to identify common underlying structures across these groups while evaluating the balance or relative contributions of each group to the overall analysis. The primary objectives of MFA include summarizing complex multi-table data into a lower-dimensional representation, detecting redundancies or complementarities between variable groups, and achieving a unified dimensionality reduction that accounts for the individual inertias of each group. By normalizing and weighting the groups appropriately, MFA facilitates the exploration of shared patterns without one group dominating the results due to scale differences.90135-X) This approach builds on principal component analysis (PCA) for continuous variables and multiple correspondence analysis (MCA) for categorical ones, adapting their principles to multi-group settings. Key benefits of MFA lie in its ability to handle mixed data types without requiring homogeneity across groups, enabling direct comparisons of the importance of different variable sets, and supporting exploratory analyses in diverse fields such as sensory evaluation and multi-omics studies. In sensory evaluation, for instance, it allows integration of assessor ratings and physicochemical measurements to assess product perceptions holistically. Similarly, in multi-omics research, MFA integrates datasets like genomics and proteomics to uncover coordinated biological variations. The main outputs of MFA consist of a global factor map representing the compromise across all groups, partial factor maps illustrating each group's specific structure projected onto the global axes, and balance indicators that quantify the contributions and inertias of individual groups. These visualizations and metrics provide insights into both the consensus and discrepancies among the data tables.90135-X)Relation to Other Factorial Methods
Multiple factor analysis (MFA) extends principal component analysis (PCA) to the analysis of multiple data tables describing the same set of observations, addressing the limitations of standard PCA when applied to single-block data by incorporating a normalization step for each group to ensure balanced contributions. In PCA, the focus is on maximizing variance within a single table through eigenvalue decomposition, whereas MFA first performs a PCA on each individual group (or block) of variables, scales the data by dividing by the square root of the first eigenvalue of that group's PCA to normalize inertia, and then concatenates the normalized tables for a global PCA. This adaptation prevents any single group with larger variance from dominating the analysis, allowing for a joint representation that respects the structure of heterogeneous data sets.[6] For groups involving categorical variables, MFA integrates principles from multiple correspondence analysis (MCA) by treating such data as contingency tables and adjusting for category frequencies to align the scaling with continuous variables, effectively performing MCA within each qualitative group before normalization. Unlike standalone MCA, which analyzes categorical data using chi-squared distances to handle the double contingency table inherent in multiple categories, MFA embeds this within the multi-group framework, representing categories by their centers of gravity rather than disjunctive coding alone, which facilitates integration with quantitative groups without distorting the overall factor space. This hybrid approach ensures that categorical and continuous variables contribute comparably to the global factors after normalization by the first singular value of the group's MCA.[6][7] MFA distinguishes itself from other multi-table methods, such as STATIS, which seeks a compromise between tables by optimizing weights to maximize the similarity of observation factor scores across groups via RV coefficients, whereas MFA employs a fixed normalization scheme to promote balance without iterative weighting. In contrast to multi-block PCA variants like SUM-PCA, which concatenate blocks after simple variance standardization and may allow dominant blocks to overshadow others, MFA's inertia-based normalization and emphasis on eigenvalue ratios—comparing each group's first eigenvalue to the global principal component—explicitly assess and enforce equilibrium across groups, making it particularly suited for mixed data types. These differences position MFA as a balanced extension of factorial methods for multi-block settings, assuming prior knowledge of PCA's variance maximization and MCA's distance metrics.[6]Data Structure and Preparation
Organization of Multiple Variable Groups
Multiple factor analysis (MFA) requires a multi-table data structure where a set of I observations is described by K distinct groups of variables, with the k-th group comprising J_k variables organized as an I × J_k matrix.[8] These matrices are conceptually concatenated horizontally to form a global I × ∑J_k data set, though each group is analyzed separately in the initial stages to account for its internal structure.[4] For instance, in sensory analysis of food products, one group might include physical attributes like pH and density, while another covers sensory attributes such as sweetness and bitterness.[4] A fundamental requirement is that the same I observations must appear across all K groups, ensuring comparability and alignment in the analysis.[8] Missing values can be handled in standard MFA: for numerical variables, often imputed with column means; for categorical variables, treated as an additional category or coded as absent in the disjunctive table. Complete data is ideal for accuracy, and advanced imputation methods are available for complex cases.[9] Groups must also be conceptually distinct, representing different aspects or domains of the observations (e.g., quantitative measurements versus qualitative descriptors), to facilitate the identification of shared and unique patterns.[8] Preprocessing begins with centering all numerical variables within each group by subtracting the group-specific mean, which removes location effects and focuses on variance.[4] For categorical variables, they are transformed into disjunctive tables or indicator matrices, where each category becomes a binary column (1 if present, 0 otherwise), enabling factorial treatment akin to multiple correspondence analysis (MCA).[4] If variables within a group exhibit differing scales, they may be scaled to unit variance prior to analysis to ensure equitable contribution during group factorization.[8] When defining groups, practitioners should aim for a relatively balanced number of variables (J_k) across the K groups to prevent any single group from disproportionately influencing the global structure, although subsequent normalization steps in MFA mitigate imbalances.[8] Each group is typically analyzed using principal component analysis (PCA) for quantitative variables or MCA for categorical ones, providing the foundation for integration.[4]Handling Different Variable Types
In Multiple Factor Analysis (MFA), data are organized into groups of variables, where the treatment of variable types within each group is essential for equitable contribution to the overall analysis. Numerical variables are standardized by centering them to a mean of zero and scaling to unit variance, enabling the application of Principal Component Analysis (PCA) that relies on Euclidean distances to summarize the group's variability. This standardization ensures that all numerical variables have comparable scales, preventing any single variable from dominating the group's principal components.[7] Categorical variables require transformation into a disjunctive table, consisting of indicator columns for each category, to facilitate analysis. Each indicator column is weighted by the proportion of individuals who do not possess that category (1 - f_i, where f_i is the frequency of the category), and Multiple Correspondence Analysis (MCA) is then applied using chi-squared distances, which account for the relative frequencies and capture associations among categories. This approach balances the influence of categories with varying prevalences, aligning the categorical group's inertia with that of numerical groups.[7] Groups with mixed variable types are rare in standard MFA, as the method assumes homogeneity within groups to apply appropriate metric spaces consistently. In such cases, variables are often separated into homogeneous sub-groups for separate PCA or MCA before integration, or extensions incorporate hybrid distances that combine Euclidean for numerical and chi-squared for categorical components; this is particularly noted in applications to multi-omics data, where diverse data modalities like continuous expression levels and discrete mutations necessitate adaptive handling.[10] Ordinal variables pose type-specific challenges, as their ordered categories can be treated either as categorical—via disjunctive coding and MCA to respect discrete levels—or as numerical if the scale is sufficiently granular to approximate continuous data, allowing standardization and PCA. The decision hinges on the number of levels and the meaningfulness of intervals, ensuring the treatment aligns with the variable's measurement properties for compatibility in the global MFA framework.[11] To validate the setup, groups should contain an adequate number of variables, such as more than five per group, to promote stability in the extracted factors and reliable estimation of group-specific inertias. Smaller groups risk unstable principal components and inflated variability in balance metrics.[8]Core Methodology
Group Normalization and Weighting
In Multiple Factor Analysis (MFA), the process of group normalization and weighting begins with the separate analysis of each group of variables to ensure equitable contributions across diverse data sets. For numerical variable groups, Principal Component Analysis (PCA) is performed on the centered and scaled data matrix X_k, while for categorical groups, Multiple Correspondence Analysis (MCA) is applied, yielding the first eigenvalue \lambda_{1k} for each group k. This initial step captures the internal structure of each group independently, with \lambda_{1k} representing the maximum variance (or inertia) explained by the first principal component.[7][8] Normalization follows to balance the influence of groups that may differ in size or variability. Specifically, the data matrix X_k for group k is divided by the square root of the first eigenvalue \lambda_{1k}, producing the normalized matrix Z_k = \frac{1}{\sqrt{\lambda_{1k}}} X_k. This adjustment equalizes the maximum variance across groups, as the first dimension of each normalized group now accounts for a unit variance. The rationale for this weighting is to prevent larger groups—those with more variables or higher overall inertia—from dominating the subsequent global analysis, thereby promoting a fair comparison of typologies or structures within each group.[7][8][11] The output of this normalization step consists of normalized partial factor maps for each group, where the coordinates in Z_k rescale the original principal components to a common scale. These maps preserve the relative positions within each group while mitigating scale disparities, preparing the data for integration into a unified framework. By design, this approach ensures that no single group can unilaterally define the primary axes of variation in the overall analysis.[7][8]Global Data Set Construction
In multiple factor analysis (MFA), the global data set is assembled by horizontally concatenating the normalized matrices from each group of variables, enabling a unified principal component analysis (PCA) across all groups. Specifically, for K groups, each normalized matrix Z_k (of dimensions I \times J_k, where I is the number of observations and J_k the number of variables in group k) is bound side-by-side to form the global matrix Z = [Z_1 \ Z_2 \ \dots \ Z_K], resulting in an I \times \sum J_k matrix. This structure preserves the block-wise organization while allowing the extraction of compromise factors that balance contributions from all groups. The normalization of each Z_k, typically by dividing the original group matrix by its first singular value \phi_k = \sqrt{\lambda_{k1}} (where \lambda_{k1} is the first eigenvalue from a preliminary PCA or MCA on group k), ensures that no single group dominates due to scale differences.[7][8] The primary purpose of this global construction is to perform PCA on Z, yielding factors that equally represent all groups post-normalization and reveal shared structures across variable sets while highlighting discrepancies. Unequal group sizes are implicitly addressed through the normalization step, as scaling by the first singular value equalizes the inertia of each group along the first dimension; however, when groups differ vastly in variable counts (e.g., one with 5 variables versus another with 50), this may introduce subtle biases toward larger groups, prompting extensions like explicit group weighting in advanced implementations.Factor Extraction and Coordinates
Once the global data set Z is constructed by concatenating the normalized group data tables Z_k, multiple factor analysis proceeds with a principal component analysis (PCA) applied to this aggregated matrix.[6] The PCA extracts the principal factors by decomposing the covariance structure of Z, yielding eigenvalues \lambda_j that quantify the variance explained by each successive factor j, along with the global principal coordinates F_j for the observations and the loadings for the variables.[6] These global coordinates F_j represent the positions of observations in the compromise space, which synthesizes information across all groups while respecting their individual structures.[4] The partial coordinates for each group k on factor j are then derived by projecting the group's normalized matrix Z_k onto the global eigenvectors V_j from the PCA of Z, given by the formula: F_{k j} = Z_k V_j This projection captures how each group's variables contribute to the global factors without altering the overall compromise.[6] The resulting partial coordinates F_{k j} allow for group-specific interpretations within the shared factor space. The number of factors to retain is typically determined using criteria such as the scree plot of eigenvalues or the cumulative percentage of inertia explained, often selecting 2 to 5 dimensions for practical interpretability in applications like sensory analysis.[4] Total inertia in MFA is computed as I = \frac{1}{I} \trace(Z^T Z), where I denotes the number of observations, providing a measure of the overall variance across the balanced groups.[6] Mathematically, this global PCA maximizes the explained variance across all normalized groups simultaneously, ensuring that no single group dominates the factor structure due to prior balancing.[4] The eigenvalues \lambda_j thus reflect the inertia along each principal axis in this unified space, with the sum of retained \lambda_j indicating the proportion of total inertia captured.[6]Balance Analysis
Metrics for Group Contributions
In multiple factor analysis, several metrics quantify the contributions of individual variable groups to the global factors, enabling researchers to assess relative importance, alignment, and potential imbalances post-extraction. The first eigenvalue ratio for group k, denoted L_{gk} = \lambda_{1k} / \sum_{m=1}^K \lambda_{1m}, measures the group's relative importance prior to normalization, where \lambda_{1k} is the first eigenvalue from the separate principal component analysis (or multiple correspondence analysis) of group k, and the denominator sums these values across all K groups; higher ratios indicate groups with stronger inherent structure that could dominate the analysis without adjustment.[7] The contribution of group k to the inertia of global factor j is captured by C_{kj} = (\sum \text{variances in partial } F_{kj}) / \lambda_j, where the numerator sums the variances (or squared coordinates) from the partial factor map of group k on dimension j, and \lambda_j is the eigenvalue of the global factor; this proportion reveals how much each group supports the explanation of overall data variance along specific dimensions, with larger C_{kj} values highlighting influential groups.[8] Coordinates quality for group k on factor j is evaluated using \cos^2_{kj} = (\text{inertia of partial } F_{kj}) / \lambda_{1k}, which expresses the fraction of the group's total variance (as given by its first eigenvalue \lambda_{1k}) explained by the global factor; values approaching 1 denote excellent fit, meaning the global structure effectively captures the group's variability, while lower values suggest misalignment.[12] For imbalance detection, the average \cos^2 across the initial factors (typically the first two or three) is computed for each group; persistently low averages, such as below 0.3, indicate inadequate representation and potential imbalance, where the group's structure deviates substantially from the global factors and may warrant further scrutiny or preprocessing.[12]Interpreting Balance Across Groups
In multiple factor analysis (MFA), balance across groups is interpreted by evaluating the uniformity of the group inertias L_{gk} and the squared correlations \cos^2 between each group's principal components and the global factors. High uniformity in L_{gk} values across groups, combined with \cos^2 > 0.5 for most groups on the primary dimensions, indicates that the groups capture similar aspects of the underlying global data structure, suggesting a harmonious integration without any single group overly influencing the analysis.[11] Disparities in these metrics, such as varying L_{gk} levels, signal potential imbalances that may warrant remedial steps like removing outlier groups or adjusting weights to equalize their contributions.[7] Decision rules for addressing imbalances rely on thresholds for these metrics to guide analytical choices. For instance, if the C_{kj} for one group exceeds 0.5 on a given dimension, that group is considered dominant and may skew the global solution, prompting separate PCA analyses for that group or exclusion from the MFA to avoid distortion.[11] Pairwise similarities between groups can be further assessed using the RV coefficient, which ranges from 0 (no structural similarity) to 1 (perfect homothety); values above 0.8 typically indicate strong alignment, while lower values suggest divergent information that could justify subgrouping or hierarchical extensions.[13] The implications of balanced versus imbalanced MFA outcomes provide key insights into the data's underlying patterns. A well-balanced analysis reveals shared structures across groups, facilitating the identification of common factors that generalize across variable sets, such as consensus in sensory evaluations from multiple experts.[8] In contrast, imbalance highlights unique aspects within specific groups, allowing researchers to isolate group-specific variances that might otherwise be masked; this is particularly useful in exploratory studies where group disparities inform targeted follow-up analyses.[11] To address persistent imbalances, especially in hierarchically structured data, hierarchical MFA extends the method by balancing contributions at multiple levels, offering a more nuanced remedial approach than standard weighting.[11] Despite their utility, balance metrics in MFA have notable limitations that affect interpretation. These metrics inherently assume equal relevance of all groups to the global structure, which may not hold if some groups are conceptually peripheral, leading to over- or under-emphasis.[7] Additionally, they are sensitive to imbalances in the number of variables per group, as larger groups can artificially inflate their first eigenvalues and thus their weights, potentially biasing the overall balance assessment even after normalization.[11]Visualization and Interpretation
Standard Factorial Graphics
In multiple factor analysis (MFA), standard factorial graphics provide visualizations of the global principal components derived from the concatenated and normalized data sets, enabling an overview of the overall structure across all variable groups. These plots adapt classical principal component analysis (PCA) and multiple correspondence analysis (MCA) techniques to the MFA framework, where observations are projected onto the global factor space to reveal patterns of similarity and variable contributions without emphasizing group-specific differences.[14] The global factor map, often presented as a biplot, displays observations and variable loadings simultaneously on the first two global principal components, illustrating the primary axes of variation in the combined data. Observations are positioned according to their coordinates in this global space, while arrows or points represent the loadings of variables from all groups, typically color-coded by their originating group to distinguish contributions visually. This graphic highlights clusters of similar observations and the directions in which variable groups pull the structure, with the length of arrows indicating the magnitude of influence on the factors. For instance, in applications involving mixed variable types, quantitative variables appear as vectors, and categorical modalities as points, all scaled to the global eigenvalues.[15] A scree plot visualizes the eigenvalues associated with the global principal components, plotted against component number, to assess the dimensionality of the solution and the proportion of total inertia explained by each factor. In MFA, this plot often includes both global eigenvalues and a comparison to partial inertias from individual group analyses, aiding in the decision of how many dimensions to retain for interpretation—typically those where the eigenvalue curve begins to flatten. The cumulative variance explained is marked, with the first few components often accounting for a substantial portion, such as over 60% in balanced data sets. Individual factor maps extend the global view by projecting observations onto a single principal component or a specific pair beyond the first two, allowing deeper inspection of variance along isolated dimensions. These maps position observations based on their global coordinates for the selected factor, often supplemented with confidence ellipses or color gradients based on squared correlations (cos²) to indicate how well individuals are represented. Such plots are useful for identifying outliers or subtle patterns not evident in the primary biplot.[15] Correlation circles, akin to those in PCA, depict the correlations between variables (or modalities) and the global principal components, plotted on a unit circle to show angular relationships and strengths. In MFA, variables from different groups are included and color-coded accordingly, revealing how each group's elements align with or oppose the factors—for example, variables with correlations near 1 lying close to the axis. This graphic underscores the quality of variable representation, with points nearer the circle periphery indicating stronger associations with the component.Unique MFA Visualizations
Multiple factor analysis (MFA) employs several specialized visualizations to evaluate the alignment and balance among variable groups, going beyond standard factorial plots by highlighting group-specific contributions and inter-group relationships. Partial factor maps are superimposed representations that project each group's variables or individuals onto the global principal axes, often using transparency, color coding, or distinct symbols to reveal overlaps and discrepancies in how different groups structure the data. For instance, in sensory analysis applications, these maps allow researchers to compare the configuration of chemical attributes against sensory perceptions, identifying whether groups capture similar patterns in the observations.[4] This visualization aids in assessing group balance by quantifying the proximity of partial points to the global compromise, where closer alignments indicate harmonious contributions across datasets.[1] Group contribution bar plots further illuminate imbalances by displaying metrics such as the eigenvalue-based group weights L_{gk} (the contribution of group g to dimension k) and the average squared cosine \cos^2_k (measuring the quality of representation of group g on dimension k) across principal components. These horizontal or vertical bar charts, typically ordered by magnitude, highlight dominant groups on specific axes; for example, a group with high L_{gk} values suggests it disproportionately influences the global solution, potentially signaling the need for reweighting. Such plots are essential for detecting redundancies or underrepresentations, as low \cos^2_k values imply poor alignment with the overall factors.[6] Recent implementations extend these to interactive formats, enhancing interpretability in complex multiblock studies.[16] The between-group RV matrix, visualized as a heatmap, computes pairwise RV coefficients—a generalization of the squared correlation for matrices—to quantify structural similarities between groups, with values ranging from 0 (no similarity) to 1 (identical structure). In the heatmap, rows and columns represent groups, and color intensity (e.g., from blue for low to red for high) reveals complementary or redundant datasets; for example, high RV values between sensory and instrumental groups indicate convergent information. This tool is particularly useful for identifying clusters of aligned groups, aiding decisions on data fusion.[11] Additionally, dendrograms derived from hierarchical clustering on the RV matrix facilitate group clustering, where branches represent similarity levels, helping to organize groups into hierarchical structures of complementarity or redundancy. These advancements, integrated into modern software, enhance the analysis of group balances in diverse applications like bioinformatics.[16]Examples and Applications
Introductory Worked Example
To illustrate the principles of multiple factor analysis (MFA), consider a hypothetical dataset on 20 wines, where each wine is described by three distinct groups of variables: chemical properties (numerical variables such as pH, alcohol content, and residual sugar), sensory attributes (numerical scores for aroma intensity, body, and aftertaste on a 1-10 scale), and tasting notes (categorical variables classifying dominant flavors as fruity, oaky, or spicy). This setup allows MFA to integrate diverse data types while assessing their balanced contributions to a global structure.[2] The analysis proceeds in steps, beginning with separate analyses of each group to normalize their scales. For the numerical groups (chemical and sensory), principal component analysis (PCA) is applied; for the categorical tasting notes group, multiple correspondence analysis (MCA) is used to handle the qualitative data. The first eigenvalues from these separate analyses quantify each group's internal structure: λ₁ = 5.2 for the chemical group, λ₁ = 3.1 for the sensory group, and λ₁ = 2.8 for the tasting notes group. To ensure comparability, each group's data matrix is scaled by dividing by the square root of its respective λ₁, effectively normalizing the first eigenvalue to 1 across groups and preventing any single group from dominating due to scale differences.[2][1] The normalized matrices are then concatenated column-wise to form a global dataset, on which a single PCA is performed to extract common factors. The eigenvalues from this global PCA are summarized below, showing that the first two factors account for 60% of the total variance, providing a compact representation of the wines' shared patterns.| Factor | Eigenvalue | Variance Explained (%) | Cumulative Variance (%) |
|---|---|---|---|
| 1 | 4.20 | 35.0 | 35.0 |
| 2 | 2.90 | 25.0 | 60.0 |
| 3 | 1.80 | 15.0 | 75.0 |
| Group | cos² (First Two Factors) |
|---|---|
| Chemical | 0.70 |
| Sensory | 0.60 |
| Tasting Notes | 0.40 |
| Wine | Chemical (pH) | Sensory (Aroma Score) | Tasting Notes |
|---|---|---|---|
| 1 | 3.45 | 7.2 | Fruity |
| 2 | 3.60 | 6.8 | Oaky |
| 3 | 3.30 | 8.1 | Spicy |
| 4 | 3.50 | 7.5 | Fruity |
| 5 | 3.40 | 6.9 | Oaky |