Fact-checked by Grok 2 weeks ago

ComBat

ComBat is a statistical method for adjusting batch effects in high-throughput genomic data, particularly microarray gene expression datasets, using empirical Bayes frameworks to remove non-biological experimental variation while preserving signals of biological interest.^[1] Introduced in 2007 by W. Evan Johnson, Cheng Li, and Ariel Rabinovic, it addresses challenges in combining data from multiple batches or labs, where systematic biases can confound downstream analyses like differential expression detection.^[2] The method employs parametric and non-parametric empirical Bayes approaches that are robust to outliers and effective even with small sample sizes per batch, unlike prior techniques requiring larger datasets.^[1] ComBat models batch effects as additive shifts in location and scale, borrowing information across features (e.g., genes) via hierarchical priors to estimate adjustment parameters stably.^[1] This enables the integration of disparate datasets to boost statistical power in studies limited by sample availability, such as those involving sequential array hybridizations or multi-site collaborations.^[1] Since its inception, ComBat has become one of the most widely adopted batch correction tools in bioinformatics due to its simplicity, interpretability, and availability as open-source software.^[3] Subsequent extensions have broadened its applicability; for instance, ComBat-seq adapts the framework for RNA-seq count data using negative binomial regression to handle discrete, over-dispersed distributions and retain integer values compatible with tools like DESeq2 and edgeR.^[3] These variants adjust both mean and variance components of batch effects, enhancing control of false positives and increasing detection power in integrated analyses.^[3] Implementations in languages like R (via the sva Bioconductor package) and Python (pyComBat) further support its use across bulk and single-cell omics workflows.^[4]

Background

Batch Effects in Gene Expression Data

Batch effects in gene expression data refer to systematic, non-biological variations introduced during the experimental process, which can obscure true biological signals.^[5] These effects arise from technical factors unrelated to the biological conditions under study, such as differences in sample processing dates, laboratory protocols, or instrument calibration.^[5] Common sources of batch effects vary by technology. In microarray experiments, they often stem from array manufacturing batches, reagent lots, or operator handling differences, leading to inconsistencies across hybridization runs.^[5] In RNA sequencing (RNA-seq), additional contributors include library preparation kits, RNA extraction methods, sequencing platform variations, and even atmospheric conditions affecting sample degradation.^[6] For instance, samples processed on different dates may exhibit shifts due to gradual changes in reagent quality or equipment performance.^[5] The impact of batch effects is profound, often inflating false positives in differential expression analysis and masking genuine biological differences.^[7] Systematic shifts in gene expression levels across batches can lead to erroneous correlations between technical artifacts and phenotypes of interest, potentially resulting in incorrect biological interpretations.^[5] In severe cases, these effects dominate the overall variance, reducing statistical power and reproducibility in downstream analyses like biomarker discovery. Statistical detection of batch effects typically involves visualization techniques to identify clustering or shifts attributable to technical factors. Principal component analysis (PCA) is widely used, where the first few principal components often reveal batch-specific groupings if technical variation explains a large portion of the data's variance. Boxplots of gene expression distributions across batches can similarly highlight median shifts or increased variability, indicating non-biological influences as dominant components. Empirical Bayes approaches have been developed to address these effects, though their detailed application is covered elsewhere.^[5]

Historical Context of Batch Correction

Batch effects in gene expression data were first recognized in the late 1990s and early 2000s as microarray technology proliferated, with early experiments revealing systematic technical variations due to factors like reagent lots, processing dates, or instrument differences that confounded biological signals.^[1] These issues gained prominence as high-throughput microarray studies scaled up, prompting concerted efforts to assess reproducibility; notably, the MicroArray Quality Control (MAQC) project in 2006 analyzed reference RNA samples across multiple platforms and sites, demonstrating strong overall reproducibility but underscoring persistent batch-related inconsistencies in gene expression measurements that could bias differential expression analyses. The MAQC findings highlighted the need for standardized protocols to mitigate such technical artifacts, influencing subsequent quality control guidelines in bioinformatics.^[8] Prior to the introduction of more sophisticated corrections, initial approaches relied on simple normalization techniques to address batch-induced distributional shifts. Quantile normalization, proposed in 2003, adjusted probe intensities across arrays to make their empirical distributions identical, effectively reducing some intensity-based batch effects by assuming similar overall expression profiles between samples. However, this method had notable limitations, particularly when batches were associated with known covariates like treatment groups or tissues, as it could inadvertently homogenize biological differences or fail to account for unbalanced designs, leading to residual biases in downstream analyses.^[9] Distance-based adjustments, such as early applications of principal component analysis (PCA) or singular value decomposition (SVD) to detect and subtract batch-related variance components, served as precursors to later methods but often required manual intervention and struggled with preserving subtle biological signals amid technical noise.^[9] The emergence of model-based corrections in the mid-2000s marked a shift toward more structured handling of batch effects, with ANOVA-like linear models gaining traction for incorporating known batch factors as covariates. The limma package, released in 2003, enabled users to fit moderated t-tests within a linear modeling framework that explicitly adjusted for batch alongside biological variables of interest, thereby estimating and removing technical variance while aiming to retain true differential expression. These approaches emphasized the importance of preserving biological heterogeneity—such as condition-specific expression patterns—while targeting technical noise, yet they faced challenges in small sample sizes, where overfitting or over-correction could occur, especially if batches were confounded with phenotypes, and they often underperformed in scenarios with multiple or unknown batch sources.^[1] From 2000 to 2007, the proliferation of public repositories like the Gene Expression Omnibus (GEO), launched in 2000, amplified the challenges of batch effects, as researchers increasingly sought to integrate heterogeneous datasets for meta-analyses, only to encounter substantial bottlenecks from inconsistent processing across studies. Batch variations became a primary barrier to reliable cross-study inferences, with analyses revealing that uncorrected technical artifacts could dominate principal components of variation, overshadowing biological insights and reducing power in combined datasets.^[9] This period underscored the demand for robust, covariate-aware methods that could harmonize diverse microarray data without compromising scientific validity. ComBat addressed these gaps as a seminal empirical Bayes solution for known batch adjustment.^[1]

Development and Methodology

Original Publication and Authors

ComBat was originally introduced in a seminal 2007 paper titled "Adjusting batch effects in microarray expression data using empirical Bayes methods," published in the journal Biostatistics.^[1] The article, appearing in volume 8, issue 1, pages 118–127, presented the method as a robust solution for correcting non-biological variations in gene expression data.^[2] The lead author, W. Evan Johnson, was affiliated with the Department of Biostatistics at the Harvard School of Public Health (now Harvard T.H. Chan School of Public Health) at the time, where he focused on computational biology and statistical methods for genomics.^[2] Co-author Cheng Li, also from Harvard, contributed expertise in DNA microarray analysis tools and integrative genomic data processing.^[10] Ariel Rabinovic, the third author, was affiliated with the Department of Biostatistics at the University of California, Los Angeles, at the time and provided key insights into statistical modeling for batch adjustment, drawing from his work in biostatistics.^[2] The development of ComBat was motivated by persistent challenges in integrating microarray datasets generated across different laboratories and experimental conditions, particularly for cancer genomics research aimed at identifying robust biomarkers through large-scale data combination.^[1] This work emerged within broader efforts to enable integrative analysis of high-throughput genomic data, addressing limitations in existing normalization techniques that failed to fully mitigate batch-induced artifacts. Upon publication, ComBat experienced rapid adoption within the bioinformatics community due to its empirical Bayes framework's effectiveness in small-sample scenarios.^[1] As of November 2025, the original paper has been cited more than 8,600 times, underscoring its high impact and influence on subsequent batch correction tools, including integrations in popular packages like limma and adaptations in edgeR for differential expression analysis.^[11]

Empirical Bayes Framework

Empirical Bayes methods represent a shrinkage estimation approach that borrows information across multiple parameters to stabilize estimates, integrating prior distributions with observed data in a data-driven manner.^[12] This framework treats parameters as random effects drawn from a prior distribution, whose hyperparameters are empirically estimated from the data itself rather than specified subjectively.^[13] In essence, it combines Bayesian principles with frequentist estimation techniques, enabling robust inference by shrinking individual parameter estimates toward a common value informed by the ensemble of data.^[14] At its core, the empirical Bayes framework relies on hierarchical modeling, where lower-level parameters (e.g., gene-specific effects) are governed by higher-level priors whose hyperparameters are derived empirically to avoid overfitting, particularly in high-dimensional settings such as genomics with thousands of variables.^[12] Hyperparameters are typically estimated by maximizing the marginal likelihood of the observed data, which integrates out the parameters and leverages the full dataset to inform the prior structure.^[14] This empirical estimation reduces variability in high-dimensional contexts by pooling information across similar units, promoting more reliable predictions and variable selection.^[13] Compared to classical Bayesian methods, empirical Bayes offers computational feasibility for large datasets, as it circumvents the need for full posterior sampling (e.g., via MCMC) by directly optimizing hyperparameters through marginal likelihood maximization.^[14] It also handles uncertainty in prior specification by deriving the prior from the data, making it particularly advantageous in scenarios with limited prior knowledge or vast parameter spaces, though it may slightly underestimate posterior uncertainty.^[12] In the context of batch correction, the empirical Bayes framework facilitates modeling of batch-specific means and variances while safeguarding underlying biological signals through moderated, shrinkage-based estimates that borrow strength across batches.^[1] This approach, as applied in methods like ComBat, ensures that adjustments for technical artifacts do not overly disrupt true differential expression patterns.^[1]

Mathematical Model

The ComBat algorithm models batch effects in gene expression data under a location-scale framework, assuming that the observed expression values are influenced by both biological covariates and unwanted batch-specific shifts in mean and variance. Specifically, for each gene g, the expression Y_{gij} of sample i in batch j is modeled as

Y_{gij} = X_i \beta_g + \gamma_{gj} + \delta_{gj} \varepsilon_{gij},

where X_i represents the biological covariates for sample i, \beta_g is the vector of regression coefficients for gene g, \gamma_{gj} denotes the additive batch mean shift for batch j and gene g, \delta_{gj} captures the batch-specific variance scaling, and \varepsilon_{gij} \sim N(0,1) is the standardized error term, assuming the data have been pre-normalized per gene to have unit variance.^[1] To estimate the batch effects, ComBat employs an empirical Bayes approach by imposing priors on the batch parameters across genes. The additive effects follow \gamma_{gj} \sim N(\mu_b, \sigma_b^2), where \mu_b and \sigma_b^2 are hyperparameters representing the overall batch mean and variance, shared across all genes and batches. For the scale parameters, \delta_{gj} \sim |N(0, \tau_b^2)|, using a folded normal distribution to ensure non-negativity, with \tau_b^2 as the hyperparameter controlling batch variance variability; alternative parameterizations, such as inverse gamma for the squared scales, may also be used in implementations. The hyperparameters \mu_b, \sigma_b^2, and \tau_b^2 are estimated globally from the data using the method of moments, leveraging the marginal distribution of the batch effects across all genes to borrow strength and stabilize estimates, particularly in cases with small sample sizes per batch.^[1] The adjustment proceeds by computing posterior means for the batch effects given the observed data, which incorporate shrinkage toward the priors. The corrected expression values are obtained by subtracting these posterior expectations from the originals:

\hat{Y}_{gi} = Y_{gi} - \mathbb{E}[\gamma_{gj} + \delta_{gj} \varepsilon_{gij} \mid \text{data}],

where the expectation is the empirical Bayes posterior mean, effectively removing the estimated batch adjustment term while preserving the biological signal. In practice, this involves iterative estimation: first standardizing the data within batches, then deriving closed-form posterior updates for \gamma_{gj} and \delta_{gj} based on the conjugate normal priors, and finally back-transforming to the original scale.^[1] When biological covariates of interest—such as disease status or treatment groups, often termed "protected" variables—are present, ComBat accommodates them by including them in the design matrix X_i and estimating \beta_g via ordinary least squares without applying shrinkage, ensuring these signals are not attenuated by the empirical Bayes procedure applied solely to the batch terms \gamma_{gj} and \delta_{gj}. This selective modeling prevents over-correction of relevant biological variation while targeting technical batch artifacts.^[1]

Applications and Extensions

Use in Microarray Analysis

ComBat finds its primary application in correcting batch effects within microarray gene expression datasets, where non-biological variations arising from experimental procedures can confound analyses. The standard workflow involves providing ComBat with a gene expression matrix of normalized log-transformed intensities—typically pre-processed via methods like RMA for Affymetrix arrays or quantile normalization for Illumina platforms—and a batch covariate vector indicating sample assignments to batches. ComBat then applies an empirical Bayes adjustment to estimate and subtract batch-specific additive and multiplicative effects, yielding a corrected expression matrix that retains biological signals for subsequent analyses, such as t-tests or ANOVA for identifying differentially expressed genes. This method proves particularly effective for microarray platforms like Affymetrix HG-U133A or Illumina BeadArrays, where batch effects often dominate due to variations in hybridization, scanning, or reagent lots, enabling the integration of data from multiple experiments to enhance statistical power without requiring large sample sizes per batch. By leveraging empirical Bayes shrinkage across genes, ComBat stabilizes parameter estimates even in small batches (e.g., fewer than 10 samples), reducing overfitting risks and preserving relative expression differences that reflect true biological variation.^[9] A representative example of its use is in combining Affymetrix microarray datasets from the original validation study, such as the lung fibroblast (IMR90) data across three batches of four arrays each, where post-correction heatmaps demonstrated diminished batch-induced clustering while maintaining tissue-specific patterns. In broader contexts, similar integrations have been applied to pan-cancer microarray compilations, like merging TCGA and external cohorts, resulting in reduced batch-driven separation in principal component analysis (PCA) plots, allowing for cross-study comparisons of oncogenic signatures.^[15] Evaluation of ComBat's performance in microarray settings typically focuses on two key metrics: effective removal of batch variance, assessed via ANOVA F-tests on batch factors pre- and post-correction (showing significant reductions in batch-associated p-values), and preservation of biological variance, measured by correlations between adjusted expressions and known phenotypes (e.g., retaining >90% of original phenotype-batch correlations while eliminating batch signals). Principal component analysis further visualizes success, with post-correction PCA plots exhibiting tighter clustering by biological groups rather than batches, as seen in Affymetrix evaluations where the first principal component's batch explanation dropped from ~50% to <10%. These metrics confirm ComBat's ability to enhance downstream differential analysis without introducing artifacts.^[16]^[17]

Adaptations for RNA-Seq Data

The original ComBat method, designed for continuous microarray data, assumes a Gaussian distribution for gene expression values, which is inappropriate for the skewed and overdispersed count data typical of RNA-seq experiments.^[18] This assumption can lead to erroneous batch adjustments, as RNA-seq counts often exhibit mean-variance dependence and cannot be adequately transformed to normality without losing integer integrity or introducing artifacts.^[18] Consequently, adaptations for RNA-seq have shifted to parametric models like the negative binomial distribution to better capture the discrete, overdispersed nature of the data while preserving counts suitable for downstream tools such as DESeq2 or edgeR.^[18] A key adaptation is ComBat-seq, introduced by Zhang et al. in 2020, which employs a negative binomial regression framework to adjust batch effects directly on raw count data.^[18] The model assumes that the count for gene g in sample j from batch i, denoted y_{gij}, follows a negative binomial distribution y_{gij} \sim \text{NB}(\mu_{gij}, \phi_{gi}), where \mu_{gij} is the mean and \phi_{gi} is the batch-specific dispersion parameter.^[18] Batch effects are incorporated into the mean via the log-link:

\log(\mu_{gij}) = \alpha_g + X_j \beta_g + \log(N_j) + \gamma_{gi},

where \alpha_g is the baseline expression, X_j \beta_g captures biological covariates, \log(N_j) accounts for library size, and \gamma_{gi} represents the additive batch effect on the log-mean.^[18] Adjustment proceeds by estimating parameters using maximum likelihood within batches, then computing conditional expectations to obtain batch-free counts that retain integer values and empirical quantiles.^[18] ComBat-seq demonstrates superior performance over applying the original ComBat to log-transformed counts, particularly in preserving biological signals while removing batch effects.^[18] In simulation studies using Polyester to generate realistic RNA-seq data with batch-induced mean and dispersion shifts (e.g., 1.5-fold mean and 2-fold dispersion differences), ComBat-seq achieved a true positive rate of 0.89 for detecting differentially expressed genes, compared to 0.85 for log+ComBat, with better control of false positives (0.039 vs. 0.046).^[18] Evaluations on GTEx data further confirmed its efficacy in reducing batch variance without over-correcting tissue-specific signals, as measured by higher RV coefficients for biological structure preservation (e.g., 0.73 power vs. 0.66 for log+ComBat under stronger batch effects).^[18] Other variants include ComBat-ref, developed by Zhang et al. in 2024, which refines ComBat-seq by selecting a low-dispersion reference batch and aligning others toward it using pooled dispersion estimates in a negative binomial model.^[19] This reference-based approach improves differential expression power by minimizing over-adjustment, outperforming ComBat-seq in simulations with mean fold changes of 2.4 and dispersion shifts of 4, where it detected true positives at rates comparable to batch-free data while maintaining false discovery rates below 0.05 in edgeR analyses.^[19] Real-data applications, such as on GFRN and NASA GeneLab datasets, highlighted its ability to identify key genes like EGFR with low FDR (e.g., 0.0001), enhancing reliability in multi-batch RNA-seq studies.^[19]

Case Studies in Genomics Research

In cancer genomics, ComBat has been instrumental in integrating multi-center microarray and RNA-seq data from The Cancer Genome Atlas (TCGA), particularly within the Pan-Cancer Atlas initiative launched in 2013, which analyzes molecular profiles across 33 tumor types to identify shared and distinct cancer subtypes.^[20] By correcting for batch effects arising from differences in sequencing platforms, tissue processing, and laboratory protocols across TCGA's contributing centers, ComBat enabled robust cross-cohort comparisons that revealed pan-cancer molecular aberrations, such as recurrent driver mutations and pathway alterations, facilitating the discovery of tumor subtypes like immune-hot and immune-cold phenotypes in multiple cancers.^[21] This integration was critical for the Pan-Cancer Atlas's identification of 299 significant driver genes and the elucidation of subtype-specific therapeutic vulnerabilities, as uncorrected batch effects could otherwise mask biologically relevant signals.^[22] In population-scale studies, ComBat has been applied in analyses of data from the Genotype-Tissue Expression (GTEx) project to support tissue-specific expression quantitative trait loci (eQTL) analysis by mitigating donor-specific and sequencing-related batch effects across 838 donors and 49 tissues.^[23] For instance, researchers have employed ComBat to harmonize RNA-seq data from GTEx's diverse postmortem samples, accounting for variables like postmortem interval and sequencing center, supporting the identification of tissue-specific eQTLs and enhancing the reliability of genetic association studies.^[23] This correction preserved genetic associations while reducing technical variance, enabling discoveries such as tissue-specific regulatory effects of variants near disease genes, including those linked to neurological disorders, and enhancing the reliability of eQTL maps for downstream functional genomics.^[23] A notable meta-analysis example comes from Leek et al. (2010), who reanalyzed public datasets from the Gene Expression Omnibus (GEO), demonstrating ComBat's efficacy in reducing false discoveries during cross-study validation of gene expression signatures.^[24] In their examination of nine GEO datasets, including those from transplant rejection and population stratification studies, batch effects—often tied to processing dates or groups—accounted for up to 99.5% of variation and confounded up to 32.1% of features, leading to inflated false positive rates in signature validation; applying ComBat via empirical Bayes adjustment removed these artifacts, preserving biological signals and lowering spurious associations by aligning distributions across batches.^[24] This work underscored ComBat's role in enabling reliable meta-analyses, as evidenced by improved reproducibility of prognostic signatures in independent cohorts post-correction. These applications have yielded tangible outcomes, such as the identification of batch-robust biomarkers for Alzheimer's disease in the Religious Orders Study and Memory and Aging Project (ROSMAP) cohort.^[25] In a 2021 study integrating ROSMAP RNA-seq data with the Mount Sinai Brain Bank cohort, ComBat corrected for cohort-specific effects, revealing a novel Alzheimer's subtype characterized by upregulated synaptic genes and distinct neuropathological profiles, which correlated with cognitive decline and amyloid-beta pathology independent of technical biases.^[25] Such findings have advanced the development of resilient biomarkers, like microRNA-129-5p expression patterns, for early detection and subtyping in neurodegenerative research.^[26]

Implementations and Software

R Package Integration

The primary implementation of ComBat is available within the sva (Surrogate Variable Analysis) package for the R programming language, distributed through the Bioconductor repository.^[27] The sva package, which integrates ComBat alongside tools for estimating surrogate variables to capture unknown sources of variation, has included the ComBat function and later releases, enabling researchers to adjust for known batch effects in high-dimensional genomic data (current version 3.58.0 as of Bioconductor 3.19). Installation of the sva package requires R version 3.2 or higher and is performed using the BiocManager utility with the command BiocManager::install("sva"), ensuring access to the latest stable version from Bioconductor.^[27] The core function, ComBat, takes as input an expression matrix and batch information, applying the empirical Bayes adjustment framework originally described by Johnson et al. (2007).^[28] Its basic syntax is:

ComBat(dat, batch, mod = NULL, par.prior = TRUE, prior.plots = FALSE, mean.only = FALSE, ref.batch = NULL, BPPARAM = bpparam("SerialParam"))
ComBat(dat, batch, mod = NULL, par.prior = TRUE, prior.plots = FALSE, mean.only = FALSE, ref.batch = NULL, BPPARAM = bpparam("SerialParam"))

Here, dat is the input data matrix with features (e.g., genes) as rows and samples as columns, typically normalized expression data; batch is a vector or factor specifying the batch assignment for each sample; and mod is an optional model matrix encoding biological covariates of interest, such as treatment groups, to preserve their effects during adjustment.^[29] The function assumes the input data has been pre-processed for quality and normalization, as batch correction is most effective on clean datasets. Key parameters control the adjustment's flexibility and assumptions. The par.prior argument (default: TRUE) specifies whether to use parametric empirical Bayes priors, which assume normality and estimate hyperparameters from the data, or non-parametric priors for more robust handling of skewed distributions; parametric priors are generally recommended for microarray data but can be tuned based on data characteristics.^[29] Setting mean.only to TRUE performs adjustment only on location shifts (means) across batches while preserving variance structure, which is useful when batch effects primarily affect means rather than dispersions.^[28] Additional options like prior.plots generate diagnostic plots of estimated priors, and ref.batch allows aligning all batches to a reference batch to facilitate comparisons.^[29] The output is a corrected data matrix with batch effects removed, alongside a list of estimated batch parameters (e.g., means and variances per batch and gene) for further inspection. The sva package depends on Biobase for handling ExpressionSet objects, which are commonly used to store and manipulate genomic datasets in Bioconductor workflows, allowing seamless input of structured data into ComBat.^[27] Furthermore, the adjusted output from ComBat integrates directly with the limma package for downstream differential expression analysis, where the corrected matrix can be fitted into linear models while accounting for the original biological design via the mod matrix. This interoperability within the Bioconductor ecosystem facilitates end-to-end pipelines for batch-corrected genomic studies.^[27]

Python and Other Language Tools

pyComBat is a Python package first described in a 2021 preprint and formally published in 2023 that provides an implementation of the ComBat method for batch effect correction in high-throughput molecular data, including both microarray and RNA-seq datasets; however, the repository was archived in March 2024 and merged into the InMoose package, making it no longer maintained as standalone.^[4]^[30]^[31] It mirrors the functionality of the original R implementation while integrating seamlessly with scikit-learn pipelines through wrapper functions, enabling easy incorporation into machine learning workflows for bioinformatics analysis.^[4] The package supports empirical Bayes adjustments for location and scale parameters across batches, preserving biological variability.^[4] For RNA-seq specific corrections, pyComBat includes a port of ComBat-seq, adapted for count data using negative binomial regression to handle the discrete nature of sequencing outputs.^[4] Additionally, the Scanpy library offers a built-in pp.combat function that implements ComBat for single-cell RNA-seq data stored in AnnData objects, facilitating batch harmonization in large-scale scRNA-seq analyses with vectorized operations for efficiency on datasets exceeding millions of cells. These Python tools often leverage pandas for data manipulation and NumPy for numerical computations, supporting both bulk and single-cell applications via GitHub repositories like epigenelabs/pyComBat.^[31] In other languages, MATLAB implementations of ComBat are available in specialized bioinformatics toolboxes, such as the Matisse toolbox for in situ sequencing data analysis, which applies the empirical Bayes framework to correct batch effects in RNA-seq-like datasets.^[32] These versions emphasize integration with MATLAB's matrix operations for handling high-dimensional genomic data. Key features across these implementations include vectorized algorithms to process extensive datasets efficiently and compatibility with Scanpy for single-cell RNA-seq harmonization, reducing computational overhead in integrative analyses.

Usage Guidelines

Prior to applying ComBat, genomic data should be preprocessed through normalization to ensure comparability across samples and to stabilize variance, as unnormalized data can lead to biased batch adjustments.^[33] For microarray expression data, robust multi-array average (RMA) normalization is recommended, which includes background correction, quantile normalization, and summarization to probe sets. In RNA-seq analyses, variance-stabilizing transformations such as those provided by DESeq2 (e.g., variance stabilizing transformation or regularized log) should be applied to count data prior to ComBat to approximate normality and reduce heteroscedasticity. Accurate annotation of the batch covariate is essential, as misclassification can propagate errors into the correction process.^[34] Effective use of ComBat involves incorporating biological covariates into the model matrix to preserve relevant signals while adjusting for batches. The mod parameter allows specification of a design matrix for phenotypes or other factors of interest, preventing the removal of true biological variation.^[33] For datasets with small sample sizes per batch, parametric priors (par.prior=TRUE) are preferred over non-parametric ones, as they borrow information across genes to stabilize estimates and improve robustness.^[34] Visualization techniques, such as principal component analysis (PCA), should be employed to inspect data before and after correction, confirming reduction in batch clustering while maintaining separation by biological groups. To evaluate ComBat's performance, assess the reduction in batch effects through metrics like the correlation between batch labels and principal components, which should approach zero post-correction, and the preservation of biological signals via associations with phenotypes, such as increased F-statistics for models including the outcome variable.^[33] Cross-validation can help detect over-correction by testing whether biological differentials are unduly diminished.^[35] Common pitfalls include applying ComBat to scenarios with unknown batch factors, where surrogate variable analysis (SVA) is more appropriate for estimating latent variables. Additionally, when batch effects interact with biological covariates across multiple batches, standard ComBat assumptions of additive effects may not hold, potentially requiring extended models or alternative methods.^[34]

Limitations and Comparisons

Known Limitations

The original ComBat method relies on a location-scale model that assumes normally distributed gene expression values, an assumption frequently violated in RNA-seq data due to the inherent skewness, over-dispersion, and zero-inflation of count distributions, particularly for low-expression genes.^[18] This mismatch can result in invalid adjustments, such as the generation of negative expression values, which are incompatible with downstream analyses requiring non-negative counts, like differential expression testing in tools such as edgeR or DESeq2.^[18] A key practical limitation of ComBat is its potential for over-correction, which can inadvertently remove biologically relevant variation, especially in datasets where batch factors are confounded with phenotypes of interest.^[36] For example, in time-series experiments, if batches systematically align with temporal progression—a biological covariate—ComBat may treat temporal signals as technical artifacts, thereby distorting true developmental or dynamic patterns.^[37] This risk is heightened when unknown subtypes or unmodeled factors within batches lead the method to misinterpret biological heterogeneity as noise.^[37] Regarding scalability, the canonical R implementation in the sva package can be computationally intensive for very large datasets, where faster alternatives like Python-based ports have been developed to mitigate these bottlenecks.^[4] Additionally, ComBat is designed for known batch covariates and performs suboptimally on latent or unknown batches unless augmented with surrogate variables from methods like SVA, which can introduce further estimation challenges in complex designs.^[38] Post-publication critiques from 2020s benchmarks highlight ComBat's inferior performance relative to deep learning approaches in intricate single-cell RNA-seq scenarios, such as integrating datasets with nonlinear batch effects or at atlas scale.^[39] For instance, across evaluations of multiple real and simulated datasets, ComBat ranked in the lower tier for metrics like batch mixing (e.g., kBET scores) and cell-type preservation, while methods like scVI excelled by better balancing technical removal with biological conservation.^[40] These shortcomings underscore ComBat's reliance on linear assumptions, which falter in the high-dimensional, sparse nature of single-cell data.^[39]

Alternative Batch Correction Methods

While ComBat remains a cornerstone for batch correction in microarray and bulk RNA-seq data, several normalization-based methods offer simpler alternatives when batch effects are less pronounced or when computational efficiency is prioritized. Quantile normalization, introduced as a preprocessing step for microarray data, adjusts the distribution of intensities across samples to make them comparable by ranking and replacing values with averages from a reference distribution. This approach is effective for reducing technical variability in homogeneous datasets but does not explicitly model known batch structures, potentially leading to over-correction or residual effects in multi-batch scenarios. Another normalization-based option is RUVSeq, which employs a residuals-based strategy to remove unwanted variation by factoring out surrogate variables derived from negative control genes or principal components of the data. RUVSeq is particularly useful for RNA-seq counts where batch effects confound biological signals, offering flexibility for both known and unknown confounders without assuming a parametric batch model. Model-based alternatives to ComBat provide linear or factor-analytic frameworks that can be faster or more adaptable to hidden confounders. The removeBatchEffect function in the limma package uses linear modeling to adjust expression values by estimating batch means and subtracting them after fitting a design matrix for biological covariates, achieving correction without the empirical Bayes shrinkage that defines ComBat. This method is computationally lightweight and integrates seamlessly with differential expression analysis pipelines, making it preferable for large datasets where speed is essential, though it may underperform in preserving biological variance compared to shrinkage-based approaches. PEER (Probabilistic Estimation of Expression Residuals), on the other hand, applies factor analysis to infer and remove hidden factors, including batch effects, by modeling expression as a combination of biological signals and latent variables. PEER excels in scenarios with unknown or complex confounding factors, such as in population genetics studies, and has been widely adopted for its ability to handle high-dimensional data without requiring batch labels. For single-cell RNA sequencing (scRNA-seq), advanced integration methods have emerged that surpass ComBat's applicability due to their handling of cell-type manifolds and sparsity. Harmony, a graph-based algorithm, corrects batch effects by aligning neighborhoods in a low-dimensional embedding space, iteratively adjusting positions to maximize mutual nearest neighbors across batches while preserving local structures. Published in 2019, Harmony is optimized for scRNA-seq integration, enabling rapid correction of datasets with thousands of cells and batches, and is often preferred when visualizing or clustering heterogeneous single-cell data where ComBat's bulk-oriented assumptions fail. Similarly, fastMNN (fast Mutual Nearest Neighbors) integrates batches by projecting data into a common space based on shared cell neighborhoods, using canonical correlation analysis to correct for batch-specific shifts. This method, from 2018, is efficient for large-scale scRNA-seq and balances correction with the retention of cell-type distinctions, making it suitable for cross-study harmonization. Surrogate Variable Analysis (SVA) offers a robust choice for unknown batch effects, estimating surrogate variables from the data to protect against over-correction of biological signals. Developed in 2010, SVA is integrated into tools like limma and is ideal for meta-analyses where batch metadata is incomplete, contrasting with ComBat's reliance on known batches. In summary, alternatives like Harmony or fastMNN are preferable for scRNA-seq manifold preservation, SVA for unidentified confounders, and simpler methods like quantile normalization for preliminary processing; ComBat, however, continues to excel in known-batch microarray meta-analysis due to its empirical Bayes framework for robust shrinkage.

References

[1]
Adjusting batch effects in microarray expression data using ...
In this paper, we develop an empirical Bayes (EB) method that is robust for adjusting for batch effects in data whose batch sizes are small. 1.1 Microarray data ...Missing: original | Show results with:original
[2]
Adjusting batch effects in microarray expression data ... - PubMed
Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007 Jan;8(1):118-27. doi: 10.1093/biostatistics/ ...Missing: original | Show results with:original<|control11|><|separator|>
[3]
ComBat-seq: batch effect adjustment for RNA-seq count data
We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies.INTRODUCTION · MATERIALS AND METHODS · RESULTS · DISCUSSION
[4]
pyComBat, a Python tool for batch effects correction in high ...
Dec 7, 2023 · We present a new Python implementation of state-of-the-art tools ComBat and ComBat-Seq for the correction of batch effects in microarray and RNA-Seq data.
[5]
Tackling the widespread and critical impact of batch effects in high-throughput data - Nature Reviews Genetics
### Summary of Batch Effects in High-Throughput Data (Gene Expression Focus)
[6]
Assessing and mitigating batch effects in large-scale omics studies
Oct 3, 2024 · When samples are processed in batches, the processing order can influence the degree of signal drift, leading to batch effects in the data.
[7]
Overcoming the impacts of two-step batch effect correction on gene ...
Our purpose in this article is to understand and correct the impact of two-step batch adjustment on downstream differential expression analysis based on linear ...Missing: paper | Show results with:paper
[8]
Adjusting for Batch Effects in DNA Methylation Microarray Data, a ...
Mar 15, 2018 · In a 30-sample pilot Illumina Infinium HumanMethylation450 (450k array) experiment, we identified two sources of batch effects: row and chip.
[9]
Batch effect removal methods for microarray gene expression data ...
Jul 31, 2012 · The batch effect originates from various sources. Basically, at each step of a MAGE experiment, a number of potential sources are susceptible to ...
[10]
‪Cheng Li‬ - ‪Google Scholar‬
Adjusting batch effects in microarray expression data using empirical Bayes methods. WE Johnson, C Li, A Rabinovic. Biostatistics 8 (1), 118-127, 2007. 8622 ...<|control11|><|separator|>
[11]
Learning from a lot: Empirical Bayes for high‐dimensional model ...
Empirical Bayes is a versatile approach to “learn from a lot” in two ways: first, from a large number of variables and, second, from a potentially large amount ...
[12]
[PDF] Objective Priors in the Empirical Bayes Framework - arXiv
May 11, 2020 · Empirical Bayes methods offer a data-driven solution to this problem by estimating the prior itself from an ensemble of data.
[13]
[PDF] Empirical Bayes methods in classical and Bayesian inference
Rather than in comparison with Bayesian inference, the advantage of empirical Bayes methods can be appreciated in comparison with classical maximum likelihood ...
[14]
Compendiums of cancer transcriptomes for machine learning ...
Oct 8, 2019 · Here, we present merged microarray-acquired datasets (MMDs) across 11 major cancer types, curating 8,386 patient-derived tumor and tumor-free ...
[15]
Methods that remove batch effects while retaining group differences ...
Aug 13, 2015 · We re-analyzed the cited data (GEO: GSE40566) as described, detecting 2011 differentially expressed genes at 5% FDR. ... microarray platform ( ...
[16]
Batch Correction Analysis | Griffith Lab - RNA-seq
In this section we will use the ComBat-Seq tool in R (Bioconductor) to demonstrate the principles and application of batch correction.Missing: original | Show results with:original
[17]
ComBat-seq: batch effect adjustment for RNA-seq count data - PMC
Sep 21, 2020 · We developed a batch correction method, ComBat-seq, using a negative binomial regression model that retains the integer nature of count data in RNA-seq studies.Missing: context | Show results with:context
[18]
Highly effective batch effect correction method for RNA-seq count data
We introduce ComBat-ref, a refined batch effect correction method designed to enhance the statistical power and reliability of differential expression analysis ...
[19]
POIBM: batch correction of heterogeneous RNA-seq datasets ...
Feb 23, 2022 · We present POIBM, an RNA-seq batch correction method, which learns virtual reference samples directly from the data.Missing: ICGC | Show results with:ICGC
[20]
TCGA Batch Effects Viewer - Bioinformatics
This website is designed to help assess, diagnose and correct for any batch effects in TCGA data. It first allows the user to assess and quantify the presence ...Missing: ICGC microarray
[21]
[PDF] Detection, Diagnosis and Correction of Batch Effects in TCGA Data
• Some algorithms for correction: – ComBat (aka Empirical Bayes). – ANOVA. – Median Polish. • Included in MBatch R package http://bioinformatics.mdanderson.org ...
[22]
Batch correction evaluation framework using a-priori gene-gene ...
May 28, 2019 · Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray ... Authors and Affiliations. Department of Biomedical Informatics ...
[23]
Effect of uniform processing and batch effect removal on gene...
As a consequence, we performed a batch correction using ComBat 29 for each tissue available in both TCGA and GTEx databases, as previously reported in a study ...
[24]
An ontology-based method for assessing batch effect adjustment ...
Sep 8, 2018 · When applying SVA, Combat and RUV to a mixed dataset of GTEx and TCGA data, the ontology score suggests that BEA was generally not beneficial.3 Materials And Methods · 4 Results · 4.1 The Ontology Score...<|separator|>
[25]
Tackling the widespread and critical impact of batch effects in high ...
Sep 14, 2010 · However, we find that batch effects are strong enough to change not only mean levels of gene expression between batches but also correlations ...
[26]
Molecular subtyping of Alzheimer's disease using RNA sequencing ...
Jan 6, 2021 · Before model creation, both datasets are first corrected for cohort effect between MSBB-AD and ROSMAP, using the ComBat program, to reduce ...
[27]
miR-129-5p as a biomarker for pathology and cognitive decline in ...
Jan 9, 2024 · Assessment of CERAD, Braak, and cognition in ROS/MAP. Definite or probable Alzheimer's disease relative to possible or no Alzheimer's disease ...
[28]
sva
### Summary of ComBat Function in sva Package (Version 3.0+)
[29]
ComBat function - RDocumentation
ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology described in Johnson et al. 2007.
[30]
ComBat: Adjust for batch effects using an empirical Bayes framework
ComBat allows users to adjust for batch effects in datasets where the batch covariate is known, using methodology described in Johnson et al. 2007.
[31]
epigenelabs/pyComBat - GitHub
Mar 5, 2024 · pyComBat is a Python 3 implementation of ComBat, one of the most widely used tool for correcting technical biases, called batch effects, in microarray ...
[32]
Matisse: a MATLAB-based analysis toolbox for in situ sequencing ...
Jul 31, 2021 · Within Matisse, we have implemented Combat [33], which is a known method designed for batch effect correction of RNA sequencing datasets and ...
[33]
sva package - Bioconductor
No information is available for this page. · Learn why
[34]
https://doi.org/10.1093/biostatistics/kxj037
[35]
https://doi.org/10.1093/bioinformatics/bts034
[36]
iComBat: An incremental framework for batch effect correction in ...
Quantile normalization standardizes the distribution of signal intensities among samples under the assumption that signals of the same rank share the same ...Research Article · 3. Materials And Methods · 4. Results
[37]
reComBat: batch-effect removal in large-scale multi-source gene ...
reComBat fills the gap in batch-correction approaches applicable to large-scale, public omics databases and opens up new avenues for data-driven analysis.
[38]
Full article: Batch Effects Correction with Unknown Subtypes
BUS simultaneously corrects batch effects, discovers subtypes, and selects features that discriminate different subtypes.Missing: seminal | Show results with:seminal
[39]
Thinking points for effective batch correction on biomedical data
Oct 13, 2024 · This paper underscores the necessity of a flexible and holistic approach for selecting batch effect correction algorithms (BECAs), advocating for proper BECA ...
[40]
Benchmarking atlas-level data integration in single-cell genomics
Dec 23, 2021 · For ComBat and MNN, usability and scalability scores corresponding to the Python implementation of the methods are reported (Scanpy and mnnpy, ...<|control11|><|separator|>
[41]
A benchmark of batch-effect correction methods for single-cell RNA ...
Jan 16, 2020 · We compare 14 methods in terms of computational runtime, the ability to handle large datasets, and batch-effect correction efficacy while preserving cell type ...