Fact-checked by Grok 2 weeks ago

Quantile normalization

Quantile normalization is a preprocessing technique in statistics and bioinformatics designed to align the distributions of multiple datasets, making their statistical properties identical by matching quantiles across samples, thereby removing systematic technical biases while preserving biological signals. Introduced in 2003 by Bolstad et al. in the context of high-density oligonucleotide microarray data, it assumes that most features (e.g., genes or probes) exhibit similar expression levels across samples, allowing adjustments that equalize probe intensities without altering relative differences within each dataset. The method operates on a of where rows represent (e.g., genes) and columns represent samples: first, the values in each sample are sorted in ascending order; second, the average value is computed for each position across all sorted samples; third, these average values are assigned back to the original by replacing the sorted values in each sample with the corresponding averages; and finally, the is reordered to match the original feature order. This ensures that the resulting distributions have the same quantiles, such as identical medians, quartiles, and overall shapes, which facilitates downstream analyses like differential expression testing. Originally developed for GeneChip arrays to address variations from and array manufacturing, quantile normalization has become a standard tool in high-throughput studies, including RNA-sequencing and , where it effectively mitigates batch effects and in large-scale datasets. Its advantages include , computational , and superior compared to methods like global scaling, particularly when few features are differentially expressed between conditions. However, it can introduce artifacts in scenarios with strong class effects (e.g., tumor vs. normal tissues), potentially masking true biological differences or generating false signals, necessitating careful application or variants like class-specific normalization.

Overview

Definition

Quantile normalization is a statistical designed to align multiple probability distributions by making them identical in shape, achieved by matching corresponding quantiles across the distributions while preserving the rank order of individual data points but adjusting their actual values. This method ensures that the empirical distributions of the data sets become indistinguishable in terms of their quantile profiles, effectively removing systematic differences in distributional form without altering the relative ordering within each sample. In quantile normalization, for a collection of samples, the value at each quantile position in a given sample is replaced by the value from the corresponding quantile of a reference distribution, typically constructed as the average across all samples to create a balanced target. This approach contrasts with other normalization methods, such as z-score standardization, which centers data around a mean of zero and scales it to unit variance, or min-max scaling, which linearly transforms data to a bounded interval like [0, 1]; quantile normalization uniquely targets the full shape of the distribution through non-linear adjustments rather than focusing solely on central tendency and spread. The technique is particularly valuable in high-throughput data analysis, where aligning distributions helps mitigate technical variations across experiments.

Historical Development

Quantile normalization was initially proposed by Ben Bolstad in a 2001 unpublished manuscript focused on probe-level data from high-density oligonucleotide arrays produced by . This work introduced the method as a technique to adjust for technical variations arising from factors such as , labeling , and scanner differences, which could obscure biological signals in experiments. The approach aimed to equalize the distributions of intensities across s without relying on a baseline array, making it suitable for multi-array studies in . The method gained formal recognition through its publication in 2003 by Bolstad, Irizarry, Åstrand, and Speed in Bioinformatics, where it was presented alongside other normalization strategies and evaluated for variance and bias reduction in data. In this seminal paper, quantile normalization was demonstrated to effectively mitigate non-linear differences between arrays, outperforming simpler methods in preserving the rank order of probe intensities while stabilizing overall distributions. Following its introduction, quantile normalization saw rapid early adoption in , particularly for addressing batch effects—systematic variations introduced by experimental processing across different runs or laboratories. It became a standard preprocessing step in analysis pipelines, integrated into tools like the suite. Post-2010, while major theoretical advancements have been limited, the technique has proliferated through enhanced computational implementations, including adaptations for and other high-throughput data in software such as R's preprocessCore package, facilitating broader use in large-scale genomic studies.

Methodology

Algorithm Steps

Quantile normalization is applicable to datasets with any number of samples n \geq 2, where each sample consists of multiple observations, such as levels across arrays in . For n = 2, the target quantiles are the averages of the corresponding sorted values from both samples, aligning both to this common distribution. The algorithm proceeds in the following steps:
  1. Sort each sample: For a matrix X of dimensions p \times n (where p is the number of observations per sample and n is the number of samples, with samples as columns), sort the values in each column in ascending order to obtain the sorted matrix X_{\text{sort}}. This ranks the observations within each sample.
  2. Compute target quantiles: Across the rows of X_{\text{sort}}, calculate the average value for each rank position (i.e., the mean across the n sorted samples at each of the p positions). Assign this average to every element in the corresponding row to form the target matrix X'_{\text{sort}}, where all columns are identical. When ties occur in the sorted values (multiple observations sharing the same rank within or across samples), average the target values for those tied ranks.
  3. Reassign to original order: For each original sample, replace its sorted values with the corresponding column from X'_{\text{sort}}, but rearrange them back to the original (unsorted) order of the observations in X. This yields the normalized matrix X_{\text{normalized}}, preserving the relative ordering within each sample while aligning their distributional shapes.

Mathematical Formulation

Quantile normalization operates on a dataset consisting of n samples, each with m features, represented by the matrix X = (X_{j,i}) where j = 1, \dots, m indexes the features (rows) and i = 1, \dots, n indexes the samples (columns). For each sample i, the order statistics are denoted X_{(1,i)} \leq X_{(2,i)} \leq \dots \leq X_{(m,i)}, obtained by sorting the values \{X_{j,i} \mid j=1,\dots,m\} in column i. The target distribution is defined by the average rank-specific values across all samples. Specifically, for each rank k = 1, \dots, m, \text{target}_k = \frac{1}{n} \sum_{i=1}^n X_{(k,i)}, which forms the reference quantile profile shared by all normalized samples. This applies directly even for n=2, using the average of the two sorted samples. The normalization step transforms each original sample while preserving the relative ordering within it. For sample i and feature j, let r_{j,i} be the rank of X_{j,i} in the sorted sample i, i.e., r_{j,i} = k if X_{j,i} = X_{(k,i)}. The normalized value is then Y_{j,i} = \text{target}_{r_{j,i}}. This assignment ensures that the sorted normalized values for every sample i are exactly \{\text{target}_1, \dots, \text{target}_m\}. The formulation derives its effectiveness from aligning the empirical cumulative distribution functions (ECDFs) of all samples. After normalization, for any sample i and threshold \text{target}_k, the probability under the normalized ECDF satisfies P(Y_i \leq \text{target}_k) = k/m, identical across all i, as the k smallest normalized values in each sample equal \text{target}_1, \dots, \text{target}_k. This equivalence holds asymptotically under mild conditions on the underlying distributions.

Illustrative Example

To illustrate quantile normalization, consider a simple dataset with two samples, each containing four values (e.g., representing levels across four features). Sample A: [1, 3, 2, 4]; Sample B: [5, 7, 6, 8]. These samples display distributional bias, as Sample A has a of 2.5 and of 2.5, while Sample B has a of 6.5 and of 6.5. The process begins by the values within each sample in ascending order: Sample A becomes [1, 2, 3, 4]; Sample B becomes [5, 6, 7, 8]. Next, calculate the average of the sorted values across samples at each corresponding rank position to obtain the target quantiles: rank 1: (1 + 5)/2 = 3; rank 2: (2 + 6)/2 = 4; rank 3: (3 + 7)/2 = 5; rank 4: (4 + 8)/2 = 6. These targets form the common reference . Replace the sorted values in each sample with these target quantiles while preserving the order, yielding [3, 4, 5, 6] for both. Then, map these back to the original s based on the ranks of the unsorted data. For Sample A, the original values [1, 3, 2, 4] correspond to ranks 1, 3, 2, 4, so the normalized sample is [3, 5, 4, 6]. For Sample B, the original values [5, 7, 6, 8] correspond to ranks 1, 3, 2, 4, producing the identical [3, 5, 4, 6]. Both samples now share the same empirical , with 4.5 and 4.5. Before normalization, boxplots for the samples would reveal distinct profiles: Sample A with quartiles (1, 2, 3, 4) and a compact lower ; Sample B with quartiles (5, 6, 7, 8) and a shifted higher , indicating systematic . After , the boxplots overlap exactly, both showing quartiles (3, 4, 5, 6), which visually confirms the alignment of distributions across samples. Quantile normalization handles ties by assigning the average of the relevant target quantiles to tied values. For example, if a sample has two values tying for 3 (e.g., both 5 in a sorted list [1, 4, 5, 5]), they would both receive the of the rank-3 and rank-4 targets, such as (5 + 6)/2 = 5.5, ensuring consistent preservation.

Properties

Advantages

Quantile normalization aligns the of multiple samples to be identical in shape by matching their quantiles, thereby enabling direct and fair comparisons across datasets without requiring assumptions about the underlying , such as . This property is particularly valuable in high-dimensional settings where samples may exhibit systematic shifts due to variations, allowing researchers to on biological differences rather than artifacts. In the seminal work introducing the method for data, quantile normalization demonstrated superior performance in reducing variance compared to approaches and comparable or slightly better results than more complex non-linear methods like cyclic . Unlike mean-based normalization techniques, quantile is robust to outliers because it operates on rather than values, preventing extreme measurements from disproportionately influencing the adjustment process. By sorting probe within each sample and replacing them with average quantiles while preserving the original order, the maintains the relative relationships among features in individual samples, avoiding the of artificial correlations that could arise from value-based transformations. This -based approach ensures stability even in the presence of noisy or skewed data common in genomic experiments. The technique effectively mitigates technical biases, such as batch effects, in high-dimensional data by equalizing distributional properties across samples, which enhances the accuracy of downstream analyses like differential expression detection. For instance, in studies, it has been shown to remove unwanted variations between arrays, leading to more reliable identification of biologically relevant signals. Quantile normalization is also straightforward to implement and computationally efficient, primarily involving operations with a of O(n m log m), where n is the number of samples and m is the number of features, making it suitable for large-scale datasets.

Disadvantages and Limitations

One key limitation of quantile normalization is its fundamental assumption that all samples should share identical underlying distributions after adjustment for technical artifacts. This premise can distort genuine biological heterogeneity, particularly in datasets involving diverse sample types, such as different tissues, where global distributional differences reflect meaningful physiological variations rather than biases. For instance, applying quantile normalization to multi-tissue data from the GTEx consortium can lead to over-normalization, skewing tissue-specific profiles toward the dominant tissue's and inflating root squared errors in downstream analyses. Quantile normalization is also sensitive to imbalances in sample sizes across groups, where smaller cohorts can result in a reference dominated by larger groups, amplifying and reducing the reliability of the normalization. In scenarios with few samples per group (e.g., fewer than 10), the method's reliance on averaging across limited data points exacerbates variance estimation errors, limiting the overall statistical even for genes with large sizes. Furthermore, by excessively equalizing variances across samples, quantile normalization can diminish the power of differential expression tests, as the procedure mixes signals from non-differentially and differentially expressed genes, thereby attenuating detectable mean differences. The handling of tied values in discrete datasets, such as counts, introduces additional challenges; quantile normalization typically resolves ties through arbitrary averaging of ranks, which can inadvertently smooth out subtle biological differences that might otherwise be preserved. This approach assumes independent and identically distributed (i.i.d.) observations within samples, rendering it unsuitable for non-i.i.d. structures common in complex experiments. Similarly, when global shifts in expression levels (e.g., overall upregulation in certain conditions) convey important biological information, the method's enforcement of distributional uniformity can erase these signals, leading to biased interpretations.

Applications

In Genomics and Bioinformatics

Quantile normalization was initially developed and widely adopted for processing data, particularly from arrays, to mitigate array-specific technical artifacts such as differences in probe hybridization efficiencies and variations. In these applications, it equalizes the intensity distributions across arrays, enabling reliable comparisons of levels between samples. The method, as introduced in seminal work on high-density arrays, effectively reduces between-array variance while preserving biological signals, and forms a core component of the Robust Multi-array Average (RMA) preprocessing pipeline commonly applied to data. In analysis, quantile normalization facilitates between-sample normalization by aligning the empirical distributions of read counts, thereby correcting for variations in sequencing depth (library size) and, to some extent, compositional biases like effects that can distort relative expression estimates. This approach is particularly useful when integrating datasets from different experiments or platforms, as it ensures comparable intensity profiles without assuming a specific form for the count distributions. Although specialized methods like trimmed of M-values (TMM) or median-of-ratios are often preferred in tools such as edgeR or DESeq2, quantile normalization remains a viable option for exploratory analyses or when direct distributional matching is desired. For single-cell (scRNA-Seq), quantile normalization addresses technical noise introduced by variable capture efficiencies, amplification biases, and high dropout rates, where zero or low counts predominate due to limited mRNA input per cell. By forcing quantile equivalence across cells, it stabilizes variance and enhances clustering or differential expression detection, though it must be applied cautiously to avoid over-correction of sparse data. Benchmarks evaluating multiple normalization strategies highlight quantile normalization's effectiveness in reducing batch effects and improving reproducibility in scRNA-Seq workflows. Quantile normalization is integrated into established bioinformatics pipelines for differential gene expression analysis, such as the limma , where the normalizeQuantiles function preprocesses or log-transformed data prior to linear modeling with empirical Bayes moderation. In the context of (TCGA) datasets, it has been extensively used since the project's early phases (post-2010) to ensure cross-batch comparability in multi-omic studies, often combined with loess-based corrections for intensity-dependent biases in platforms like . This combination enhances the removal of non-linear technical effects, supporting robust pan-cancer analyses of thousands of samples.

In Other Scientific Fields

Quantile normalization has found applications in mass spectrometry-based to standardize intensity distributions across multiple experimental runs, thereby mitigating systematic variations due to instrument drift and technical artifacts. This technique ensures that the empirical distributions of intensities match, facilitating more reliable comparisons of protein abundance profiles in workflows. For instance, evaluations of normalization methods in label-free have shown that quantile normalization effectively reduces non-biological variance while preserving biological signals, though it may sometimes underperform compared to specialized approaches like probabilistic quotient normalization in certain datasets. As of 2025, comparative assessments across omic layers continue to identify quantile normalization as a robust option, particularly when combined with other methods in multi-omics pipelines. In and , quantile normalization is employed to standardize the distribution of radiomic extracted from (CT) and (MRI) scans, addressing variability introduced by different scanners and acquisition protocols. This standardization enhances the reproducibility of models for tasks such as tumor and outcome , as demonstrated in studies from the mid-2010s that highlighted its role in reducing inter-scanner discrepancies in feature values. By aligning distributions across images, the minimizes technical , enabling robust multi-center analyses without altering the underlying biological information. In and , quantile normalization is utilized to align the distributional shapes of , such as or expenditure curves, allowing for fairer comparisons of metrics across regions or time periods. This approach corrects for systematic biases in survey or financial datasets, preserving the relative ordering of observations while equalizing distributional properties, as applied in analyses of yield curves and econometric models for asset returns. For example, in modeling crude oil price dynamics, it has been used to preprocess data for , ensuring that variance related to market factors is accurately captured without distortion from uneven distributions. The technique has been applied in , particularly with (NMR) , since the early 2010s to correct batch effects in of profiles. In these contexts, quantile normalization adjusts spectral intensities across batches to remove technical variations from instrument calibration or sample handling, improving the detection of subtle biological changes in large-scale studies. Comparative assessments have confirmed its utility in NMR data, where it outperforms simpler scaling methods by maintaining the integrity of concentration rankings while homogenizing distributions. Emerging applications include climate and environmental , where quantile normalization standardizes readings from distributed sensor networks to account for inconsistencies in or environmental conditions. This is particularly valuable for ecological studies that use the method to align empirical distributions and enhance the reliability of trend analyses.

Variants and Extensions

Robust Quantile Normalization

Robust quantile normalization modifies the standard quantile normalization procedure to improve resistance to outliers and noise by altering the computation of the target distribution. Instead of using the of the sorted values across samples for each rank, it employs the or a weighted , which downweights the influence of extreme values in individual samples. This variant also includes options to exclude extreme samples (based on variance or intensity) prior to normalization, effectively trimming outliers at the sample level. In detail, for a with n samples and m features, each sample is sorted in ascending order to obtain X_{(i,1)} \leq X_{(i,2)} \leq \cdots \leq X_{(i,m)} for sample i = 1 to n. The target value for k is then computed as \text{target}_k = \median_{i=1}^n \left( X_{(i,k)} \right) when the median option is selected, rather than the average \frac{1}{n} \sum_{i=1}^n X_{(i,k)}. Alternatively, Winsorization-like trimming can be applied by removing high-variance or extreme-mean samples before calculating the targets, capping the impact of aberrant data points. Each sample is subsequently adjusted so that its sorted values match this robust target distribution, preserving order while mitigating effects. This method offers advantages over standard quantile in datasets prone to noise or outliers, as the median-based target reduces distortion from extreme values and better maintains underlying biological signals. It was developed as an extension of rank-based normalization techniques and is implemented in the preprocessCore as the "robust" option within the normalize.quantiles function.

Smooth Quantile Normalization

Smooth quantile normalization, often referred to as qsmooth, is a generalization of standard quantile that incorporates to estimate reference distributions while preserving differences between predefined biological groups, such as types in genomic data. This variant addresses limitations in discrete rank-based methods by modeling empirical quantile functions across samples using with group covariates, then applying to the regression coefficients. The detailed process begins by estimating the empirical quantile function F_j^{-1}(u) for each sample j at quantile levels u \in \{1/(n_j+1), \dots, n_j/(n_j+1)\}, where n_j is the sample size. A linear model is fitted at each u: F_j^{-1}(u) = \beta_0(u) + \sum_g \beta_g(u) I(g = \text{group of } j) + \varepsilon_j(u), with coefficients smoothed via a rolling median filter over a window of width approximately 0.05 times the number of quantiles to produce continuous group-specific inverse cumulative distribution functions (CDFs), \hat{F}_g^{-1}(u). The target quantile function for a sample i in group g(i) is a weighted average, balancing the overall average inverse CDF \bar{F}^{-1}(u) = \frac{1}{J} \sum_j F_j^{-1}(u) and the group-specific \hat{F}_{g(i)}^{-1}(u), with weights w_u computed as the smoothed median of $1 - S_B(u)/S_T(u), where S_B(u) and S_T(u) measure between-group and total variability across quantiles, respectively. The normalized value \hat{y}_{ij} for an observation at rank-based quantile q in sample i is then \hat{y}_{ij} = w_q \, \bar{F}^{-1}(q) + (1 - w_q) \, \hat{F}_{g(i)}^{-1}(q), which maps the observation continuously via the functions, akin to F_{avg}^{-1}(F_i(x)) but adapted for group structure and . This approach estimates empirical via the on coefficients rather than direct or averaging, enabling inversion of CDFs for continuous mapping that mitigates jumps from tied values or sparse . It performs well with small sample sizes by stabilizing variance in estimates through the and , reducing in downstream analyses compared to unsmoothed methods. Additionally, it is suited for non-tabular, continuous where ranks may introduce artifacts, as the functions approximate underlying distributions more flexibly. The method was proposed in a 2018 Biostatistics paper to remove technical biases in genomic datasets while retaining biological group differences, such as tissue-specific expression patterns in data from and liver samples. It has been applied to data, improving clustering of cell types in profiles from purified blood cells by preserving subtle distributional shifts between groups.

References

  1. [1]
    Chapter 5 Data normalisation: centring, scaling, quantile normalisation
    Quantile normalisation is a method that will make different data distributions identical. An example is shown below using a small dataset with quantitative data ...
  2. [2]
    [PDF] A Comparison of Normalization Methods for High Density ...
    pdf. Bolstad, B. (2001). Probe level quantile normalization of high density oligonucleotide array data. Unpublished Manuscript. http://www.stat.berkeley.edu ...
  3. [3]
    The ENCODE Imputation Challenge: a critical assessment of ...
    Apr 18, 2023 · We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging.
  4. [4]
    None
    ### Extracted Information
  5. [5]
    How to do quantile normalization correctly for gene expression data ...
    Sep 23, 2020 · We evaluate five strategies for performing quantile normalization, and demonstrate that good performance in terms of batch-effect correction and statistical ...
  6. [6]
    A comparison of normalization methods for high density ... - PubMed
    We present three methods of performing normalization at the probe intensity level. These methods are called complete data methods.
  7. [7]
    Quantile Normalization: - Bio-protocol
    Therefore, quantile normalization as a rank order-based method is thought to be more robust than intensity-based in some respects, but its limitations ...<|control11|><|separator|>
  8. [8]
    Statistical strategies for microRNAseq batch effect reduction - PMC
    Quantile normalization has been used frequently in microarray data to remove batch effects (18). After performing quantile normalization on miRNAseq data ...
  9. [9]
    Tissue-aware RNA-Seq processing and normalization for ... - NIH
    Oct 3, 2017 · The qsmooth [27] normalization method is a generalization of quantile normalization that normalizes all samples together but relaxes the ...
  10. [10]
    The impact of quantile and rank normalization procedures on the ...
    Apr 11, 2013 · In this study, we find that these normalization procedures can have a profound impact on differential expression analysis, especially in terms of testing power.
  11. [11]
    miRNA normalization enables joint analysis of several datasets to ...
    Feb 10, 2021 · Vanilla quantile normalization averages values across samples with the same rank, while our method averages values across samples within the ...
  12. [12]
    quantro: a data-driven approach to guide the choice of an ...
    Jun 4, 2015 · Quantile normalization was originally developed for gene expression microarrays [1, 2] but today it is applied in a wide-range of data types ...
  13. [13]
    [PDF] Selecting Between-Sample RNA-Seq Normalization Methods ... - arXiv
    Sep 4, 2016 · These situations, especially a global shift in expression, violate assumptions of many commonly-used methods and so result in errors in ...
  14. [14]
    systematic evaluation of normalization methods in quantitative label ...
    Oct 2, 2016 · The quantile normalization forces the distributions of the samples to be the same on the basis of the quantiles of the samples by replacing each ...
  15. [15]
    Evaluation of normalization strategies for mass spectrometry-based ...
    Jul 1, 2025 · We identified PQN and LoessQC as the most robust normalization methods across all omic layers while Quantile normalization was often the least ...
  16. [16]
    A Framework of Analysis to Facilitate the Harmonization of ... - MDPI
    Quantile normalization, which transforms the original data to remove unwanted technical variation by forcing the observed distributions to be the same as the ...
  17. [17]
    Minimising multi-centre radiomics variability through image ... - Nature
    Jul 22, 2022 · In this paper, a comprehensive analysis of radiomics variability under several image- and feature-based normalisation techniques was conducted
  18. [18]
    [PDF] Statistical Properties of the Quantile Normalization Method to Curve ...
    Study of expenditure and income curves in economics. Yield curves in finance ... Among these, there is the popular quantile normalization method by. Bolstad ...
  19. [19]
    [PDF] Econometric Model Using Arbitrage Pricing Theory and Quantile ...
    The quantile normalization procedure was used to modify the raw data to preserve the true variance that we were interested in while removing any unwanted ...
  20. [20]
    Multivariate analysis of NMR‐based metabolomic data - Debik - 2022
    Nov 5, 2021 · Probabilistic quotient normalization (PQN) and quantile normalization are two alternative normalization techniques commonly used in metabolomic ...
  21. [21]
    Normalization of metabolomics data with applications to correlation ...
    NMR samples are affected by technical artifacts and may exhibit inflated between-sample variation owing to batch effects.
  22. [22]
  23. [23]
    Analytical methods for quantifying environmental connectivity for the ...
    Cost distances were subject to quantile normalization (Bolstad et al. 2003), and variance was stabilized using a generalized logarithm (glog) procedure ...
  24. [24]
    Robust Quantile Normalization in preprocessCore - rdrr.io
    Using a normalization based upon quantiles, this function normalizes a matrix of probe level intensities. Allows weighting of chips.
  25. [25]
    How to do quantile normalization correctly for gene expression data ...
    Sep 23, 2020 · Abstract. Quantile normalization is an important normalization technique commonly used in high-dimensional data analysis.Missing: definition | Show results with:definition
  26. [26]
    Smooth quantile normalization - PMC - NIH
    Quantile normalization is a global adjustment normalization method that transforms the statistical distributions across samples to be the same and assumes ...