Fact-checked by Grok 2 weeks ago

Data binning

Data binning, also known as data discretization, is a preprocessing technique in and that transforms continuous numerical data into a of intervals or categories, known as bins, to simplify representation and analysis. This method involves the data values and partitioning them into bins based on specified criteria, such as equal width or equal frequency, allowing for the reduction of data granularity while preserving essential distributional characteristics. In practice, data binning serves multiple purposes across various domains, including smoothing noisy observations by replacing values with bin statistics like means or medians, mitigating the effects of outliers, and preparing datasets for algorithms that perform better with categorical features. For instance, equal-width binning divides the data range into uniform intervals regardless of data density, which can be sensitive to extreme values, whereas equal-frequency binning ensures each bin contains approximately the same number of data points, making it more robust for skewed distributions. These approaches are foundational in creating histograms for visualizing data distributions, where bin width critically influences the balance between detail and smoothness—too few bins obscure patterns, while too many introduce noise. Beyond , data binning plays a key role in and modeling by enabling the identification of trends, such as multimodal distributions or clusters, and supporting subsequent statistical tests or predictive techniques. Optimal bin selection often relies on rules like the of the sample size or data-driven methods to minimize estimation error, ensuring the technique enhances interpretability without significant loss of . Applications span fields like for analyzing data and for in tasks.

Overview

Definition

Data binning, also known as , quantization, or bucketing, is the process of transforming continuous or numerical into a set of intervals or "bins," where each point is assigned to a based on its value. This technique quantizes the data by partitioning the range of values into finite, non-overlapping intervals, enabling the representation of potentially infinite continuous values as a manageable set of categories. The key components of data binning include the bins themselves, which are defined as finite intervals such as [a, b), along with bin edges (also called cut-points) that delineate the boundaries between intervals, and bin labels that represent each interval, such as the numerical midpoint or a categorical name. These elements ensure that the original data's structure is mapped systematically while reducing complexity. Mathematically, a data point x is assigned to bin k if it falls within the interval \text{bin}_k = [l_k, u_k), satisfying l_k \leq x < u_k, where l_k and u_k are the lower and upper edges of the bin, respectively. Unlike , which approximates continuous values to the nearest point and thereby loses the interval-based grouping, binning preserves the interval structure to maintain contextual relationships among values within each . This process serves broader goals like simplifying analysis, as explored later.

Purpose and Motivation

Data binning addresses key challenges in handling continuous data by grouping values into intervals, primarily to reduce from minor observation errors or measurement inaccuracies. This process mitigates the effects of small perturbations in the data, leading to more stable and reliable analyses. Binning also facilitates management by confining extreme values to edge intervals, preventing them from skewing overall results without complete removal. Furthermore, it transforms numerical features into categorical ones, enabling the application of techniques designed for data, such as decision trees or rule in . In large-scale datasets, binning enhances computational efficiency by decreasing the of features, which reduces memory requirements and accelerates execution. From a statistical perspective, binning provides a foundational method for approximating the underlying of a , particularly through construction, where bin counts reflect relative frequencies. It smooths inherent data distributions by aggregating points within bins, highlighting broader trends while dampening irrelevant variations. This approach supports non-parametric , allowing practitioners to infer distributional shapes without presupposing a like . Practically, binning counters issues in high-dimensional spaces, where the curse of dimensionality causes sparsity and computational intractability; by discretizing features, it compresses the space and improves model performance on sparse . The technique's roots trace to early statistical practices, exemplified by Herbert Sturges' rule for determining bin counts based on sample size, which aimed to balance resolution and generalization in frequency distributions. This historical motivation continues to drive its use in contemporary for robust and .

Binning Techniques

Equal-Width Binning

Equal-width binning, also known as equal-interval binning, is a straightforward technique that partitions the range of a continuous into a fixed number of intervals of identical width. This method begins by identifying the minimum and maximum values in the , denoted as \min and \max, respectively, and then divides the total range \max - \min into k equal parts, where k is the specified number of bins. The width w of each bin is calculated as w = \frac{\max - \min}{k}. Data points are then assigned to bins based on their position within these intervals, promoting a uniform division regardless of data distribution density. The assignment of a data point x to a bin follows the formula for the bin index i = \left\lfloor \frac{x - \min}{w} \right\rfloor, where i ranges from 0 to k-1. The bins are typically defined as half-open intervals: the i-th bin covers [\min + i \cdot w, \min + (i+1) \cdot w) for i = 0 to k-2, with the final bin [\min + (k-1) \cdot w, \max] closed on the right to include the maximum value. This ensures all points, including the extremes, are properly binned without overflow; if the floor computation yields i = k exactly (which can occur due to floating-point or exact matches at boundaries), the point is assigned to the last bin k-1. Ties at bin boundaries are conventionally assigned to the lower bin to maintain in the half-open scheme. A key parameter in equal-width binning is the number of bins k, which determines the granularity of the discretization. Common heuristics for selecting k include the square root rule k \approx \sqrt{n}, where n is the number of data points, Sturges' rule k = \lceil \log_2 n + 1 \rceil, and Rice's rule k = \lceil 2 n^{1/3} \rceil, each balancing detail and smoothness based on sample size. Empty bins may arise if the data is clustered or skewed, as this method does not adjust for frequency; in such cases, the bins remain as defined without merging, preserving the uniform structure. This approach is particularly advantageous for its computational simplicity and interpretability, making it suitable for datasets assumed to be roughly uniformly distributed where equal intervals align naturally with the data's spread. For illustration, consider a dataset {1, 2, 3, 4, 5} with k = 2. Here, \min = 1, \max = 5, and w = \frac{5-1}{2} = 2, yielding bins [1, 3) and [3, 5]. The points 1 and 2 fall into the first bin (index 0), while 3, 4, and 5 fall into the second (index 1), as \left\lfloor \frac{5-1}{2} \right\rfloor = 2 is capped to 1 for the last bin.

Equal-Frequency Binning

Equal-frequency binning, also known as quantile binning or equal-depth partitioning, is a discretization technique that partitions a continuous dataset into a predefined number of bins such that each bin contains approximately the same number of data points. This approach is particularly useful in statistics and data preprocessing for handling non-uniform distributions by ensuring balanced sample sizes across bins. The algorithm begins by the in ascending order. For a of size n and k bins, the bin edges are determined using quantiles to divide the sorted values into groups of roughly n/k points each. Specifically, the edges q_i are set at the i \times (n/k)-th position in the sorted list, with adjustments such as functions to account for divisions and ensure no bin is excessively small. A value x is then assigned to bin i if q_{i-1} \leq x < q_i, where q_0 = -\infty and q_k = +\infty. Key parameters include the number of bins k, which controls the level of detail in the discretization, and rules for managing ties or uneven splits, such as rounding the division points or assigning duplicate values to the lower bin to maintain approximate equality. The computational complexity is typically O(n \log n) due to the initial sorting step. In mathematical terms, the bin edges can be expressed using the percentile function: q_i = \text{percentile}(data, i \times 100 / k) \quad \text{for} \quad i = 0, 1, \dots, k This formulation leverages the empirical cumulative distribution to place boundaries at equal probability intervals in the observed data. For skewed distributions, equal-frequency binning excels by adapting bin widths to the data density, allocating narrower intervals in dense regions and wider ones in sparse areas, thereby avoiding empty bins that plague methods assuming uniform spacing. Unlike equal-width binning, which uses fixed intervals and can result in imbalanced or vacant bins for non-uniform data, this quantile-based division ensures representative sampling across the range. Consider a sorted dataset of five values: [1, 2, 3, 10, 20], with k=2. The target size per bin is $5/2 = 2.5, so the first bin edge is at the ceiling-adjusted position of 3 (the median value), yielding bins [1, 3) containing 2 points (1, 2) and [3, 20] containing 3 points (3, 10, 20).

Adaptive and Advanced Methods

Adaptive binning techniques adjust bin widths dynamically based on the underlying distribution, often employing (KDE) to determine optimal edges that reflect local density variations. In this approach, the is estimated using a , such as a Gaussian , where the is adapted to the local density to avoid over-smoothing in sparse regions or under-smoothing in dense areas. For instance, adaptive KDE sets bin boundaries at points where the estimated density reaches predefined quantiles or inflection points, ensuring bins capture distributions more effectively than fixed-width methods. This method is particularly useful for s with varying densities, as demonstrated in multivariate adaptive binned quasi-interpolation , which uses a structure to refine bins iteratively based on KDE outputs. Clustering-based binning leverages algorithms like K-means to form bins centered around , assigning data points to the nearest bin via the assignment rule: each point x_i is allocated to k by minimizing the \arg\min_k \| x_i - \mu_k \|^2, where \mu_k is the of bin k. This results in irregular bin shapes and sizes that adapt to natural groupings in the data, improving representation for non-uniform distributions compared to quantile-based partitioning. K-means binning has been applied in construction for , where it replaces equidistant binning to better align intensity distributions by clustering values into meaningful intervals. The algorithm iteratively updates and reassigns points until , typically requiring an initial specification of the number of bins k. Information-based methods, such as minimization or tests, create bins by optimizing statistical criteria that preserve predictive power, especially in supervised settings. The , a seminal supervised technique, starts with each data point in its own and iteratively merges adjacent intervals if the test between their distributions yields a greater than a user-defined (e.g., 0.05), effectively reducing loss while maintaining separability. This bottom-up merging process uses the statistic \chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}} to evaluate adjacency, where O_{ij} and E_{ij} are observed and expected frequencies for j in i, ensuring bins are statistically homogeneous yet discriminative. ChiMerge has been widely adopted for preprocessing in , as it directly incorporates target variable information to guide formation. Advanced variants extend these ideas to specific modeling contexts, such as optimal binning in decision trees like , where recursive splitting on continuous effectively creates adaptive bins by selecting thresholds that maximize impurity reduction (e.g., Gini index) at each node. In , the optimal split for a is found by evaluating all possible cut points sorted from the , leading to a that implicitly bins variables into decision regions without predefined bin counts. For probabilistic models, Bayesian binning approaches, such as Bayesian Binning into Quantiles (BBQ), calibrate probability outputs by non-parametrically partitioning the score space into quantiles and applying a Bayesian smoother to estimate event rates, addressing issues like overconfidence in classifiers. BBQ models bin probabilities as a mixture of beta distributions, updating posteriors to produce well-calibrated predictions across bins. These methods integrate prior knowledge or target supervision to refine bins for enhanced model reliability. A key challenge in these adaptive and advanced methods is computational complexity; for example, sorting-based initialization in ChiMerge or KDE requires O(n \log n) time for n data points, while K-means binning incurs O(n k t) complexity, where k is the number of bins and t is the number of iterations, potentially scaling poorly for large datasets without approximations.

Applications

In Statistics and Visualization

Data binning serves as the foundational process in constructing histograms, a graphical representation of the distribution of numerical where continuous values are grouped into intervals, or bins, and the height of each bar corresponds to the frequency or count of observations within that interval. This allows for the of empirical , enabling statisticians to identify patterns such as , spread, and modality. The choice of the number of bins, denoted as k, is critical to accurately representing the underlying without introducing excessive distortion; common rules of thumb include Sturges' , k \approx 1 + \log_2 n, where n is the sample size, which approximates to k = 1 + 3.322 \log_{10} n for practical computation. This rule assumes a roughly and provides a baseline for bin selection in . In density estimation, binned histograms offer a piecewise constant approximation of the probability density function, where bin counts are normalized by the bin width and total sample size to estimate density. Unlike (KDE), which applies a smooth function to each point for a continuous estimate, binned approaches inherently reduce variance in the empirical distribution by aggregating observations within intervals, thereby mitigating the impact of sampling noise at the cost of introducing bias from the . This variance reduction makes histograms particularly useful for initial assessments of distribution shape, though KDE is often preferred when smoother estimates are needed without fixed bin boundaries. The historical development of binning for such purposes traces back to Karl Pearson's work in the 1890s, where he employed grouped frequency to fit theoretical curves to empirical distributions, laying groundwork for modern histogram-based . Binning also plays a key role in , particularly in the goodness-of-fit test, where continuous must first be categorized into bins to create observed frequencies comparable to expected values under a hypothesized , effectively treating the binned as categorical proxies. This approach, originally formalized by Pearson in 1900, allows testing whether the empirical aligns with a specified model, such as , by comparing binned counts against theoretical probabilities. For continuous variables, the number of bins is typically chosen to ensure sufficient expected frequencies per bin (often at least 5) while preserving distributional details. In , the selection of bin parameters directly influences interpretability; too few bins can lead to oversmoothing, masking or fine-grained features in the , while too many bins result in undersmoothing, amplifying random noise and creating a jagged appearance that obscures the true signal. Optimal binning balances these effects to reveal underlying patterns without artifactual distortions, often guided by rules like Sturges' or adaptive methods that adjust based on data characteristics such as . For instance, in datasets exhibiting clear peaks, increasing k beyond the Sturges' recommendation may be necessary to avoid hiding .

In Machine Learning and Data Preprocessing

In pipelines, data binning serves as a key preprocessing technique for discretizing continuous features, enabling algorithms that perform better with categorical data or requiring computational efficiency. For tree-based models such as decision trees and random forests, binning reduces the number of potential splits by grouping values into discrete intervals, which simplifies the and enhances model interpretability without significant loss in . In frameworks like and , internal histogram binning approximates continuous splits by quantizing features into fixed bins (typically 255 or fewer), accelerating training through faster histogram construction while maintaining accuracy comparable to exact splits. This preprocessing step is often applied early in the , before imputation or scaling, to handle high-cardinality continuous variables and prevent issues like excessive memory usage in downstream tasks. Supervised binning methods leverage target variable information to create bins that preserve monotonic relationships between features and outcomes, outperforming approaches in predictive tasks. A prominent example is monotonic binning combined with Weight of Evidence (WoE) transformation, widely used in credit scoring to transform features into scores that reflect their separation power for good versus bad risks; WoE for a bin is calculated as \ln\left(\frac{\% \text{ of good outcomes in bin}}{\% \text{ of bad outcomes in bin}}\right), ensuring bins align with class distributions. This target-aware minimizes information loss and supports models in regulatory-compliant applications like assessment. In libraries such as , binning integrates seamlessly via tools like KBinsDiscretizer, which applies equal-width or equal-frequency strategies before feeding data into classifiers, addressing high-cardinality issues post-binning through encoding. Similarly, in R environments like , discretization precedes scaling to normalize binned categories. For specific algorithms, binning mitigates in Naive Bayes by reducing feature cardinality—Naive Bayes relies on probabilistic counts, and excessive unique values lead to sparse probabilities; supervised binning creates optimal boundaries using -based splits to consolidate . Feature selection can also employ bin-wise information gain, where each bin's reduction relative to the target guides variable ranking, enhancing model sparsity in high-dimensional settings. In modern AutoML frameworks of the 2020s, such as Auto-sklearn and H2O AutoML, automated discretization is embedded in pipeline optimization, dynamically selecting bin counts and strategies during hyperparameter search to balance model performance and complexity across datasets. This automation extends binning's utility beyond manual engineering, particularly for tabular data where continuous features dominate.

Examples

Numerical Example

Consider a small dataset of seven exam scores: 34, 45, 56, 67, 78, 82, and 91. These values represent continuous data with a minimum of 34 and a maximum of 91, spanning a range of 57 points. To apply equal-width binning with k=3 bins, first calculate the bin width as (91 - 34) / 3 = 19. The bins are then defined as [34, 53), [53, 72), and [72, 91]. Assigning the scores to these bins yields: 34 and 45 in the first bin (count: 2), 56 and 67 in the second bin (count: 2), and 78, 82, and 91 in the third bin (count: 3). This method partitions the data range into uniform intervals, regardless of data distribution. In contrast, equal-frequency binning aims for approximately equal numbers of observations per bin. Sort the data: 34, 45, 56, 67, 78, 82, 91. With n=7 and k=3, each bin should hold about $7/3 \approx 2.33 values. The bin edges are set at the 33rd and 66th percentiles, at 56 and 78, resulting in bins [34, 56), [56, 78), and [78, 91]. Assigning the scores gives: 34 and 45 in the first bin (count: 2), 56 and 67 in the second bin (count: 2), and 78, 82, and 91 in the third bin (count: 3). This approach adjusts boundaries to balance counts, accommodating the data's density. The following table summarizes the bin assignments for both methods:
ScoreEqual-Width BinEqual-Frequency Bin
34[34, 53)[34, 56)
45[34, 53)[34, 56)
56[53, 72)[56, 78)
67[53, 72)[56, 78)
78[72, 91][78, 91]
82[72, 91][78, 91]
91[72, 91][78, 91]
Counts2, 2, 32, 2, 3
These binning results reveal patterns in the , such as clustering in the higher-range scores (around 70–90), where three values fall in both methods' final bins, indicating a concentration of in that interval. An ASCII for equal-width bins might appear as:
[34,53):  **
[53,72):  **
[72,91]: ***
This visualization highlights the relative frequencies, emphasizing the higher-range density without requiring advanced tools.

Practical Implementation

Implementing data binning in practice involves a structured to ensure and effectiveness. Begin with loading the into a suitable , such as a DataFrame in or a data frame in . Next, determine the number of bins (k) using data-driven methods; for instance, Knuth's algorithm optimizes k by maximizing the 's likelihood, balancing under- and over-binning for unimodal or distributions. Label the bins descriptively, such as using interval notations like "low," "medium," or "high," and validate the binning by visualizing bin counts via or plots to check for even or outliers. In , the pandas library provides efficient functions for binning. For equal-width binning, use pandas.cut(), which divides the range of the data into intervals of equal size. For equal-frequency binning, employ pandas.qcut(), which creates bins with approximately equal numbers of observations based on quantiles. Consider the following example with a sample array of 20 values ranging from 1 to 100:
python
import pandas as pd
import numpy as np

# Sample data
data = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
                 15, 25, 35, 45, 55, 65, 75, 85, 95, 105])
df = pd.DataFrame({'values': data})

# Equal-width binning with 5 bins
df['equal_width'] = pd.cut(df['values'], bins=5, labels=['Very Low', 'Low', 'Medium', 'High', 'Very High'])

# Equal-frequency binning with 5 quantiles
df['equal_freq'] = pd.qcut(df['values'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])

print(df)
This produces a DataFrame where the original values are augmented with bin labels. The output might show, for equal-width bins, intervals like (10.0, 42.0] for 'Low', ensuring uniform spacing across the data range, while equal-frequency bins adjust boundaries to hold roughly 4 observations each, such as (10.0, 25.0] for 'Q1'. In , the base cut() function handles equal-width binning by specifying breaks as a sequence from the minimum to maximum . For equal-frequency binning, first compute quantiles using the quantile() function to define breaks, then apply cut(). Using the same sample data:
r
data <- c(10, 20, 30, 40, 50, 60, 70, 80, 90, 100,
          15, 25, 35, 45, 55, 65, 75, 85, 95, 105)
df <- data.frame(values = data)

k <- 5
breaks_width <- seq(min(df$values), max(df$values), length.out = k + 1)
df$equal_width <- cut(df$values, breaks = breaks_width, labels = c("Very Low", "Low", "Medium", "High", "Very High"))

# Equal-frequency binning with 5 [quantiles](/page/Quantile)
probs <- seq(0, 1, length.out = k + 1)
breaks_freq <- [quantile](/page/Quantile)(df$values, probs = probs)
df$equal_freq <- cut(df$values, breaks = breaks_freq, labels = c("Q1", "Q2", "Q3", "Q4", "Q5"), include.lowest = TRUE)

print(df)
The resulting frame appends columns with labels, where equal-width breaks might span 10 to 42 for the first , and equal-frequency ensures each captures about 20% of the . For handling real datasets, consider the Iris dataset, a standard multivariate dataset with 150 samples of sepal and petal measurements across three species. In Python, load it via scikit-learn and bin the sepal length into 3 equal-width intervals:
python
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = iris.target_names[iris.target]

# Bin sepal length into 3 equal-width bins
df['sepal_length_bin'] = [pd](/page/PD).cut(df['sepal length (cm)'], bins=3, labels=['Short', 'Medium', 'Long'])

# Summarize bin counts by species
summary = df.groupby(['sepal_length_bin', 'species']).size().unstack(fill_value=0)
print(summary)
This yields a summary table showing bin distributions, such as approximately 50 short sepals (4.3–5.5 cm) mostly in setosa, 60 medium (5.5–6.7 cm) across versicolor and virginica, and 40 long (6.7–7.9 cm) predominantly virginica, aiding in species differentiation. Best practices include selecting bin counts based on data characteristics to avoid bias, such as using Knuth's method for non-uniform data. In version 2.0 and later, leverage IntervalIndex for binned columns to enable advanced operations like overlap checks or merging intervals efficiently. Always include edge cases in validation plots to confirm bin integrity, and document bin definitions for in production pipelines.

Advantages and Limitations

Benefits

Data binning reduces and smooths by aggregating values into bins, which averages out minor variations and enhances robustness, particularly in noisy environments such as collection. For instance, in s, spatially-varying binning techniques improve performance by adaptively grouping values to mitigate random fluctuations. Similarly, in sparse datasets, wider bins in low-density regions help counteract sampling , leading to more stable statistical estimates. One key advantage of data binning is its enhancement of interpretability, as it transforms continuous variables into categorical or ordinal ones, facilitating easier human comprehension and reporting of complex relationships. This process serves as an interpretable tool for revealing nonlinear dependencies between variables and , making models more transparent without sacrificing essential patterns. Semantic binning, in particular, leverages meaningful real-world distinctions to boost both interpretability and model performance in downstream tasks. Binning also improves computational efficiency by reducing data dimensionality and , which accelerates algorithms that operate on categories, such as enabling constant-time lookups in histograms or decision trees compared to continuous computations. Adaptive binning methods further optimize this by achieving accurate results with fewer bins and lower processing times. Additionally, binning enables specialized analyses, including ordinal modeling where binned features preserve order for tasks, and privacy-preserving techniques like , where binning aggregates data to bound individual contributions while releasing useful statistics. Empirical evidence supports these benefits, with studies demonstrating that binning via techniques like weight-of-evidence improves predictive accuracy in models, especially on skewed datasets, by better handling nonlinearities and outliers compared to untreated continuous predictors.

Challenges and Considerations

One major challenge in data binning is the inherent loss of , as the process replaces precise continuous values with bin representatives, such as midpoints or boundaries, thereby discarding within each bin. This can introduce in statistical estimates, for instance, shifting the computed in sparsely populated bins where the representative value poorly approximates the underlying . The extent of this loss can be measured using divergence metrics, such as the Kullback-Leibler divergence, which quantifies the difference in between the original and binned probability distributions. Selecting the optimal number of bins presents another difficulty, as under-binning smooths out important variations leading to underfitting, while over-binning amplifies noise and risks to outliers. Data-driven rules, such as the Freedman-Diaconis rule, address this by estimating bin width based on the and sample size to bias and variance in . Boundary artifacts arise from arbitrary cutoffs, which can create artificial discontinuities or clusters near edges, exacerbating in finite datasets and distorting visualizations or analyses. Overlapping bins or adaptive partitioning techniques can mitigate these issues by smoothing transitions and reducing sensitivity to exact boundary placement. In domain-specific contexts like , binning may conflict with regulatory demands for high precision in reporting, as aggregation can obscure granular details essential for compliance with standards requiring exact transactional or risk . When such preservation is paramount, alternatives like are often favored over binning, offering continuous approximations that avoid discrete artifacts while retaining more distributional fidelity.

References

  1. [1]
    [PDF] Preprocessing - Stony Brook Computer Science
    Binning Methods for Data DISCRETIZATION. (simple example). • Sorted data ... Binning Methods for Data DISCRETIZATION. • Sorted data (attribute values ) ...
  2. [2]
    How do I create and interpret histograms? Binning data for analysis ...
    Aug 11, 2023 · A histogram is a visual representation of a single variable (typically numeric) sorted into bins of values (or buckets) that can help you answer this question.
  3. [3]
    [PDF] Data-Based Choice of Histogram Bin Width
    The most important parameter of a histogram is the bin width because it controls the tradeoff between presenting a picture with too much detail ("undersmoothing ...<|control11|><|separator|>
  4. [4]
    [PDF] Discretization: An Enabling Technique
    Abstract. Discrete values have important roles in data mining and knowledge discovery. They are about intervals of numbers which are more concise to represent ...
  5. [5]
    [PDF] Supervised and Unsupervised Discretization of Continuous Features
    In our study, we consider three methods of dis- cretization in depth: equal width intervals, 1RD, the method proposed by Holte for the 1R algorithm, and the ...
  6. [6]
    Binning in Data Mining - GeeksforGeeks
    Jan 15, 2025 · Data binning or bucketing is a data preprocessing method used to minimize the effects of small observation errors.
  7. [7]
    [PDF] Two-Step Classification for Solving Data Imbalance and Anomalies ...
    Two prevalent approaches to handle outliers are omission and ... data binning. Data binning transforms continuous data into categorical data ...
  8. [8]
    Numerical data: Binning | Machine Learning - Google for Developers
    Aug 25, 2025 · Binning (also called bucketing) is a feature engineering technique that groups different numerical subranges into bins or buckets.
  9. [9]
    [PDF] arXiv:2108.10660v1 [cs.LG] 24 Aug 2021
    Aug 24, 2021 · K-means clustering and data binning is used for data aggregation and compared with ran- dom sampling as the simplest data reduction method. We ...
  10. [10]
    [PDF] Methods for Binning and Density Estimation of Load Parameters for ...
    The approach involves optimally binning data in a manner that provides the best estimate of ... If the entire data set, without data binning, is retained, a ...
  11. [11]
    [PDF] Lecture 2: Density Estimation 2.1 Histogram
    Nonparametric Estimation by A.B. Tsybakov. Density estimation is the problem of reconstructing the probability density function using a set of given data.
  12. [12]
    2.1 Histograms | Notes for Nonparametric Statistics - Bookdown
    2 Moving histogram. An alternative to avoid the dependence on t0 t 0 is the moving histogram, also known as the naive density estimator. The idea is to ...
  13. [13]
  14. [14]
    Sturges' rule - Scott - 2009 - WIREs Computational Statistics
    Nov 2, 2009 · Sturges' rule gives a number-of-bins formula for histograms, and is the first rule given in the literature, still widely implemented.
  15. [15]
    [PDF] 3. Data Preprocessing 3.1 Introduction
    3.2 Data Cleaning. Binning. Equal-width binning. • Divides the range into N intervals of equal size. • Wdth of intervals: • Simple. • Outliers may dominate ...
  16. [16]
    [PDF] Data Mining Input: Concepts, Instances, Attributes …and Pre ...
    Equal-width binning divides the range of possible values into N subranges of the same size. – bin width = (max value – min value) / N. – example: if the ...
  17. [17]
    [PDF] Data Mining: Concepts and Techniques, Second Edition
    ... Equal-width binning, where the interval size of each bin is the same. Equal-frequency binning, where each bin has approximately the same number of tuples ...
  18. [18]
    [PDF] Nonparametric Function Estimation Chapter 3 Stat 5501
    ▷ Sturges' Rule for Choosing the Number of Bins. ▷ Histogram with k bins ... λ 6= 0 log x λ = 0. Page 34. More Practical Data-Based Bin Width Rules:.
  19. [19]
    [PDF] Optimal Bin Number for Equal Frequency Discretizations in ...
    In this paper, we propose a new method to set the optimal number of bins for the. Equal Frequency discretization method, when the class information is ...
  20. [20]
    [PDF] 7.90J 6.874J Computational functional genomics
    Quantile discretization. • Place an equal number of observations into L levels. • Policy vector x + x x + x . N. N. +1. 2N. 2N. +1. L. L. L. L. Λ = ( ...
  21. [21]
    [PDF] Binning, Histogram
    □ Skewed data is not handled well. Equal-depth (frequency) partitioning ... * Partition into equal-frequency (equi-depth) bins: -. Bin 1: 4, 8, 9, 15.
  22. [22]
    [PDF] Supervised and Unsupervised Discretization of Continuous Features
    Often, uniform binning of the data is used to produce the necessary data trans- formations for a learning algorithm, and no careful study of how this ...
  23. [23]
    An Improved Model for Kernel Density Estimation Based on ... - MDPI
    Jul 8, 2022 · A new improved model of multivariate adaptive binned quasi-interpolation density estimation based on a quadtree algorithm and quasi-interpolation is proposed.1. Introduction · 2.2. Adaptive Binned Method... · 4.1. Univariate Test<|separator|>
  24. [24]
    [PDF] Normalized mutual information based registration using k-means ...
    Instead of the generally used equidistant re-binning, we use k-means clustering in order to achieve a more natural binning of the intensity distribution.
  25. [25]
    [PDF] 1992-ChiMerge: Discretization of Numeric Attributes
    To use such an algorithm when there are numeric at- tributes, all numeric values must first be converted into discrete values-a process called discretiza- tion.Missing: binning original
  26. [26]
    Obtaining Well Calibrated Probabilities Using Bayesian Binning
    Feb 21, 2015 · In this paper we present a new non-parametric calibration method called Bayesian Binning into Quantiles (BBQ) which addresses key limitations of existing ...
  27. [27]
    Data Binning Explained. Background and Disclaimer - Medium
    Feb 18, 2023 · Data binning, also known as “data bucketing”, is a data pre-processing technique used in machine learning and data mining to group continuous ...
  28. [28]
    The Choice of a Class Interval - Taylor & Francis Online
    (1926). The Choice of a Class Interval. Journal of the American Statistical Association: Vol. 21, No. 153, pp. 65-66.
  29. [29]
    [PDF] Histogram and Kernel Density Estimator
    Density estimation is the problem of reconstructing the probability density function using a set of given data ... In addition to estimating the density function ...<|control11|><|separator|>
  30. [30]
    [PDF] Contributions to the Mathematical Theory of Evolution. II. Skew ...
    Skew Variation in. Homogeneous Matterial. By KARL PEARSON, University College, London. Communicated by Professor HENRhcI, F.R.S.. Received December 19, 1894,- ...
  31. [31]
    [PDF] Karl Pearson a - McGill University
    To cite this Article Pearson, Karl(1900)'X. On the criterion that a given system of deviations from the probable in the case of a.
  32. [32]
    [PDF] A Chi-square Goodness-of-Fit Test for Continuous Distributions ...
    May 6, 2020 · We have presented a new method for binning continuous data for a chi-square test. Our method uses the idea of a perfect data set to quickly ...
  33. [33]
    7 Visualizing distributions: Histograms and density plots
    In general, if the bin width is too small, then the histogram becomes overly peaky and visually busy and the main trends in the data may be obscured. On the ...
  34. [34]
    Using KBinsDiscretizer to discretize continuous features - Scikit-learn
    One way to make linear model more powerful on continuous data is to use discretization (also known as binning). In the example, we discretize the feature and ...
  35. [35]
  36. [36]
    KBinsDiscretizer — scikit-learn 1.7.2 documentation
    Transform discretized data back to original feature space. Note that this function does not regenerate the original data due to discretization rounding.
  37. [37]
    A Necessary Condition for a Good Binning Algorithm in Credit Scoring
    Aug 5, 2025 · Binning is widely used in credit scoring. In particular, it can be used to define the Weight of Evidence (WOE) transformation. In this paper ...
  38. [38]
    24 Naive Bayes - Oracle Help Center
    ... Naive Bayes usually requires binning. Naive Bayes relies on counting techniques to calculate probabilities. Columns must be binned to reduce the cardinality ...
  39. [39]
    An Empirical Study on the Usage of Automated Machine Learning ...
    Aug 28, 2022 · This study examines how ML practitioners use AutoML tools in real-world projects, identifying which tools are used, their purposes, and stages ...
  40. [40]
    [physics/0605197] Optimal Data-Based Binning for Histograms - arXiv
    May 23, 2006 · We introduce a straightforward data-based method of determining the optimal number of bins in a uniform bin-width histogram.
  41. [41]
    pandas.cut — pandas 2.3.3 documentation - PyData |
    Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable.
  42. [42]
    Convert Numeric to Factor - R
    The `cut` function divides a numeric vector into intervals, coding values by interval, returning a factor, or integer codes if labels are FALSE.Missing: function | Show results with:function
  43. [43]
    Sample Quantiles - R
    The generic function quantile produces sample quantiles corresponding to the given probabilities. The smallest observation corresponds to a probability of 0 and ...
  44. [44]
    load_iris — scikit-learn 1.7.2 documentation
    Load and return the iris dataset (classification). The iris dataset is a classic and very easy multi-class classification dataset.
  45. [45]
    [2310.13446] Simple binning algorithm and SimDec visualization for ...
    Oct 20, 2023 · This paper introduces a simple binning approach for computing sensitivity indices and uses SimDec visualization for complex models, computing ...
  46. [46]
    [2412.07093] Streaming Private Continual Counting via Binning - arXiv
    Dec 10, 2024 · In this paper, we present a simple approach to approximating factorization mechanisms in low space via \textit{binning}, where adjacent matrix ...
  47. [47]
    Loss-Aware Histogram Binning and Principal Component Analysis ...
    Feb 15, 2024 · We propose a method to estimate information loss when conducting histogram binning and principal component analysis (PCA) sequentially, as usually done in ...
  48. [48]
    Consequences of arbitrary binning the midpoint category in survey ...
    ABSTRACT. Arbitrary placing cut-offs in data, i.e. binning, is recognised as poor statistical practice. We explore the consequences of using arbitrary cut-offs ...
  49. [49]
    Quantifying the Loss of Information from Binning List-Mode Data
    Feb 12, 2019 · In the end we find that the information loss depends on three factors. The first factor is related to the smoothness of the mean data function ...
  50. [50]
    On the histogram as a density estimator:L 2 theory
    On the histogram as a density estimator:L 2 theory ... Article PDF. Download to read the full article text. Use our pre ...
  51. [51]
    The significance of finite buffer edge effects in histogram ...
    The significance of finite buffer edge effects in histogram measurements of queues. Published: January 2001. Volume 16, pages 177–193, (2001); Cite this ...
  52. [52]
    Statistical uncertainty associated with histograms in the Earth sciences
    Feb 24, 2005 · To avoid any “edge effects”, we “pad” the vector n of observed bin counts with zeros. This is more than just a trick, because it is an implicit ...
  53. [53]
    [PDF] Developing Best Practices for Regulatory Data Collections
    May 10, 2016 · Before beginning, regulators need to (1) define the business purpose, (2) design a template, (3) ensure key terms are clear and precise, and. (4) ...
  54. [54]
    [PDF] Density estimation in R - Hadley Wickham
    The main advantages are its extreme simplicity and speed of computation. A histogram is piecewise constant (hence not at all smooth) and can be extremely ...