Fact-checked by Grok 2 weeks ago

CatBoost

CatBoost is an open-source library that implements an for on decision trees, with native support for categorical features without requiring preprocessing. Developed by researchers and engineers at , it was first introduced in 2017 as a successor to Yandex's MatrixNet system. The library is designed for high-performance tasks such as , , and , and is available in multiple programming languages including , , , and C++. CatBoost's key innovations include ordered boosting, which mitigates target leakage and prediction shift by processing training examples in a randomized order during each iteration, and ordered target statistics for handling categorical features to reduce overfitting. These techniques enable the library to achieve high accuracy with default parameters, often outperforming competitors like and in benchmarks on datasets with categorical variables. Additionally, it supports fast GPU with multi-card configurations and provides tools for model interpretation and feature importance analysis. Widely adopted in industry and research, CatBoost is used by for applications in search engines, recommendation systems, autonomous vehicles, and weather forecasting, as well as by organizations such as , , and for diverse needs. Its open-source nature under the Apache 2.0 license has facilitated extensive community contributions and integrations with popular frameworks like . As of 2025, recent developments include version 1.2.8 with enhanced scalability through support for distributed learning and advanced sampling methods such as minimal variance sampling.

Overview

Definition and Purpose

CatBoost, short for Categorical Boosting, is an open-source library developed by that implements on decision trees, with a particular optimization for handling categorical features natively without requiring manual preprocessing. This library enables the construction of high-performance models for tasks such as , , and ranking, leveraging decision trees as base learners in an ensemble framework. The primary purpose of CatBoost is to deliver superior predictive accuracy on structured or tabular datasets, while minimizing and enhancing ease of use relative to traditional methods like or . By addressing common challenges in tabular , such as the effective incorporation of categorical variables and the prevention of prediction shift, CatBoost allows data scientists to build robust models quickly using default parameters, making it suitable for real-world applications in search engines, recommendation systems, and . At a high level, CatBoost's integrates iterative boosting with construction, employing random permutations of the training data during each iteration to enable ordered learning and thereby avoid target leakage—a common issue in boosting algorithms that can lead to overly optimistic performance estimates. This design ensures unbiased estimates and supports seamless handling of mixed data types, contributing to its efficiency on heterogeneous datasets. CatBoost was first introduced in 2017 by as an accessible tool for data scientists to develop accurate models with minimal , serving as the successor to Yandex's internal MatrixNet and marking a shift toward open-source accessibility for advanced boosting techniques.

Core Components

In the implementation, CatBoost utilizes a core data structure called the to represent datasets, which efficiently stores matrices, values, and sample weights while supporting both dense and sparse data formats for versatile data handling. This structure allows seamless integration with common input sources such as arrays, DataFrames, or external files, enabling robust preparation for training on diverse datasets including those with categorical features specified via column indices or names. The booster object serves as the central class for model training in the Python API, with implementations like CatBoostClassifier for tasks and CatBoostRegressor for , encapsulating the ensemble of decision trees and overseeing the iterative of the model. These classes provide intuitive methods for fitting models to Pool objects and generating predictions, while maintaining compatibility with scikit-learn's estimator interface for streamlined workflows. Model serialization in CatBoost supports saving trained boosters to binary files in the .cbm for compact, efficient storage and rapid loading, or to for interpretable, platform-agnostic deployment across languages. The corresponding load_model method reconstructs the model from these files, preserving its full functionality without requiring access to the original training data. CatBoost's hyperparameters are pivotal in tuning model complexity and convergence. The iterations parameter defines the total number of trees in the ensemble, directly impacting training duration and the model's expressive power. Depth controls the maximum level of each decision tree, preventing excessive complexity that could lead to overfitting while allowing sufficient depth for capturing interactions. Learning_rate scales the update from each tree, facilitating finer adjustments to reduce variance and enhance generalization across iterations.

Technical Background

Gradient Boosting Fundamentals

is a ensemble method that constructs a strong predictive model by sequentially adding weak learners, most commonly shallow decision trees, to minimize a differentiable through a process analogous to in . Unlike bagging or random forests, which build models in parallel, fits each new learner to correct the errors of the previous ensemble by approximating the negative gradient of the evaluated at the current predictions. This forward stagewise additive modeling approach allows for flexible handling of various tasks, including and , while leveraging the interpretability of trees. The mathematical foundation of relies on building an additive expansion of the model. Starting with an initial guess F_0(x), typically a constant, the ensemble after m iterations is given by F_m(x) = F_{m-1}(x) + \nu h_m(x), where h_m(x) is the m-th weak learner fitted to the pseudo-residuals (negative gradients) r_{im} = -\left[ \frac{\partial L(y_i, F(x_i))}{\partial F(x_i)} \right]_{F = F_{m-1}} for each training example i, and \nu \in (0, 1] is the or shrinkage parameter that scales the contribution of each new tree to control the learning pace. The optimal h_m is found by minimizing a squared-error objective to the pseudo-residuals, enabling the use of regression trees even for non-regression losses. This process continues until a specified number of iterations or criteria are met, yielding a final predictor F_M(x). Loss functions in define the optimization objective and determine the pseudo-residuals. For , the (MSE) is a standard choice, expressed as L(y, \hat{y}) = \frac{1}{2}(y - \hat{y})^2, where the factor of \frac{1}{2} simplifies the to the y - \hat{y}. In , the log loss (binomial deviance) L(y, \hat{y}) = -y \log p - (1-y) \log(1-p), with p = \sigma(F(x)) and \sigma the , produces pseudo-residuals that adjust the log-odds, facilitating probabilistic outputs via logistic transformation. These losses must be smooth and differentiable to compute gradients effectively, though the framework generalizes to other convex criteria like Poisson deviance for count data. To mitigate overfitting, which can arise from the sequential nature of boosting leading to complex ensembles, standard techniques include shrinkage via the learning rate \nu, which reduces the step size and necessitates more trees for equivalent fit but improves generalization by smoothing the model. Additionally, subsampling introduces stochasticity by fitting each tree to a random subset of the training data (typically 50-100% of samples), akin to bagging, which decreases variance, decorrelates trees, and accelerates convergence while further preventing overfitting to noise. These regularization strategies balance bias and variance without altering the core gradient-fitting mechanism.

Decision Trees in Boosting

In boosting frameworks, decision trees function as weak learners, each fitted to the negative gradients (pseudo-residuals) of the loss function from the built in prior iterations, thereby sequentially correcting errors to minimize the overall . This approach leverages the trees' ability to capture non-linear patterns and interactions in data, while their shallow depth ensures they remain underfitting models suitable for boosting. Decision trees in this context are structures, where each non-leaf defines a split on a single using a value, directing instances to one of two child nodes based on whether the feature value meets or exceeds the . Splits are selected to reduce , measured by the Gini for (which quantifies class mixture within a ) or by for (which minimizes the spread of target values). Leaf nodes assign constant values, typically the mean of pseudo-residuals for or class probabilities for , derived from gradient fitting as detailed in fundamentals. Tree construction proceeds via a greedy algorithm that recursively evaluates potential splits across all features and thresholds, selecting the one that maximizes gain—the improvement in loss reduction post-split—to build the tree level by level until stopping criteria are met. For second-order gradient boosting methods, the gain at a node is computed as \text{Gain} = \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{(G_L + G_R)^2}{H_L + H_R + \lambda}, where G_L and G_R denote the summed first-order gradients (negative partial derivatives of the loss) over instances in the left and right child nodes, respectively; H_L and H_R are the corresponding summed second-order gradients (Hessians); and \lambda is a regularization term penalizing complex splits. This formulation approximates the loss reduction, enabling efficient split finding while incorporating regularization to favor simpler trees. To manage model complexity and avoid , the maximum depth parameter limits the number of splits from to leaf, typically keeping trees shallow (e.g., depth 3–6) with a corresponding cap on the number of leaves. enhances ensemble diversity and generalization: at each tree, a random subset of features is considered for splitting (e.g., via column ), and rows may be sampled with replacement, akin to bagging, to introduce variability across trees.

Algorithm Innovations

Ordered Target Statistics for Categoricals

Traditional methods for handling categorical features in models, such as encoding, often result in high-dimensional sparse representations that increase computational costs and can lead to information loss, particularly for features with high . CatBoost addresses these challenges through ordered boosting, which employs random of the training data across boosting iterations to compute feature statistics using only preceding examples in the permutation order, thereby preventing target leakage where future values influence current predictions. The core mechanism is the calculation of ordered target statistics for each categorical value c, defined as \text{statistic}_c = \frac{\sum y_i + \text{prior}}{\sum w_i + 1}, where the sums are over prior examples with the same categorical value, y_i are the target values, w_i are sample weights (defaulting to 1), and the prior is a smoothing term typically set to the dataset's mean target value multiplied by a small constant to stabilize estimates for rare categories. This randomized ordering ensures unbiased statistics by simulating a temporal sequence without actual time dependencies. To manage high-cardinality categoricals, CatBoost uses parameters like border count for binning values into numerical ranges during quantization, reducing dimensionality while preserving ordering, and supports combinations of categorical features up to a specified maximum (e.g., pairs or triples) to capture s without explicit encoding. These are configurable via options such as ctr_border_count for split borders and max_ctr_complexity for interaction depth, enhancing model expressiveness on datasets with complex categorical relationships.

Symmetric Tree Structures

In CatBoost, symmetric tree structures represent a departure from traditional greedy decision trees, which grow asymmetrically by sequentially selecting the best split for individual nodes. Instead, symmetric trees construct each level of the tree simultaneously, applying the identical splitting condition—consisting of a single and —to all nodes at that depth. This uniform approach optimizes the overall gain across the entire level rather than per node, resulting in balanced with exactly $2^d leaves for a of depth d. The construction process begins at the root level and proceeds depth-wise: for each level, CatBoost evaluates potential splits across features and thresholds, selecting the one that maximizes the aggregate improvement in the objective function when applied universally to all current leaves from the previous level. Leaves are then partitioned left or right based on this shared condition, ensuring throughout the . This method contrasts with asymmetric growth by reducing the computational overhead of repeated split searches, as only one optimal split per level is computed. The controls the number of such levels, limiting while maintaining expressiveness. Symmetric trees offer significant advantages in efficiency and scalability, particularly enabling high parallelism during both and inference. By standardizing splits per level, they facilitate vectorized operations and GPU acceleration, where histogram-based computations can leverage without conflicts, achieving up to 10 times faster prediction speeds compared to non-symmetric trees. This structure also supports distributed across multiple GPUs, scaling nearly linearly with hardware resources. In practice, symmetric growth is activated via the grow_policy='SymmetricTree' parameter in the CatBoost , serving as the default mode for most applications due to its balance of speed and quality. These symmetric trees form the foundation for CatBoost's oblivious decision trees, which extend the concept by precomputing splits for reuse across multiple trees.

Oblivious Decision Trees

Oblivious decision trees, also known as decision tables, form the core structure of base predictors in CatBoost, where the same splitting rule—consisting of a selected and —is applied uniformly to all nodes at a given tree level. This design ensures a deterministic structure that is balanced and less susceptible to compared to traditional decision trees, as it avoids individualized splits per node. By fixing the split condition across an entire level, the tree paths for instances with similar feature values become identical early in the traversal, enabling efficient and . The building process for these trees follows a top-down, level-wise approach: at each level, candidate splits are evaluated across all potential features and thresholds to identify the one that maximizes the gain in the objective function, such as the reduction in , and this optimal split is then applied uniformly to all at that level before recursing to the next. This contrasts with non-oblivious trees, where splits are chosen independently for each node, potentially leading to more varied and computationally intensive structures; in oblivious trees, the uniform selection fixes branching paths more rigidly from the outset. Once the tree reaches the specified depth, leaf values are computed by averaging the gradients of the examples assigned to each leaf, incorporating random permutations to mitigate prediction shift. Prediction with oblivious decision trees is highly efficient due to their symmetric nature, which allows for vectorized computations across batches of instances, as the identical splits per level facilitate path evaluations without branching logic per sample. This structure reduces the of to near-linear in the number of instances and , making it particularly suitable for large-scale deployments. In CatBoost, this growth policy is enabled by default via the grow_policy parameter set to 'SymmetricTree', which enforces the level-wise uniform splits characteristic of oblivious trees.

Implementation Details

Training Process

The training process in CatBoost begins with data preparation and model initialization. Users load the training dataset, along with optional validation data, into a specialized Pool object that handles features, targets, weights, and categorical indicators efficiently. The booster is then initialized via the CatBoost class (or equivalents in other interfaces like R or command-line), specifying hyperparameters such as the number of iterations (trees), learning rate, depth of trees, and handling of categorical features. This setup ensures the model is configured for the specific task, such as regression or classification, before entering the boosting loop. The core of training occurs in an iterative boosting loop, where a sequence of s is constructed sequentially to minimize the function. For each boosting round, CatBoost computes first- and second-order gradients (and optionally Hessians) based on the current ensemble's predictions and the values. A new oblivious is then built using these gradients as pseudo-residuals, incorporating ordered statistics to handle categorical features without leakage. The tree is fitted to the data, and the ensemble is updated by adding the new tree's contributions scaled by the , gradually reducing the overall . This process repeats for the specified number of iterations or until criteria are met, with each tree focusing on correcting errors from the previous ensemble. To prevent , CatBoost implements during when a validation is provided. The model evaluates a specified metric, such as RMSE for , on the validation set after each . If the metric does not improve for the number of iterations specified by early_stopping_rounds (which must be explicitly set, as it defaults to disabled), halts, and the best 's model is retained based on the optimal validation score. This mechanism balances model complexity and generalization without requiring manual . CatBoost also supports cross-validation to assess model performance robustly during the workflow. The built-in cv method (in ) or equivalent functions in other perform k-fold cross-validation by partitioning the into folds, training a model on k-1 folds, and evaluating on the held-out fold, aggregating results across all folds to compute mean and standard deviation of metrics like logloss or accuracy. This integrated approach allows for hyperparameter tuning and without separate validation splits.

Supported Loss Functions and Objectives

CatBoost provides a range of built-in functions and objectives tailored for and tasks, enabling optimization for various types and problem structures. For problems, CatBoost supports the root squared error (RMSE) objective, defined as \sqrt{\frac{\sum_{i=1}^N (a_i - t_i)^2 w_i}{\sum_{i=1}^N w_i}}, where a_i is the , t_i the , and w_i the weight for the i-th object; this measures the magnitude of errors in a squared form, making it suitable for datasets where larger errors should be penalized more heavily, such as in financial forecasting. The (MAE) objective, given by \frac{\sum_{i=1}^N w_i |a_i - t_i|}{\sum_{i=1}^N w_i}, computes the differences between predictions and , offering robustness to outliers and applicability in scenarios like demand where median-based accuracy is preferred. The objective minimizes the quantile \frac{\sum_{i=1}^N (\alpha - I(t_i \leq a_i))(t_i - a_i) w_i}{\sum_{i=1}^N w_i}, with \alpha specifying the desired quantile (default 0.5); it is ideal for estimating conditional quantiles, such as in or . Additionally, the objective, \frac{\sum_{i=1}^N w_i (e^{a_i} - a_i t_i)}{\sum_{i=1}^N w_i}, models count distributions assuming a Poisson process, commonly used in applications like event frequency in . In classification tasks, the Logloss objective for binary problems is -\frac{\sum_{i=1}^N w_i [c_i \log(p_i) + (1 - c_i) \log(1 - p_i)]}{\sum_{i=1}^N w_i}, where c_i is the true label and p_i the predicted probability; this penalizes confident wrong predictions and is widely applied in decision-making, such as detection. For , the MultiClass objective uses the softmax-based \frac{\sum_{i=1}^N w_i \log\left(\frac{e^{a_{i t_i}}}{\sum_{j=0}^{M-1} e^{a_{i j}}}\right)}{\sum_{i=1}^N w_i}, with t_i as the true class index and M classes; it optimizes probability distributions over multiple categories, suitable for tasks like image recognition or customer segmentation. CatBoost allows user-defined objectives through Python callbacks, where developers implement custom and computations to tailor the loss to specific needs, such as domain-specific penalties in healthcare analytics. Evaluation metrics in CatBoost are distinct from optimization losses and serve for model monitoring during training; examples include the area under the curve (), which assesses ranking quality across thresholds for binary tasks, and the F1 score, balancing for imbalanced datasets like detection.

Performance Characteristics

Computational Efficiency

CatBoost achieves significant computational efficiency through optimizations tailored for both CPU and GPU environments. On GPUs, training leverages for accelerated computation, providing up to 40 times speedup compared to CPU training on large datasets, such as the dataset with 36 million samples, where a single V100 GPU reduced training time from 1060 seconds on CPU to 69.8 seconds. This GPU mode benefits from symmetric structures, which facilitate of tree levels, enabling efficient scaling across multiple GPUs— for instance, eight GPUs can outperform hundreds of CPU cores. The original requires CUDA-enabled hardware, with support for devices like V100, P100, and GTX 1080Ti. Memory usage in CatBoost is optimized via the data structure, which efficiently stores training data by supporting dense numerical arrays in float32 format for minimal overhead and sparse matrices through formats like , significantly reducing the footprint for datasets with many zeros or missing values. Feature discretization into bins—defaulting to 255 borders—further compresses requirements, while bit-compressed perfect hashes for categorical features are held in CPU and streamed to GPU as needed, maintaining comparable to or better than competitors like . Parallelization is supported through multi-threading on CPUs, controlled by the thread_count parameter, which accelerates both tree construction and prediction by distributing computations across multiple cores— for example, achieving significant speedups in scoring through multi-threading. On GPUs, feature-parallel learning across multiple devices enhances this, with non-deterministic floating-point operations handled to preserve model quality. CatBoost scales effectively to datasets with millions of rows, as demonstrated by its performance on the 36-million-sample dataset and the 400,000-sample dataset with 2,000 features. Per-iteration time complexity for tree building approximates O(depth × features × samples / batch_size), benefiting from histogram-based approximations that reduce computation from O(s n²) to O(s n), where s represents the number of splits and n the sample size.

Accuracy Benchmarks

CatBoost demonstrates strong predictive performance on standard UCI benchmark datasets, particularly those involving categorical features. On the Adult dataset, a task predicting whether an individual's income exceeds $50K, CatBoost achieves an accuracy of approximately 0.87 in baseline configurations. This performance is often competitive with or outperforming in accuracy and F1-score metrics on datasets with categorical variables, attributed to its native handling of categorical variables without preprocessing. Similarly, on the Higgs dataset, a large-scale problem from with over 10 million samples, CatBoost attains an of approximately 0.85, competitive with alternatives like in scenarios with high-cardinality features. CatBoost's resistance to is a key strength, largely due to its ordered statistics mechanism, which prevents leakage by computing statistics from prior permutations during . Empirical evaluations show lower validation errors compared to standard methods, as evidenced by smoother learning curves where and validation losses converge more closely without divergence. This robustness reduces the need for extensive hyperparameter tuning to control , leading to more stable generalization across iterations. Ablation studies highlight the impact of CatBoost's categorical handling on error rates. Disabling ordered target statistics and reverting to one-hot encoding increases log-loss by up to 5-10% on datasets with mixed feature types, such as the Adult dataset, due to heightened sensitivity to noise and leakage. Conversely, enabling native categorical support lowers classification error rates relative to XGBoost's manual encoding approaches, demonstrating the feature's contribution to overall accuracy. In recent benchmarks up to , CatBoost remains competitive on tabular tasks within frameworks like MLPerf-inspired evaluations. A 2024 comprehensive study across diverse tabular datasets, including UCI and real-world sets, positions CatBoost as a top performer in accuracy for and , often tying or exceeding models on non-image data while maintaining efficiency. As of , CatBoost continues to be a strong performer in tabular data benchmarks, including comparisons with foundation models like TabPFN, where it excels in efficiency for medium-sized datasets.

Development History

Origins and Releases

CatBoost was developed by researchers and engineers at , a leading technology company, during 2016 and 2017 as a successor to the company's internal MatrixNet algorithm, specifically to tackle challenges in handling categorical features prevalent in search engines, recommendation systems, and other applications like weather forecasting and ranking tasks. The library emerged from Yandex's need for a high-performance framework that could process categorical data natively without extensive preprocessing, reducing prediction shift and issues common in traditional methods. The initial open-source release of CatBoost occurred on July 18, 2017, with version 0.1, made available under the Apache 2.0 license to encourage widespread adoption and contributions from the community. This marked a significant , transitioning the from Yandex's proprietary tools to a freely accessible , initially supporting , C++, and command-line interfaces for tasks including , , and ranking. Subsequent updates rapidly expanded CatBoost's capabilities. In November 2017, version 0.3 introduced GPU acceleration, enabling up to 30 times faster training on hardware for large datasets while maintaining accuracy. By late 2018, release 0.9 enhanced model interpretability through deeper integration with SHAP (SHapley Additive exPlanations) for feature importance analysis, alongside support for feature combinations and improved handling of text data. The library reached version 1.0 in October 2021, stabilizing the core for production use, adding robust error handling, and optimizing cross-platform compatibility across , , , and C++. Further refinements came with the 1.2 series, starting in May 2023, which switched to a build system, added support for 3.11 and newer functions like Focal on CPU, and improved GPU features for multi-class objectives. As of November 2025, the latest stable release is version 1.2.8 from April 2025, incorporating 3.13 compatibility, 2.x support, and fixes for GPU custom metrics and architectures. CatBoost remains primarily maintained by the team, with ongoing development hosted on , where the open-source community contributes through pull requests, bug reports, and feature suggestions, fostering iterative improvements without any major forks.

Key Contributors and Evolution

CatBoost was primarily developed by a team at , with key contributions from Anna Veronika Dorogush, Vasily Ershov, and Andrey Gulin, who presented the initial framework at the Workshop on Machine Learning Systems at NeurIPS 2017. Their work focused on addressing challenges in handling categorical features within , laying the foundation for the library's core innovations. Subsequent advancements involved additional researchers, including Liudmila Prokhorenkova, Gleb Gusev, and Aleksandr Vorobev, who refined the algorithms in follow-up publications. The library builds directly on Jerome H. Friedman's seminal 2001 introduction of machines, which established the paradigm of iteratively adding weak learners to minimize a through . It also draws from Prokhorenkova and colleagues' 2018 research on ordered boosting techniques and ordered target statistics for categorical features, which mitigate prediction shift and target leakage to improve model unbiasedness and generalization. These influences enabled CatBoost to extend 's robustness while introducing specialized handling for real-world datasets with mixed feature types. Over its development, CatBoost transitioned from an initial CPU-focused implementation to a versatile multi-platform library, with native support for from its 2017 release and later expansions to , , C++, and other languages through bindings and . GPU acceleration was integrated early to enhance training speed on large datasets, as noted in the foundational workshop paper. Updates have incorporated built-in tools for feature importance computation, partial dependence plots via plot_predictions, and integration with the SHAP framework for explaining individual predictions. By 2025, the open-source CatBoost repository on GitHub had amassed over 10,000 stars, underscoring its widespread adoption in the machine learning community. This growth has spurred extensions, such as integrations with federated learning frameworks like Flower, allowing privacy-preserving training across distributed datasets without centralizing data.

Practical Applications

Use Cases in Industry

CatBoost has found significant adoption in the sector, particularly for detection in banking . Its native support for categorical features enables efficient processing of heterogeneous data such as types, merchant categories, and user profiles without extensive preprocessing, allowing banks to identify fraudulent activities more accurately than traditional methods. For instance, in detection models, CatBoost has demonstrated superior and F1 scores compared to baselines like and , achieving up to 93% on large-scale datasets. This capability has been applied in real-world financial systems to reduce false positives and enhance detection rates for imbalanced datasets typical in scenarios. In , CatBoost powers recommendation systems by handling user interaction categories, such as browsing history and product attributes, to improve item ranking and . At , it serves as a trees ranker in recommendation surfaces for retargeting and , where integrating similarity-based features from models boosts offline evaluation metrics like nDCG. This approach facilitates quick adaptation to diverse categorical inputs, enabling platforms to deliver more relevant suggestions and increase user engagement without manual . Healthcare applications leverage CatBoost for predictive modeling on tabular electronic health records (EHR) data, focusing on outcomes like patient risk stratification and length of stay predictions. Its ability to manage categorical variables such as codes and treatment histories allows for robust models that outperform other algorithms in ICU settings, with F1 scores reaching 89.2% for identifying high-risk s. In , CatBoost has been used to forecast treatment responses, aiding clinicians in personalized care planning through interpretable predictions on structured EHR inputs. Beyond these domains, CatBoost supports ad tech applications in click-through rate (CTR) prediction, where it processes categorical ad features like targeting segments to optimize and placement strategies. Studies highlight its effectiveness in enhancing CTR models for , balancing in high-volume, imbalanced environments. In telecommunications, it excels in churn analysis by analyzing customer categorical data such as plan types and usage patterns, enabling operators to predict and deploy retention interventions.

Integration with Other Tools

CatBoost provides seamless integration with the ecosystem through its , which is designed to be compatible with . The CatBoostClassifier and CatBoostRegressor classes inherit from scikit-learn's base estimators, enabling their use in standard scikit-learn workflows such as pipelines, cross-validation, and hyperparameter tuning with tools like GridSearchCV. The library extends support to other programming languages via official bindings. In , the catboost package allows training and prediction with decision trees, offering equivalent functionality to the Python interface. For , the CatBoost package includes native libraries for model application, supporting integration in JVM-based environments. At its core, CatBoost is implemented in C++, providing a high-performance foundation that can be directly utilized or extended for custom applications. For deployment, CatBoost models can be exported to the ONNX format, facilitating interoperability with other frameworks and runtime environments. This export capability supports containerized deployments using , which is commonly used for scaling on cloud platforms such as AWS SageMaker and . On AWS, CatBoost is natively available as a built-in algorithm in SageMaker, allowing training and hosting within Docker-based containers. Similarly, integrates CatBoost through its AutoML runtime, enabling automated model selection and deployment in containerized setups. CatBoost includes built-in visualization tools for model interpretation, such as plotting decision trees with plot_tree and generating feature importance charts via get_feature_importance. Additionally, it natively computes SHAP values through the get_feature_importance method with the 'SHAP' type, ensuring compatibility with the SHAP library for advanced explainability visualizations like summary plots and dependence charts.

References

  1. [1]
    CatBoost - open-source gradient boosting library
    CatBoost is an algorithm for gradient boosting on decision trees. It is developed by Yandex researchers and engineers, and is used for search, recommendation ...
  2. [2]
    Reference papers - CatBoost
    A paper explaining the CatBoost working principles: how it handles categorical features, how it fights overfitting, how GPU training and fast formula applier ...
  3. [3]
    catboost/catboost - GitHub
    A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks.CatBoost · CatBoost tutorials repository · Releases · Issues 630
  4. [4]
    [1706.09516] CatBoost: unbiased boosting with categorical features
    Jun 28, 2017 · This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other ...
  5. [5]
  6. [6]
  7. [7]
    Pool | CatBoost
    A one-dimensional array of categorical columns indices (specified as integers) or names (specified as strings).
  8. [8]
    Pool initialization | CatBoost
    Pool initialization: Loading data from a file. A list of possible methods to load the dataset from a file is given in the table below.Missing: documentation | Show results with:documentation
  9. [9]
    CatBoostClassifier | CatBoost
    Training and applying models for the classification problems. Provides compatibility with the scikit-learn tools.
  10. [10]
    CatBoostRegressor | CatBoost
    A one-dimensional array of categorical columns indices (specified as integers) or names (specified as strings). This array can contain both indices and names ...
  11. [11]
    save_model - CatBoost
    The model can be saved to the JSON format without a pool. In this case it is available for review but it is not applicable.
  12. [12]
    Export a model to JSON - CatBoost
    The trained CatBoost model can be saved as a JSON file. This file can be accessed later to apply the model. Refer to the CatBoost JSON model tutorial for ...
  13. [13]
    load_model - CatBoost
    Load the model from a file. Method call format. load_model(fname, format='cbm'). Parameters fname Description. The path to the input model. Possible types.
  14. [14]
    Parameter tuning - CatBoost
    By default, the learning rate is defined automatically based on the dataset properties and the number of iterations. The automatically defined value should be ...
  15. [15]
    [PDF] Greedy Function Approximation: A Gradient Boosting Machine
    Abstract. Function estimation/approximation is viewed from the perspective of numerical opti- mization in function space, rather than parameter space.
  16. [16]
    [PDF] Stochastic Gradient Boosting - Jerome H. Friedman
    Stochastic Gradient Boosting. Jerome H. Friedman*. March 26, 1999. Abstract. Gradient boosting constructs additive regression models by sequentially fitting ...
  17. [17]
    [PDF] XGBoost: A Scalable Tree Boosting System - arXiv
    Jun 10, 2016 · In this paper, we describe a scalable end- to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state- ...
  18. [18]
    None
    ### Summary of Ordered Target Statistics for Categorical Features in CatBoost
  19. [19]
    Transforming categorical features to numerical features - CatBoost
    The method of transforming categorical features to numerical generally includes the following stages: Permutating the set of input objects in a random order.
  20. [20]
    Common parameters - CatBoost
    A tree is built level by level until the specified depth is ...
  21. [21]
    CatBoost Enables Fast Gradient Boosting on Decision Trees Using ...
    CatBoost is a high-performance, open-source library for gradient boosting on decision trees, using GPUs for fast training, especially on large datasets.
  22. [22]
    How training is performed - CatBoost
    CatBoost is based on gradient boosted decision trees. During training, a set of decision trees is built consecutively. Each successive tree is built with ...
  23. [23]
    Training | CatBoost
    CatBoost provides a variety of modes for training a model. Choose the implementation for more details. Python package Classes CatBoost. Class purpose.
  24. [24]
  25. [25]
    Objectives and metrics | CatBoost
    Use object/group weights to calculate metrics if the specified value is true and set all weights to 1 regardless of the input data if the specified value is ...Missing: booster | Show results with:booster
  26. [26]
  27. [27]
  28. [28]
  29. [29]
  30. [30]
  31. [31]
  32. [32]
    Multiclassification: objectives and metrics - CatBoost
    CatBoost metrics for multiclassification include MultiClass, MultiClassOneVsAll, Precision, Recall, F, F1, TotalF1, MCC, Accuracy, HingeLoss, HammingLoss, ...
  33. [33]
    User-defined metrics - CatBoost
    A custom Python object can be set as a value for the training metric. The following parameters can be set for the corresponding classes and are used when ...Missing: callback | Show results with:callback
  34. [34]
    Extremely fast learning on GPU has arrived! - CatBoost
    Nov 2, 2017 · Training on a single GPU outperforms CPU training up to 40x times on large datasets. · CatBoost efficiently supports multi-card per unit ...
  35. [35]
    [PDF] CatBoost: unbiased boosting with categorical features
    This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other ...
  36. [36]
  37. [37]
    [PDF] Benchmarking state-of-the-art gradient boosting algorithms ... - arXiv
    Also, CatBoost was the most accurate in the case of two datasets, although it was the slowest across three. LightGBM and SnapBoost needed the least amount of ...
  38. [38]
    CatBoost for big data: an interdisciplinary review - PMC - NIH
    CatBoost is an open source, Gradient Boosted Decision Tree (GBDT) implementation for Supervised ML bringing two innovations: Ordered Target Statistics and ...
  39. [39]
    CatBoost (Categorical Boosting) | ML and AI Wiki by AryaXAI
    In symmetric tree growth, all splits at a given depth happen simultaneously across all branches, resulting in a balanced and uniform tree structure. Advantage: ...
  40. [40]
    Using the overfitting detector - CatBoost
    If overfitting occurs, CatBoost can stop the training earlier than the training parameters dictate. For example, it can be stopped before the specified number ...
  41. [41]
    [PDF] CatBoost: unbiased boosting with categorical features
    This paper presents the key algorithmic techniques behind CatBoost, a new gradient boosting toolkit. Their combination leads to CatBoost outperforming other ...
  42. [42]
    Yandex Announces CatBoost, a New Open-Source Machine ...
    Jul 17, 2017 · CatBoost is the successor to MatrixNet, the machine learning algorithm that is widely used within Yandex for numerous ranking tasks, weather ...
  43. [43]
    open-source gradient boosting library - CatBoost
    We are open-sourcing our gradient boosting library CatBoost. It is well-suited for training machine learning models on tasks where data is heterogeneous.
  44. [44]
    Yandex open sources CatBoost, a gradient boosting machine ...
    Jul 18, 2017 · Second, Yandex is offering the CatBoost library as a free service, released under an Apache license, to any and all who need or want to use ...Missing: origins | Show results with:origins
  45. [45]
    0.10.x and 0.9.x releases review - CatBoost
    Nov 1, 2018 · Now it's easy to investigate your model as far as we implemented 2-times speedup for AUC calculation in eval_metrics. Furthermore we speeded up ...
  46. [46]
    catboost 1.0.0 - PyPI
    catboost 1.0.0. pip install catboost==1.0.0. Copy PIP instructions. Newer version available (1.2.8). Released: Oct 1, 2021. Catboost Python Package ...
  47. [47]
    catboost - PyPI
    Software Development :: Libraries :: Python Modules. Release history Release notifications | RSS feed. This version. 1.2.8. Apr 13, 2025 · 1.2.7. Sep 7, 2024.Missing: milestones | Show results with:milestones
  48. [48]
  49. [49]
    Greedy function approximation: A gradient boosting machine.
    October 2001 Greedy function approximation: A gradient boosting machine. Jerome H. Friedman · DOWNLOAD PDF + SAVE TO MY LIBRARY.
  50. [50]
  51. [51]
    Federated Learning with CatBoost and Flower (Quickstart Example)
    Here, we leverage this algorithm for learning CatBoost trees in a federated learning environment. Specifically, each client is treated as a bootstrap by ...Missing: extensions | Show results with:extensions
  52. [52]
    A Comparative Study of CatBoost, XGBoost and LightGBM[v1]
    Mar 17, 2025 · This table compares the performance of CatBoost, XGBoost, and LGBM in credit card fraud detection. CatBoost achieves the best results, with an ...
  53. [53]
    CatBoost for Fraud Detection in Financial Transactions
    To improve detection accuracy, feature engineering is applied to generate highly important features and feed them into CatBoost for classification and ...
  54. [54]
    [PDF] Personalized Transformer-based Ranking for e-Commerce at Yandex
    Oct 9, 2023 · Recommender systems aim to alleviate information overload and present relevant and interesting results to users from billions of potential ...
  55. [55]
    CatBoost — Yandex Technologies
    CatBoost is a state-of-the-art open-source gradient boosting on decision trees library. ... Developed by Yandex researchers and engineers, it is the successor of ...
  56. [56]
    Using Machine Learning and Electronic Health Records to Identify ...
    Oct 2, 2024 · The CatBoost model excelled in ICU environments with an F1 Score of 89.2%, while XGBoost performed best in general hospital settings with a 75.4 ...
  57. [57]
    [PDF] CatBoost Model for Enhanced Treatment Prediction in Type 2 ...
    May 14, 2025 · This paper presents a predictive model utilizing CatBoost, a gradient boosting algorithm adept at managing categorical data, to offer.
  58. [58]
    ftmoztl/CTR-prediction-with-catboost: Click-Through Rate ... - GitHub
    This study focuses on improving Click-Through Rate (CTR) prediction in online advertising and content recommendation systems.Missing: tech | Show results with:tech
  59. [59]
    An efficient churn prediction model using gradient boosting machine ...
    Sep 2, 2023 · The results showed that CatBoost with SMOTE is the best model. Zhu & Liu conducted a comparative study between ten ML models for churn ...
  60. [60]
    R package installation - CatBoost
    Installation requires the 64-bit version of R. It is strongly recommended to install the released version. Try it if other installation methods result in errors ...
  61. [61]
    Java | CatBoost
    To apply a previously trained CatBoost model in Java: Install the package using a package manager. Note that the package contains native code shared libraries ...Missing: history multi-
  62. [62]
    С and C++ - CatBoost
    The following methods for applying a trained model are provided: Evaluation library. Standalone evaluator. Model modification.Missing: core | Show results with:core
  63. [63]
    ONNX - CatBoost
    CatBoost models are based on ensembles of decision trees, therefore only exporting models to the ONNX-ML format is supported. Specifics. Only models trained on ...
  64. [64]
    CatBoost - Amazon SageMaker AI - AWS Documentation
    CatBoost is a supervised learning algorithm that is an open-source implementation of the gradient boosted decision tree algorithm.
  65. [65]
    CatBoostClassifier Class - azureml.automl.runtime - Microsoft Learn
    Model wrapper for the CatBoost Classifier. Construct a CatBoostClassifier ... Azure Machine Learning · Python SDK · Reference · azureml.automl.runtime · shared ...Constructor · Parameters · Methods
  66. [66]
    Data visualization - CatBoost
    CatBoost provides tools for the Python package that allow plotting charts with different training statistics.
  67. [67]
    plot_tree - CatBoost
    An example of a plotted tree: Tutorial. Refer to the Visualization of CatBoost decision trees tutorial for details. Was the article helpful? Yes No. Previous.
  68. [68]
    ShapValues - CatBoost
    This is an implementation of the Consistent Individualized Feature Attribution for Tree Ensembles approach. See the ShapValues file format. Use the SHAP package ...Missing: integration | Show results with:integration