Fact-checked by Grok 2 weeks ago

ID3 algorithm

The ID3 (Iterative Dichotomiser 3) algorithm is a foundational method for inducing decision trees from a training dataset consisting of examples with discrete attributes and class labels, enabling of objects based on their attribute values. Developed by J. Ross Quinlan at the , it was first presented in the mid-1970s and iteratively refined, with its core implementation detailed in a 1986 publication. ID3 operates using a top-down, to construct the : at each node, it evaluates all candidate attributes and selects the one that maximizes information gain, defined as the reduction in (a measure of ) after partitioning the on that attribute. for a set with p positive and n negative examples is calculated as I(p, n) = -\frac{p}{p+n} \log_2 \frac{p}{p+n} - \frac{n}{p+n} \log_2 \frac{n}{p+n}, and the gain for an attribute A is \text{gain}(A) = I(p, n) - \sum_{v \in \text{values}(A)} \frac{p_i + n_i}{p + n} I(p_i, n_i), where the sum weights the in each . The process recurses on child nodes until all examples in a share the same class or no attributes remain, producing a that classifies new instances by traversing from root to . Originally designed to handle large datasets, such as a of 1.4 million positions, demonstrated predictive accuracies up to 84% on unseen data in early applications, assuming error-free training examples and categorical attributes. It incorporates an iterative mechanism using random subsets (windows) of the training data to build and refine trees, aiding convergence on complex problems. However, the algorithm exhibits biases toward attributes with many distinct values, which can lead to overly specific trees, and it does not natively support continuous attributes, missing values, or noisy data without extensions. 's influence extends to modern methods, serving as the direct precursor to Quinlan's , which addressed these limitations through enhancements like and .

Introduction

Overview

The ID3 (Iterative Dichotomiser 3) algorithm is a supervised method for constructing s from labeled training data. Its core goal is to create a that classifies instances based on their feature values, enabling accurate prediction of target class labels. ID3 adopts a , top-down approach to build the tree by iteratively selecting attributes that best reduce classification uncertainty at each step. The output is a decision tree where internal nodes represent tests on attribute values, branches indicate the possible outcomes of those tests, and leaf nodes assign the predicted class labels. Designed primarily for discrete attributes and categorical target variables, ID3 excels with nominal data and uses metrics like and gain for attribute selection.

History

The ID3 algorithm, standing for Iterative Dichotomiser 3, was developed by J. Ross Quinlan as the third iteration in a series of his rule systems, building on earlier work such as his 1969 problem-solving learning . It emerged within the broader context of early research during the and , which focused on and the construction of expert systems to automate processes from data. Quinlan first introduced ID3 in 1979 through a chapter on discovering rules by from large collections of examples, motivated by a challenging task posed by Donald Michie. This initial version replaced the cost-driven lookahead of prior systems like Hunt's Concept Learning System (CLS) from 1966 with an information-driven approach to attribute selection. The algorithm gained formal structure and wider recognition with Quinlan's 1986 paper, "Induction of Decision Trees," published in the Machine Learning journal, which detailed ID3's methodology, including the use of information gain as the criterion for selecting attributes to split data. This publication marked a key milestone, as it synthesized decision tree induction techniques that had been applied in various systems and demonstrated ID3's effectiveness on practical tasks, influencing the development of subsequent machine learning algorithms. Quinlan provided a freely available implementation of ID3 as part of the 1986 system description, which facilitated its adoption in early machine learning toolkits and research environments. ID3's evolution continued with refinements documented in Quinlan's 1983 work on learning efficient classification procedures, particularly for chess endgames, leading to related systems like ACLS in 1983 and ASSISTANT in 1984. Ultimately, ID3 was succeeded by the C4.5 algorithm in 1993, which Quinlan developed to address limitations such as the handling of continuous attributes and missing values, further advancing decision tree methodologies.

Core Concepts

Entropy

In the ID3 algorithm, entropy serves as a fundamental measure of the impurity or uncertainty present in a , quantifying the average amount of required to predict the class of an instance within that set. This concept draws from Claude Shannon's , where represents the expected information content or surprise associated with a random variable's outcome. The mathematical formulation of entropy for a dataset S with c distinct classes is given by: H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i) where p_i is the proportion of instances in S belonging to class i. This formula arises from the principle that is zero when the is pure (all instances belong to the same , so p_i = 1 for one i and 0 otherwise), and it reaches its maximum value when the classes are uniformly distributed, indicating maximum . For instance, consider a with 50% positive and 50% negative instances (p_1 = p_2 = 0.5); the is H(S) = - (0.5 \log_2 0.5 + 0.5 \log_2 0.5) = 1 bit, representing the highest possible for two classes. exhibits key properties: it is always non-negative, bounded above by \log_2(c) for c classes, and decreases as the becomes more homogeneous or pure. In ID3, is computed for the to assess its initial before any splits are considered. This measure is later incorporated into the evaluation of information gain for attribute selection, though the focus here remains on itself as a standalone .

Information Gain

Information gain is a key metric in the ID3 algorithm used to evaluate and select the best attribute for splitting a at each node. It quantifies the reduction in , or uncertainty, about the target class after partitioning the data based on the values of a given attribute. By choosing the attribute that maximizes information gain, ID3 aims to create splits that most effectively separate the classes, leading to more homogeneous subsets. The mathematical formulation of information gain for an attribute A over a dataset S is given by: \text{Gain}(A) = H(S) - \sum_{v \in \text{Values}(A)} \frac{|S_v|}{|S|} \cdot H(S_v) Here, H(S) represents the entropy of the original dataset S, \text{Values}(A) are the possible values of attribute A, and S_v is the subset of S where attribute A takes value v. This formula subtracts the weighted average entropy of the resulting subsets from the parent entropy to yield the gain. Entropy H serves as the baseline measure of impurity in the dataset prior to the split. To compute information gain, first calculate the entropy of the full S. Then, for each attribute A, S into S_v based on A's values, compute the H(S_v) for each subset, and determine the weighted average of these entropies using the proportion of instances in each subset (|S_v|/|S|). Subtract this average from H(S) to obtain \text{Gain}(A). The attribute with the highest gain is selected for the split. This process is repeated recursively for subsequent nodes. A representative example illustrates this computation using a weather dataset for predicting whether to play , with 14 instances and attributes including (, , ). The dataset has 9 positive (play) and 5 negative (no play) outcomes, yielding an initial H(S) \approx 0.940 bits.
PlayNo Play
23
40
32
For , the subsets have entropies of approximately 0.971 bits (), 0 bits (), and 0.971 bits (). The weighted average is (5/14) \times 0.971 + (4/14) \times 0 + (5/14) \times 0.971 \approx 0.693 bits, so \text{[Gain](/page/Gain)}(\text{Outlook}) \approx 0.940 - 0.693 = 0.247 bits. Comparable calculations for other attributes like yield lower gains (e.g., 0.029 bits), making the preferred split as it provides the highest reduction. Higher information gain indicates a more effective attribute for reducing uncertainty and improving class separation, with ID3 selecting the maximum-gain attribute at each to build the greedily. This approach promotes simpler that generalize well, as demonstrated by predictive accuracies around % on unseen chess position in early evaluations. However, the has a toward attributes with many distinct values, as they tend to produce more subsets and lower weighted entropies even if not highly predictive; this issue is mitigated in subsequent algorithms like C4.5 through normalized measures such as gain ratio.

Algorithm Description

Step-by-Step Process

The ID3 algorithm constructs a decision tree from a given training through a top-down, recursive process that begins at the root node with the entire dataset. The root node represents the full set of training instances, each described by a of attribute values and associated with a class label. This initialization sets the stage for partitioning the data based on selected attributes to progressively refine class predictions. The core procedure is recursive and operates on a current of instances at each . First, the algorithm checks if all instances in the subset belong to the same ; if so, it designates the node as a and labels it with that class, terminating further branching since no additional discrimination is needed. Second, if the subset contains instances from multiple classes but no remaining attributes are available for splitting, the node becomes a labeled with the class in the subset to provide the most probable prediction. Third, in cases where the subset is empty—such as when an attribute value has no corresponding instances—the node is labeled with the class from the parent subset or marked as , depending on the variant. These base cases ensure the tree construction halts appropriately, preventing infinite . If neither base case applies, the algorithm proceeds by evaluating all available attributes and selecting the one that maximizes information gain, a measure of how effectively the attribute reduces in class classification for the current . It then creates a child for each distinct value of the selected attribute, partitioning the subset into corresponding sub-subsets based on those values. The recursive process is then applied independently to each sub-subset at these child nodes, continuing to build branches until base cases are met throughout the tree. This step-wise partitioning refines the decision boundaries, with the tree growing depth-first until all leaves are pure or no further splits are possible. ID3 employs a in attribute selection, choosing the locally optimal split at each without or considering global optimality, which promotes efficiency but may lead to suboptimal trees in some scenarios. In the original , instances with missing values for the selected attribute are handled by splitting their weight proportionally among all branches, proportional to the number of instances with each known value in the current subset. Extensions like C4.5 introduced additional methods, such as surrogate splits, for more robust handling. The process concludes when the entire tree is built, yielding a structure where paths from root to leaves encode rules.

Pseudocode

The ID3 algorithm is formally described through pseudocode that captures its recursive, top-down construction of decision trees from a set of training examples. This representation assumes categorical (discrete) attributes and requires no explicit discretization step, as continuous attributes must be preprocessed into discrete bins before application. The core procedure, originally presented in a Lisp-like pseudocode in Quinlan's 1986 paper, takes as input a set of examples, the available attributes, and the target classification attribute; it returns a decision tree node. Modern implementations of this pseudocode are commonly expressed in languages such as Python or Java for educational and practical purposes. To support attribute selection, helper functions compute and information gain, where measures the impurity of a set of examples with respect to the target attribute, and information gain quantifies the reduction in from partitioning on a given attribute.
[function](/page/Function) Entropy(examples, target_attribute):
    if len(examples) == 0:
        return 0
    proportions = compute frequency proportions of each target value in examples
    return -∑ (p * log₂(p)) for each proportion p in proportions
function InformationGain(examples, attribute, target_attribute):
    base_entropy = Entropy(examples, target_attribute)
    values = unique values of attribute in examples
    weighted_entropy = 0
    for each value v in values:
        sub_examples = {e in examples | e[attribute] == v}
        weight = len(sub_examples) / len(examples)
        weighted_entropy += weight * Entropy(sub_examples, target_attribute)
    return base_entropy - weighted_entropy
function ID3(examples, attributes, target_attribute):
    create root_node
    if len(examples) == 0:
        label root_node with most common target value in parent context
        return root_node
    if all examples share the same target value v:
        label root_node as leaf with v
        return root_node
    if attributes is empty:
        label root_node as leaf with most common target value in examples
        return root_node
    gains = {}
    for each attr in attributes:
        gains[attr] = InformationGain(examples, attr, target_attribute)
    best_attribute = argmax over attr in attributes of gains[attr]
    label root_node with best_attribute
    remaining_attributes = attributes \ {best_attribute}
    for each value v in unique values of best_attribute:
        sub_examples = {e in examples | e[best_attribute] == v}
        if len(sub_examples) == 0:
            create leaf child_node labeled with most common target value in examples
        else:
            child_node = ID3(sub_examples, remaining_attributes, target_attribute)
        attach child_node to root_node as branch for value v
    return root_node

Properties and Analysis

Key Properties

The ID3 algorithm is inherently , as it selects the attribute that maximizes information gain at each to make locally optimal splitting decisions, which promotes efficient tree construction but can result in that are suboptimal with respect to global structure. This greedy nature stems from its top-down, recursive approach, where the choice at one level does not consider downstream impacts, allowing for rapid development of decision rules without exhaustive search over all possible . A prominent strength of ID3 lies in its interpretability, producing decision trees that are visually simple and easily comprehensible, enabling users to trace the logical paths from to leaves and understand the hierarchical process. This readability facilitates applications where explaining the model's reasoning is crucial, such as in domains requiring transparent rule extraction from data. In terms of efficiency, ID3 exhibits a of O(n \cdot m^2), where n is the number of training instances and m is the number of attributes, arising from the repeated computation of information gain across attributes at each of up to m levels in the tree; this renders it well-suited for small to medium datasets but less scalable for very large ones. ID3 demonstrates a toward multi-valued attributes, as the gain metric inherently favors those with more distinct outcomes due to the way is partitioned across branches, which may lead to the selection of attributes that split the data finely but risk if the additional values do not correlate strongly with the target . Unlike later extensions, the original ID3 performs no post-pruning after tree construction but incorporates a pre-pruning using a test to halt splitting a if there is insufficient evidence of dependence between the attribute and , resulting in a tree that accommodates the training data while attempting to avoid unnecessary complexity, though this can still lead to on noisy data. ID3 is specifically designed to process categorical attributes and discrete target classes, creating branches for each possible value of the selected attribute; it cannot directly handle continuous features, necessitating discretization techniques like binning in contemporary implementations to adapt it for mixed data types.

Limitations

The ID3 algorithm is unable to directly handle continuous attributes, necessitating their discretization into discrete bins prior to tree construction, a process that can introduce bias by arbitrarily defining thresholds and result in information loss from the original data distribution. This limitation stems from the algorithm's design for nominal attributes only, as outlined in its foundational description. ID3 exhibits sensitivity to noisy data, lacking mechanisms for outlier detection or noise reduction beyond its chi-square pre-pruning, which often leads to overly complex trees that overfit the training set and perform poorly on unseen examples. The attribute selection process in ID3 favors attributes with a larger number of distinct values, as they tend to yield higher information gain regardless of their actual to the target class, introducing a that can result in suboptimal feature choices. This issue arises from the reliance on raw information gain without . In terms of computational efficiency, ID3 incurs a of O(n \cdot m^2) in the worst case, where m is the number of attributes and n is the number of training examples, which becomes prohibitive for large datasets due to repeated calculations across subsets at each . In the worst case, with extensive attribute sets and deep trees, this can approach higher scaling in practice for exhaustive evaluations. Although the original formulation from includes support for missing values through probabilistic fractional assignment—distributing instances proportionally across branches based on available data—the algorithm does not natively handle them in all basic implementations, often relying on workarounds like complete-case deletion that can reduce dataset size and introduce further bias. Finally, is designed exclusively for classification tasks with discrete target labels and cannot be applied to problems, where continuous outputs are predicted. Its selection strategy, while efficient, may also trap the algorithm in local optima, yielding trees that are not globally optimal.

Applications and Extensions

Practical Usage

The ID3 algorithm finds practical application in domains requiring interpretable classification models from categorical data, such as , where it aids in predicting diseases based on symptoms like in breast tumor detection from ultrasonic images. In , it supports scoring by classifying applicants' risk profiles using attributes like and history, leveraging its foundational role in methods. For simple , ID3 excels in tasks like chess position evaluation, processing large datasets of attributes to classify board states with high accuracy on unseen examples. Prior to applying ID3, datasets must be prepared to align with its requirement for categorical attributes; continuous features are converted to categorical via binning or threshold-based , such as sorting values and selecting midpoints between pairs as split points. Since ID3 assumes no missing values, common preprocessing steps include removing affected instances or imputing them using methods like majority class or attribute mode, based on class distributions in the dataset. Implementations of ID3 are available in several tools, including the Id3 classifier in for nominal attributes without missing values. In Python's library, the DecisionTreeClassifier with the criterion='entropy' parameter approximates ID3's information gain for building trees. R users can access ID3 via the party package's wrappers or conditional inference trees that extend its principles. A classic example is Quinlan's PlayTennis dataset, which uses attributes like (sunny, overcast, rainy), , , and wind to classify whether to play ; selects "" as the root node due to its highest information gain, resulting in a tree that accurately predicts decisions for 14 instances. Best practices for emphasize its suitability for datasets where interpretability and simple splits are prioritized to prevent excessive complexity. To mitigate , integrate k-fold cross-validation during evaluation, partitioning data into folds for repeated training and testing to estimate generalization performance reliably. ID3 remains a staple in courses for its simplicity in illustrating and induction concepts. It indirectly influences methods like random forests, where multiple s inspired by ID3's structure aggregate predictions for robust . As of 2025, ID3 sees continued use in educational tools for interactive decision tree simulations. Recent research applications include sports performance management, path planning, and prediction systems. The ID3 algorithm has significantly influenced the development of subsequent decision tree methods, particularly through its emphasis on information-theoretic measures for attribute selection. One direct successor is the C4.5 algorithm, developed by J. Ross Quinlan in 1993 as an extension to address ID3's limitations with real-valued attributes and incomplete data. C4.5 introduces discretization thresholds for continuous features, probabilistic handling of missing values by distributing instances proportionally across branches, and post-pruning techniques to reduce overfitting by replacing subtrees with leaves based on error rate estimates. Instead of ID3's information gain, C4.5 employs gain ratio, which normalizes gain by the intrinsic information of the attribute split to mitigate bias toward features with many values. This algorithm was later commercialized by Quinlan as C5.0 in 1998, incorporating further optimizations like boosting for improved accuracy and rule-based representations alongside trees, and it remains available through RuleQuest Research. Another influential contemporary is the algorithm, introduced by Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone in 1984. Unlike ID3's multi-way splits aligned with attribute values, CART enforces binary splits at each node, enabling both and tasks by predicting class labels or continuous outcomes, respectively. It uses Gini impurity as the default splitting criterion for , which measures the probability of misclassifying a randomly chosen instance, though it can incorporate information gain for consistency with ID3-like approaches. CART also includes cost-complexity to balance tree depth against predictive error on validation data. A notable variant in the ID3 lineage is CHAID (Chi-squared Automatic Interaction Detection), proposed by Gordon V. Kass in 1980, which adapts the idea for categorical data using statistical testing. CHAID employs the to evaluate multi-way splits, merging categories with non-significant differences (p > 0.05) to form statistically homogeneous groups, and it supports both classification and interaction detection without requiring entropy-based measures. This focus on significance testing makes CHAID particularly suited for exploratory analysis in social sciences and . ID3's information gain concept has broadly inspired impurity measures in later algorithms, serving as a foundational for quantifying attribute in decision trees. This influence extends to ensemble methods, where ID3-style trees form the building blocks; for instance, random forests, developed by Breiman in 2001, aggregate multiple CART-like trees trained on bootstrapped samples with random feature subsets to reduce variance and improve generalization. Similarly, gradient boosting frameworks, such as those outlined by Jerome H. Friedman in 2001, iteratively fit shallow decision trees (often binary like CART) to residuals, leveraging ID3's partitioning logic to minimize loss functions in sequential ensembles. These advancements highlight ID3's role in enabling scalable, high-performance tree-based learning. In modern contexts, -derived methods continue to evolve within (AutoML) frameworks. For example, H2O.ai's AutoML platform, updated through 2024, integrates decision tree learners like GBM and —rooted in ID3 principles—for automated and hyperparameter tuning on large-scale datasets.

References

  1. [1]
    Induction of decision trees | Machine Learning
    This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail.
  2. [2]
    C4.5 - ScienceDirect.com
    Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have ...
  3. [3]
    [PDF] Induction of decision trees - Machine Learning (Theory)
    This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system,. ID3, in detail.
  4. [4]
  5. [5]
    C4.5: Programs for Machine Learning | Guide books
    This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use.
  6. [6]
    A mathematical theory of communication | Nokia Bell Labs Journals ...
    In the present paper we will extend the theory to include a number of new factors, in particular the effect of noise in the channel, and the savings possible.
  7. [7]
  8. [8]
    [PDF] A comparative study of decision tree ID3 and C4.5
    ID3 ALGORITHM. J. Ross Quinlan originally developed ID3 (Iterative. DiChaudomiser 3) [21] at the University of Sydney. He first presented ID3 in 1975 in a ...
  9. [9]
    [PDF] An improved ID3 algorithm based on variable precision ...
    Jul 13, 2023 · So the time complexity of the. ID3 algorithm is O(M ∗ N ∗ logN). The C4.5 algorithm is improved compared with the ID3 algorithm, and the con-.
  10. [10]
    None
    ### Summary of ID3 Applications in Medicine, Health, Education, etc.
  11. [11]
    What is a Decision Tree & How to Make One (+ Template) - Appinio
    Oct 26, 2023 · Applicability: Decision Trees find applications in diverse domains, including healthcare (disease diagnosis), finance (credit scoring) ...
  12. [12]
    Iterative Dichotomiser 3 (ID3) Algorithm From Scratch - GeeksforGeeks
    Aug 6, 2025 · Developed by Ross Quinlan in the 1980s, ID3 remains a fundamental algorithm, forming the basis for subsequent tree-based methods like C4.5 ...Missing: original paper<|control11|><|separator|>
  13. [13]
    Id3
    Class for constructing an unpruned decision tree based on the ID3 algorithm. Can only deal with nominal attributes. No missing values allowed.Missing: scikit- party
  14. [14]
    1.10. Decision Trees — scikit-learn 1.7.2 documentation
    Training time can be orders of magnitude faster for a sparse matrix input compared to a dense matrix when features have zero values in most of the samples. 1.10 ...
  15. [15]
    ID3 - an R/Weka Classifier Trees in talgalili/HBP - rdrr.io
    May 31, 2019 · rdrr.io Find an R package R language docs Run R in your browser ... Available options can be obtained on-line using the Weka Option Wizard WOW, or ...<|separator|>
  16. [16]
    Building a Decision Tree from Scratch with ID3 Algorithm
    Aug 28, 2025 · : Manual calculation is for small datasets; use libraries for large ones. Practice Tip: Try modifying the dataset (e.g., add more rows) and ...
  17. [17]
    Overfitting in Machine Learning: What It Is and How to Prevent It
    Jul 6, 2022 · Cross-validation is a powerful preventative measure against overfitting. The idea is clever: Use your initial training data to generate multiple ...Signal Vs. Noise · Overfitting Vs. Underfitting · Early Stopping<|separator|>
  18. [18]
    Decision Trees and ID3 Learning Algorithm - Lecture 3 - Class Central
    Learn about decision trees and the ID3 heuristic algorithm in this 48-minute machine learning lecture that covers fundamental concepts like data ...
  19. [19]
    Random Forest Algorithm in Machine Learning - GeeksforGeeks
    Oct 31, 2025 · Random Forest is a machine learning algorithm that uses many decision trees to make better predictions. Each tree looks at different random ...
  20. [20]
    Implementation of decision trees for embedded systems
    This research work develops real-time incremental learning decision tree solutions suitable for real-time embedded systems ... ID3 algorithm, along with ...
  21. [21]
    C4.5: Programs for Machine Learning by J. Ross Quinlan. Morgan ...
    Instead, he focuses on details of the C4.5 algorithm and solutions to a set of problems that have arisen over the years among decision tree researchers. In the ...
  22. [22]
    [PDF] Machine Learning Approach To Augmenting News Headline ...
    C5.0 (Quinlan, 1998) is a commercial machine learning program developed by RuleQuest. Research and is the successor of the widely used.Missing: commercialization | Show results with:commercialization
  23. [23]
    Classification and Regression Trees | Leo Breiman, Jerome ...
    Oct 19, 2017 · ABSTRACT. The methodology used to construct tree structured rules is the focus of this monograph. Unlike many other statistical procedures, ...
  24. [24]
    An Exploratory Technique for Investigating Large - jstor
    The technique set out in the paper, CHAID, is an offshoot of AID (Automatic Interaction. Detection) designed for a categorized dependent variable. Some ...
  25. [25]
    (PDF) CHAID and Earlier Supervised Tree Methods - ResearchGate
    Mar 21, 2017 · The aim of this paper is twofold. First we discuss the origin of tree methods. Essentially we survey earlier methods that led to CHAID (Kass ...
  26. [26]
    [PDF] Evaluating the Impact of GINI Index and Information Gain on ...
    Evaluating the Impact of GINI Index and Information Gain on Classification using Decision Tree Classifier Algorithm*.<|separator|>
  27. [27]
    [PDF] 1 RANDOM FORESTS Leo Breiman Statistics Department University ...
    A recent paper (Breiman [2000]) shows that in distribution space for two class problems, random forests are equivalent to a kernel acting on the true margin.
  28. [28]
    H2O AutoML: Automatic Machine Learning
    Aug 19, 2024 · H2O's AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time ...
  29. [29]
    [PDF] H2O AutoML: Scalable Automatic Machine Learning
    Abstract. H2O is an open source, distributed machine learning platform designed to scale to very large datasets, with APIs in R, Python, Java and Scala.