Isolation forest
The Isolation Forest is an unsupervised anomaly detection algorithm that identifies outliers in data by explicitly isolating them through random partitioning, rather than profiling normal instances.[1] Developed by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou in 2008, it constructs an ensemble of isolation trees (iTrees), where each tree recursively splits a random subsample of the data using randomly selected features and split values until instances are isolated.[2] Anomalies are detected based on their shorter average path lengths in these trees, as they require fewer splits to isolate due to their distinctiveness from the majority of normal points.[1]
This approach leverages the principle that anomalies are "few and different," enabling efficient detection without assuming data distribution or employing distance measures.[1] The algorithm achieves linear time complexity for training (O(t ψ log ψ), where t is the number of trees and ψ is the subsample size) and constant time for evaluation per instance, making it scalable to large, high-dimensional datasets where irrelevant features are common.[2] By default, it uses 100 trees and subsamples of 256 instances, mitigating issues like the "swamping" effect (where anomalies mask each other) and the "masking" effect (where normal points obscure anomalies) through randomization and isolation.[1] Anomaly scores are computed as s(x, n) = 2^{-E(h(x))/c(n)}, where E(h(x)) is the average path length for an instance x, and c(n) normalizes it against the expected path length for n instances; scores near 0.5 indicate normal points, while those approaching 1 or exceeding 0.75 signal anomalies.[1]
Isolation Forest has demonstrated superior performance over methods like Local Outlier Factor (LOF) and One-Class Support Vector Machines (OC-SVM) in terms of area under the ROC curve (AUC) and execution speed on benchmark datasets, particularly those with evolving anomalies or irrelevant attributes.[2] It is particularly useful in applications such as fraud detection, network intrusion monitoring, and system fault diagnosis, where training data may lack labeled anomalies.[1] Extensions, such as the Extended Isolation Forest, have further improved its robustness to varying anomaly densities by incorporating proximity measures.[3]
Introduction and History
History
The Isolation Forest algorithm was introduced in 2008 by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou as an unsupervised method for anomaly detection, leveraging random partitioning to isolate outliers more efficiently than traditional density-based or distance-based approaches.[2] The core idea, detailed in their seminal paper presented at the Eighth IEEE International Conference on Data Mining, emphasized the algorithm's linear time complexity and scalability for high-dimensional data without requiring distance computations.[1]
In 2010, Liu, Ting, and Zhou proposed SCiForest as an evolution of Isolation Forest specifically tailored for detecting clustered anomalies in high-dimensional spaces, incorporating subspace clustering to enhance isolation mechanisms while preserving computational efficiency. This variant addressed limitations in handling local anomalies by randomly selecting hyperplanes informed by data density, improving detection accuracy in scenarios with dense normal clusters.[4]
A key extension for dynamic environments came in 2013 with the development of iForestASD by Z. Ding and M. Fei, adapting Isolation Forest for streaming data through a sliding window framework that accommodates concept drift by periodically rebuilding trees to reflect evolving data distributions.[5] This approach maintained the algorithm's isolation principle while enabling real-time processing and adaptation to non-stationary streams, such as sensor data or network traffic.
Post-2015, Isolation Forest saw widespread industry adoption due to its robustness and ease of implementation, with notable integrations including its addition to the scikit-learn library in version 0.18 (released in 2016), facilitating broader use in machine learning pipelines for fraud detection, cybersecurity, and quality control.[6] By 2018, enhancements in version 0.20 further stabilized its behavior for production environments.[7]
As of 2025, recent advancements have focused on hybrid models that combine Isolation Forest with deep learning techniques for enhanced real-time anomaly detection, such as integrating it with long short-term memory (LSTM) networks for sequence modeling in ransomware detection via network traffic analysis. These hybrids leverage Isolation Forest's efficiency for initial isolation alongside neural architectures for capturing temporal dependencies, achieving superior accuracy in dynamic settings without excessive computational overhead.[8]
Overview
Isolation Forest is an unsupervised anomaly detection algorithm designed to identify outliers in datasets by exploiting the principle that anomalies are rare and distinct from the majority of data points. Rather than profiling normal instances, as in many traditional methods, it explicitly isolates anomalies through random partitioning of the data space. This approach was first introduced in a 2008 paper by Liu, Ting, and Zhou.[1]
The core intuition of Isolation Forest relies on the observation that anomalies, being "few and different," require fewer partitions to be separated from the rest of the data compared to normal points, which tend to cluster and thus take longer to isolate.[1] By constructing an ensemble of isolation trees—each a decision tree built via random splits on features and values—the algorithm aggregates path lengths from the root to leaf nodes across trees to determine anomaly likelihood, providing robustness through this collective decision-making process.[1]
As an unsupervised method, Isolation Forest requires no labeled examples of anomalies during training, distinguishing it from supervised techniques like Support Vector Machines (SVM), which depend on labeled data for classification.[1] It achieves linear time complexity, [O(t \psi \log \psi)] for training, where t is the number of trees and \psi is the subsample size, achieved through subsampling, and constant memory usage, enabling efficient processing of large-scale datasets.[1]
Isolation Forest excels in high-dimensional spaces and scenarios with outliers, as it makes no assumptions about data distributions—unlike density-based methods such as Local Outlier Factor (LOF)—allowing it to handle complex, real-world data without preprocessing for normality.[1] This scalability and distribution-agnostic nature contribute to its superior performance in terms of both accuracy and speed over baselines like one-class SVM and random forests in empirical evaluations.[1]
Algorithm Fundamentals
Isolation Tree Construction
The isolation tree, a core component of the Isolation Forest algorithm, is constructed as a binary tree through a process of random recursive partitioning of a data subsample. This method leverages randomness to isolate data points efficiently, assuming that anomalies are more susceptible to isolation due to their distinct characteristics. The construction begins with a randomly selected subsample of size \psi from the original dataset, typically set to 256 or 512 points to balance computational efficiency and randomness.[1]
At each internal node, the partitioning step involves randomly selecting a feature q from the set of available attributes in the current subsample X. A split value p is then chosen uniformly at random from the observed range of q in X, specifically p \in [\min_q, \max_q], where \min_q and \max_q are the minimum and maximum values of q in the subsample. This randomization in both feature selection and split value ensures that the tree structure avoids overfitting to specific patterns in the data, promoting generalization across the dataset. The subsample X is subsequently partitioned into two child subsets: the left subset X_l containing points where q < p, and the right subset X_r containing points where q \geq p.[1]
The partitioning process is recursive, with each node splitting its subsample until one of the stopping conditions is met: the subsample size |X| reduces to 1 (achieving isolation), all points in X share identical attribute values (indicating no further meaningful split), or the tree height reaches the predefined limit l = \lceil \log_2 \psi \rceil. This height limit prevents excessive depth and maintains computational tractability, as deeper trees would require more resources without proportional benefits in isolation. External nodes, which are leaves, store the size of the subsample that reached them, aiding in subsequent path length computations, though the focus here remains on the growth mechanism.[1]
The algorithm for constructing an isolation tree can be formalized as follows in pseudocode:
iTree(X, e, l)
Input: subsample X, current height e, height limit l
Output: isolation tree structure
if |X| ≤ 1 or e ≥ l then
return external node with size |X|
else
randomly select feature q ∈ Q (attributes in X)
randomly select split value p ∈ [min_q, max_q]
partition X into X_l (q < p) and X_r (q ≥ p)
left child = iTree(X_l, e+1, l)
right child = iTree(X_r, e+1, l)
return internal node (q, p, left child, right child)
iTree(X, e, l)
Input: subsample X, current height e, height limit l
Output: isolation tree structure
if |X| ≤ 1 or e ≥ l then
return external node with size |X|
else
randomly select feature q ∈ Q (attributes in X)
randomly select split value p ∈ [min_q, max_q]
partition X into X_l (q < p) and X_r (q ≥ p)
left child = iTree(X_l, e+1, l)
right child = iTree(X_r, e+1, l)
return internal node (q, p, left child, right child)
This recursive procedure ensures that each isolation tree is built independently and efficiently, with an average construction time complexity of O(\psi \log \psi) due to the balanced expected height from random splits.[1]
Isolation Forest Ensemble
The Isolation Forest algorithm constructs an ensemble of t isolation trees, denoted as iTrees, to enhance the robustness of anomaly isolation through collective decision-making.[1] Each iTree in the ensemble is built independently using a random subsample of size \psi drawn without replacement from the original dataset, with a default subsample size of \psi = 256 to balance computational efficiency and detection performance.[1][9] This random subsampling process introduces diversity among the trees by exposing each iTree to a unique subset of the data, thereby mitigating issues such as the swamping effect (where anomalies are overshadowed by normal points) and the masking effect (where clustered anomalies hide from isolation).[9]
The training of the Isolation Forest ensemble involves generating these t subsamples and constructing an iTree on each via recursive random partitioning of the feature space, typically until individual instances are isolated or a predefined tree height is reached.[1] Since the construction of each iTree is independent, the ensemble can be built in parallel across multiple processors, achieving a time complexity of O(t \psi \log \psi) and enabling efficient scaling to large datasets.[1] In the ensemble, each iTree isolates data points independently based on its random partitions, and the results are aggregated to form a unified model that reduces the variance inherent in any single tree's isolation behavior.[1]
The hyperparameter t, representing the number of trees, plays a critical role in the ensemble's stability; a default value of t = 100 is commonly used, as performance metrics stabilize well before this point, with increasing t progressively reducing variance in the isolation outcomes across trees.[9] This variance reduction arises from the averaging effect of the ensemble, where diverse subsamples ensure that the collective isolation paths capture a more representative view of the data's structure without overfitting to noise in individual subsamples.[9]
Anomaly Detection Process
The anomaly detection process in Isolation Forest evaluates test instances against a pre-trained ensemble of isolation trees to identify outliers based on their susceptibility to isolation. For a new data point x, the inference begins by passing it through each of the t trees in the forest, typically with t = 100 for stable results.[1]
In each tree, x is traversed from the root node to an external node, recording the path length h(x) as the number of edges encountered during this descent plus c(s) if the external node size s > 1, where c(s) is the average path length for s instances. The average path length across all trees, denoted E(h(x)), then serves as the primary isolation measure for x.[9]
Anomalies are determined through thresholding: points with shorter average path lengths—indicating easier isolation—are classified as anomalies, as they deviate from the denser clusters of normal data that require longer paths to separate. This approach leverages the fact that outliers are fewer and more distinct, allowing them to be isolated with fewer splits.[1]
Isolation Forest primarily operates in batch detection mode, where a collection of test points is processed collectively against the fixed ensemble for efficiency, with constant time complexity per point. Extensions, such as hybrid models, enable online detection for streaming data by incrementally updating the forest structure.[9][10]
Post-processing typically involves ranking instances by their average path lengths in ascending order, facilitating the identification and prioritization of the most easily isolated points as top anomalies. This ranking supports downstream tasks like visualization or alerting without altering the core detection logic.[9]
Anomaly Scoring and Properties
Anomaly Score Calculation
In the Isolation Forest algorithm, the path length h(x) for an instance x in a single isolation tree is defined as the number of edges traversed from the root node to the external (leaf) node where the isolation of x terminates.[9] This measure captures how quickly an instance can be isolated by random splits, with shorter paths indicating instances that are easier to separate from the majority of the data.[9]
To normalize path lengths across trees and datasets, the average path length c(\psi) is used, where \psi is the subsample size employed in tree construction. The formula is given by
c(\psi) =
\begin{cases}
2H(\psi - 1) - \frac{2(\psi - 1)}{\psi} & \text{for } \psi > 2, \\
1 & \text{for } \psi = 2, \\
0 & \text{otherwise},
\end{cases}
with H(i) denoting the i-th harmonic number, approximated as H(i) \approx \ln(i) + \gamma (where \gamma \approx 0.5772156649 is Euler's constant).[9] This normalization, derived from the expected path length in a binary search tree for unsuccessful searches, adjusts for varying subsample sizes to ensure comparable anomaly measures regardless of dataset scale.[9]
The anomaly score s(x, n) aggregates path lengths across the ensemble of n isolation trees, using the expected path length E(h(x)), which is the average h(x) over all trees. It is computed as
s(x, n) = 2^{-\frac{E(h(x))}{c(n)}},
where n here represents the subsample size (consistent with \psi in tree building).[9] This exponential form transforms the normalized path length into a score between 0 and 1, providing a probabilistic interpretation of anomaly likelihood.
The score interpretation hinges on its value relative to 0.5: scores approaching 1 indicate anomalies, as E(h(x)) is much shorter than c(n), signifying easy isolation; scores near 0.5 suggest instances neither clearly anomalous nor normal, occurring when E(h(x)) \approx c(n); scores below 0.5 (approaching 0 for very long paths) identify inliers, as these instances require many splits to isolate and blend with the data majority.[9] The use of c(n) inherently normalizes scores for different dataset sizes, making the method robust to variations in data volume without additional rescaling.[9]
Key Properties
Isolation Forest exhibits several inherent characteristics that distinguish it from traditional anomaly detection methods, primarily due to its isolation-based approach rather than density or distance profiling. A key property is its isolation efficiency, where anomalies are isolated more rapidly than normal instances because of their sparsity and distinct attribute values in the feature space. This results in shorter path lengths for anomalies in isolation trees, as they require fewer random partitions to be separated from the majority of data points.[1]
The algorithm demonstrates strong scalability, with a time complexity of O(t \psi \log \psi), where t is the number of trees and \psi is the subsampling size, making it independent of the number of anomalies in the dataset. This efficiency arises from subsampling the data for each tree construction, ensuring linear scalability with respect to the total data size n when \psi is fixed at a small constant like 256. Additionally, memory usage remains low, typically requiring only O(t \psi) space.[1][9]
Robustness to irrelevant features is another notable property, achieved through random attribute selection during tree construction, which mitigates the impact of high-dimensional or noisy data without requiring explicit feature engineering. This random partitioning helps in avoiding overfitting to irrelevant dimensions and maintains performance in sparse or high-dimensional spaces.[1][9]
Despite these strengths, Isolation Forest has limitations, including sensitivity to the contamination parameter, which estimates the proportion of anomalies and directly influences the anomaly score threshold; misestimation can lead to suboptimal detection rates. It is also less effective for detecting clustered anomalies, where outliers may be masked by similar nearby points or swamp the isolation process in dense regions.[1][9]
Theoretically, these properties are underpinned by mass-volume analysis, which demonstrates that outliers are isolated faster than normal points because anomalies occupy smaller volumes in the data space and thus require fewer splits to isolate. This analysis provides a probabilistic foundation for why path lengths serve as a reliable indicator of abnormality, with shorter average paths corresponding to higher isolation likelihood for outliers.[1]
Parameter Tuning and Implementation
Parameter Selection
The selection of hyperparameters in the Isolation Forest algorithm is essential for achieving effective anomaly detection while managing computational resources and adapting to dataset characteristics such as dimensionality and anomaly prevalence. The primary parameters include the contamination rate, number of trees, subsample size, and maximum features per split, each influencing the model's isolation mechanism and output scores.
The contamination parameter represents the expected fraction of anomalies in the data, with a common default value of 0.1 in practical implementations; it determines the threshold applied to anomaly scores, where values closer to 0.5 allow for higher anomaly proportions but risk over-detection. In the scikit-learn implementation, the default is 'auto', which computes the threshold using the expected path length from the original formulation, though users can override it with a float in (0, 0.5] based on domain knowledge to adjust sensitivity.[6]
The number of trees, denoted as t, defaults to 100 and controls the ensemble size, offering a trade-off between detection accuracy and training speed; typical values range from 100 to 500, as higher counts enhance stability but increase runtime linearly. The original authors note that average path lengths converge effectively at t \approx 100, making this a robust starting point for most applications without unnecessary overhead.[1]
The subsample size, often symbolized as \psi, specifies the number of instances drawn randomly for constructing each isolation tree and defaults to 256 (or the minimum of 256 and the dataset size); values of 256 or 512 are standard, as they balance bias-variance trade-offs by promoting diverse isolations while keeping memory usage low. Empirical evaluations in the seminal work recommend restricting \psi to 128–512, beyond which processing time rises disproportionately without accuracy improvements.[9]
The max_features parameter defaults to 1.0, indicating the use of all features for splits in each tree, but can be set to a fraction (e.g., 0.8) to subsample features randomly, which adds diversity in high-dimensional data at the cost of extended computation. This setting, when reduced below 1.0, helps mitigate overfitting in feature-rich environments but requires careful calibration to avoid diluting isolation effectiveness.[6]
Tuning these parameters typically involves grid search or random search over predefined ranges, often using cross-validation on a validation set if partial labels are available to optimize supervised metrics like AUC-ROC. In purely unsupervised contexts, intrinsic metrics such as the silhouette score can evaluate the separation between normal points and detected anomalies, guiding selection via tools like scikit-learn's GridSearchCV with custom scorers. Such methods refine the anomaly scores by aligning isolation paths more closely with data structure, though they demand computational investment proportional to the search space.[11]
Open-Source Implementations
One of the most widely used open-source implementations of the Isolation Forest algorithm is available in the scikit-learn library, a popular machine learning toolkit in Python. The IsolationForest class was introduced in version 0.18 of scikit-learn and provides methods such as fit for training the model on data and predict for scoring new observations as anomalies or inliers.[6] It supports key parameters like contamination, which estimates the proportion of outliers in the dataset to adjust decision thresholds.[6]
PyOD is another Python library specialized for outlier and anomaly detection, offering an enhanced variant of Isolation Forest through its IForest class. This implementation wraps scikit-learn's IsolationForest but adds extended functionalities, such as integration with other detection algorithms and built-in benchmarking tools to evaluate performance across datasets. PyOD's IForest is designed for seamless use in pipelines, supporting parameters like the number of trees while providing utilities for model comparison and visualization of results.
For large-scale data processing, a distributed implementation of Isolation Forest is provided by the isolation-forest library, which operates on Apache Spark for scalable anomaly detection. This Scala-based tool enables training on clusters, handling big data volumes through parallel processing of isolation trees, and supports features like ONNX export for model portability across environments.[12]
In the R programming language, the isotree package offers a fast, multi-threaded implementation of Isolation Forest, suitable for outlier detection in tabular data. It includes core functions for model fitting and prediction, along with utilities for generating diagnostic plots to visualize isolation paths and anomaly scores.
| Library | Language | Key Features | Scalability |
|---|
| scikit-learn | Python | fit/predict methods; contamination parameter; ensemble integration | Single-machine; handles moderate datasets |
| PyOD | Python | Enhanced wrapper; benchmarking tools; pipeline support | Single-machine; optimized for anomaly tasks |
| isolation-forest (Spark) | Scala | Distributed training; ONNX export; parallel tree building | Cluster-based; big data via Spark |
| isotree | R | Multi-threaded; plotting utilities; extended variants | Single-machine; multi-core parallelization |
Variants
SCiForest
SCiForest addresses the limitations of the original Isolation Forest in detecting clustered anomalies within high-dimensional data, where the curse of dimensionality causes points to become sparse and equidistant, leading to a masking effect that hides local anomalies dense in certain subspaces.[13] This variant is particularly motivated by challenges in sparse high-dimensional datasets, such as text or genomics, where traditional methods struggle to isolate anomalous clusters without assuming global sparsity.[13]
In SCiForest, subspace selection occurs by generating random hyperplanes defined over a small number of attributes q (with q \ll d, typically q=2) to focus on informative subspaces that better separate anomalies from normal points.[13] These hyperplanes are non-axis-parallel, and the best one is chosen from multiple candidates (\tau=10) using an Sd gain criterion, which measures the separation between data distributions on either side of the hyperplane.[13] The algorithm builds an ensemble of isolation trees (t=100) on subsamples (\psi=256), where each tree is constructed by recursively applying the selected hyperplane to split the data until subsets are small (|X'| \leq 2).[13]
For anomaly scoring, SCiForest computes the path length h(x) for a test instance x through each tree, averaging these across the ensemble to obtain the expected path length E(h(x)), with shorter paths indicating anomalies.[13] The score is normalized using the average path length c(\psi) for subsamples of size \psi, and thresholds are set based on these subspace-adjusted averages to classify points, penalizing those with out-of-range paths.[13] Key hyperparameters include the subspace size q (e.g., 2, tunable based on dimensionality), number of hyperplane trials \tau, number of trees t, and subsample size \psi.[13]
This approach offers advantages in sparse high-dimensional data by leveraging hyperplanes and Sd gain to detect local clustered anomalies more effectively than the original Isolation Forest, with empirical superiority on datasets like KDDCUP 1999 and synthetic high-dimensional benchmarks, while maintaining linear time complexity.[13]
Extended Isolation Forest
The Extended Isolation Forest (EIF) is an enhancement to the original Isolation Forest that addresses inconsistencies in anomaly score assignment caused by the axis-parallel splits in standard isolation trees, which can lead to artifacts in score heatmaps and biased variance along certain directions.[3] Developed by Sahand Hariri, Matias Carrasco Kind, and Robert J. Brunner in 2018, EIF improves the robustness and reliability of anomaly detection by incorporating randomly oriented hyperplanes for data partitioning, reducing score variance for points on constant level sets.[14]
EIF operates in a manner similar to the original algorithm but modifies the tree construction to use non-axis-aligned splits. In the preferred approach, each split employs a hyperplane with a random slope, selected to isolate instances more uniformly and avoid directional biases inherent in random feature and value selections. An alternative method involves randomly transforming the data before building each tree to average out biases, though the hyperplane method is recommended for better performance. The ensemble of such extended isolation trees (eTrees) is used to compute anomaly scores based on average path lengths, normalized similarly to the original, where shorter paths indicate anomalies.[3]
This variant maintains the linear time complexity and scalability of Isolation Forest while providing more consistent anomaly scores, particularly beneficial in high-dimensional spaces where split directions matter. Evaluations on synthetic datasets and real-world benchmarks demonstrate comparable AUROC and AUPRC to the original method, with reduced variance in score assignments and no increase in computation time.[14] EIF has been applied in areas such as astronomy data analysis and general outlier detection tasks requiring precise scoring.[3]
Applications
General Use Cases
Isolation Forest has found wide applicability in unsupervised anomaly detection across diverse domains due to its efficiency in handling high-dimensional data and scalability to large datasets. In network security, it is employed to identify intrusions and distributed denial-of-service (DDoS) attacks by analyzing patterns in network traffic data, where anomalies manifest as unusual traffic volumes or protocols that deviate from normal behavior.[15][16] For instance, the algorithm isolates aberrant packets or flows that could indicate malicious activities, enabling real-time threat mitigation without labeled training data.[17]
In manufacturing, Isolation Forest aids in detecting faulty sensors within Internet of Things (IoT) streams by flagging outliers in sensor readings that signal equipment malfunctions or environmental interferences. This approach is particularly suited for continuous monitoring in industrial settings, where it processes streaming data to predict maintenance needs and prevent production downtime.[18][19] Similarly, in finance, beyond specific fraud scenarios, it detects outliers in transaction volumes, such as sudden spikes or drops in market activity that may indicate economic irregularities or system errors. By leveraging random partitioning, it efficiently ranks these deviations in high-volume financial datasets.[20][21]
Healthcare applications include monitoring anomalous patient vitals in real-time systems, where Isolation Forest identifies irregular heart rates, blood pressure, or oxygen levels that could signal critical health events. This unsupervised method integrates seamlessly with wearable devices and electronic health records to prioritize alerts for medical intervention.[22][23] In environmental monitoring, it uncovers unusual patterns in climate sensor data, such as aberrant temperature or humidity readings that deviate from expected seasonal trends, supporting early detection of ecological disruptions. Anomalies are typically ranked using the isolation-based scores from the ensemble of trees.[24] Recent extensions as of 2025 have applied it to real-time IoT security for environmental parameters and aero-engine fault detection in manufacturing.[16][25]
Performance in these unsupervised settings is commonly evaluated using metrics like the area under the receiver operating characteristic curve (AUC-ROC) to assess ranking quality across thresholds, or precision-recall curves to handle imbalanced data where anomalies are rare.[26][27]
Credit Card Fraud Detection Case Study
The Kaggle Credit Card Fraud Detection dataset, provided by the ULB Machine Learning Group, consists of anonymized transactions from European cardholders over two days in September 2013, totaling 284,807 records with 492 confirmed frauds (0.172% prevalence), resulting in a highly imbalanced distribution. The features include 28 principal component analysis (PCA)-transformed variables (V1 to V28), along with 'Time' (seconds elapsed since first transaction) and 'Amount' (transaction value), with the target 'Class' labeling fraud as 1 and legitimate as 0.[28]
Preprocessing involves scaling the non-PCA features to handle outliers in transaction amounts, using a robust scaler that reduces sensitivity to extreme values by relying on medians and quartiles rather than means.[29] To address class imbalance without relying solely on the model's parameters, subsampling of the majority (legitimate) class is applied, selecting a subset to balance training data while preserving all fraud instances. Feature selection follows, using random forest importance scores to retain the top five most discriminative features (e.g., V14 with importance 0.191), reducing dimensionality and focusing on key anomaly indicators.[28]
The Isolation Forest model is trained using scikit-learn's implementation, with key parameters set to n_estimators=100 (number of trees), max_samples=128 (subsampling size per tree), and contamination=0.001 (approximating the known fraud proportion of ~0.0017).[28] The dataset is split into 70% training and 30% testing sets, with further validation partitioning (60:40) to tune thresholds; training isolates anomalies by constructing random isolation trees on the preprocessed features, leveraging the algorithm's efficiency for high-dimensional, imbalanced data.
Evaluation emphasizes metrics suited to imbalance, including precision, recall, F1-score, ROC-AUC, and AUCPR (area under the precision-recall curve). On the test set, the model achieves precision of 0.807, recall of 0.764 (detecting 76.4% of frauds), and F1-score of 0.785; ROC-AUC reaches 0.974, while AUCPR is 0.759, indicating strong discrimination at low false positive rates.[28] A confusion matrix reveals effective separation, with high true negatives for legitimate transactions and minimized false positives among the vast majority class.
Visualizations include precision-recall and ROC curves, demonstrating the model's ability to maintain high recall (>0.90) at precision thresholds above 0.50, highlighting its robustness to imbalance without explicit labeling assumptions. Path length distributions from the isolation trees further illustrate shorter paths for fraud instances, confirming their outlier status.[28] These strengths enable Isolation Forest to handle the dataset's skewness effectively, outperforming baselines like Local Outlier Factor (F1-score ~0.00 on similar unsupervised setups) by isolating anomalies faster in high dimensions.[30]
Overall results show a fraud detection rate exceeding 90% achievable at tuned thresholds with low false positives (<1% of legitimate transactions flagged), surpassing Local Outlier Factor's performance in recall and computational efficiency on this benchmark.[28][31]