Isolation Forest outlier detection on KDD Cup ‘99 dataset
Method
Isolation forests (IF) are tree based models specifically used for outlier detection. The IF isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. The number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of random trees, is a measure of normality and is used to define an anomaly score. Outliers can typically be isolated quicker, leading to shorter paths.
Dataset
The outlier detector needs to detect computer network intrusions using TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack.
There are 4 types of attacks in the dataset:
DOS: denial-of-service, e.g. syn flood;
R2L: unauthorized access from a remote machine, e.g. guessing password;
U2R: unauthorized access to local superuser (root) privileges;
probing: surveillance and other probing, e.g., port scanning.
The dataset contains about 5 million connection records.
There are 3 types of features:
basic features of individual connections, e.g. duration of connection
content features within a connection, e.g. number of failed log in attempts
traffic features within a 2 second window, e.g. number of connections to the same host as the current connection
This notebook requires the seaborn
package for visualization which can be installed via pip
:
Load dataset
We only keep a number of continuous (18 out of 41) features.
Assume that a model is trained on normal instances of the dataset (not outliers) and standardization is applied:
Apply standardization:
Define outlier detector
We train an outlier detector from scratch:
The warning tells us we still need to set the outlier threshold. This can be done with the infer_threshold
method. We need to pass a batch of instances and specify what percentage of those we consider to be normal via threshold_perc
. Let's assume we have some data which we know contains around 5% outliers. The percentage of outliers can be set with perc_outlier
in the create_outlier_batch
function.
Let's save the outlier detector with updated threshold:
Detect outliers
We now generate a batch of data with 10% outliers and detect the outliers in the batch.
Predict outliers:
Display results
F1 score and confusion matrix:
Plot instance level outlier scores vs. the outlier threshold:
We can see that the isolation forest does not do a good job at detecting 1 type of outliers with an outlier score around 0. This makes inferring a good threshold without explicit knowledge about the outliers hard. Setting the threshold just below 0 would lead to significantly better detector performance for the outliers in the dataset. This is also reflected by the ROC curve:
Last updated
Was this helpful?