Mahalanobis outlier detection on KDD Cup ‘99 dataset

Method

The Mahalanobis online outlier detector aims to predict anomalies in tabular data. The algorithm calculates an outlier score, which is a measure of distance from the center of the features distribution (Mahalanobis distance). If this outlier score is higher than a user-defined threshold, the observation is flagged as an outlier. The algorithm is online, which means that it starts without knowledge about the distribution of the features and learns as requests arrive. Consequently you should expect the output to be bad at the start and to improve over time.

Dataset

The outlier detector needs to detect computer network intrusions using TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack.

There are 4 types of attacks in the dataset:

  • DOS: denial-of-service, e.g. syn flood;

  • R2L: unauthorized access from a remote machine, e.g. guessing password;

  • U2R: unauthorized access to local superuser (root) privileges;

  • probing: surveillance and other probing, e.g., port scanning.

The dataset contains about 5 million connection records.

There are 3 types of features:

  • basic features of individual connections, e.g. duration of connection

  • content features within a connection, e.g. number of failed log in attempts

  • traffic features within a 2 second window, e.g. number of connections to the same host as the current connection

This notebook requires the seaborn package for visualization which can be installed via pip:

!pip install seaborn

Load dataset

We only keep a number of continuous (18 out of 41) features.

Assume that a machine learning model is trained on normal instances of the dataset (not outliers) and standardization is applied:

Define outlier detector

We train an outlier detector from scratch.

Be aware that Mahalanobis is an online, stateful outlier detector. Saving or loading a Mahalanobis detector therefore also saves and loads the state of the detector. This allows the user to warm up the detector before deploying it into production.

The warning tells us we still need to set the outlier threshold. This can be done with the infer_threshold method. We need to pass a batch of instances and specify what percentage of those we consider to be normal via threshold_perc. Let's assume we have some data which we know contains around 5% outliers. The percentage of outliers can be set with perc_outlier in the create_outlier_batch function.

Detect outliers

We now generate a batch of data with 10% outliers, standardize those with the mean and stdev values obtained from the normal data (inliers) and detect the outliers in the batch.

Predict outliers:

We can now save the warmed up outlier detector:

Display results

F1 score and confusion matrix:

Plot instance level outlier scores vs. the outlier threshold:

We can also plot the ROC curve for the outlier scores of the detector:

Include categorical variables

So far we only tracked continuous variables. We can however also include categorical variables. The fit step first computes pairwise distances between the categories of each categorical variable. The pairwise distances are based on either the model predictions (MVDM method) or the context provided by the other variables in the dataset (ABDM method). For MVDM, we use the difference between the conditional model prediction probabilities of each category. This method is based on the Modified Value Difference Metric (MVDM) by Cost et al (1993). ABDM stands for Association-Based Distance Metric, a categorical distance measure introduced by Le et al (2005). ABDM infers context from the presence of other variables in the data and computes a dissimilarity measure based on the Kullback-Leibler divergence. Both methods can also be combined as ABDM-MVDM. We can then apply multidimensional scaling to project the pairwise distances into Euclidean space.

Load and transform data

Create a dictionary with as keys the categorical columns and values the number of categories for each variable in the dataset. This dictionary will later be used in the fit step of the outlier detector.

Fit an ordinal encoder on the categorical data:

Combine scaled numerical and ordinal features. X_fit will be used to infer distances between categorical features later. To make it easy, we will already transform the whole dataset, including the outliers that need to be detected later. This is for illustrative purposes:

Initialize and fit outlier detector

We use the same threshold as for the continuous data. This will likely not result in optimal performance. Alternatively, you can infer the threshold again.

Set fit parameters:

Apply fit method to find numerical values for categorical variables:

The numerical values for the categorical features are stored in the attribute od.d_abs. This is a dictionary with as keys the columns for the categorical features and as values the numerical equivalent of the category:

Another option would be to set d_type to 'mvdm' and y to kddcup.target to infer the numerical values for categorical variables from the model labels (or alternatively the predictions).

Run outlier detector and display results

Generate batch of data with 10% outliers:

Preprocess the outlier batch:

Predict outliers:

F1 score and confusion matrix:

Plot instance level outlier scores vs. the outlier threshold:

Use OHE instead of ordinal encoding for the categorical variables

Since we will apply one-hot encoding (OHE) on the categorical variables, we convert cat_vars_ord from the ordinal to OHE format. alibi_detect.utils.mapping contains utility functions to do this. The keys in cat_vars_ohe now represent the first column index for each one-hot encoded categorical variable. This dictionary will later be used in the counterfactual explanation.

Fit a one-hot encoder on the categorical data:

Transform X_fit to OHE:

Initialize and fit outlier detector

Initialize:

Apply fit method:

Run outlier detector and display results

Transform outlier batch to OHE:

Predict outliers:

F1 score and confusion matrix:

Plot instance level outlier scores vs. the outlier threshold:

Last updated

Was this helpful?