1 of 80

docs-master

Alibi Detect

Getting Started

Installation

PyPI

conda-forge

conda install mamba -n base -c conda-forge

mamba can then be used to install alibi-detect in a conda enviroment:

mamba install -c conda-forge alibi-detect

Features

To get a list of respectively the latest outlier, adversarial and drift detection algorithms, you can type:

import alibi_detect
# View all the Outlier Detection (od) algorithms available
alibi_detect.od.__all__

['OutlierAEGMM',
 'IForest',
 'Mahalanobis',
 'OutlierAE',
 'OutlierVAE',
 'OutlierVAEGMM',
 'OutlierProphet',
 'OutlierSeq2Seq',
 'SpectralResidual',
 'LLR']

# View all the Adversarial Detection (ad) algorithms available
alibi_detect.ad.__all__

['AdversarialAE',
'ModelDistillation']

# View all the Concept Drift (cd) detection algorithms available
alibi_detect.cd.__all__

['ChiSquareDrift',
 'ClassifierDrift',
 'ClassifierUncertaintyDrift',
 'ContextMMDDrift',
 'CVMDrift',
 'FETDrift',
 'KSDrift',
 'LearnedKernelDrift',
 'LSDDDrift',
 'LSDDDriftOnline',
 'MMDDrift',
 'MMDDriftOnline',
 'RegressorUncertaintyDrift',
 'SpotTheDiffDrift',
 'TabularDrift']

For detailed information on the outlier detectors:

Similar for adversarial detection:

And data drift:

Basic Usage

First, we import the detector:

from alibi_detect.od import OutlierVAE

Then we initialize it by passing it the necessary arguments:

od = OutlierVAE(
    threshold=0.1,
    encoder_net=encoder_net,
    decoder_net=decoder_net,
    latent_dim=1024
)

Some detectors require an additional .fit step using training data:

od.fit(X_train)

preds = od.predict(X_test)

The predictions are returned in a dictionary with as keys meta and data. meta contains the detector's metadata while data is in itself a dictionary with the actual predictions (and other relevant values). It has either is_outlier, is_adversarial or is_drift (filled with 0's and 1's) as well as optional instance_score, feature_score or p_value as keys with numpy arrays as values.

Algorithm Overview

The following tables summarize the advised use cases for the current algorithms. Please consult the method specific pages for a more detailed breakdown of each method. The column Feature Level indicates whether the detection can be done and returned at the feature level, e.g. per pixel for an image.

Outlier Detection

Detector

Tabular

Image

Time Series

Text

Categorical Features

Online

Feature Level

✔

Adversarial Detection

Detector

Tabular

Image

Time Series

Text

Categorical Features

Online

Feature Level

✔

Drift Detection

Detector

Tabular

Image

Time Series

Text

Categorical Features

Online

Feature Level

✔

Saving and Loading

Alibi Detect includes support for saving and loading detectors to disk. To save a detector, simply call the save_detector method and provide a path to a directory (a new one will be created if it doesn't exist):

from alibi_detect.od import OutlierVAE
from alibi_detect.saving import save_detector

od = OutlierVAE(...) 

filepath = './my_detector/'
save_detector(od, filepath)

To load a previously saved detector, use the load_detector method and provide it with the path to the detector's directory:

from alibi_detect.saving import load_detector

filepath = './my_detector/'
od = load_detector(filepath)

Note: When loading a saved detector, a warning will be issued if the runtime alibi-detect version is different from the version used to save the detector. It is highly recommended to use the same alibi-detect, Python and dependency versions as were used to save the detector to avoid potential bugs and incompatibilities.

Formats

Detectors can be saved using two formats:

Supported detectors


```{tab-item} Drift detectors
| Detector                                                                       | Legacy save/load | Config save/load |
|:-------------------------------------------------------------------------------|:----------------:|:----------------:|
| [Kolmogorov-Smirnov](../cd/methods/ksdrift.ipynb)                              |        ✅         |        ✅         |
| [Cramér-von Mises](../cd/methods/cvmdrift.ipynb)                               |        ❌         |        ✅         |
| [Fisher's Exact Test](../cd/methods/fetdrift.ipynb)                            |        ❌         |        ✅         |
| [Least-Squares Density Difference](../cd/methods/lsdddrift.ipynb)              |        ❌         |        ✅         |
| [Maximum Mean Discrepancy](../cd/methods/mmddrift.ipynb)                       |        ✅         |        ✅         |
| [Learned Kernel MMD](../cd/methods/learnedkerneldrift.ipynb)                   |        ❌         |        ✅         |
| [Chi-Squared](../cd/methods/chisquaredrift.ipynb)                              |        ✅         |        ✅         |
| [Mixed-type tabular](../cd/methods/tabulardrift.ipynb)                         |        ✅         |        ✅         |
| [Classifier](../cd/methods/classifierdrift.ipynb)                              |        ✅         |        ✅         |
| [Spot-the-diff](../cd/methods/spotthediffdrift.ipynb)                          |        ❌         |        ✅         |
| [Classifier Uncertainty](../cd/methods/modeluncdrift.ipynb)                    |        ❌         |        ✅         |
| [Regressor Uncertainty](../cd/methods/modeluncdrift.ipynb)                     |        ❌         |        ✅         |
| [Online Cramér-von Mises](../cd/methods/onlinecvmdrift.ipynb)                  |        ❌         |        ✅         |
| [Online Fisher's Exact Test](../cd/methods/onlinefetdrift.ipynb)               |        ❌         |        ✅         |
| [Online Least-Squares Density Difference](../cd/methods/onlinelsdddrift.ipynb) |        ❌         |        ✅         |
| [Online Maximum Mean Discrepancy](../cd/methods/onlinemmddrift.ipynb)          |        ❌         |        ✅         |
```

```{tab-item} Outlier detectors
| Detector                                                | Legacy save/load | Config save/load |
|:--------------------------------------------------------|:----------------:|:----------------:|
| [Isolation Forest](../od/methods/iforest.ipynb)         |         ✅       |       ❌          |         
| [Mahalanobis Distance](../od/methods/mahalanobis.ipynb) |         ✅       |       ❌          |
| [AE](../od/methods/ae.ipynb)                            |         ✅       |       ❌          |
| [VAE](../od/methods/vae.ipynb)                          |         ✅       |       ❌          |
| [AEGMM](../od/methods/aegmm.ipynb)                      |         ✅       |       ❌          |
| [VAEGMM](../od/methods/vaegmm.ipynb)                    |         ✅       |       ❌          |
| [Likelihood Ratios](../od/methods/llr.ipynb)            |         ✅       |       ❌          |
| [Prophet](../od/methods/prophet.ipynb)                  |         ✅       |       ❌          |
| [Spectral Residual](../od/methods/sr.ipynb)             |         ✅       |       ❌          |
| [Seq2Seq](../od/methods/seq2seq.ipynb)                  |         ✅       |       ❌          |

```

```{tab-item} Adversarial detectors
| Detector                                                    | Legacy save/load | Config save/load |
|:------------------------------------------------------------|:----------------:|:----------------:|
| [Adversarial AE](../ad/methods/adversarialae.ipynb)         |        ✅        |        ❌         |
| [Model distillation](../ad/methods/modeldistillation.ipynb) |        ✅        |        ❌         |
```

Detector

Legacy save/load

Config save/load

Adversarial AE

✅

❌

Model distillation

✅

❌

Supported ML models

model = ... # A TensorFlow model
preprocess_fn = partial(preprocess_drift, model=model, batch_size=128)
cd = MMDDrift(x_ref, backend='tensorflow', p_val=.05, preprocess_fn=preprocess_fn)

cd = ClassifierDrift(x_ref, model, backend='sklearn', p_val=.05, preds_type='probs')

Online detectors

from alibi_detect.cd import LSDDDriftOnline
from alibi_detect.saving import save_detector, load_detector

# Init detector (t=0)
dd = LSDDDriftOnline(x_ref, window_size=10, ert=50)

# Run 2 predictions
pred_1 = dd.predict(x_1)  # t=1 
pred_2 = dd.predict(x_2)  # t=2

# Save detector (state will be saved since t>0)
save_detector(dd, filepath)

# Load detector
dd_new = load_detector(filepath)  # detector will start at t=2

To save a clean (stateless) detector, it should be reset before saving:

dd.reset_state()  # reset to t=0
save_detector(dd, filepath)  # save the detector without state

Detector Configuration Files

For advanced use cases, Alibi Detect features powerful configuration file based functionality. As shown below, Drift detectors can be specified with a configuration file named config.toml (adversarial and outlier detectors coming soon!), which can then be passed to {func}~alibi_detect.saving.load_detector:


````{grid-item-card}
:shadow: md
:margin: 1
:padding: 1
:columns: auto


**Standard instantiation**
^^^

```python
import numpy as np
from alibi_detect.cd import MMDDrift

x_ref = np.load('detector_directory/x_ref.npy')
detector = MMDDrift(x_ref, p_val=0.05)
```
````

````{grid-item-card}
:shadow: md
:margin: 1
:padding: 1
:columns: auto

**Config-driven instantiation**
^^^

<p class="codeblock-label">config.toml</p>

```{code-block} toml

name = "MMDDrift"
x_ref = "x_ref.npy"
p_val = 0.05
```

```python
from alibi_detect.saving import load_detector
filepath = 'detector_directory/'
detector = load_detector(filepath)
```
````

Compared to standard instantiation, config-driven instantiation has a number of advantages:

Human readable: The config.toml files are human-readable (and editable!), providing a readily accessible record of previously created detectors.

Configuration file layout

````{tab-item} Config-driven instantiation

<p class="codeblock-label">config.toml</p>

```toml
name = "KSDrift"
x_ref = "x_ref.npy"
p_val = 0.05
preprocess_fn = "function.dill"
```

```python
from alibi_detect.saving import load_detector
detector = load_detector('detector_directory/')
```
````

````{tab-item} Standard instantiation

```python
import numpy as np
from alibi_detect.cd import KSDrift

x_ref = np.load('detector_directory/x_ref.npy')
preprocess_fn = dill.load('detector_directory/function.dill')
detector = MMDDrift(x_ref, p_val=0.05, preprocess_fn=preprocess_fn)
```
````

In the  above example `config.toml`, `x_ref` and `preprocess_fn` are stored in `detector_directory/`, but this directory
isn't included in the config file. This is because in the config file, **relative directories are relative to the 
location of the config.toml file**. Filepaths may be absolute, or include nested directories, but **must be POSIX 
paths** i.e. use `/` path separators instead of `\`.

Sometimes, fields representing kwargs need to be set to `None`. However, unspecified fields are set to a detector's 
default kwargs (or for [Artefact dictionaries](dictionaries), the defaults shown in the tables). To set 
fields as `None`, specify them as the string `"None"`.

(complex_fields)=

Specifying artefacts

When specifying a detector via a config.toml file, the locally stored reference data x_ref must be specified. In addition, many detectors also require (or allow) additional artefacts, such as kernels, functions and models. Depending on their type, artefacts can be specified in config.toml in a number of ways:

The following table shows the allowable formats for all possible config file artefacts.

:name: all-artefacts-table

|Field                     |.npy file  |.dill file  |[Registry](registering_artefacts)|[Artefact Dictionary](dictionaries)                                                                         | 
|:-------------------------|:---------:|:----------:|:-------------------------------:|:----------------------------------------------------------------------------------------------------------:|
|`x_ref`                   |✔          |            |                                 |                                                                                                            |
|`c_ref`                   |✔          |            |                                 |                                                                                                            |
|`reg_loss_fn`             |           |✔           |✔                                |                                                                                                            |
|`dataset`                 |           |✔           |✔                                |                                                                                                            |
|`initial_diffs`           |✔          |            |                                 |                                                                                                            |
|`model`/`proj`            |           |            |✔                                |{class}`~alibi_detect.saving.schemas.ModelConfig`                                                           |
|`preprocess_fn`           |           |✔           |✔                                |{class}`~alibi_detect.saving.schemas.PreprocessConfig`                                                      |
|`preprocess_batch_fn`     |           |✔           |✔                                |                                                                                                            |
|`embedding`               |           |            |✔                                |{class}`~alibi_detect.saving.schemas.EmbeddingConfig`                                                       |
|`tokenizer`               |           |            |✔                                |{class}`~alibi_detect.saving.schemas.TokenizerConfig`                                                       |
|`kernel`                  |           |            |✔                                |{class}`~alibi_detect.saving.schemas.KernelConfig` or {class}`~alibi_detect.saving.schemas.DeepKernelConfig`|
|`kernel_a`/`kernel_b`     |           |            |✔                                |{class}`~alibi_detect.saving.schemas.KernelConfig`                                                          |
|`optimizer`               |           |✔           |✔                                | {class}`~alibi_detect.saving.schemas.OptimizerConfig`                                                      |

(dictionaries)=

Artefact dictionaries

Simple artefacts, for example a simple preprocessing function serialized in a dill file, can be specified directly: preprocess_fn = "function.dill". However, if more complex, they can be specified as an artefact dictionary:

config.toml (excerpt)

[preprocess_fn]
src = "function.dill"
kwargs = {'kwarg1'=42, 'kwarg2'=false}

Other config fields in the {ref}all-artefacts-table table can be specified via artefact dictionaries in a similar way. For example, the model and proj fields can be set as TensorFlow or PyTorch models via the {class}~alibi_detect.saving.schemas.ModelConfig dictionary. Often an artefact dictionary may itself contain nested artefact dictionaries, as is the case in in the following example, where a preprocess_fn is specified with a TensorFlow model.

config.toml (excerpt)

[preprocess_fn]
src = "@cd.tensorflow.preprocess.preprocess_drift"
batch_size = 32

[preprocess_fn.model]
src = "model/"

(registering_artefacts)=

Registering artefacts


````{grid-item-card}
:shadow: md
:margin: 1
:padding: 1
:columns: auto

**Registering a function**
^^^

```python
import numpy as np
from alibi_detect.saving import registry, load_detector

# Register a simple function
@registry.register('my_function.v1')
def my_function(x: np.ndarray) -> np.ndarray:
    "A custom function to normalise input data."
    return (x - x.mean()) / x.std()

# Load detector with config.toml file referencing "@my_function.v1"    
detector = load_detector(filepath)
```
````

````{grid-item-card}
:shadow: md
:margin: 1
:padding: 1
:columns: auto

**Specifying in a config.toml**
^^^

<p class="codeblock-label">config.toml</p>

```toml
name = "MMDDrift"
x_ref = "x_ref.npy"
preprocess_fn = "@my_function.v1"
```
````

Once the custom function has been registered, it can be specified in config.toml files via its reference string (with @ prepended), for example "@my_function.v1" in this case. Other objects, such as custom tensorflow or pytorch models, can also be registered by using the register function directly. For example, to register a tensorflow encoder model:

import tensorflow as tf
from tensorflow.keras.layers import Conv2D, Dense, Flatten, InputLayer
from alibi_detect.saving import registry

encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(32, 32, 3)),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Flatten(),
      Dense(32,)
  ]
)
registry.register("my_encoder.v1", func=encoder_net)

Examining registered artefacts

A registered object's metadata can be obtained with registry.find(), and all currently registered objects can be listed with registry.get_all(). For example, registry.find("my_function.v1") returns the following:

{'module': '__main__', 'file': 'test.py', 'line_no': 3, 'docstring': 'A custom function to normalise input data.'}

Pre-registered utility functions/objects

For convenience, Alibi Detect also pre-registers a number of commonly used utility functions and objects.

Function/Class

Registry reference*

Tensorflow

Pytorch

{func}~alibi_detect.cd.tensorflow.preprocess.preprocess_drift

'@cd.[backend].preprocess.preprocess_drift'

✔

{class}~alibi_detect.utils.tensorflow.kernels.GaussianRBF

'@utils.[backend].kernels.GaussianRBF'

✔

{class}~alibi_detect.utils.tensorflow.data.TFDataset

'@utils.tensorflow.data.TFDataset'

✔

*For backend-specific functions/classes, [backend] should be replaced the desired backend e.g. tensorflow or pytorch.

(examples)=

Example config files

% To demonstrate the config-driven functionality, example detector configurations are presented in this section.

% To download a config file and its related artefacts, click on the Run Me tabs, copy the Python code, and run it % in your local Python shell.

(imdb_example)=

Drift detection on text data

%````{tabbed} Config file %:new-group:

config.toml

x_ref = "x_ref.npy"
name = "MMDDrift"

[preprocess_fn]
src = "@cd.tensorflow.preprocess.preprocess_drift"
batch_size = 32
max_len = 100
tokenizer.src = "tokenizer/"

[preprocess_fn.model]
src = "model/"

[preprocess_fn.embedding]
src = "embedding/"
type = "hidden_state"
layers = [-1, -2, -3, -4, -5, -6, -7, -8]

% % %{tabbed} Run Me % %python %from alibi_detect.utils.fetching import fetch_config %from alibi_detect.saving import load_detector %filepath = 'IMDB_example_MMD/' %fetch_config('imdb_mmd', filepath) %detector = load_detector(filepath) % %````

% TODO: Add a second example demo-ing loading of state (once implemented). e.g. for online or learned kernel.

%## Advanced usage

(validation)=

Validating config files

When {func}~alibi_detect.saving.load_detector is called, the {func}~alibi_detect.saving.validate_config utility function is used internally to validate the given detector configuration. This allows any problems with the configuration to be detected prior to sometimes time-consuming operations of loading artefacts and instantiating the detector. {func}~alibi_detect.saving.validate_config can also be used by devs working with Alibi Detect config dictionaries.

Under-the-hood, {func}~alibi_detect.saving.load_detector parses the config.toml file into a unresolved config dictionary. It then passes this dict through {func}~alibi_detect.saving.validate_config, to check for errors such as incorrectly named fields, and incorrect types. If working directly with config dictionaries, the same process can be done explicitly, for example:

from alibi_detect.saving import validate_config

# Define a simple config dict
cfg = {
    'name': 'MMDDrift',
    'x_ref': 'x_ref.npy',
    'p_val': [0.05],
    'bad_field': 'oops!'
}

# Validate the config
validate_config(cfg)

This will return a ValidationError because p_val is expected to be float not a list, and bad_field isn't a recognised field for the MMDDrift detector:

ValidationError: 2 validation errors for MMDDriftConfig
p_val
  value is not a valid float (type=type_error.float)
bad_field
  extra fields not permitted (type=value_error.extra)

Validating at this stage is useful as errors can be caught before the sometimes time-consuming operation of resolving the config dictionary, which involves loading each artefact in the dictionary ({func}~alibi_detect.saving.read_config and {func}~alibi_detect.saving.resolve_config can be used to manually read and resolve a config for debugging). The resolved config dictionary is then also passed through {func}~alibi_detect.saving.validate_config, and this second validation can also be done explicitly:

import numpy as np
from alibi_detect.saving import validate_config

# Create some reference data
x_ref = np.random.normal(size=(100,5))

# Define a simple config dict
cfg = {
    'name': 'MMDDrift',
    'x_ref': x_ref,
    'p_val': 0.05
}

# Validate the config
validate_config(cfg, resolved=True)

Note that since resolved=True, {func}~alibi_detect.saving.validate_config is now expecting x_ref to be a Numpy ndarray instead of a string. This second level of validation can be useful as it helps detect problems with loaded artefacts before attempting the sometimes time-consuming operation of instantiating the detector.

Roadmap

Alibi Detect aims to be the go-to library for outlier, adversarial and drift detection in Python using both the TensorFlow and PyTorch backends.

This means that the algorithms in the library need to handle:

Online detection with often stateful detectors.
Offline detection, where the detector is trained on a batch of unsupervised or semi-supervised data. This assumption resembles a lot of real-world settings where labels are hard to come by.

The algorithms will cover the following data types:

Tabular, including both numerical and categorical data.
Images
Time series, both univariate and multivariate.
Text
Graphs

It will also be possible to combine different algorithms in ensemble detectors.

The library currently covers both online and offline outlier detection algorithms for tabular data, images and time series as well as offline adversarial detectors for tabular data and images. Current drift detection capabilities cover almost any data modality such as mixed type tabular data, text, images or graphs, both in the online and offline setting. Furthermore, Alibi Detect provides supervised drift and context-aware drift detectors.

The near term focus will be on extending save/load functionality for PyTorch detectors, and adding outlier detectors for text and mixed data types.

Outlier Detection

Methods

Mahalanobis Distance

Overview

Usage

Initialize

Parameters:

threshold: Mahalanobis distance threshold above which the instance is flagged as an outlier.
n_components: number of principal components used.
std_clip: feature-wise standard deviation used to clip the observations before updating the mean and covariance matrix.
start_clip: number of observations before clipping is applied.
max_n: algorithm behaves as if it has seen at most max_n points.
cat_vars: dictionary with as keys the categorical columns and as values the number of categories per categorical variable. Only needed if categorical variables are present.
ohe: boolean whether the categorical variables are one-hot encoded (OHE) or not. If not OHE, they are assumed to have ordinal encodings.
data_type: can specify data type added to metadata. E.g. 'tabular' or 'image'.

Initialized outlier detector example:

from alibi_detect.od import Mahalanobis

od = Mahalanobis(
    threshold=10.,
    n_components=2,
    std_clip=3,
    start_clip=100
)

Fit

We only need to fit the outlier detector if there are categorical variables present in the data. The following parameters can be specified:

X: training batch as a numpy array.
y: model class predictions or ground truth labels for X. Used for 'mvdm' and 'abdm-mvdm' pairwise distance metrics. Not needed for 'abdm'.
d_type: pairwise distance metric used for categorical variables. Currently, 'abdm', 'mvdm' and 'abdm-mvdm' are supported. 'abdm' infers context from the other variables while 'mvdm' uses the model predictions. 'abdm-mvdm' is a weighted combination of the two metrics.
w: weight on 'abdm' (between 0. and 1.) distance if d_type equals 'abdm-mvdm'.
disc_perc: list with percentiles used in binning of numerical features used for the 'abdm' and 'abdm-mvdm' pairwise distance measures.
standardize_cat_vars: standardize numerical values of categorical variables if True.
feature_range: tuple with min and max ranges to allow for numerical values of categorical variables. Min and max ranges can be floats or numpy arrays with dimension (1, number of features) for feature-wise ranges.
smooth: smoothing exponent between 0 and 1 for the distances. Lower values will smooth the difference in distance metric between different features.
center: whether to center the scaled distance measures. If False, the min distance for each feature except for the feature with the highest raw max distance will be the lower bound of the feature range, but the upper bound will be below the max feature range.

od.fit(
    X_train,
    d_type='abdm',
    disc_perc=[25, 50, 75]
)

It is often hard to find a good threshold value. If we have a batch of normal and outlier data and we know approximately the percentage of normal data in the batch, we can infer a suitable threshold:

od.infer_threshold(
    X, 
    threshold_perc=95
)

Beware though that the outlier detector is stateful and every call to the score function will update the mean and covariance matrix, even when inferring the threshold.

Detect

We detect outliers by simply calling predict on a batch of instances X to compute the instance level Mahalanobis distances. We can also return the instance level outlier score by setting return_instance_score to True.

The prediction takes the form of a dictionary with meta and data keys. meta contains the detector's metadata while data is also a dictionary which contains the actual predictions stored in the following keys:

is_outlier: boolean whether instances are above the threshold and therefore outlier instances. The array is of shape (batch size,).
instance_score: contains instance level scores if return_instance_score equals True.

preds = od.predict(
    X,
    return_instance_score=True
)

Examples

Tabular

Isolation Forest

Overview

Usage

Initialize

Parameters:

threshold: threshold value for the outlier score above which the instance is flagged as an outlier.
n_estimators: number of base estimators in the ensemble. Defaults to 100.
max_samples: number of samples to draw from the training data to train each base estimator. If int, draw max_samples samples. If float, draw max_samples times number of features samples. If 'auto', max_samples = min(256, number of samples).
max_features: number of features to draw from the training data to train each base estimator. If int, draw max_features features. If float, draw max_features times number of features features.
bootstrap: whether to fit individual trees on random subsets of the training data, sampled with replacement.
n_jobs: number of jobs to run in parallel for fit and predict.
data_type: can specify data type added to metadata. E.g. 'tabular' or 'image'.

Initialized outlier detector example:

Fit

We then need to train the outlier detector. The following parameters can be specified:

X: training batch as a numpy array.
sample_weight: array with shape (batch size,) used to assign different weights to each instance during training. Defaults to None.

It is often hard to find a good threshold value. If we have a batch of normal and outlier data and we know approximately the percentage of normal data in the batch, we can infer a suitable threshold:

Detect

We detect outliers by simply calling predict on a batch of instances X to compute the instance level outlier scores. We can also return the instance level outlier score by setting return_instance_score to True.

is_outlier: boolean whether instances are above the threshold and therefore outlier instances. The array is of shape (batch size,).
instance_score: contains instance level scores if return_instance_score equals True.

Examples

Tabular

Examples

Methods

Offline

Online

Examples

Adversarial Detection

Methods

Examples

Detector Configuration Files


````{grid-item-card}
:shadow: md
:margin: 1
:padding: 1
:columns: auto


**Standard instantiation**
^^^

```python
import numpy as np
from alibi_detect.cd import MMDDrift

x_ref = np.load('detector_directory/x_ref.npy')
detector = MMDDrift(x_ref, p_val=0.05)
```
````

````{grid-item-card}
:shadow: md
:margin: 1
:padding: 1
:columns: auto

**Config-driven instantiation**
^^^

<p class="codeblock-label">config.toml</p>

```{code-block} toml

name = "MMDDrift"
x_ref = "x_ref.npy"
p_val = 0.05
```

```python
from alibi_detect.saving import load_detector
filepath = 'detector_directory/'
detector = load_detector(filepath)
```
````

Compared to standard instantiation, config-driven instantiation has a number of advantages:

Human readable: The config.toml files are human-readable (and editable!), providing a readily accessible record of previously created detectors.
Flexible artefact specification: Artefacts such as datasets and models can be specified as locally serialized objects, or as runtime registered objects (see ). Multiple detectors can share the same artefacts, and they can be easily swapped.
Inbuilt validation: The {func}~alibi_detect.saving.load_detector function uses to validate detector configurations.

To get a general idea of the expected layout of a config file, see the . Alternatively, to obtain a fully populated config file for reference, users can run one of the and generate a config file by passing an instantiated detector to {func}~alibi_detect.saving.save_detector.

Configuration file layout

All detector configuration files follow a consistent layout, simplifying the process of writing simple config files by hand. For example, a {class}~alibi_detect.cd.KSDrift detector with a serialized function to preprocess reference and test data can be specified as:

````{tab-item} Config-driven instantiation

<p class="codeblock-label">config.toml</p>

```toml
name = "KSDrift"
x_ref = "x_ref.npy"
p_val = 0.05
preprocess_fn = "function.dill"
```

```python
from alibi_detect.saving import load_detector
detector = load_detector('detector_directory/')
```
````

````{tab-item} Standard instantiation

```python
import numpy as np
from alibi_detect.cd import KSDrift

x_ref = np.load('detector_directory/x_ref.npy')
preprocess_fn = dill.load('detector_directory/function.dill')
detector = MMDDrift(x_ref, p_val=0.05, preprocess_fn=preprocess_fn)
```
````

The name field should always be the name of the detector, for example KSDrift or SpotTheDiffDrift. The remaining fields are the args/kwargs to pass to the detector (see the {mod}alibi_detect.cd docs for a full list of permissible args/kwargs for each detector). All config fields follow this convention, however as discussed in , some fields can be more complex than others.

In the  above example `config.toml`, `x_ref` and `preprocess_fn` are stored in `detector_directory/`, but this directory
isn't included in the config file. This is because in the config file, **relative directories are relative to the 
location of the config.toml file**. Filepaths may be absolute, or include nested directories, but **must be POSIX 
paths** i.e. use `/` path separators instead of `\`.

Sometimes, fields representing kwargs need to be set to `None`. However, unspecified fields are set to a detector's 
default kwargs (or for [Artefact dictionaries](dictionaries), the defaults shown in the tables). To set 
fields as `None`, specify them as the string `"None"`.

(complex_fields)=

Specifying artefacts

Local files: Simple functions and/or models can be specified as locally stored files, whilst data arrays are specified as locally stored numpy files.
Function/object registry: As discussed in , functions and other objects defined at runtime can be registered using {func}alibi_detect.saving.registry, allowing them to be specified in the config file without having to serialise them. For convenience a number of Alibi Detect functions such as {func}~alibi_detect.cd.tensorflow.preprocess.preprocess_drift are also pre-registered.
Dictionaries: More complex artefacts are specified via nested dictionaries, usually containing a src field and additional option/setting fields. Sometimes these fields may be nested artefact dictionaries themselves. See for further details.

The following table shows the allowable formats for all possible config file artefacts.

:name: all-artefacts-table

|Field                     |.npy file  |.dill file  |[Registry](registering_artefacts)|[Artefact Dictionary](dictionaries)                                                                         | 
|:-------------------------|:---------:|:----------:|:-------------------------------:|:----------------------------------------------------------------------------------------------------------:|
|`x_ref`                   |✔          |            |                                 |                                                                                                            |
|`c_ref`                   |✔          |            |                                 |                                                                                                            |
|`reg_loss_fn`             |           |✔           |✔                                |                                                                                                            |
|`dataset`                 |           |✔           |✔                                |                                                                                                            |
|`initial_diffs`           |✔          |            |                                 |                                                                                                            |
|`model`/`proj`            |           |            |✔                                |{class}`~alibi_detect.saving.schemas.ModelConfig`                                                           |
|`preprocess_fn`           |           |✔           |✔                                |{class}`~alibi_detect.saving.schemas.PreprocessConfig`                                                      |
|`preprocess_batch_fn`     |           |✔           |✔                                |                                                                                                            |
|`embedding`               |           |            |✔                                |{class}`~alibi_detect.saving.schemas.EmbeddingConfig`                                                       |
|`tokenizer`               |           |            |✔                                |{class}`~alibi_detect.saving.schemas.TokenizerConfig`                                                       |
|`kernel`                  |           |            |✔                                |{class}`~alibi_detect.saving.schemas.KernelConfig` or {class}`~alibi_detect.saving.schemas.DeepKernelConfig`|
|`kernel_a`/`kernel_b`     |           |            |✔                                |{class}`~alibi_detect.saving.schemas.KernelConfig`                                                          |
|`optimizer`               |           |✔           |✔                                | {class}`~alibi_detect.saving.schemas.OptimizerConfig`                                                      |

(dictionaries)=

Artefact dictionaries

config.toml (excerpt)

[preprocess_fn]
src = "function.dill"
kwargs = {'kwarg1'=42, 'kwarg2'=false}

Here, the preprocess_fn field is a {class}~alibi_detect.saving.schemas.PreprocessConfig artefact dictionary. In this example, specifying the preprocess_fn function as a dictionary allows us to specify additional kwarg's to be passed to the function upon loading. This example also demonstrates the flexibility of the TOML format, with dictionaries able to be specified with {} brackets or by sections demarcated with [] brackets (see the for more details on the TOML format).

config.toml (excerpt)

[preprocess_fn]
src = "@cd.tensorflow.preprocess.preprocess_drift"
batch_size = 32

[preprocess_fn.model]
src = "model/"

Each artefact dictionary has an associated pydantic model which is used for . The for these pydantic models provides a description of the permissible fields for each artefact dictionary. For examples of how the artefact dictionaries can be used in practice, see {ref}examples.

(registering_artefacts)=

Registering artefacts

Custom artefacts defined in Python code may be specified in the config file without the need to serialise them, by first adding them to the Alibi Detect artefact registry using the {mod}alibi_detect.saving.registry submodule. This submodule harnesses the library to allow functions to be registered with a decorator syntax:


````{grid-item-card}
:shadow: md
:margin: 1
:padding: 1
:columns: auto

**Registering a function**
^^^

```python
import numpy as np
from alibi_detect.saving import registry, load_detector

# Register a simple function
@registry.register('my_function.v1')
def my_function(x: np.ndarray) -> np.ndarray:
    "A custom function to normalise input data."
    return (x - x.mean()) / x.std()

# Load detector with config.toml file referencing "@my_function.v1"    
detector = load_detector(filepath)
```
````

````{grid-item-card}
:shadow: md
:margin: 1
:padding: 1
:columns: auto

**Specifying in a config.toml**
^^^

<p class="codeblock-label">config.toml</p>

```toml
name = "MMDDrift"
x_ref = "x_ref.npy"
preprocess_fn = "@my_function.v1"
```
````

import tensorflow as tf
from tensorflow.keras.layers import Conv2D, Dense, Flatten, InputLayer
from alibi_detect.saving import registry

encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(32, 32, 3)),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Flatten(),
      Dense(32,)
  ]
)
registry.register("my_encoder.v1", func=encoder_net)

Examining registered artefacts

{'module': '__main__', 'file': 'test.py', 'line_no': 3, 'docstring': 'A custom function to normalise input data.'}

Pre-registered utility functions/objects

For convenience, Alibi Detect also pre-registers a number of commonly used utility functions and objects.

Function/Class

Registry reference*

Tensorflow

Pytorch

{func}~alibi_detect.cd.tensorflow.preprocess.preprocess_drift

'@cd.[backend].preprocess.preprocess_drift'

✔

{class}~alibi_detect.utils.tensorflow.kernels.GaussianRBF

'@utils.[backend].kernels.GaussianRBF'

✔

{class}~alibi_detect.utils.tensorflow.data.TFDataset

'@utils.tensorflow.data.TFDataset'

✔

*For backend-specific functions/classes, [backend] should be replaced the desired backend e.g. tensorflow or pytorch.

These can be used in config.toml files. Of particular importance are the preprocess_drift utility functions, which allows models, tokenizers and embeddings to be easily specified for preprocessing, as demonstrated in the .

(examples)=

Example config files

% To demonstrate the config-driven functionality, example detector configurations are presented in this section.

% To download a config file and its related artefacts, click on the Run Me tabs, copy the Python code, and run it % in your local Python shell.

(imdb_example)=

Drift detection on text data

This example presents a configuration for the {class}~alibi_detect.cd.MMDDrift detector used in . The detector will pass the input text data through a preprocess_fn step consisting of a tokenizer, embedding and model. An model is included in order to reduce the dimensionality of the embedding space, which consists of a 768-dimensional vector for each instance.

%````{tabbed} Config file %:new-group:

config.toml

x_ref = "x_ref.npy"
name = "MMDDrift"

[preprocess_fn]
src = "@cd.tensorflow.preprocess.preprocess_drift"
batch_size = 32
max_len = 100
tokenizer.src = "tokenizer/"

[preprocess_fn.model]
src = "model/"

[preprocess_fn.embedding]
src = "embedding/"
type = "hidden_state"
layers = [-1, -2, -3, -4, -5, -6, -7, -8]

% TODO: Add a second example demo-ing loading of state (once implemented). e.g. for online or learned kernel.

%## Advanced usage

(validation)=

Validating config files

from alibi_detect.saving import validate_config

# Define a simple config dict
cfg = {
    'name': 'MMDDrift',
    'x_ref': 'x_ref.npy',
    'p_val': [0.05],
    'bad_field': 'oops!'
}

# Validate the config
validate_config(cfg)

This will return a ValidationError because p_val is expected to be float not a list, and bad_field isn't a recognised field for the MMDDrift detector:

ValidationError: 2 validation errors for MMDDriftConfig
p_val
  value is not a valid float (type=type_error.float)
bad_field
  extra fields not permitted (type=value_error.extra)

import numpy as np
from alibi_detect.saving import validate_config

# Create some reference data
x_ref = np.random.normal(size=(100,5))

# Define a simple config dict
cfg = {
    'name': 'MMDDrift',
    'x_ref': x_ref,
    'p_val': 0.05
}

# Validate the config
validate_config(cfg, resolved=True)

%### Detector specification schemas % %Validation of detector config files is performed with . Each %detector's unresolved configuration is represented by a pydantic model, stored in DETECTOR_CONFIGS. Information on %a detector config's permitted fields and their types can be obtained via the config model's schema method %(or schema_json if a json formatted string is preferred). For example, for the KSDrift detector: % %python %from alibi_detect.saving.schemas import DETECTOR_CONFIGS %schema = DETECTOR_CONFIGS['KSDrift'].schema() % % % %returns a dictionary with the keys ['title', 'type', 'properties', 'required', 'additionalProperties', 'definitions']. %The 'properties' item is a dictionary containing all the possible fields for the detector: % %python %{'name': {'title': 'Name', 'type': 'string'}, % 'version': {'title': 'Version', 'default': '0.8.1dev', 'type': 'string'}, % 'backend': {'title': 'Backend', % 'default': 'tensorflow', % 'enum': ['tensorflow', 'pytorch'], % 'type': 'string'}, % 'x_ref': {'title': 'X Ref', 'default': 'x_ref.npy', 'type': 'string'}, % 'p_val': {'title': 'P Val', 'default': 0.05, 'type': 'number'}, % 'x_ref_preprocessed': {'title': 'X Ref Preprocessed', % 'default': False, % 'type': 'boolean'}, % ... % } % % %Compulsory fields are listed in 'required', whilst additional pydantic model definitions such as %ModelConfig are stored in 'definitions'. These additional models define the schemas for the %. % %Similarly, resolved detector configurations are represented by pydantic models stored in %DETECTOR_CONFIGS_RESOLVED. The difference being that for these models, artefacts are expected to have their %resolved types, for example x_ref should be a NumPy ndarray. The schema and schema_json methods can be applied %to these models in the same way.

Getting Started

Installation

Alibi Detect can be installed from or by following the instructions below.

PyPI

Alibi Detect can be installed from with pip. We provide optional dependency buckets for several modules that are large or sometimes tricky to install. Many detectors are supported out of the box with the default install but some detectors require a specific optional dependency installation to use. For instance, the OutlierProphet detector requires the prophet installation. Other detectors have a choice of backend. For instance, the LSDDDrift detector has a choice of tensorflow or pytorch backends. The tabs below list the full set of detector functionality provided by each optional dependency.

Default installation.

pip install alibi-detect

The default installation provides out the box support for the following detectors:

If you are unsure which detector to use, or wish to have access to as many as possible the recommended installation is:

pip install alibi-detect[tensorflow,prophet]

If you would rather use pytorch backends then you can use:

pip install alibi-detect[torch,prophet]

However, the following detectors do not have pytorch backend support:

Alternatively you can install all the dependencies using (this will include both tensorflow and pytorch):

pip install alibi-detect[all]

Note

If you wish to use the GPU version of PyTorch, or are installing on Windows, it is recommended to prior to installing alibi-detect.

Note

If using torch version 2.0.0 or 2.0.1 along with some versions of tensorflow you may experience hanging depending on the order you import each of these libraries. This is fixed in torch 2.1.0 onwards.

pip install alibi-detect[torch]

The PyTorch installation is required to use the PyTorch backend for the following detectors:

Note

pip install alibi-detect[tensorflow]

The TensorFlow installation is required to use the TensorFlow backend for the following detectors:

The TensorFlow installation is required to use the following detectors:

pip install alibi-detect[keops]

The KeOps installation is required to use the KeOps backend for the following detectors:

Note

pip install alibi-detect[prophet]

conda-forge

To install the conda-forge version it is recommended to use , which can be installed to the baseconda enviroment with:

conda install mamba -n base -c conda-forge

mamba can then be used to install alibi-detect in a conda enviroment:

mamba install -c conda-forge alibi-detect

Features

is an open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline detectors for tabular data, text, images and time series. TensorFlow, PyTorch and (where applicable) backends are supported for drift detection. Alibi-Detect does not install these as default. See for more details.

To get a list of respectively the latest outlier, adversarial and drift detection algorithms, you can type:

import alibi_detect
# View all the Outlier Detection (od) algorithms available
alibi_detect.od.__all__

['OutlierAEGMM',
 'IForest',
 'Mahalanobis',
 'OutlierAE',
 'OutlierVAE',
 'OutlierVAEGMM',
 'OutlierProphet',
 'OutlierSeq2Seq',
 'SpectralResidual',
 'LLR']

# View all the Adversarial Detection (ad) algorithms available
alibi_detect.ad.__all__

['AdversarialAE',
'ModelDistillation']

# View all the Concept Drift (cd) detection algorithms available
alibi_detect.cd.__all__

['ChiSquareDrift',
 'ClassifierDrift',
 'ClassifierUncertaintyDrift',
 'ContextMMDDrift',
 'CVMDrift',
 'FETDrift',
 'KSDrift',
 'LearnedKernelDrift',
 'LSDDDrift',
 'LSDDDriftOnline',
 'MMDDrift',
 'MMDDriftOnline',
 'RegressorUncertaintyDrift',
 'SpotTheDiffDrift',
 'TabularDrift']

Summary tables highlighting the practical use cases for all the algorithms can be found .

For detailed information on the outlier detectors:

Similar for adversarial detection:

And data drift:

Basic Usage

We will use the to illustrate the usage of outlier and adversarial detectors in alibi-detect.

First, we import the detector:

from alibi_detect.od import OutlierVAE

Then we initialize it by passing it the necessary arguments:

od = OutlierVAE(
    threshold=0.1,
    encoder_net=encoder_net,
    decoder_net=decoder_net,
    latent_dim=1024
)

Some detectors require an additional .fit step using training data:

od.fit(X_train)

The detectors can be saved or loaded as described in . Finally, we can make predictions on test data and detect outliers or adversarial examples.

preds = od.predict(X_test)

The exact details will vary slightly from method to method, so we encourage the reader to become familiar with the in alibi-detect.

Likelihood Ratio Outlier Detection on Genomic Sequences

Method

The perturbations are added using an independent and identical Bernoulli distribution with rate $\mu$ which substitutes a feature with one of the other possible feature values with equal probability. Each feature in the genome dataset can take 4 values (one of the ACGT nucleobases). This means that a perturbed feature is swapped with one of the other nucleobases. The generative model used in the example is a simple LSTM network.

Dataset

This notebook requires the seaborn package for visualization which can be installed via pip:

!pip install seaborn

#| scrolled: true
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, LSTM

from alibi_detect.od import LLR
from alibi_detect.datasets import fetch_genome
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.utils.visualize import plot_roc

Load genome data

X represents the genome sequences and y whether they are outliers ($1$) or not ($0$).

(X_train, y_train), (X_val, y_val), (X_test, y_test) = \
        fetch_genome(return_X_y=True, return_labels=False)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape)

There are no outliers in the training set and a majority of outliers (compared to the training data) in the validation and test sets:

print('Fraction of outliers in train, val and test sets: '
      '{:.2f}, {:.2f} and {:.2f}'.format(y_train.mean(), y_val.mean(), y_test.mean()))

Define model

We need to define a generative model which models the genome sequences. We follow the paper and opt for a simple LSTM. Note that we don't actually need to define the model below if we simply load the pretrained detector later on:

genome_dim = 249  # not 250 b/c we use 1->249 as input and 2->250 as target
input_dim = 4  # ACGT nucleobases
hidden_dim = 2000

inputs = Input(shape=(genome_dim,), dtype=tf.int8)
x = tf.one_hot(tf.cast(inputs, tf.int32), input_dim)
x = LSTM(hidden_dim, return_sequences=True)(x)
logits = Dense(input_dim, activation=None)(x)
model = tf.keras.Model(inputs=inputs, outputs=logits, name='LlrLSTM')

We also need to define our loss function which we can utilize to evaluate the log-likelihood for the outlier detector:

def loss_fn(y, x):
    y = tf.one_hot(tf.cast(y, tf.int32), 4)  # ACGT on-hot encoding
    return tf.nn.softmax_cross_entropy_with_logits(y, x, axis=-1)

def likelihood_fn(y, x):
    return -loss_fn(y, x)

Load or train the outlier detector

load_pretrained = True

#| scrolled: false
filepath = os.path.join(os.getcwd(), 'my_path')  # change to download directory
detector_type = 'outlier'
dataset = 'genome'
detector_name = 'LLR'
filepath = os.path.join(filepath, detector_name)
if load_pretrained:  # load pretrained outlier detector
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
else:
    # initialize detector
    od = LLR(threshold=None, model=model, log_prob=likelihood_fn, sequential=True)
    
    # train
    od.fit(
        X_train,
        mutate_fn_kwargs=dict(rate=.2, feature_range=(0,3)),
        mutate_batch_size=1000,
        loss_fn=loss_fn,
        optimizer=tf.keras.optimizers.Adam(learning_rate=5e-4),
        epochs=20,
        batch_size=100,
        verbose=False
    )
    
    # save the trained outlier detector
    save_detector(od, filepath)

Compare the log likelihoods

Let's compare the log likelihoods of the inliers vs. the outlier test set data under the semantic and background models. We randomly sample $100,000$ instances from both distributions since the full test set contains $7,000,000$ genomic sequences. The histograms show that the generative model does not distinguish well between inliers and outliers.

idx_in, idx_ood = np.where(y_test == 0)[0], np.where(y_test == 1)[0]
n_in, n_ood = idx_in.shape[0], idx_ood.shape[0]
n_sample = 100000  # sample 100k inliers and outliers each
sample_in = np.random.choice(n_in, size=n_sample, replace=False)
sample_ood = np.random.choice(n_ood, size=n_sample, replace=False)
X_test_in, X_test_ood = X_test[idx_in[sample_in]], X_test[idx_ood[sample_ood]]
y_test_in, y_test_ood = y_test[idx_in[sample_in]], y_test[idx_ood[sample_ood]]
X_test_sample = np.concatenate([X_test_in, X_test_ood])
y_test_sample = np.concatenate([y_test_in, y_test_ood])
print(X_test_in.shape, X_test_ood.shape)

# semantic model
logp_s_in = od.logp_alt(od.dist_s, X_test_in, batch_size=100)
logp_s_ood = od.logp_alt(od.dist_s, X_test_ood, batch_size=100)
logp_s = np.concatenate([logp_s_in, logp_s_ood])
# background model
logp_b_in = od.logp_alt(od.dist_b, X_test_in, batch_size=100)
logp_b_ood = od.logp_alt(od.dist_b, X_test_ood, batch_size=100)
logp_b = np.concatenate([logp_b_in, logp_b_ood])

# show histograms
plt.hist(logp_s_in, bins=100, label='in');
plt.hist(logp_s_ood, bins=100, label='ood');
plt.title('Semantic Log Probabilities')
plt.legend()
plt.show()

plt.hist(logp_b_in, bins=100, label='in');
plt.hist(logp_b_ood, bins=100, label='ood');
plt.title('Background Log Probabilities')
plt.legend()
plt.show()

This is because of the background-effect which is in this case the GC-content in the genomic sequences. This effect is partially reduced when taking the likelihood ratio:

llr_in = logp_s_in - logp_b_in
llr_ood = logp_s_ood - logp_b_ood

plt.hist(llr_in, bins=100, label='in');
plt.hist(llr_ood, bins=100, label='ood');
plt.title('Likelihood Ratio')
plt.legend()
plt.show()

llr = np.concatenate([llr_in, llr_ood])
roc_data = {'LLR': {'scores': -llr, 'labels': y_test_sample}}
plot_roc(roc_data)

Detect outliers

We follow the same procedure with the outlier detector. First we need to set an outlier threshold with infer_threshold. We need to pass a batch of instances and specify what percentage of those we consider to be normal via threshold_perc. Let's assume we have a small batch of data with roughly $30$% outliers but we don't know exactly which ones.

n, frac_outlier = 1000, .3
perc_outlier = 100 * frac_outlier
n_sample_in, n_sample_ood = int(n * (1 - frac_outlier)), int(n * frac_outlier)
idx_in, idx_ood = np.where(y_val == 0)[0], np.where(y_val == 1)[0]
n_in, n_ood = idx_in.shape[0], idx_ood.shape[0]
sample_in = np.random.choice(n_in, size=n_sample_in, replace=False)
sample_ood = np.random.choice(n_ood, size=n_sample_ood, replace=False)
X_thr_in, X_thr_ood = X_val[idx_in[sample_in]], X_val[idx_ood[sample_ood]]
X_threshold = np.concatenate([X_thr_in, X_thr_ood])
print(X_threshold.shape)

od.infer_threshold(X_threshold, threshold_perc=perc_outlier, batch_size=100)
print('New threshold: {}'.format(od.threshold))

Let's save the outlier detector with updated threshold:

save_detector(od, filepath)

Let'spredict outliers on a sample of the test set:

od_preds = od.predict(X_test_sample, batch_size=100)

Display results

F1 score, accuracy, precision, recall and confusion matrix:

y_pred = od_preds['data']['is_outlier']
labels = ['normal', 'outlier']
f1 = f1_score(y_test_sample, y_pred)
acc = accuracy_score(y_test_sample, y_pred)
prec = precision_score(y_test_sample, y_pred)
rec = recall_score(y_test_sample, y_pred)
print('F1 score: {:.3f} -- Accuracy: {:.3f} -- Precision: {:.3f} '
      '-- Recall: {:.3f}'.format(f1, acc, prec, rec))
cm = confusion_matrix(y_test_sample, y_pred)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()

We can also plot the ROC curve based on the instance level outlier scores:

roc_data = {'LLR': {'scores': od_preds['data']['instance_score'], 'labels': y_test_sample}}
plot_roc(roc_data)

AEGMM and VAEGMM outlier detection on KDD Cup ‘99 dataset

Method

Dataset

The outlier detector needs to detect computer network intrusions using TCP dump data for a local-area network (LAN) simulating a typical U.S. Air Force LAN. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labeled as either normal, or as an attack.

There are 4 types of attacks in the dataset:

DOS: denial-of-service, e.g. syn flood;
R2L: unauthorized access from a remote machine, e.g. guessing password;
U2R: unauthorized access to local superuser (root) privileges;
probing: surveillance and other probing, e.g., port scanning.

The dataset contains about 5 million connection records.

There are 3 types of features:

basic features of individual connections, e.g. duration of connection
content features within a connection, e.g. number of failed log in attempts
traffic features within a 2 second window, e.g. number of connections to the same host as the current connection

This notebook requires the seaborn package for visualization which can be installed via pip:

!pip install seaborn

import os
import logging
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix, f1_score
import tensorflow as tf
tf.keras.backend.clear_session()
from tensorflow.keras.layers import Dense, InputLayer

from alibi_detect.datasets import fetch_kdd
from alibi_detect.models.tensorflow import eucl_cosim_features
from alibi_detect.od import OutlierAEGMM, OutlierVAEGMM
from alibi_detect.utils.data import create_outlier_batch
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.utils.visualize import plot_instance_score, plot_feature_outlier_tabular, plot_roc

logger = tf.get_logger()
logger.setLevel(logging.ERROR)

Load dataset

We only keep a number of continuous (18 out of 41) features.

kddcup = fetch_kdd(percent10=True)  # only load 10% of the dataset
print(kddcup.data.shape, kddcup.target.shape)

Assume that a model is trained on normal instances of the dataset (not outliers) and standardization is applied:

np.random.seed(0)
normal_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=400000, perc_outlier=0)
X_train, y_train = normal_batch.data.astype('float32'), normal_batch.target
print(X_train.shape, y_train.shape)
print('{}% outliers'.format(100 * y_train.mean()))

mean, stdev = X_train.mean(axis=0), X_train.std(axis=0)

Apply standardization:

X_train = (X_train - mean) / stdev

Load or define AEGMM outlier detector

load_outlier_detector = True

filepath = 'my_path'  # change to directory (absolute path) where model is downloaded
detector_type = 'outlier'
dataset = 'kddcup'
detector_name = 'OutlierAEGMM'
filepath = os.path.join(filepath, detector_name)
if load_outlier_detector:  # load pretrained outlier detector
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
else:  # define model, initialize, train and save outlier detector
    # the model defined here is similar to the one defined in the original paper
    n_features = X_train.shape[1]
    latent_dim = 1
    n_gmm = 2  # nb of components in GMM

    encoder_net = tf.keras.Sequential(
    [
        InputLayer(input_shape=(n_features,)),
        Dense(60, activation=tf.nn.tanh),
        Dense(30, activation=tf.nn.tanh),
        Dense(10, activation=tf.nn.tanh),
        Dense(latent_dim, activation=None)
    ])

    decoder_net = tf.keras.Sequential(
    [
        InputLayer(input_shape=(latent_dim,)),
        Dense(10, activation=tf.nn.tanh),
        Dense(30, activation=tf.nn.tanh),
        Dense(60, activation=tf.nn.tanh),
        Dense(n_features, activation=None)
    ])

    gmm_density_net = tf.keras.Sequential(
    [
        InputLayer(input_shape=(latent_dim + 2,)),
        Dense(10, activation=tf.nn.tanh),
        Dense(n_gmm, activation=tf.nn.softmax)
    ])
    
    # initialize outlier detector
    od = OutlierAEGMM(threshold=None,  # threshold for outlier score
                      encoder_net=encoder_net,         # can also pass AEGMM model instead
                      decoder_net=decoder_net,         # of separate encoder, decoder
                      gmm_density_net=gmm_density_net, # and gmm density net
                      n_gmm=n_gmm,
                      recon_features=eucl_cosim_features)  # fn used to derive features
                                                           # from the reconstructed
                                                           # instances based on cosine 
                                                           # similarity and Eucl distance 
    
    # train
    od.fit(X_train,
           epochs=50,
           batch_size=1024,
           verbose=True)
    
    # save the trained outlier detector
    save_detector(od, filepath)

The warning tells us we still need to set the outlier threshold. This can be done with the infer_threshold method. We need to pass a batch of instances and specify what percentage of those we consider to be normal via threshold_perc. Let's assume we have some data which we know contains around 5% outliers. The percentage of outliers can be set with perc_outlier in the create_outlier_batch function.

np.random.seed(0)
perc_outlier = 5
threshold_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=perc_outlier)
X_threshold, y_threshold = threshold_batch.data.astype('float32'), threshold_batch.target
X_threshold = (X_threshold - mean) / stdev
print('{}% outliers'.format(100 * y_threshold.mean()))

od.infer_threshold(X_threshold, threshold_perc=100-perc_outlier)
print('New threshold: {}'.format(od.threshold))

Save outlier detector with updated threshold:

save_detector(od, filepath)

Detect outliers

We now generate a batch of data with 10% outliers and detect the outliers in the batch.

np.random.seed(1)
outlier_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=10)
X_outlier, y_outlier = outlier_batch.data.astype('float32'), outlier_batch.target
X_outlier = (X_outlier - mean) / stdev
print(X_outlier.shape, y_outlier.shape)
print('{}% outliers'.format(100 * y_outlier.mean()))

Predict outliers:

od_preds = od.predict(X_outlier, return_instance_score=True)

Display results

F1 score and confusion matrix:

labels = outlier_batch.target_names
y_pred = od_preds['data']['is_outlier']
f1 = f1_score(y_outlier, y_pred)
print('F1 score: {:.4f}'.format(f1))
cm = confusion_matrix(y_outlier, y_pred)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()

Plot instance level outlier scores vs. the outlier threshold:

plot_instance_score(od_preds, y_outlier, labels, od.threshold, ylim=(None, None))

We can also plot the ROC curve for the outlier scores of the detector:

roc_data = {'AEGMM': {'scores': od_preds['data']['instance_score'], 'labels': y_outlier}}
plot_roc(roc_data)

Investigate results

We can visualize the encodings of the instances in the latent space and the features derived from the instance reconstructions by the decoder. The encodings and features are then fed into the GMM density network.

enc = od.aegmm.encoder(X_outlier)  # encoding
X_recon = od.aegmm.decoder(enc)  # reconstructed instances
recon_features = od.aegmm.recon_features(X_outlier, X_recon)  # reconstructed features

df = pd.DataFrame(dict(enc=enc[:, 0].numpy(), 
                       cos=recon_features[:, 0].numpy(), 
                       eucl=recon_features[:, 1].numpy(), 
                       label=y_outlier))

groups = df.groupby('label')
fig, ax = plt.subplots()
for name, group in groups:
    ax.plot(group.enc, group.cos, marker='o', 
            linestyle='', ms=6, label=labels[name])
plt.title('Encoding vs. Cosine Similarity')
plt.xlabel('Encoding')
plt.ylabel('Cosine Similarity')
ax.legend()
plt.show()

fig, ax = plt.subplots()
for name, group in groups:
    ax.plot(group.enc, group.eucl, marker='o', 
            linestyle='', ms=6, label=labels[name])
plt.title('Encoding vs. Relative Euclidean Distance')
plt.xlabel('Encoding')
plt.ylabel('Relative Euclidean Distance')
ax.legend()
plt.show()

A lot of the outliers are already separated well in the latent space.

Use VAEGMM outlier detector

load_outlier_detector = True

filepath = 'my_path'  # change to directory (absolute path) where model is downloaded
detector_type = 'outlier'
dataset = 'kddcup'
detector_name = 'OutlierVAEGMM'
filepath = os.path.join(filepath, detector_name)
if load_outlier_detector:  # load pretrained outlier detector
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
else:  # define model, initialize, train and save outlier detector
    # the model defined here is similar to the one defined in
    # the OutlierVAE notebook
    n_features = X_train.shape[1]
    latent_dim = 2
    n_gmm = 2

    encoder_net = tf.keras.Sequential(
    [
        InputLayer(input_shape=(n_features,)),
        Dense(20, activation=tf.nn.relu),
        Dense(15, activation=tf.nn.relu),
        Dense(7, activation=tf.nn.relu)
    ])

    decoder_net = tf.keras.Sequential(
    [
        InputLayer(input_shape=(latent_dim,)),
        Dense(7, activation=tf.nn.relu),
        Dense(15, activation=tf.nn.relu),
        Dense(20, activation=tf.nn.relu),
        Dense(n_features, activation=None)
    ])

    gmm_density_net = tf.keras.Sequential(
    [
        InputLayer(input_shape=(latent_dim + 2,)),
        Dense(10, activation=tf.nn.relu),
        Dense(n_gmm, activation=tf.nn.softmax)
    ])
    
    
    # initialize outlier detector
    od = OutlierVAEGMM(threshold=None,
                       encoder_net=encoder_net,
                       decoder_net=decoder_net,
                       gmm_density_net=gmm_density_net,
                       n_gmm=n_gmm,
                       latent_dim=latent_dim,
                       samples=10,
                       recon_features=eucl_cosim_features)
    
    # train
    od.fit(X_train,
           epochs=50,
           batch_size=1024,
           cov_elbo=dict(sim=.0025),  # standard deviation assumption
           verbose=True)           # for elbo training
    
    # save the trained outlier detector
    save_detector(od, filepath)

Need to infer the threshold again:

od.infer_threshold(X_threshold, threshold_perc=100-perc_outlier)
print('New threshold: {}'.format(od.threshold))

Save outlier detector with updated threshold:

save_detector(od, filepath)

Detect outliers and display results

Predict:

od_preds = od.predict(X_outlier, return_instance_score=True)

F1 score and confusion matrix:

labels = outlier_batch.target_names
y_pred = od_preds['data']['is_outlier']
f1 = f1_score(y_outlier, y_pred)
print('F1 score: {:.4f}'.format(f1))
cm = confusion_matrix(y_outlier, y_pred)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()

Plot instance level outlier scores vs. the outlier threshold:

plot_instance_score(od_preds, y_outlier, labels, od.threshold, ylim=(None, None))

You can zoom in by adjusting the min and max values in ylim. We can also compare the VAEGMM ROC curve with AEGMM:

roc_data['VAEGMM'] = {'scores': od_preds['data']['instance_score'], 'labels': y_outlier}
plot_roc(roc_data)

VAE outlier detection for income prediction

Method

Dataset

!pip install alibi seaborn

import os
import alibi
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, precision_score, recall_score
from sklearn.preprocessing import OneHotEncoder
import tensorflow as tf
tf.keras.backend.clear_session()
from tensorflow.keras.layers import Dense, InputLayer

from alibi_detect.od import OutlierVAE
from alibi_detect.utils.perturbation import inject_outlier_tabular
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.utils.visualize import plot_instance_score

def set_seed(s=0):
    np.random.seed(s)
    tf.random.set_seed(s)

Load adult dataset

The fetch_adult function returns a Bunch object containing the features, the targets, the feature names and a mapping of the categories in each categorical variable.

adult = alibi.datasets.fetch_adult()
X, y = adult.data, adult.target
feature_names = adult.feature_names
category_map_tmp = adult.category_map

Shuffle data:

set_seed(0)
Xy_perm = np.random.permutation(np.c_[X, y])
X, y = Xy_perm[:,:-1], Xy_perm[:,-1]

Reorganize data so categorical features come first, remove some features and adjust feature_names and category_map accordingly:

keep_cols = [2, 3, 5, 0, 8, 9, 10]
feature_names = feature_names[2:4] + feature_names[5:6] + feature_names[0:1] + feature_names[8:11]
print(feature_names)

X = X[:, keep_cols]
print(X.shape)

category_map = {}
i = 0
for k, v in category_map_tmp.items():
    if k in keep_cols:
        category_map[i] = v
        i += 1

Preprocess data

Normalize numerical features or scale numerical between -1 and 1:

minmax = False

X_num = X[:, -4:].astype(np.float32, copy=False)
if minmax:
    xmin, xmax = X_num.min(axis=0), X_num.max(axis=0)
    rng = (-1., 1.)
    X_num_scaled = (X_num - xmin) / (xmax - xmin) * (rng[1] - rng[0]) + rng[0]
else:  # normalize
    mu, sigma = X_num.mean(axis=0), X_num.std(axis=0)
    X_num_scaled = (X_num - mu) / sigma

Fit OHE to categorical variables:

X_cat = X[:, :-4].copy()
ohe = OneHotEncoder(categories='auto')
ohe.fit(X_cat)

Combine numerical and categorical data:

X = np.c_[X_cat, X_num_scaled].astype(np.float32, copy=False)

Define train, validation (to find outlier threshold) and test set:

n_train = 25000
n_valid = 5000
X_train, y_train = X[:n_train,:], y[:n_train]
X_valid, y_valid = X[n_train:n_train+n_valid,:], y[n_train:n_train+n_valid]
X_test, y_test = X[n_train+n_valid:,:], y[n_train+n_valid:]
print(X_train.shape, y_train.shape,
      X_valid.shape, y_valid.shape,
      X_test.shape, y_test.shape)

Create outliers

Inject outliers in the numerical features. First we need to know the features for each kind:

cat_cols = list(category_map.keys())
num_cols = [col for col in range(X.shape[1]) if col not in cat_cols]
print(cat_cols, num_cols)

Numerical

Now we can add outliers to the validation (or threshold) and test sets. For the numerical data, we need to specify the numerical columns (cols), the percentage of outliers (perc_outlier), the strength (n_std) and the minimum size of the perturbation (min_std). The outliers are distributed evenly across the numerical features:

perc_outlier = 10
data = inject_outlier_tabular(X_valid, num_cols, perc_outlier, n_std=8., min_std=6.)
X_threshold, y_threshold = data.data, data.target
X_threshold_, y_threshold_ = X_threshold.copy(), y_threshold.copy()  # store for comparison later
outlier_perc = 100 * y_threshold.sum() / len(y_threshold)
print('{:.2f}% outliers'.format(outlier_perc))

Let's inspect an instance that was changed:

outlier_idx = np.where(y_threshold != 0)[0]
vdiff = X_threshold[outlier_idx[0]] - X_valid[outlier_idx[0]]
fdiff = np.where(vdiff != 0)[0]
print('{} changed by {:.2f}.'.format(feature_names[fdiff[0]], vdiff[fdiff[0]]))

Same thing for the test set:

data = inject_outlier_tabular(X_test, num_cols, perc_outlier, n_std=8., min_std=6.)
X_outlier, y_outlier = data.data, data.target
print('{:.2f}% outliers'.format(100 * y_outlier.sum() / len(y_outlier)))

Apply one-hot encoding

OHE to train, threshold and outlier sets:

X_train_ohe = ohe.transform(X_train[:, :-4].copy())
X_threshold_ohe = ohe.transform(X_threshold[:, :-4].copy())
X_outlier_ohe = ohe.transform(X_outlier[:, :-4].copy())
print(X_train_ohe.shape, X_threshold_ohe.shape, X_outlier_ohe.shape)

X_train = np.c_[X_train_ohe.toarray(), X_train[:, -4:]].astype(np.float32, copy=False)
X_threshold = np.c_[X_threshold_ohe.toarray(), X_threshold[:, -4:]].astype(np.float32, copy=False)
X_outlier = np.c_[X_outlier_ohe.toarray(), X_outlier[:, -4:]].astype(np.float32, copy=False)
print(X_train.shape, X_threshold.shape, X_outlier.shape)

Load or define outlier detector

load_outlier_detector = True

filepath = './models/'  # change to directory where model is downloaded
if load_outlier_detector:  # load pretrained outlier detector
    detector_type = 'outlier'
    dataset = 'adult'
    detector_name = 'OutlierVAE'
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
else:  # define model, initialize, train and save outlier detector
    n_features = X_train.shape[1]
    latent_dim = 2

    encoder_net = tf.keras.Sequential(
      [
          InputLayer(input_shape=(n_features,)),
          Dense(25, activation=tf.nn.relu),
          Dense(10, activation=tf.nn.relu),
          Dense(5, activation=tf.nn.relu)
      ])

    decoder_net = tf.keras.Sequential(
      [
          InputLayer(input_shape=(latent_dim,)),
          Dense(5, activation=tf.nn.relu),
          Dense(10, activation=tf.nn.relu),
          Dense(25, activation=tf.nn.relu),
          Dense(n_features, activation=None)
      ])

    # initialize outlier detector
    od = OutlierVAE(threshold=None,  # threshold for outlier score
                    score_type='mse',  # use MSE of reconstruction error for outlier detection
                    encoder_net=encoder_net,  # can also pass VAE model instead
                    decoder_net=decoder_net,  # of separate encoder and decoder
                    latent_dim=latent_dim,
                    samples=5)

    # train
    od.fit(X_train,
           loss_fn=tf.keras.losses.mse,
           epochs=5,
           verbose=True)

    # save the trained outlier detector
    save_detector(od, filepath)

od.infer_threshold(X_threshold, threshold_perc=100-outlier_perc, outlier_perc=100)
print('New threshold: {}'.format(od.threshold))

Let’s save the outlier detector with updated threshold:

save_detector(od, filepath)

Detect outliers

od_preds = od.predict(X_outlier,
                      outlier_type='instance',
                      return_feature_score=True,
                      return_instance_score=True)

Display results

F1 score and confusion matrix:

labels = data.target_names
y_pred = od_preds['data']['is_outlier']
f1 = f1_score(y_outlier, y_pred)
acc = accuracy_score(y_outlier, y_pred)
prec = precision_score(y_outlier, y_pred)
rec = recall_score(y_outlier, y_pred)
print('F1 score: {:.2f} -- Accuracy: {:.2f} -- Precision: {:.2f} -- Recall: {:.2f}'.format(f1, acc, prec, rec))
cm = confusion_matrix(y_outlier, y_pred)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()

Plot instance level outlier scores vs. the outlier threshold:

plot_instance_score(od_preds, y_outlier.astype(int), labels, od.threshold, ylim=(0, 25))

AE outlier detection on CIFAR10

Method

The Auto-Encoder (AE) outlier detector is first trained on a batch of unlabeled, but normal (inlier) data. Unsupervised training is desireable since labeled data is often scarce. The AE detector tries to reconstruct the input it receives. If the input data cannot be reconstructed well, the reconstruction error is high and the data can be flagged as an outlier. The reconstruction error is measured as the mean squared error (MSE) between the input and the reconstructed instance.

Dataset

import logging
import matplotlib.pyplot as plt
import numpy as np
import os
import tensorflow as tf
tf.keras.backend.clear_session()
from tensorflow.keras.layers import Conv2D, Conv2DTranspose, \
    Dense, Layer, Reshape, InputLayer, Flatten
from tqdm import tqdm

from alibi_detect.od import OutlierAE
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.utils.perturbation import apply_mask
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.utils.visualize import plot_instance_score, plot_feature_outlier_image

logger = tf.get_logger()
logger.setLevel(logging.ERROR)

Load CIFAR10 data

train, test = tf.keras.datasets.cifar10.load_data()
X_train, y_train = train
X_test, y_test = test

X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

Load or define outlier detector

load_outlier_detector = True

#| scrolled: true
filepath = 'my_path'  # change to (absolute) directory where model is downloaded
detector_type = 'outlier'
dataset = 'cifar10'
detector_name = 'OutlierAE'
filepath = os.path.join(filepath, detector_name)
if load_outlier_detector:  # load pretrained outlier detector
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
else:  # define model, initialize, train and save outlier detector
    encoding_dim = 1024
    
    encoder_net = tf.keras.Sequential(
      [
          InputLayer(input_shape=(32, 32, 3)),
          Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
          Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
          Conv2D(512, 4, strides=2, padding='same', activation=tf.nn.relu),
          Flatten(),
          Dense(encoding_dim,)
      ])

    decoder_net = tf.keras.Sequential(
      [
          InputLayer(input_shape=(encoding_dim,)),
          Dense(4*4*128),
          Reshape(target_shape=(4, 4, 128)),
          Conv2DTranspose(256, 4, strides=2, padding='same', activation=tf.nn.relu),
          Conv2DTranspose(64, 4, strides=2, padding='same', activation=tf.nn.relu),
          Conv2DTranspose(3, 4, strides=2, padding='same', activation='sigmoid')
      ])
    
    # initialize outlier detector
    od = OutlierAE(threshold=.015,  # threshold for outlier score
                    encoder_net=encoder_net,  # can also pass AE model instead
                    decoder_net=decoder_net,  # of separate encoder and decoder
                    )
    # train
    od.fit(X_train,
           epochs=50,
           verbose=True)
    
    # save the trained outlier detector
    save_detector(od, filepath)

Check quality AE model

idx = 8
X = X_train[idx].reshape(1, 32, 32, 3)
X_recon = od.ae(X)

plt.imshow(X.reshape(32, 32, 3))
plt.axis('off')
plt.show()

plt.imshow(X_recon.numpy().reshape(32, 32, 3))
plt.axis('off')
plt.show()

Check outliers on original CIFAR images

X = X_train[:500]
print(X.shape)

od_preds = od.predict(X,
                      outlier_type='instance',    # use 'feature' or 'instance' level
                      return_feature_score=True,  # scores used to determine outliers
                      return_instance_score=True)
print(list(od_preds['data'].keys()))

Plot instance level outlier scores

target = np.zeros(X.shape[0],).astype(int)  # all normal CIFAR10 training instances
labels = ['normal', 'outlier']
plot_instance_score(od_preds, target, labels, od.threshold)

Visualize predictions

X_recon = od.ae(X).numpy()
plot_feature_outlier_image(od_preds, 
                           X, 
                           X_recon=X_recon,
                           instance_ids=[8, 60, 100, 330],  # pass a list with indices of instances to display
                           max_instances=5,  # max nb of instances to display
                           outliers_only=False)  # only show outlier predictions

Predict outliers on perturbed CIFAR images

We perturb CIFAR images by adding random noise to patches (masks) of the image. For each mask size in n_mask_sizes, sample n_masks and apply those to each of the n_imgs images. Then we predict outliers on the masked instances:

# nb of predictions per image: n_masks * n_mask_sizes 
n_mask_sizes = 10
n_masks = 20
n_imgs = 50

Define masks and get images:

mask_sizes = [(2*n,2*n) for n in range(1,n_mask_sizes+1)]
print(mask_sizes)
img_ids = np.arange(n_imgs)
X_orig = X[img_ids].reshape(img_ids.shape[0], 32, 32, 3)
print(X_orig.shape)

Calculate instance level outlier scores:

#| scrolled: true
all_img_scores = []
for i in tqdm(range(X_orig.shape[0])):
    img_scores = np.zeros((len(mask_sizes),))
    for j, mask_size in enumerate(mask_sizes):
        # create masked instances
        X_mask, mask = apply_mask(X_orig[i].reshape(1, 32, 32, 3),
                                  mask_size=mask_size,
                                  n_masks=n_masks,
                                  channels=[0,1,2],
                                  mask_type='normal',
                                  noise_distr=(0,1),
                                  clip_rng=(0,1))
        # predict outliers
        od_preds_mask = od.predict(X_mask)
        score = od_preds_mask['data']['instance_score']
        # store average score over `n_masks` for a given mask size
        img_scores[j] = np.mean(score)
    all_img_scores.append(img_scores)

Visualize outlier scores vs. mask sizes

x_plt = [mask[0] for mask in mask_sizes]

for ais in all_img_scores:
    plt.plot(x_plt, ais)
    plt.xticks(x_plt)
plt.title('Outlier Score All Images for Increasing Mask Size')
plt.xlabel('Mask size')
plt.ylabel('Outlier Score')
plt.show()

ais_np = np.zeros((len(all_img_scores), all_img_scores[0].shape[0]))
for i, ais in enumerate(all_img_scores):
    ais_np[i, :] = ais
ais_mean = np.mean(ais_np, axis=0)
plt.title('Mean Outlier Score All Images for Increasing Mask Size')
plt.xlabel('Mask size')
plt.ylabel('Outlier score')
plt.plot(x_plt, ais_mean)
plt.xticks(x_plt)
plt.show()

Investigate instance level outlier

i = 8  # index of instance to look at

plt.plot(x_plt, all_img_scores[i])
plt.xticks(x_plt)
plt.title('Outlier Scores Image {} for Increasing Mask Size'.format(i))
plt.xlabel('Mask size')
plt.ylabel('Outlier score')
plt.show()

Reconstruction of masked images and outlier scores per channel:

#| scrolled: true
all_X_mask = []
X_i = X_orig[i].reshape(1, 32, 32, 3)
all_X_mask.append(X_i)
# apply masks
for j, mask_size in enumerate(mask_sizes):
    # create masked instances
    X_mask, mask = apply_mask(X_i,
                              mask_size=mask_size,
                              n_masks=1,  # just 1 for visualization purposes
                              channels=[0,1,2],
                              mask_type='normal',
                              noise_distr=(0,1),
                              clip_rng=(0,1))
    all_X_mask.append(X_mask)
all_X_mask = np.concatenate(all_X_mask, axis=0)
all_X_recon = od.ae(all_X_mask).numpy()
od_preds = od.predict(all_X_mask)

Visualize:

plot_feature_outlier_image(od_preds, 
                           all_X_mask, 
                           X_recon=all_X_recon, 
                           max_instances=all_X_mask.shape[0], 
                           n_channels=3)

Predict outliers on a subset of features

The sensitivity of the outlier detector can not only be controlled via the threshold, but also by selecting the percentage of the features used for the instance level outlier score computation. For instance, we might want to flag outliers if 40% of the features (pixels for images) have an average outlier score above the threshold. This is possible via the outlier_perc argument in the predict function. It specifies the percentage of the features that are used for outlier detection, sorted in descending outlier score order.

perc_list = [20, 40, 60, 80, 100]

all_perc_scores = []
for perc in perc_list:
    od_preds_perc = od.predict(all_X_mask, outlier_perc=perc)
    iscore = od_preds_perc['data']['instance_score']
    all_perc_scores.append(iscore)

Visualize outlier scores vs. mask sizes and percentage of features used:

x_plt = [0] + x_plt
for aps in all_perc_scores:
    plt.plot(x_plt, aps)
    plt.xticks(x_plt)
plt.legend(perc_list)
plt.title('Outlier Score for Increasing Mask Size and Different Feature Subsets')
plt.xlabel('Mask Size')
plt.ylabel('Outlier Score')
plt.show()

Infer outlier threshold value

Finding good threshold values can be tricky since they are typically not easy to interpret. The infer_threshold method helps finding a sensible value. We need to pass a batch of instances X and specify what percentage of those we consider to be normal via threshold_perc.

print('Current threshold: {}'.format(od.threshold))
od.infer_threshold(X, threshold_perc=99)  # assume 1% of the training data are outliers
print('New threshold: {}'.format(od.threshold))

Drift Detection

1. What is drift?

Although powerful, modern machine learning models can be sensitive. Seemingly subtle changes in a data distribution can destroy the performance of otherwise state-of-the art models, which can be especially problematic when ML models are deployed in production. Typically, ML models are tested on held out data in order to estimate their future performance. Crucially, this assumes that the process underlying the input data $\mathbf{X}$ and output data $\mathbf{Y}$ remains constant.

Drift is said to occur when the process underlying $\mathbf{X}$ and $\mathbf{Y}$ at test time differs from the process that generated the training data. In this case, we can no longer expect the model’s performance on test data to match that observed on held out training data. At test time we always observe features $\mathbf{X}$, and the ground truth then refers to a corresponding label $\mathbf{Y}$. If ground truths are available at test time, supervised drift detection can be performed, with the model’s predictive performance monitored directly. However, in many scenarios, such as the binary classification example below, ground truths are not available and unsupervised drift detection methods are required.

To explore the different types of drift, consider the common scenario where we deploy a model $f: \boldsymbol{x} \mapsto y$ on input data $\mathbf{X}$ and output data $\mathbf{Y}$, jointly distributed according to $P(\mathbf{X},\mathbf{Y})$. The model is trained on training data drawn from a distribution $P_{ref}(\mathbf{X},\mathbf{Y})$. Drift is said to have occurred when $P(\mathbf{X},\mathbf{Y}) \ne P_{ref}(\mathbf{X},\mathbf{Y})$. Writing the joint distribution as

we can classify drift under a number of types:

Covariate drift: Also referred to as input drift, this occurs when the distribution of the input data has shifted $P(\mathbf{X}) \ne P_{ref}(\mathbf{X})$, whilst $P(\mathbf{Y}|\mathbf{X})$ = $P_{ref}(\mathbf{Y}|\mathbf{X})$. This may result in the model giving unreliable predictions.
Prior drift: Also referred to as label drift, this occurs when the distribution of the outputs has shifted $P(\mathbf{Y}) \ne P_{ref}(\mathbf{Y})$, whilst $P(\mathbf{X}|\mathbf{Y})=P_{ref}(\mathbf{X}|\mathbf{Y})$. This can affect the model's decision boundary, as well as the model's performance metrics.
Concept drift: This occurs when the process generating $y$ from $x$ has changed, such that $P(\mathbf{Y}|\mathbf{X}) \ne P_{ref}(\mathbf{Y}|\mathbf{X})$. It is possible that the model might no longer give a suitable approximation of the true process.

Below, the different types of drift are visualised for a simple two-dimensional classification problem. It is possible for a drift to fall under more than one category, for example the prior drift below also happens to be a case of covariate drift.

:align: center
:alt: 2D drift example

It is relatively easy to spot drift by eyeballing these figures here. However, the task becomes considerably harder for high-dimensional real problems, especially since real-time ground truths are not typically available. Some types of drift, such as prior and concept drift, are especially difficult to detect without access to ground truths. As a workaround proxies are required, for example a model’s predictions can be monitored to check for prior drift.

2. Detecting drift

Due to natural randomness in the process being modelled, we don’t necessarily expect observations $\mathbf{z}1,\dots,\mathbf{z}N$ drawn from $P(\mathbf{z})$ to be identical to $\mathbf{z}^{ref}1,\dots,\mathbf{z}^{ref}M$ drawn from $P{ref}(\mathbf{z})$. To decide whether differences between $P(\mathbf{z})$ and $P{ref}(\mathbf{z})$ are due to drift or just natural randomness in the data, statistical two-sample hypothesis testing is used, with the null hypothesis $P(\mathbf{z})=P{ref}(\mathbf{z})$. If the $p$-value obtained is below a given threshold, the null is rejected and the alternative hypothesis $P(\mathbf{z}) \ne P{ref}(\mathbf{z})$ is accepted, suggesting drift is occurring.

Since $\mathbf{z}$ is often high-dimensional (even a 200 x 200 greyscale image has 40k dimensions!), performing hypothesis testing in the full-space is often either computationally intractable, or unsuitable for the chosen statistical test. Instead, the pipeline below is often used, with dimension reduction as a pre-processing step.

:::{figure} images/drift_pipeline.png :align: center :alt: Drift detection pipeline

Figure inspired by Figure 1 in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. :::

Hypothesis testing

Hypothesis testing involves first choosing a test statistic $S(\mathbf{z})$, which is expected to be small if the null hypothesis $H_0$ is true, and large if the alternative $H_a$ is true. For observed data $\mathbf{z}$, $S(\mathbf{z})$ is computed, followed by a $p$-value $\hat{p} = P(\text{such an extreme } S(\mathbf{z}) | H_0)$. In other words, $\hat{p}$ represents the probability of such an extreme value of $S(\mathbf{z})$ occurring given that $H_0$ is true. When $\hat{p}\le \alpha$, results are said to be statistically significant, and the null $P(\mathbf{z})=P_{ref}(\mathbf{z})$ is rejected. Conveniently, the threshold $\alpha$ represents the desired false positive rate.

Univariate:
Multivariate:

Dimension reduction

Given an input dataset $\mathbf{X}\in \mathbb{R}^{N\times d}$, where $N$ is the number of observations and $d$ the number of dimensions, the aim is to reduce the data dimensionality from $d$ to $K$, where $K\ll d$. A drift detector can then be applied to the lower dimensional data $\hat{\mathbf{X}}\in \mathbb{R}^{N\times K}$, where distances more meaningfully capture notions of similarity/dissimilarity between instances. Dimension reduction approaches can be broadly categorised under:

Linear projections
Non-linear projections
Feature maps (from ML model)
Model uncertainty

Linear projections

pca = PCA(2)
pca.fit(X_train)
detector = MMDDrift(X_ref, backend='tensorflow', p_val=.05, preprocess_fn=pca.transform)

:::{admonition} Note 1: Disjoint training and reference data sets

Non-linear projections

encoder_net = torch.nn.Sequential(...)
preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=512)
detector = MMDDrift(X_ref, backend='pytorch', p_val=.05, preprocess_fn=preprocess_fn)

Feature maps

:::{figure} images/BBSD.png :align: center :alt: Black box shift detection

Figure inspired by this MNIST classification example from the timeserio package. :::

clf = fetch_tf_model('cifar10', 'resnet32')
preprocess_fn = partial(preprocess_drift, model=HiddenOutput(clf, layer=-1), batch_size=128)
detector = MMDDrift(X_ref, backend='tensorflow', p_val=.05,preprocess_fn=preprocess_fn)

Model uncertainty

The model uncertainty-based detectors are classed under the dimension reduction category since a model's uncertainty is by definition one-dimensional. However, the syntax for the uncertainty-based detectors is different to the other detectors. Instead of passing a pre-processing step to a detector via a preprocess_fn (or similar) argument, the dimension reduction (in this case computing a notion of uncertainty) is performed internally by these detectors.

reg =  # pytorch regression model with at least 1 dropout layer
detector = RegressorUncertaintyDrift(x_ref, reg, backend='pytorch', 
                                     p_val=.05, uncertainty_type='mc_dropout')

Input preprocessing

Dimension reduction is a common preprocessing task (e.g. for covariate drift detection on tabular or image data), but some modalities of data (e.g. text and graph data) require other forms of preprocessing in order for drift detection to be performed effectively.

Text data

When dealing with text data, performing drift detection on raw strings or tokenized data is not effective since they don’t represent the semantics of the input. Instead, we extract contextual embeddings from language transformer models and detect drift on those. This procedure has a significant impact on the type of drift we detect. Strictly speaking we are not detecting covariate/input drift anymore since the entire training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract.

:::{figure} images/BERT.png :align: center :alt: The DistilBERT language representation model

Graph data

:align: center
:alt: A graph embedding
:width: 550px

Simple example

The true model/process is defined as:

where the slope $s$ is set as $s=-1$.

def true_model(X,slope=-1):
    z = slope*X[:,0]
    idx = np.argwhere(X[:,1]>z)
    y = np.zeros(X.shape[0])
    y[idx] = 1
    return y
    
true_slope = -1

The reference distribution is defined as a mixture of two Normal distributions:

with the standard deviation set at $\sigma=0.8$, and the weights set to $\phi_1=\phi_2=0.5$. Reference data $\mathbf{X}^{ref}$ and training data $\mathbf{X}^{train}$ (see Note 1) can be generated by sampling from this distribution. The corresponding labels $\mathbf{Y}^{ref}$ and $\mathbf{Y}^{train}$ are obtained by evaluating true_model().

# Reference distribution
sigma = 0.8
phi1 = 0.5
phi2 = 0.5
ref_norm_0 = multivariate_normal([-1,-1], np.eye(2)*sigma**2)
ref_norm_1 = multivariate_normal([ 1, 1], np.eye(2)*sigma**2)

# Reference data (to initialise the detectors)
N_ref = 240
X_0 = ref_norm_0.rvs(size=int(N_ref*phi1),random_state=1)
X_1 = ref_norm_1.rvs(size=int(N_ref*phi2),random_state=1)
X_ref = np.vstack([X_0, X_1])
y_ref = true_model(X_ref,true_slope)

# Training data (to train the classifer)
N_train = 240
X_0 = ref_norm_0.rvs(size=int(N_train*phi1),random_state=0)
X_1 = ref_norm_1.rvs(size=int(N_train*phi2),random_state=0)
X_train = np.vstack([X_0, X_1])
y_train = true_model(X_train,true_slope)

For a model, we choose the well-known decision tree classifier. As well as training the model, this is a good time to initialise the MMD detector with the held-out reference data $\mathbf{X}^{ref}$ by calling:

detector = MMDDrift(X_ref, backend='pytorch', p_val=.05)

The significance threshold is set at $\alpha=0.05$, meaning the detector will flag results as drift detected when the computed $p$-value is less than this i.e. $\hat{p}< \alpha$.

# Fit decision tree classifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(max_depth=20)
clf.fit(X_train, y_train)

# Plot with a pre-defined helper function
plot(X_ref,y_ref,true_slope,clf=clf)

# Classifier accuracy
print('Mean training accuracy %.2f%%' %(100*clf.score(X_ref,y_ref)))

# Fit a drift detector to the training data
from alibi_detect.cd import MMDDrift
detector = MMDDrift(X_ref, backend='pytorch', p_val=.05)

.. parsed-literal::

    Mean training accuracy 99.17%
    No GPU detected, fall back on CPU.

No drift

Before introducing drift, we first examine the case where no drift is present. We resample from the same mixture of Gaussian distributions to generate test data $\mathbf{X}$. The individual data observations are different, but the underlying distributions are unchanged, hence no true drift is present.

N_test = 120
X_0 = ref_norm_0.rvs(size=int(N_test*phi1),random_state=2)
X_1 = ref_norm_1.rvs(size=int(N_test*phi2),random_state=2)
X_test = np.vstack([X_0, X_1])

# Plot
y_test = true_model(X_test,true_slope)
plot(X_test,y_test,true_slope,clf=clf)

# Classifier accuracy
print('Mean test accuracy %.2f%%' %(100*clf.score(X_test,y_test)))

.. parsed-literal::

    Mean test accuracy 95.00%

Unsurprisingly, the model’s mean test accuracy is relatively high. To run the detector on test data the .predict() is used:

detector.predict(X_test)

.. parsed-literal::

    {'data': {'is_drift': 0,
      'distance': 0.0023595122654528344,
      'p_val': 0.30000001192092896,
      'threshold': 0.05,
      'distance_threshold': 0.008109889},
     'meta': {'name': 'MMDDriftTorch',
      'detector_type': 'offline',
      'data_type': None,
      'backend': 'pytorch'}}

Covariate and prior drift

To impose covariate drift, we apply a shift to the mean of one of the normal distributions:

shift_norm_0 = multivariate_normal([2, -4], np.eye(2)*sigma**2)
X_0 = shift_norm_0.rvs(size=N_test*phi1,random_state=2)
X_1 = ref_norm_1.rvs(size=N_test*phi2,random_state=2)
X_test = np.vstack([X_0, X_1])

# Plot
y_test = true_model(X_test,slope)
plot(X_test,y_test,slope,clf=clf)

# Classifier accuracy
print('Mean test accuracy %.2f%%' %(100*clf.score(X_test,y_test)))

# Check for drift in covariates
pred = detector.predict(X_test)
labels = ['No','Yes']
print('Is drift? %s!' %labels[pred['data']['is_drift']])

.. parsed-literal::

    Mean test accuracy 66.67%
    Is drift? Yes!

The test data has drifted into a previously unseen region of feature space, and the model is now misclassifying a number of test observations. If true test labels are available, this is easily detectable by monitoring the test accuracy. However, labels are not always available at test time, in which case a drift detector monitoring the covariates comes in handy. In this case, the MMD detector successfully detects the covariate drift.

In a similar manner, a proxy for prior drift can be monitored by initialising a detector on labels from the reference set, and then feeding it a model’s predicted labels:

label_detector = MMDDrift(y_ref.reshape(-1,1), backend='tensorflow', p_val=.05)
y_pred = clf.predict(X_test)
label_detector.predict(y_pred.reshape(-1,1))

3. Learned drift detection

It can often be challenging to specify a test statistic $S(\mathbf{z})$ that is large when drift is present and small otherwise. Alibi Detect offers a number of learned detectors, which try to explicitly learn a test statistic which satisfies this property:

These detectors can be highly effective, but require training hence potentially increasing data requirements and set-up time. Similarly to when training preprocessing steps, it is important that the learned detectors are trained on training data which is held-out from the reference data set (see Note 1). A brief overview of these detectors is given below. For more details, see the detectors’ respective pages.

Learned kernel

where $\Phi$ is a learnable projection, $k_a$ and $k_b$ are simple characteristic kernels (such as a Gaussian RBF), and $\epsilon>0$ is a small constant. By letting $\Phi$ be very flexible we can learn powerful kernels in this manner.

The figure below compares the use of a Gaussian and a learned kernel for identifying differences between two distributions $P$ and $P_{ref}$. The distributions are each equal mixtures of nine Gaussians with the same modes, but each component of $P_{ref}$ is an isotropic Gaussian, whereas the covariance of $P$ differs in each component. The Gaussian kernel (c) treats points isotropically throughout the space, based upon $\lVert \mathbf{z} - \mathbf{z}^{ref} \rVert$ only. The learned kernel (d) behaves differently in different regions of the space, adapting to local structure and therefore allowing better detection of differences between $P$ and $P_{ref}$.

:::{figure} images/deep_kernel.png :align: center :alt: Gaussian and deep kernels

Original image source: Liu et al., 2020. Captions modified to match notation used elsewhere on this page. :::

Classifier

Spot-the-diff

4. Online drift detection

So far, we have discussed drift detection in an offline context, with the entire test set ${\mathbf{z}i}{i=1}^{N}$ compared to the reference dataset ${\mathbf{z}^{ref}i}{i=1}^{M}$. However, at test time, data sometimes arrives sequentially. Here it is desirable to detect drift in an online fashion, allowing us to respond as quickly as possible and limit the damage it might cause.

:align: center
:alt: Online drift detection
:width: 350px

:align: center
:alt: Offline detector with W=2
:width: 700px

:align: center
:alt: Offline detector with W=20
:width: 700px

Online drift detectors

online_detector = MMDDriftOnline(X_ref, ert, window_size, backend='tensorflow', n_bootstraps=5000)

But, in addition to providing the detector with reference data, the expected run-time (see below), and size of the sliding window must also be defined. Another important difference is that the online detectors make predictions on single data instances:

result = online_detector.predict(X[i])

This can be seen in the animation below, where the online detector considers each incoming observation/sample individually, instead of considering a batch of observations like the offline detectors.

:align: center
:alt: Online detector
:width: 700px

Expected run-time and detection delay

Unlike offline detectors which require the specification of a threshold $p$-value, which is equivalent to a false positive rate (FPR), the online detectors in alibi-detect require the specification of an expected run-time (ERT) (an inverted FPR). This is the number of time-steps that we insist our detectors, on average, should run for in the absence of drift, before making a false detection.

VAE outlier detection on KDD Cup ‘99 dataset

Method

Dataset

There are 4 types of attacks in the dataset:

DOS: denial-of-service, e.g. syn flood;
R2L: unauthorized access from a remote machine, e.g. guessing password;
U2R: unauthorized access to local superuser (root) privileges;
probing: surveillance and other probing, e.g., port scanning.

The dataset contains about 5 million connection records.

There are 3 types of features:

basic features of individual connections, e.g. duration of connection
content features within a connection, e.g. number of failed log in attempts
traffic features within a 2 second window, e.g. number of connections to the same host as the current connection

This notebook requires the seaborn package for visualization which can be installed via pip:

!pip install seaborn

import os
import logging
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix, f1_score
import tensorflow as tf
tf.keras.backend.clear_session()
from tensorflow.keras.layers import Dense, InputLayer

from alibi_detect.datasets import fetch_kdd
from alibi_detect.models.tensorflow import elbo
from alibi_detect.od import OutlierVAE
from alibi_detect.utils.data import create_outlier_batch
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.utils.visualize import plot_instance_score, plot_feature_outlier_tabular, plot_roc

logger = tf.get_logger()
logger.setLevel(logging.ERROR)

Load dataset

We only keep a number of continuous (18 out of 41) features.

kddcup = fetch_kdd(percent10=True)  # only load 10% of the dataset
print(kddcup.data.shape, kddcup.target.shape)

Assume that a model is trained on normal instances of the dataset (not outliers) and standardization is applied:

np.random.seed(0)
normal_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=400000, perc_outlier=0)
X_train, y_train = normal_batch.data.astype('float'), normal_batch.target
print(X_train.shape, y_train.shape)
print('{}% outliers'.format(100 * y_train.mean()))

mean, stdev = X_train.mean(axis=0), X_train.std(axis=0)

Apply standardization:

X_train = (X_train - mean) / stdev

Load or define outlier detector

load_outlier_detector = True

#| scrolled: true
filepath = 'my_dir'  # change to directory (absolute path) where model is downloaded
detector_type = 'outlier'
dataset = 'kddcup'
detector_name = 'OutlierVAE'
filepath = os.path.join(filepath, detector_name)
if load_outlier_detector:  # load pretrained outlier detector
    od = fetch_detector(filepath, detector_type, dataset, detector_name)
else:  # define model, initialize, train and save outlier detector
    n_features = X_train.shape[1]
    latent_dim = 2
    
    encoder_net = tf.keras.Sequential(
      [
          InputLayer(input_shape=(n_features,)),
          Dense(20, activation=tf.nn.relu),
          Dense(15, activation=tf.nn.relu),
          Dense(7, activation=tf.nn.relu)
      ])

    decoder_net = tf.keras.Sequential(
      [
          InputLayer(input_shape=(latent_dim,)),
          Dense(7, activation=tf.nn.relu),
          Dense(15, activation=tf.nn.relu),
          Dense(20, activation=tf.nn.relu),
          Dense(n_features, activation=None)
      ])
    
    # initialize outlier detector
    od = OutlierVAE(threshold=None,  # threshold for outlier score
                    score_type='mse',  # use MSE of reconstruction error for outlier detection
                    encoder_net=encoder_net,  # can also pass VAE model instead
                    decoder_net=decoder_net,  # of separate encoder and decoder
                    latent_dim=latent_dim,
                    samples=5)
    # train
    od.fit(X_train,
           loss_fn=elbo,
           cov_elbo=dict(sim=.01),
           epochs=30,
           verbose=True)
    
    # save the trained outlier detector
    save_detector(od, filepath)

np.random.seed(0)
perc_outlier = 5
threshold_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=perc_outlier)
X_threshold, y_threshold = threshold_batch.data.astype('float'), threshold_batch.target
X_threshold = (X_threshold - mean) / stdev
print('{}% outliers'.format(100 * y_threshold.mean()))

od.infer_threshold(X_threshold, threshold_perc=100-perc_outlier)
print('New threshold: {}'.format(od.threshold))

We could have also inferred the threshold from the normal training data by setting threshold_perc e.g. at 99 and adding a bit of margin on top of the inferred threshold. Let's save the outlier detector with updated threshold:

save_detector(od, filepath)

Detect outliers

We now generate a batch of data with 10% outliers and detect the outliers in the batch.

np.random.seed(1)
outlier_batch = create_outlier_batch(kddcup.data, kddcup.target, n_samples=1000, perc_outlier=10)
X_outlier, y_outlier = outlier_batch.data.astype('float'), outlier_batch.target
X_outlier = (X_outlier - mean) / stdev
print(X_outlier.shape, y_outlier.shape)
print('{}% outliers'.format(100 * y_outlier.mean()))

Predict outliers:

od_preds = od.predict(X_outlier,
                      outlier_type='instance',    # use 'feature' or 'instance' level
                      return_feature_score=True,  # scores used to determine outliers
                      return_instance_score=True)
print(list(od_preds['data'].keys()))

Display results

F1 score and confusion matrix:

labels = outlier_batch.target_names
y_pred = od_preds['data']['is_outlier']
f1 = f1_score(y_outlier, y_pred)
print('F1 score: {:.4f}'.format(f1))
cm = confusion_matrix(y_outlier, y_pred)
df_cm = pd.DataFrame(cm, index=labels, columns=labels)
sns.heatmap(df_cm, annot=True, cbar=True, linewidths=.5)
plt.show()

Plot instance level outlier scores vs. the outlier threshold:

plot_instance_score(od_preds, y_outlier, labels, od.threshold)

We can clearly see that some outliers are very easy to detect while others have outlier scores closer to the normal data. We can also plot the ROC curve for the outlier scores of the detector:

roc_data = {'VAE': {'scores': od_preds['data']['instance_score'], 'labels': y_outlier}}
plot_roc(roc_data)

Investigate instance level outlier

We can now take a closer look at some of the individual predictions on X_outlier.

X_recon = od.vae(X_outlier).numpy()  # reconstructed instances by the VAE

#| scrolled: false
plot_feature_outlier_tabular(od_preds,
                             X_outlier,
                             X_recon=X_recon,
                             threshold=od.threshold,
                             instance_ids=None,  # pass a list with indices of instances to display
                             max_instances=5,  # max nb of instances to display
                             top_n=5,  # only show top_n features ordered by outlier score
                             outliers_only=False,  # only show outlier predictions
                             feature_names=kddcup.feature_names,  # add feature names
                             figsize=(20, 30))

The srv_count feature is responsible for a lot of the displayed outliers.

Spot-the-diff

Overview

If the detector flags drift and $b_i >0$ then we know that it reached its decision by considering how similar each instance is to the instance $w_i$, with those being more similar being more likely to be test instances than reference instances. Alternatively if $b_i < 0$ then instances more similar to $w_i$ were deemed more likely to be reference instances.

As with the standard classifier-based approach, we should specify the proportion of data to use for training and testing respectively as well as training arguments such as the learning rate and batch size. Note that classifier is trained for each test set that is passed for detection.

Usage

Initialize

Arguments:

x_ref: Data used as reference distribution.

Keyword arguments:

backend: Specify the backend (tensorflow or pytorch) to use for defining the kernel and training the test locations/differences.
p_val: p-value threshold used for the significance of the test.
preprocess_fn: Function to preprocess the data before computing the data drift metrics.
kernel: A differentiable TensorFlow or PyTorch module that takes two instances as input and returns a scalar notion of similarity os output. Defaults to a Gaussian radial basis function.
n_diffs: The number of test locations to use, each corresponding to an interpretable difference.
initial_diffs: Array used to initialise the diffs that will be learned. Defaults to Gaussian for each feature with equal variance to that of reference data.
l1_reg: Strength of l1 regularisation to apply to the differences.
binarize_preds: Whether to test for discrepency on soft (e.g. probs/logits) model predictions directly with a K-S test or binarise to 0-1 prediction errors and apply a binomial test.
train_size: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size. Cannot be used in combination with n_folds.
n_folds: Optional number of stratified folds used for training. The model preds are then calculated on all the out-of-fold instances. This allows to leverage all the reference and test data for drift detection at the expense of longer computation. If both train_size and n_folds are specified, n_folds is prioritized.
retrain_from_scratch: Whether the classifier should be retrained from scratch for each set of test data or whether it should instead continue training from where it left off on the previous set.
seed: Optional random seed for fold selection.
optimizer: Optimizer used during training of the kernel. From torch.optim for PyTorch and tf.keras.optimizers for TensorFlow.
learning_rate: Learning rate for the optimizer.
batch_size: Batch size used during training of the kernel.
preprocess_batch_fn: Optional batch preprocessing function. For example to convert a list of generic objects to a tensor which can be processed by the kernel.
epochs: Number of training epochs for the kernel.
verbose: Verbosity level during the training of the kernel. 0 is silent and 1 prints a progress bar.
train_kwargs: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer) or PyTorch (from alibi_detect.models.pytorch import trainer) trainer functions.
dataset: Dataset object used during training of the classifier. Defaults to alibi_detect.utils.pytorch.TorchDataset (an instance of torch.utils.data.Dataset) for the PyTorch backend and alibi_detect.utils.tensorflow.TFDataset (an instance of tf.keras.utils.Sequence) for the TensorFlow backend. For PyTorch, the dataset should only take the data x and the array of labels y as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x, y) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x, y, batch_size=batch_size, shuffle=True) is used. x can be of type np.ndarray or List[Any] while y is of type np.ndarray.
input_shape: Shape of input data.
data_type: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.

Additional PyTorch keyword arguments:

device: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader: Dataloader object used during training of the classifier. Defaults to torch.utils.data.DataLoader. The dataloader is not initialized yet, this is done during init off the detector using the batch_size. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader.

Defining the kernel

Instantiating the detector

Instantiating the detector is as simple as passing your reference data and selecting a backend, but you should also consider the number of "diffs" you would like your model to use to discriminate reference from test instances and the strength of regularisation you would like to apply to them.

Using n_diffs=1 is the simplest to interpret and seems to work well in practice. Using more diffs may result in stronger detection power but the diffs may be harder to interpret due to intereactions and conditional dependencies.

The strength of the regularisation (l1_reg) to apply to the diffs should also be specified. Stronger regularisation results in sparser diffs as the classifier is encouraged to discriminate using fewer features. This may make the diff more interpretable but may again come at the cost of detection power.

Alternatively we could have used the TensorFlow backend and defined a deep kernel with a convolutional structure:

Detect Drift

We detect data drift by simply calling predict on a batch of instances x. return_p_val equal to True will also return the p-value of the test, return_distance equal to True will return a notion of strength of the drift, return_probs equals True returns the out-of-fold classifier model prediction probabilities on the reference and test data (0 = reference data, 1 = test data) as well as the associated out-of-fold reference and test instances, and return_kernel equals True will also return the trained kernel.

is_drift: 1 if the sample tested has drifted from the reference data and 0 otherwise.
diffs: a numpy array containing the diffs used to discriminate reference from test instances.
diff_coeffs a coefficient correspond to each diff where a coeffient greater than zero implies that the corresponding diff makes the average reference instances more similar to a test instance on average and less than zero implies less similar.
threshold: the user-defined p-value threshold defining the significance of the test
p_val: the p-value of the test if return_p_val equals True.
distance: a notion of strength of the drift if return_distance equals True. Equal to the K-S test statistic assuming binarize_preds equals False or the relative error reduction over the baseline error expected under the null if binarize_preds equals True.
probs_ref: the instance level prediction probability for the reference data x_ref (0 = reference data, 1 = test data) if return_probs is True.
probs_test: the instance level prediction probability for the test data x if return_probs is true.
x_ref_oof: the instances associated with probs_ref if return_probs equals True.
x_test_oof: the instances associated with probs_test if return_probs equals True.
kernel: The trained kernel if return_kernel equals True.

Examples

Classifier

Overview

Usage

Initialize

Arguments:

x_ref: Data used as reference distribution.
model: Binary classification model used for drift detection. TensorFlow, PyTorch and Sklearn models are supported.

Keyword arguments:

backend: Specify the backend (tensorflow, pytorch or sklearn). This depends on the framework of the model. Defaults to tensorflow.
p_val: p-value threshold used for the significance of the test.
preprocess_at_init: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed: Whether or not the reference data x_ref has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict.
preprocess_fn: Function to preprocess the data before computing the data drift metrics.
preds_type: Whether the model outputs 'probs' (probabilities - for 'tensorflow', 'pytorch', 'sklearn' models), 'logits' (for 'pytorch', 'tensorflow' models), 'scores' (for 'sklearn' models if decision_function is supported).
binarize_preds: Whether to test for discrepancy on soft (e.g. probs/logits/scores) model predictions directly with a K-S test or binarise to 0-1 prediction errors and apply a binomial test. Defaults to False and therefore applies the K-S test.
train_size: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size. Cannot be used in combination with n_folds.
n_folds: Optional number of stratified folds used for training. The model preds are then calculated on all the out-of-fold predictions. This allows to leverage all the reference and test data for drift detection at the expense of longer computation. If both train_size and n_folds are specified, n_folds is prioritized.
seed: Optional random seed for fold selection.
optimizer: Optimizer used during training of the classifier. From torch.optim for PyTorch and tf.keras.optimizers for TensorFlow.
learning_rate: Learning rate for the optimizer. Only relevant for tensorflow and pytorch backends.
batch_size: Batch size used during training of the classifier.Only relevant for tensorflow and pytorch backends.
epochs: Number of training epochs for the classifier. Applies to each fold if n_folds is specified. Only relevant for tensorflow and pytorch backends.
verbose: Verbosity level during the training of the classifier. 0 is silent and 1 prints a progress bar. Only relevant for tensorflow and pytorch backends.
train_kwargs: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer) or PyTorch (from alibi_detect.models.pytorch import trainer) trainer functions.
dataset: Dataset object used during training of the classifier. Defaults to alibi_detect.utils.pytorch.TorchDataset (an instance of torch.utils.data.Dataset) for the PyTorch backend and alibi_detect.utils.tensorflow.TFDataset (an instance of tf.keras.utils.Sequence) for the TensorFlow backend. For PyTorch, the dataset should only take the data x and the array of labels y as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x, y) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x, y, batch_size=batch_size, shuffle=True) is used. x can be of type np.ndarray or List[Any] while y is of type np.ndarray.
input_shape: Shape of input data.
data_type: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.

Additional PyTorch keyword arguments:

device: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader: Dataloader object used during training of the model. Defaults to torch.utils.data.DataLoader. The dataloader is not initialized yet, this is done during init off the detector using the batch_size. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader.

Additional Sklearn keyword arguments:

use_calibration : Whether to use calibration. Calibration can be used on top of any model. Only relevant for 'sklearn' backend.
calibration_kwargs : Optional additional kwargs for calibration. Only relevant for 'sklearn' backend. See https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html for more details.
use_oob : Whether to use out-of-bag(OOB) predictions. Supported only for RandomForestClassifier.

Initialized TensorFlow drift detector example:

A similar detector using PyTorch:

Detect Drift

We detect data drift by simply calling predict on a batch of instances x. return_p_val equal to True will also return the p-value of the test, return_distance equal to True will return a notion of strength of the drift and return_probs equals True also returns the out-of-fold classifier model prediction probabilities on the reference and test data (0 = reference data, 1 = test data) as well as the associated out-of-fold reference and test instances.

is_drift: 1 if the sample tested has drifted from the reference data and 0 otherwise.
threshold: the user-defined threshold defining the significance of the test
p_val: the p-value of the test if return_p_val equals True.
distance: a notion of strength of the drift if return_distance equals True. Equal to the K-S test statistic assuming binarize_preds equals False or the relative error reduction over the baseline error expected under the null if binarize_preds equals True.
probs_ref: the instance level prediction probability for the reference data x_ref (0 = reference data, 1 = test data) if return_probs is True.
probs_test: the instance level prediction probability for the test data x if return_probs is true.
x_ref_oof: the instances associated with probs_ref if return_probs equals True.
x_test_oof: the instances associated with probs_test if return_probs equals True.

Examples

Learned Kernel

Overview

As with the classifier-based approach, we should specify the proportion of data to use for training and testing respectively as well as training arguments such as the learning rate and batch size. Note that a new kernel is trained for each test set that is passed for detection.

Usage

Initialize

Arguments:

x_ref: Data used as reference distribution.
kernel: A differentiable TensorFlow or PyTorch module that takes two sets of instances as inputs and returns a kernel similarity matrix as output.

Keyword arguments:

p_val: p-value threshold used for the significance of the test.
preprocess_at_init: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed: Whether or not the reference data x_ref has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict.
preprocess_fn: Function to preprocess the data before computing the data drift metrics.
n_permutations: The number of permutations to use in the permutation test once the MMD has been computed.
var_reg: Constant added to the estimated variance of the MMD for stability.
reg_loss_fn: The regularisation term reg_loss_fn(kernel) is added to the loss function being optimized.
train_size: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size.
retrain_from_scratch: Whether the kernel should be retrained from scratch for each set of test data or whether it should instead continue training from where it left off on the previous set. Defaults to True.
optimizer: Optimizer used during training of the kernel. From torch.optim for PyTorch and tf.keras.optimizers for TensorFlow.
learning_rate: Learning rate for the optimizer.
batch_size: Batch size used during training of the kernel.
batch_size_predict: Batch size used for the trained drift detector predictions.
preprocess_batch_fn: Optional batch preprocessing function. For example to convert a list of generic objects to a tensor which can be processed by the kernel.
epochs: Number of training epochs for the kernel.
verbose: Verbosity level during the training of the kernel. 0 is silent and 1 prints a progress bar.
train_kwargs: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer) or PyTorch (from alibi_detect.models.pytorch import trainer) trainer functions.
dataset: Dataset object used during training of the kernel. Defaults to alibi_detect.utils.pytorch.TorchDataset (an instance of torch.utils.data.Dataset) for the PyTorch and KeOps backends and alibi_detect.utils.tensorflow.TFDataset (an instance of tf.keras.utils.Sequence) for the TensorFlow backend. For PyTorch or KeOps, the dataset should only take the windows x_ref and x_test as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x_ref, x_test) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x_ref, x_test, batch_size=batch_size, shuffle=True) is used. x_ref and x_test can be of type np.ndarray or List[Any].
input_shape: Shape of input data.
data_type: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.

Additional PyTorch and KeOps keyword arguments:

device: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader: Dataloader object used during training of the kernel. Defaults to torch.utils.data.DataLoader. The dataloader is not initialized yet, this is done during init off the detector using the batch_size. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader.
num_workers: The number of workers used by the DataLoader. The default (num_workers=0) means multi-process data loading is disabled. Setting num_workers>0 may be unreliable on Windows.

Additional KeOps only keyword arguments:

batch_size_permutations: KeOps computes the n_permutations of the MMD^2 statistics in chunks of batch_size_permutations. Defaults to 1,000,000.

Defining the kernel

This is easily implemented using the DeepKernel class provided in alibi_detect. We demonstrate below how we might define a convolutional kernel for images using Pytorch. By default GaussianRBF kernels are used for $k_a$ and $k_b$ and here we specify $\epsilon=0.01$, but we could alternatively set eps='trainable'.

It is important to note that, if retrain_from_scratch=True and we have not initialised the kernel bandwidth sigma for the default GaussianRBF kernel $k_a$ and optionally also for $k_b$, we will initialise sigma using a median (PyTorch and TensorFlow) or mean (KeOps) bandwidth heuristic for every detector prediction. For KeOps detectors specifically, this could form a computational bottleneck and should be avoided by already specifying a bandwidth in advance. To do this, we can leverage the library's built-in heuristics:

Instantiating the detector

Instantiating the detector is then as simple as passing the reference data and the kernel as follows:

We could have alternatively defined the kernel and instantiated the detector using KeOps:

Or by using TensorFlow as the backend:

Detect Drift

is_drift: 1 if the sample tested has drifted from the reference data and 0 otherwise.
threshold: the user-defined p-value threshold defining the significance of the test
p_val: the p-value of the test if return_p_val equals True.
distance: MMD^2 metric between the reference data and the new batch if return_distance equals True.
distance_threshold: MMD^2 metric value from the permutation test which corresponds to the the p-value threshold if return_distance equals True.
kernel: The trained kernel if return_kernel equals True.

Examples

Graph

Image

Tabular

Online Fisher’s Exact Test

Online Fisher's Exact Test

Overview

Threshold configuration

Online detectors assume the reference data is large and fixed and operate on single data points at a time (rather than batches). These data points are passed into the test-windows, and a two-sample test-statistic (in this case $F=1-\hat{p}$) between the reference data and test-window is computed at each time-step. When the test-statistic exceeds a preconfigured threshold, drift is detected. Configuration of the thresholds requires specification of the expected run-time (ERT) which specifies how many time-steps that the detector, on average, should run for in the absence of drift before making a false detection.

F_t = (1-\lambda)F_{t-1} + \lambda F_t

For a window size of $W$, at time $t$ the value of the statistic $F_t$ depends on more than just the previous $W$ values. If $\lambda$, set by lam, is too small, thresholds may keep decreasing well past $2W - 1$ timesteps. To avoid this, the default lam is set to a high value of $\lambda=0.99$, meaning that discreteness is still broken, but the value of the test statistic depends (almost) solely on the last $W$ observations. If more smoothing is desired, the t_max parameter can be manually set at a larger value.

Note

The detector must configure thresholds for each window size and each feature. This can be a time consuming process if the number of features is high. For high-dimensional data users are recommended to apply a dimension reduction step via preprocess_fn.

Window sizes

Specification of test-window sizes (the detector accepts multiple windows of different size $W$) is also required, with smaller windows allowing faster response to severe drift and larger windows allowing more power to detect slight drift. Since this detector requires a window to be full to function, the ERT is measured from t = min(window_sizes)-1.

Multivariate data

Although this detector is primarly intended for univariate data, it can also be applied to multivariate data. In this case, the detector makes a correction similar to the Bonferroni correction used for the offline detector. Given $d$ features, the detector configures thresholds by targeting the $1-\beta$ quantile of test statistics over the simulated streams, where $\beta = 1 - (1-(1/ERT))^{(1/d)}$. For the univariate case, this simplifies to $\beta = 1/ERT$. At prediction time, drift is flagged if the test statistic of any feature stream exceed the thresholds.

Note

In the multivariate case, for the ERT to be accurately targeted the feature streams must be independent.

Usage

Initialize

Arguments:

x_ref: Data used as reference distribution.
ert: The expected run-time in the absence of drift, starting from t=min(windows_sizes).
window_sizes: The sizes of the sliding test-windows used to compute the test-statistics. Smaller windows focus on responding quickly to severe drift, larger windows focus on ability to detect slight drift.

Keyword arguments:

preprocess_fn: Function to preprocess the data before computing the data drift metrics.
n_bootstraps: The number of bootstrap simulations used to configure the thresholds. The larger this is the more accurately the desired ERT will be targeted. Should ideally be at least an order of magnitude larger than the ERT.
t_max: Length of streams to simulate when configuring thresholds. If None, this is set to 2 * max(window_sizes) - 1.
alternative: Defines the alternative hypothesis. Options are 'greater' (default) or 'less', corresponding to an increase or decrease in the mean of the Bernoulli stream.
lam: Smoothing coefficient used for exponential moving average. If heavy smoothing is applied (lam<<1), a larger t_max may be necessary in order to ensure the thresholds have converged.
n_features: Number of features used in the FET test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
verbose: Whether or not to print progress during configuration.
input_shape: Shape of input data.
data_type: Optionally specify the data type (tabular, image or time-series). Added to metadata.

Initialized drift detector example:

from alibi_detect.cd import FETDriftOnline

ert = 150
window_sizes = [20,40]
cd = FETDriftOnline(x_ref, ert, window_sizes)

Detect Drift

We detect data drift by sequentially calling predict on single instances x_t (no batch dimension) as they each arrive. We can return the test-statistic and the threshold by setting return_test_stat to True.

is_drift: 1 if any of the test-windows have drifted from the reference data and 0 otherwise.
time: The number of observations that have been so far passed to the detector as test instances.
ert: The expected run-time the detector was configured to run at in the absence of drift.
test_stat: FET test-statistics (1-p_val) between the reference data and the test_windows if return_test_stat equals True.
threshold: The values the test-statsitics are required to exceed for drift to be detected if return_test_stat equals True.

preds = cd.predict(x_t, return_test_stat=True)

Managing State

The detector's state may be saved with the save_state method:

cd = FETDriftOnline(x_ref, ert, window_sizes)  # Instantiate detector at t=0
cd.predict(x_1)  # t=1
cd.save_state('checkpoint_t1')  # Save state at t=1
cd.predict(x_2)  # t=2

The previously saved state may then be loaded via the load_state method:

# Load state at t=1
cd.load_state('checkpoint_t1')

References

Online Least-Squares Density Difference

Overview

Online detectors assume the reference data is large and fixed and operate on single data points at a time (rather than batches). These data points are passed into the test-window and a two-sample test-statistic (in this case an estimate of LSDD) between the reference data and test-window is computed at each time-step. When the test-statistic exceeds a preconfigured threshold, drift is detected. Configuration of the thresholds requires specification of the expected run-time (ERT) which specifies how many time-steps that the detector, on average, should run for in the absence of drift before making a false detection. It also requires specification of a test-window size, with smaller windows allowing faster response to severe drift and larger windows allowing more power to detect slight drift.

Usage

Initialize

Arguments:

x_ref: Data used as reference distribution.
ert: The expected run-time in the absence of drift, starting from t=0.
window_size: The size of the sliding test-window used to compute the test-statistic. Smaller windows focus on responding quickly to severe drift, larger windows focus on ability to detect slight drift.

Keyword arguments:

backend: Backend used for the LSDD implementation and configuration.
preprocess_fn: Function to preprocess the data before computing the data drift metrics.
sigma: Optionally set the bandwidth of the Gaussian kernel used in estimating the LSDD. Can also pass multiple bandwidth values as an array. The kernel evaluation is then averaged over those bandwidths. If sigma is not specified, the 'median heuristic' is adopted whereby sigma is set as the median pairwise distance between reference samples.
n_bootstraps: The number of bootstrap simulations used to configure the thresholds. The larger this is the more accurately the desired ERT will be targeted. Should ideally be at least an order of magnitude larger than the ERT.
n_kernel_centers: The number of reference samples to use as centers in the Gaussian kernel model used to estimate LSDD. Defaults to 2*window_size.
lambda_rd_max: The maximum relative difference between two estimates of LSDD that the regularization parameter lambda is allowed to cause. Defaults to 0.2 as in the paper.
verbose: Whether or not to print progress during configuration.
input_shape: Shape of input data.
data_type: Optionally specify the data type (tabular, image or time-series). Added to metadata.

Additional PyTorch keyword arguments:

device: Device type used. The default None tries to use the GPU and falls back on CPU if needed. Can be specified by passing either 'cuda', 'gpu' or 'cpu'. Only relevant for 'pytorch' backend.

Initialized drift detector example:

from alibi_detect.cd import LSDDDriftOnline

cd = LSDDDriftOnline(x_ref, ert, window_size, backend='tensorflow')

The same detector in PyTorch:

cd = LSDDDriftOnline(x_ref, ert, window_size, backend='pytorch')

We can also easily add preprocessing functions for both frameworks. The following example uses a randomly initialized image encoder in PyTorch:

from functools import partial
import torch
import torch.nn as nn
from alibi_detect.cd.pytorch import preprocess_drift

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# define encoder
encoder_net = nn.Sequential(
    nn.Conv2d(3, 64, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(64, 128, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(128, 512, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(2048, 32)
).to(device).eval()

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, device=device, batch_size=512)

cd = LSDDDriftOnline(x_ref, ert, window_size, backend='pytorch', preprocess_fn=preprocess_fn)

The same functionality is supported in TensorFlow and the main difference is that you would import from alibi_detect.cd.tensorflow import preprocess_drift. Other preprocessing steps such as the output of hidden layers of a model or extracted text embeddings using transformer models can be used in a similar way in both frameworks. TensorFlow example for the hidden layer output:

from alibi_detect.cd.tensorflow import HiddenOutput, preprocess_drift

model = # TensorFlow model; tf.keras.Model or tf.keras.Sequential
preprocess_fn = partial(preprocess_drift, model=HiddenOutput(model, layer=-1), batch_size=128)

cd = LSDDDriftOnline(x_ref, ert, window_size, backend='tensorflow', preprocess_fn=preprocess_fn)

import torch
import torch.nn as nn
from transformers import AutoTokenizer
from alibi_detect.cd.pytorch import preprocess_drift
from alibi_detect.models.pytorch import TransformerEmbedding

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_name = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

embedding_type = 'hidden_state'
layers = [5, 6, 7]
embed = TransformerEmbedding(model_name, embedding_type, layers)
model = nn.Sequential(embed, nn.Linear(768, 256), nn.ReLU(), nn.Linear(256, enc_dim)).to(device).eval()
preprocess_fn = partial(preprocess_drift, model=model, tokenizer=tokenizer, max_len=512, batch_size=32)

# initialise drift detector
cd = LSDDDriftOnline(x_ref, ert, window_size, backend='pytorch', preprocess_fn=preprocess_fn)

Again the same functionality is supported in TensorFlow but with from alibi_detect.cd.tensorflow import preprocess_drift and from alibi_detect.models.tensorflow import TransformerEmbedding imports.

Detect Drift

is_drift: 1 if the test-window (of the most recent window_size observations) has drifted from the reference data and 0 otherwise.
time: The number of observations that have been so far passed to the detector as test instances.
ert: The expected run-time the detector was configured to run at in the absence of drift.
test_stat: LSDD metric between the reference data and the test_window if return_test_stat equals True.
threshold: The value the test-statsitic is required to exceed for drift to be detected if return_test_stat equals True.

preds = cd.predict(x_t, return_test_stat=True)

Managing State

The detector's state may be saved with the save_state method:

cd = LSDDDriftOnline(x_ref, ert, window_size)  # Instantiate detector at t=0
cd.predict(x_1)  # t=1
cd.save_state('checkpoint_t1')  # Save state at t=1
cd.predict(x_2)  # t=2

The previously saved state may then be loaded via the load_state method:

# Load state at t=1
cd.load_state('checkpoint_t1')

Examples

Online Maximum Mean Discrepancy

Overview

MMD(F, p, q) = || \mu_{p} - \mu_{q} ||^2_{F}

Online detectors assume the reference data is large and fixed and operate on single data points at a time (rather than batches). These data points are passed into the test-window and a two-sample test-statistic (in this case squared MMD) between the reference data and test-window is computed at each time-step. When the test-statistic exceeds a preconfigured threshold, drift is detected. Configuration of the thresholds requires specification of the expected run-time (ERT) which specifies how many time-steps that the detector, on average, should run for in the absence of drift before making a false detection. It also requires specification of a test-window size, with smaller windows allowing faster response to severe drift and larger windows allowing more power to detect slight drift.

Usage

Initialize

Arguments:

x_ref: Data used as reference distribution.
ert: The expected run-time in the absence of drift, starting from t=0.
window_size: The size of the sliding test-window used to compute the test-statistic. Smaller windows focus on responding quickly to severe drift, larger windows focus on ability to detect slight drift.

Keyword arguments:

backend: Backend used for the MMD implementation and configuration.
preprocess_fn: Function to preprocess the data before computing the data drift metrics.
kernel: Kernel used for the MMD computation, defaults to Gaussian RBF kernel.
sigma: Optionally set the GaussianRBF kernel bandwidth. Can also pass multiple bandwidth values as an array. The kernel evaluation is then averaged over those bandwidths. If sigma is not specified, the 'median heuristic' is adopted whereby sigma is set as the median pairwise distance between reference samples.
n_bootstraps: The number of bootstrap simulations used to configure the thresholds. The larger this is the more accurately the desired ERT will be targeted. Should ideally be at least an order of magnitude larger than the ERT.
verbose: Whether or not to print progress during configuration.
input_shape: Shape of input data.
data_type: Optionally specify the data type (tabular, image or time-series). Added to metadata.

Additional PyTorch keyword arguments:

device: Device type used. The default None tries to use the GPU and falls back on CPU if needed. Can be specified by passing either 'cuda', 'gpu' or 'cpu'. Only relevant for 'pytorch' backend.

Initialized drift detector example:

from alibi_detect.cd import MMDDriftOnline

cd = MMDDriftOnline(x_ref, ert, window_size, backend='tensorflow')

The same detector in PyTorch:

cd = MMDDriftOnline(x_ref, ert, window_size, backend='pytorch')

We can also easily add preprocessing functions for both frameworks. The following example uses a randomly initialized image encoder in PyTorch:

from functools import partial
import torch
import torch.nn as nn
from alibi_detect.cd.pytorch import preprocess_drift

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# define encoder
encoder_net = nn.Sequential(
    nn.Conv2d(3, 64, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(64, 128, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(128, 512, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(2048, 32)
).to(device).eval()

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, device=device, batch_size=512)

cd = MMDDriftOnline(x_ref, ert, window_size, backend='pytorch', preprocess_fn=preprocess_fn)

from alibi_detect.cd.tensorflow import HiddenOutput, preprocess_drift

model = # TensorFlow model; tf.keras.Model or tf.keras.Sequential
preprocess_fn = partial(preprocess_drift, model=HiddenOutput(model, layer=-1), batch_size=128)

cd = MMDDriftOnline(x_ref, ert, window_size, backend='tensorflow', preprocess_fn=preprocess_fn)

import torch
import torch.nn as nn
from transformers import AutoTokenizer
from alibi_detect.cd.pytorch import preprocess_drift
from alibi_detect.models.pytorch import TransformerEmbedding

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model_name = 'bert-base-cased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

embedding_type = 'hidden_state'
layers = [5, 6, 7]
embed = TransformerEmbedding(model_name, embedding_type, layers)
model = nn.Sequential(embed, nn.Linear(768, 256), nn.ReLU(), nn.Linear(256, enc_dim)).to(device).eval()
preprocess_fn = partial(preprocess_drift, model=model, tokenizer=tokenizer, max_len=512, batch_size=32)

# initialise drift detector
cd = MMDDriftOnline(x_ref, ert, window_size, backend='pytorch', preprocess_fn=preprocess_fn)

Detect Drift

is_drift: 1 if the test-window (of the most recent window_size observations) has drifted from the reference data and 0 otherwise.
time: The number of observations that have been so far passed to the detector as test instances.
ert: The expected run-time the detector was configured to run at in the absence of drift.
test_stat: MMD^2 metric between the reference data and the test_window if return_test_stat equals True.
threshold: The value the test-statsitic is required to exceed for drift to be detected if return_test_stat equals True.

preds = cd.predict(x_t, return_test_stat=True)

Managing State

The detector's state may be saved with the save_state method:

cd = MMDDriftOnline(x_ref, ert, window_size)  # Instantiate detector at t=0
cd.predict(x_1)  # t=1
cd.save_state('checkpoint_t1')  # Save state at t=1
cd.predict(x_2)  # t=2

The previously saved state may then be loaded via the load_state method:

# Load state at t=1
cd.load_state('checkpoint_t1')

Examples

Context-aware drift detection on news articles

Introduction

In our news topics example, each different topic such as politics, sports or weather represents a subpopulation of the data. Our context-aware drift detector can then detect changes in the data distribution which cannot be attributed to a change in the relative prevalences of these subpopulations, which we deem permissible. As a cherry on the cake, the context-aware detector allows you to understand which subpopulations are present in both the reference and test data. This allows you to obtain deep insights into the distribution underlying the test data.

Useful context (or conditioning) variables for the context-aware drift detector include but are not limited to:

Domain or application specific contexts such as the time of day or the weather.
Conditioning on the relative prevalences of known subpopulations, such as the frequency of political articles. It is important to note that while the relative frequency of each subpopulation might change, the distribution underlying each subpopulation cannot change.
Conditioning on model predictions. Assume we trained a classifier which tries to figure out which news topic an article belongs to. Given our model predictions we then want to understand whether our test data follows the same underlying distribution as reference instances with similar model predictions. This conditioning would also be useful in case of trending news topics which cause the model prediction distribution to shift but not necessarily the distribution within each of the news topics.
Conditioning on model uncertainties which would allow increases in model uncertainty due to drift into familiar regions of high aleatoric uncertainty (often fine) to be distinguished from that into unfamiliar regions of high epistemic uncertainty (often problematic).

The following settings will be illustrated throughout the notebook:

A change in the prevalences of subpopulations (i.e. news topics) relative to their prevalences in the training data. Contrary to traditional drift detection approaches, the context-aware detector does not flag drift as this change in frequency of news topics is permissible given the context provided (e.g. more political news articles around elections).
A change in the underlying distribution of one or more subpopulations takes place. While we allow changes in the prevalence of the subpopulations accounted for by the context variable, we do not allow changes of the subpopulations themselves. Let's assume that a newspaper usually has a certain tone (e.g. more conservative) when it comes to politics. If this tone changes (to less conservative) around elections (increased frequency of political news articles), then we want to flag it as drift since the change cannot be attributed to the context given to the detector.
A change in the distribution as we observe a previously unseen news topic. A newspaper might for instance add a classified ads section, which was not present in the reference data.

Under setting 1. we want our detector to be well-calibrated (a controlled False Positive Rate (FPR) and more generally a p-value which is uniformly distributed between 0 and 1) while under settings 2. and 3. we want our detector to be powerful and flag the drift. Lastly, we show how the detector can help you to understand the connection between the reference and test data distributions better.

Data

Requirements

The notebook requires the umap-learn, torch, sentence-transformers, statsmodels, seaborn and datasets packages to be installed, which can be done via pip:

Before we start let's fix the random seeds for reproducibility:

Load data

First we load the data, show which classes (news topics) are present and what an instance looks like.

Let's take a look at an instance from the dataset:

Define models and train a classifier

We define respectively a generic clustering model using UMAP, a model to embed the text input using pre-trained SentenceTransformers embeddings, a text classifier and a utility function to place the data on the right device.

First we train a classifier on a small subset of the data. The aim of the classifier is to predict the news topic of each instance. Below we define a few simple training and evaluation functions.

We now split the data in 2 sets. The first set (x_train) we will use to train our text classifier, and the second set (x_drift) is held out to test our drift detector on.

Let's train our classifier. The classifier consists of a simple MLP head on top of a pre-trained SentenceTransformer model as the backbone. The SentenceTransformer remains frozen during training and only the MLP head is finetuned.

Detector calibration under no change

We start with an example where no drift occurs and the reference and test data are both sampled randomly from all news topics. Under this scenario, we expect no drift to be detected by either a normal MMD detector or by the context-aware MMD detector.

First we define some helper functions. The first one visualises the clustered text data while the second function samples disjoint reference and test sets with a specified number of instances per class (i.e. per news topic).

We first define the embedding model using the pre-trained SentenceTransformer embeddings and then embed both the reference and test sets.

By applying UMAP clustering on the SentenceTransformer embeddings, we can visually inspect the various news topic clusters. Note that we fit the clustering model on the held out data first, and then make predictions on the reference and test sets.

We can visually see that the reference and test set are made up of similar clusters of data, grouped by news topic. As a result, we would not expect drift to be flagged. If the data distribution did not change, we can expect the p-value distribution of our statistical test to be uniformly distributed between 0 and 1. So let's see if this assumption holds.

Importantly, first we need to define our context variable for the context-aware MMD detector. In our experiments we allow the relative prevalences of subpopulations to vary while the distributions underlying each of the subpopulations remain unchanged. To achieve this we condition on the prediction probabilities of the classifier we trained earlier to distinguish each of the 20 different news topics. We can do this because the prediction probabilities can account for the frequency of occurrence of each of the topics (be it imperfectly given our classifier makes the occasional mistake).

Before we set off our experiments, we embed all the instances in x_drift and compute all contexts c_drift so we don't have to call our transformer model every single pass in the for loop.

As expected we can see that both the normal MMD and the context-aware MMD detectors are well-calibrated.

Changing the relative subpopulation prevalence

We now focus our attention on a more realistic problem where the relative frequency of one or more subpopulations (i.e. news topics) is changing in a way which can be attributed to external events. Importantly, the distribution underlying each subpopulation (e.g. the distribution of hockey news itself) remains unchanged, only its frequency changes.

In our example we assume that the World Series and Stanley Cup coincide on the calendar leading to a spike in news articles on respectively baseball and hockey. Furthermore, there is not too much news on Mac or Windows since there are no new releases or products planned anytime soon.

While the context-aware detector remains well calibrated, the MMD detector consistently flags drift (low p-values). Note that this is the expected behaviour since the vanilla MMD detector cannot take any external context into account and correctly detects that the reference and test data do not follow the same underlying distribution.

We can also easily see this on the plot below where the p-values of the context-aware detector are uniformly distributed while the MMD detector's p-values are consistently close to 0. Note that we limited the y-axis range to make the plot easier to read.

Changing the subpopulation distribution

In the following example we change the distribution of one or more of the underlying subpopulations. Notice that now we do want to flag drift since our context variable, which permits changes in relative subpopulation prevalences, can no longer explain the change in distribution.

Imagine our news topic classification model is not as granular as before and instead of the 20 categories only predicts the 6 super classes, organised by subject matter:

Computers: comp.graphics; comp.os.ms-windows.misc; comp.sys.ibm.pc.hardware; comp.sys.mac.hardware; comp.windows.x
Recreation: rec.autos; rec.motorcycles; rec.sport.baseball; rec.sport.hockey
Science: sci.crypt; sci.electronics; sci.med; sci.space
Miscellaneous: misc.forsale
Politics: talk.politics.misc; talk.politics.guns; talk.politics.mideast
Religion: talk.religion.misc; talk.atheism; soc.religion.christian

What if baseball and hockey become less popular and the distribution underlying the Recreation class changes? We will want to detect this as the change in distributions of the subpopulations (the 6 super classes) cannot be explained anymore by the context variable.

In order to reuse our pretrained classifier for the super classes, we add the following helper function to map the predictions on the super classes and return one-hot encoded predictions over the 6 super classes. Note that our context variable now changes from a probability distribution over the 20 news topics to a one-hot encoded representation over the 6 super classes.

We can see that the context-aware detector is powerful to detect changes in the distributions of the subpopulations.

Detect unseen topics

Next we illustrate the effectiveness of the context-aware detector to detect new topics which are not present in the reference data. Obviously we also want to flag drift in this case. As an example we introduce movie reviews in the test data.

Changing the context variable

So far we have conditioned the context-aware detector on the model predictions. There are however many other useful contexts possible. One such example would be to condition on the predictions of an unsupervised clustering algorithm. To facilitate this, we first apply kernel PCA on the embedding vectors, followed by a Gaussian mixture model which clusters the data into 6 classes (same as the super classes). We will test both the calibration under the null hypothesis (no distribution change) as well as the power when a new topic (movie reviews) is injected.

Next we change the number of instances in each cluster between the reference and test sets. Note that we do not alter the underlying distribution of each of the clusters, just the frequency.

Now we run the experiment and show the context-aware detector's calibration when changing the cluster frequencies. We also show how the usual MMD detector will consistently flag drift. Furthermore, we inject instances from the movie reviews dataset and illustrate that the context-aware detector remains powerful when the underlying cluster distribution changes (by including a previously unseen topic).

Interpretability of the context-aware detector

The test statistic $\hat{t}$ of the context-aware MMD detector can be formulated as follows: $\hat{t} = \langle K_{0,0}, W_{0,0} \rangle + \langle K_{1,1}, W_{1,1} \rangle -2\langle K_{0,1}, W_{0,1}\rangle$ where $0$ refers to the reference data, $1$ to the test data, and $W_{.,.}$ and $K_{.,.}$ are the weight and kernel matrices, respectively. The weight matrices $W_{.,.}$ allow us to focus on the distribution's subpopulations of interest. Reference instances which have similar contexts as the test data will have higher values for their entries in $W_{0,1}$ than instances with dissimilar contexts. We can therefore interpret $W_{0,1}$ as the coupling matrix between instances in the reference and the test sets. This allows us to investigate which subpopulations from the reference set are present and which are missing in the test data. If we also have a good understanding of the model performance on various subpopulations of the reference data, we could even try and use this coupling matrix to roughly proxy model performance on the unlabeled test instances. Note that in this case we would require labels from the reference data and make sure the reference instances come from the validation, not the training set.

In the following example we only pick 2 classes to be present in the test set while all 20 are present in the reference set. We can then investigate via the coupling matrix whether the test statistic $\hat{t}$ focused on the right classes in the reference data via $W_{0,1}$. More concretely, we can sum over the columns (the test instances) of $W_{0,1}$ and check which reference instances obtained the highest weights.

Online Drift Detection on the Wine Quality Dataset

In the context of deployed models, data (model queries) usually arrive sequentially and we wish to detect it as soon as possible after its occurence. One approach is to perform a test for drift every $W$ time-steps, using the $W$ samples that have arrived since the last test. Such a strategy could be implemented using any of the offline detectors implemented in alibi-detect, but being both sensitive to slight drift and responsive to severe drift is difficult. If the window size $W$ is too small then slight drift will be undetectable. If it is too large then the delay between test-points hampers responsiveness to severe drift.

An alternative strategy is to perform a test each time data arrives. However the usual offline methods are not applicable because the process for computing p-values is too expensive and doesn't account for correlated test outcomes when using overlapping windows of test data.

Online detectors instead work by computing the test-statistic once using the first $W$ data points and then updating the test-statistic sequentially at low cost. When no drift has occured the test-statistic fluctuates around its expected value and once drift occurs the test-statistic starts to drift upwards. When it exceeds some preconfigured threshold value, drift is detected.

Unlike offline detectors which require the specification of a threshold p-value (a false positive rate), the online detectors in alibi-detect require the specification of an expected run-time (ERT) (an inverted FPR). This is the number of time-steps that we insist our detectors, on average, should run for in the absense of drift before making a false detection. Usually we would like the ERT to be large, however this results in insensitive detectors which are slow to respond when drift does occur. There is a tradeoff between the expected run time and the expected detection delay.

To target the desired ERT, thresholds are configured during an initial configuration phase via simulation. This configuration process is only suitable when the amount reference data (most likely the training data of the model of interest) is relatively large (ideally around an order of magnitude larger than the desired ERT). Configuration can be expensive (less so with a GPU) but allows the detector to operate at low-cost during deployment.

This notebook demonstrates online drift detection using two different two-sample distance metrics for the test-statistic, the maximum mean discrepency (MMD) and least-squared density difference (LSDD), both of which can be updated sequentially at low cost.

Backend

Dataset

Online detection with MMD and Pytorch

The Maximum Mean Discepency (MMD) is a distance-based measure between 2 distributions p and q based on the mean embeddings $\mu_{p}$ and $\mu_{q}$ in a reproducing kernel Hilbert space $F$:

MMD(F, p, q) = || \mu_{p} - \mu_{q} ||^2_{F}

Given reference samples ${X_i}{i=1}^{N}$ and test samples ${Y_i}{i=t}^{t+W}$ we may compute an unbiased estimate $\widehat{MMD}^2(F, {X_i}{i=1}^N, {Y_i}{i=t}^{t+W})$ of the squared MMD between the two underlying distributions. Depending on the size of the reference and test windows, $N$ and $W$ respectively, this can be relatively expensive. However, once computed it is possible to update the statistic to estimate to the squared MMD between the distributions underlying ${X_i}{i=1}^{N}$ and ${Y_i}{i=t+1}^{t+1+W}$ at a very low cost, making it suitable for online drift detection.

import matplotlib.pyplot as plt
import numpy as np
import torch
import tensorflow as tf
import pandas as pd
import scipy
from sklearn.decomposition import PCA

np.random.seed(0)
torch.manual_seed(0)
tf.random.set_seed(0)

Load data

First we load in the data:

red = pd.read_csv(
    "https://storage.googleapis.com/seldon-datasets/wine_quality/winequality-red.csv", sep=';'
)
white = pd.read_csv(
    "https://storage.googleapis.com/seldon-datasets/wine_quality/winequality-white.csv", sep=';'
)
white.describe()

We can see that the data for both red and white wine samples take the same format.

red.describe()

We shuffle and normalise the data such that each feature takes a value in [0,1], as does the quality we seek to predict. We assue that our model was trained on white wine samples, which therefore forms the reference distribution, and that red wine samples can be considered to be drawn from a drifted distribution.

white, red = np.asarray(white, np.float32), np.asarray(red, np.float32)
n_white, n_red = white.shape[0], red.shape[0]

col_maxes = white.max(axis=0)
white, red = white / col_maxes, red / col_maxes
white, red = white[np.random.permutation(n_white)], red[np.random.permutation(n_red)]
X = white[:, :-1]
X_corr = red[:, :-1]

X_train = X[:(n_white//2)]
X_ref = X[(n_white//2):(3*n_white//4)]
X_h0 = X[(3*n_white//4):]

Now we define a PCA object to be used as a preprocessing function to project the 11-D data onto a 2-D representation. We learn the first 2 principal components on the training split of the reference data.

pca = PCA(2)
pca.fit(X_train)

Hopefully the learned preprocessing step has learned a projection such that in the lower dimensional space the two samples are distinguishable.

enc_h0 = pca.transform(X_h0)
enc_h1 = pca.transform(X_corr)

plt.scatter(enc_h0[:,0], enc_h0[:,1], alpha=0.2, color='green', label='white wine')
plt.scatter(enc_h1[:,0], enc_h1[:,1], alpha=0.2, color='red', label='red wine')
plt.legend(loc='upper right')
plt.show()

Now we can define our online drift detector. We specify an expected run-time (in the absence of drift) of 50 time-steps, and a window size of 10 time-steps. Upon initialising the detector thresholds will be computed using 2500 boostrap samples. These values of ert, window_size and n_bootstraps are lower than a typical use-case in order to demonstrate the average behaviour of the detector over a large number of runs in a reasonable time.

from alibi_detect.cd import MMDDriftOnline

ert = 50
window_size = 10

cd = MMDDriftOnline(
    X_ref, ert, window_size, backend='pytorch', preprocess_fn=pca.transform, n_bootstraps=2500
)

We now define a function which will simulate a single run and return the run-time. Note how the detector acts on single instances at a time, the run-time is considered as the time elapsed after the test-window has been filled, and that the detector is stateful and must be reset between detections.

def time_run(cd, X, window_size):
    n = X.shape[0]
    perm = np.random.permutation(n)
    t = 0
    cd.reset_state()
    while True:
        pred = cd.predict(X[perm[t%n]])
        if pred['data']['is_drift'] == 1:
            return t
        else:
            t += 1

n_runs = 250
times_h0 = [time_run(cd, X_h0, window_size) for _ in range(n_runs)]
print(f"Average run-time under no-drift: {np.mean(times_h0)}")
_ = scipy.stats.probplot(np.array(times_h0), dist=scipy.stats.geom, sparams=1/ert, plot=plt)

If we run the detector in an identical manner but on data from the drifted distribution of red wine samples the average run-time is much lower.

n_runs = 250
times_h1 = [time_run(cd, X_corr, window_size) for _ in range(n_runs)]
print(f"Average run-time under drift: {np.mean(times_h1)}")

Online detection with LSDD and TensorFlow

We additionally show that TensorFlow can also be used as the backend and that sometimes it is not necessary to perform preprocessing, making definition of the drift detector simpler. Moreover, in the absence of a learned preprocessing stage we may use all of the reference data available.

X_ref = np.concatenate([X_train, X_ref], axis=0)

And now we define the LSDD-based online drift detector, again with an ert of 50 and window_size of 10.

from alibi_detect.cd import LSDDDriftOnline

cd = LSDDDriftOnline(
    X_ref, ert, window_size, backend='tensorflow', n_bootstraps=2500,
)

We run this new detector on the held out reference data and again see that in the absence of drift the distribution of run-times follows a Geometric distribution with mean ert.

n_runs = 250
times_h0 = [time_run(cd, X_h0, window_size) for _ in range(n_runs)]
print(f"Average run-time under no-drift: {np.mean(times_h0)}")
_ = scipy.stats.probplot(np.array(times_h0), dist=scipy.stats.geom, sparams=1/ert, plot=plt)

And when drift has occured the detector is very fast to respond.

n_runs = 250
times_h1 = [time_run(cd, X_corr, window_size) for _ in range(n_runs)]
print(f"Average run-time under drift: {np.mean(times_h1)}")

Model uncertainty based drift detection on CIFAR-10 and Wine-Quality datasets

Method

It is important that the detector uses a reference set that is disjoint from the model's training set (on which the model's confidence may be higher).

Backend

Classifier uncertainty based drift detection

We start by demonstrating how to leverage model uncertainty to detect malicious drift when the model of interest is a classifer.

Dataset

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import tensorflow as tf
import torch
from torch import nn

from alibi_detect.cd import ClassifierUncertaintyDrift, RegressorUncertaintyDrift
from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model, fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c
from alibi_detect.models.pytorch import trainer
from alibi_detect.cd.utils import encompass_batching

Original CIFAR-10 data:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

corruptions = corruption_types_cifar10c()
print(corruptions)

Let's pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the original test set in a reference dataset and a dataset which should not be rejected under the no-change null H0. We also split the corrupted data by corruption type:

np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

# check that the classes are more or less balanced
classes, counts_ref = np.unique(y_ref, return_counts=True)
counts_h0 = np.unique(y_h0, return_counts=True)[1]
print('Class Ref H0')
for cl, cref, ch0 in zip(classes, counts_ref, counts_h0):
    assert cref + ch0 == n_test // 10
    print('{}     {} {}'.format(cl, cref, ch0))

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

We can visualise the same instance for each corruption type:

#| tags: [hide_input]
i = 1

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:

dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
    acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
    clf_accuracy[corruption[_]] = acc
    print('{} {:.4f}'.format(corruption[_], acc))

Given the drop in performance, it is important that we detect the harmful data drift!

Detect drift

Unlike many other approaches we needn't specify a dimension-reducing preprocessing step as the detector operates directly on the data as it is input to the model of interest. In fact, the two-stage projection input -> prediction -> uncertainty can be thought of as the projection from the input space onto the real line, ready to perform the test.

#| scrolled: false
cd = ClassifierUncertaintyDrift(
  X_ref, model=clf, backend='tensorflow', p_val=0.05, preds_type='probs'
)

Let's check whether the detector thinks drift occurred on the different test sets and time the prediction calls:

from timeit import default_timer as timer

labels = ['No!', 'Yes!']

def make_predictions(cd, x_h0, x_corr, corruption):
    t = timer()
    preds = cd.predict(x_h0)
    dt = timer() - t
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('Feature-wise p-values:')
    print(preds['data']['p_val'])
    print(f'Time (s) {dt:.3f}')
    
    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            t = timer()
            preds = cd.predict(x)
            dt = timer() - t
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print('Feature-wise p-values:')
            print(preds['data']['p_val'])
            print(f'Time (s) {dt:.3f}')

make_predictions(cd, X_h0, X_c, corruption)

Note here how drift is only detected for the corrupted datasets on which the model's performance is significantly degraded. For the 'brightness' corruption, for which the model maintains 89% classification accuracy, the change in model uncertainty is not deemed significant (p-value 0.11, above the 0.05 threshold). For the other corruptions which signficiantly hamper model performance, the malicious drift is detected.

Regressor uncertainty based drift detection

Dataset

First we load in the data.

red = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';'
)
white = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';'
)
red.describe()

We can see that the data for both red and white wine samples take the same format.

white.describe()

We shuffle and normalise the data such that each feature takes a value in [0,1], as does the quality we seek to predict.

red, white = np.asarray(red, np.float32), np.asarray(white, np.float32)
n_red, n_white = red.shape[0], white.shape[0]

col_maxes = red.max(axis=0)
red, white = red / col_maxes, white / col_maxes
red, white = red[np.random.permutation(n_red)], white[np.random.permutation(n_white)]
X, y = red[:, :-1], red[:, -1:]
X_corr, y_corr = white[:, :-1], white[:, -1:]

We split the red wine data into a set on which to train the model, a reference set with which to instantiate the detector and a set which the detector should not flag drift. We then instantiate a DataLoader to pass the training data to a PyTorch model in batches.

X_train, y_train = X[:(n_red//2)], y[:(n_red//2)]
X_ref, y_ref = X[(n_red//2):(3*n_red//4)], y[(n_red//2):(3*n_red//4)]
X_h0, y_h0 = X[(3*n_red//4):], y[(3*n_red//4):]

X_train_ds = torch.utils.data.TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
X_train_dl = torch.utils.data.DataLoader(X_train_ds, batch_size=32, shuffle=True, drop_last=True)

Regression model

We now define the regression model that we'll train to predict the quality from the features. The exact details aren't important other than the presence of at least one dropout layer. We then train the model for 20 epochs to optimise the mean square error on the training data.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

reg = nn.Sequential(
    nn.Linear(11, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 32),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(32, 1)
).to(device)

trainer(reg, nn.MSELoss(), X_train_dl, device, torch.optim.Adam, learning_rate=0.001, epochs=30)

We now evaluate the trained model on both unseen samples of red wine and white wine. We see that, unsurprisingly, the model is better able to predict the quality of unseen red wine samples.

reg = reg.eval()
reg_fn = encompass_batching(reg, backend='pytorch', batch_size=32)
preds_ref = reg_fn(X_ref)
preds_corr = reg_fn(X_corr)

ref_mse = np.square(preds_ref - y_ref).mean()
corr_mse = np.square(preds_corr - y_corr).mean()

print(f'MSE when predicting the quality of unseen red wine samples: {ref_mse}')
print(f'MSE when predicting the quality of unseen white wine samples: {corr_mse}')

Detect drift

We now look at whether a regressor-uncertainty detector would have picked up on this malicious drift. We instantiate the detector and obtain drift predictions on both the held-out red-wine samples and the white-wine samples. We specify uncertainty_type='mc_dropout' in this case, but alternatively we could have trained an ensemble model that for each instance outputs a vector of multiple independent predictions and specified uncertainty_type='ensemble'.

cd = RegressorUncertaintyDrift(
    X_ref, model=reg, backend='pytorch', p_val=0.05, uncertainty_type='mc_dropout', n_evals=100
)
preds_h0 = cd.predict(X_h0)
preds_h1 = cd.predict(X_corr)

print(f"Drift detected on unseen red wine samples? {'yes' if preds_h0['data']['is_drift']==1 else 'no'}")
print(f"Drift detected on white wine samples? {'yes' if preds_h1['data']['is_drift']==1 else 'no'}")

print(f"p-value on unseen red wine samples? {preds_h0['data']['p_val']}")
print(f"p-value on white wine samples? {preds_h1['data']['p_val']}")

Maximum Mean Discrepancy drift detector on CIFAR-10

Method

MMD(F, p, q) = || \mu_{p} - \mu_{q} ||^2_{F}

Backend

Dataset

from functools import partial
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from alibi_detect.cd import MMDDrift
from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c

Load data

Original CIFAR-10 data:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

corruptions = corruption_types_cifar10c()
print(corruptions)

Let's pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the original test set in a reference dataset and a dataset which should not be rejected under the H0 of the MMD test. We also split the corrupted data by corruption type:

np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

# check that the classes are more or less balanced
classes, counts_ref = np.unique(y_ref, return_counts=True)
counts_h0 = np.unique(y_h0, return_counts=True)[1]
print('Class Ref H0')
for cl, cref, ch0 in zip(classes, counts_ref, counts_h0):
    assert cref + ch0 == n_test // 10
    print('{}     {} {}'.format(cl, cref, ch0))

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

We can visualise the same instance for each corruption type:

#| tags: [hide_input]
i = 4

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:

dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
    acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
    clf_accuracy[corruption[_]] = acc
    print('{} {:.4f}'.format(corruption[_], acc))

Given the drop in performance, it is important that we detect the harmful data drift!

Detect drift with TensorFlow backend

First we try a drift detector using the TensorFlow framework for both the preprocessing and the MMD computation steps.

Random encoder

First we try the randomly initialized encoder:

#| scrolled: false
from tensorflow.keras.layers import Conv2D, Dense, Flatten, InputLayer, Reshape
from alibi_detect.cd.tensorflow import preprocess_drift

tf.random.set_seed(0)

# define encoder
encoding_dim = 32
encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(32, 32, 3)),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(512, 4, strides=2, padding='same', activation=tf.nn.relu),
      Flatten(),
      Dense(encoding_dim,)
  ]
)

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=512)

# initialise drift detector
cd = MMDDrift(X_ref, backend='tensorflow', p_val=.05, 
              preprocess_fn=preprocess_fn, n_permutations=100)

# we can also save/load an initialised detector
filepath = 'detector_tf'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

Let's check whether the detector thinks drift occurred on the different test sets and time the prediction calls:

from timeit import default_timer as timer

labels = ['No!', 'Yes!']

def make_predictions(cd, x_h0, x_corr, corruption):
    t = timer()
    preds = cd.predict(x_h0)
    dt = timer() - t
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print(f'p-value: {preds["data"]["p_val"]:.3f}')
    print(f'Time (s) {dt:.3f}')
    
    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            t = timer()
            preds = cd.predict(x)
            dt = timer() - t
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print(f'p-value: {preds["data"]["p_val"]:.3f}')
            print(f'Time (s) {dt:.3f}')

#| scrolled: false
make_predictions(cd, X_h0, X_c, corruption)

As expected, drift was only detected on the corrupted datasets.

BBSDs

X_ref_bbsds = scale_by_instance(X_ref)
X_h0_bbsds = scale_by_instance(X_h0)
X_c_bbsds = [scale_by_instance(X_c[i]) for i in range(n_corr)]

Initialisation of the drift detector. Here we use the output of the softmax layer to detect the drift, but other hidden layers can be extracted as well by setting 'layer' to the index of the desired hidden layer in the model:

from alibi_detect.cd.tensorflow import HiddenOutput

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=HiddenOutput(clf, layer=-1), batch_size=128)

# initialise drift detector
cd = MMDDrift(X_ref_bbsds, backend='tensorflow', p_val=.05, 
              preprocess_fn=preprocess_fn, n_permutations=100)

make_predictions(cd, X_h0_bbsds, X_c_bbsds, corruption)

Again drift is only flagged on the perturbed data.

Detect drift with PyTorch backend

We can do the same thing using the PyTorch backend. We illustrate this using the randomly initialized encoder as preprocessing step:

import torch
import torch.nn as nn

# set random seed and device
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

Since our PyTorch encoder expects the images in a (batch size, channels, height, width) format, we transpose the data:

def permute_c(x):
    return np.transpose(x.astype(np.float32), (0, 3, 1, 2))

X_ref_pt = permute_c(X_ref)
X_h0_pt = permute_c(X_h0)
X_c_pt = [permute_c(xc) for xc in X_c]
print(X_ref_pt.shape, X_h0_pt.shape, X_c_pt[0].shape)

from alibi_detect.cd.pytorch import preprocess_drift

# define encoder
encoder_net = nn.Sequential(
    nn.Conv2d(3, 64, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(64, 128, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(128, 512, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(2048, encoding_dim)
).to(device).eval()

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, device=device, batch_size=512)

# initialise drift detector
cd = MMDDrift(X_ref_pt, backend='pytorch', p_val=.05, 
              preprocess_fn=preprocess_fn, n_permutations=100)

# we can also save/load an initialised PyTorch based detector
filepath = 'detector_pt'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

make_predictions(cd, X_h0_pt, X_c_pt, corruption)

The drift detector will attempt to use the GPU if available and otherwise falls back on the CPU. We can also explicitly specify the device. Let's compare the GPU speed up with the CPU implementation:

device = torch.device('cpu')
preprocess_fn = partial(preprocess_drift, model=encoder_net.to(device), 
                        device=device, batch_size=512)

cd = MMDDrift(X_ref_pt, backend='pytorch', preprocess_fn=preprocess_fn, device='cpu')

make_predictions(cd, X_h0_pt, X_c_pt, corruption)

Notice the over 30x acceleration provided by the GPU.

Similar to the TensorFlow implementation, PyTorch can also use the hidden layer output from a pretrained model for the preprocessing step via:

from alibi_detect.cd.pytorch import HiddenOutput

Scaling up drift detection with KeOps

Introduction

from alibi_detect.cd import MMDDrift

detector_torch = MMDDrift(x_ref, backend='pytorch')
detector_keops = MMDDrift(x_ref, backend='keops')

In this notebook we will run a few simple benchmarks to illustrate the speed and memory improvements from using KeOps over vanilla PyTorch on the GPU (1x RTX 2080 Ti) for both the standard MMD and learned kernel MMD detectors.

Data

We randomly sample points from the standard normal distribution and run the detectors with PyTorch and KeOps backends for the following settings:

$N_\text{ref}, N_\text{test} = [2, 5, 10, 20, 50, 100]$ (batch sizes in '000s)
$D = [2, 10, 50]$

Where $D$ denotes the number of features.

Requirements

!pip install pykeops

Before we start let’s fix the random seeds for reproducibility:

import numpy as np
import torch

def set_seed(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)

set_seed(2022)

Vanilla PyTorch vs. KeOps comparison

Utility functions

First we define some utility functions to run the experiments:

from alibi_detect.cd import MMDDrift, LearnedKernelDrift
from alibi_detect.utils.keops.kernels import DeepKernel as DeepKernelKeops
from alibi_detect.utils.keops.kernels import GaussianRBF as GaussianRBFKeops
from alibi_detect.utils.pytorch.kernels import DeepKernel as DeepKernelTorch
from alibi_detect.utils.pytorch.kernels import GaussianRBF as GaussianRBFTorch
import matplotlib.pyplot as plt
from scipy.stats import kstest
from timeit import default_timer as timer
import torch.nn as nn
import torch.nn.functional as F


class Projection(nn.Module):
    def __init__(self, d_in: int, d_out: int = 2):
        super().__init__()
        self.lin1 = nn.Linear(d_in, d_out)
        self.lin2 = nn.Linear(d_out, d_out)
    
    def forward(self, x):
        return self.lin2(F.relu(self.lin1(x)))
    

def eval_detector(p_vals: np.ndarray, threshold: float, is_drift: bool, t_mean: float, t_std: float) -> dict:
    """ In case of drifted data (ground truth) it returns the detector's power.
    In case of no drift, it computes the false positive rate (FPR) and whether the p-values
    are uniformly distributed U[0,1] which is checked via a KS test. """
    results = {'power': None, 'fpr': None, 'ks': None}
    below_p_val_threshold = (p_vals <= threshold).mean()
    if is_drift:
        results['power'] = below_p_val_threshold
    else:
        results['fpr'] = below_p_val_threshold
        stat_ks, p_val_ks = kstest(p_vals, 'uniform')
        results['ks'] = {'p_val': p_val_ks, 'stat': stat_ks}
    results['p_vals'] = p_vals
    results['time'] = {'mean': t_mean, 'stdev': t_std}
    return results


def experiment(detector: str, backend: str, n_runs: int, n_ref: int, n_test: int, n_features: int, 
               mu: float = 0.) -> dict:
    """ Runs the experiment n_runs times, each time with newly sampled reference and test data.
    Returns the p-values for each test as well as the mean and standard deviations of the runtimes. """
    p_vals, t_detect = [], []
    for _ in range(n_runs):
        # Sample reference and test data
        x_ref = np.random.randn(*(n_ref, n_features)).astype(np.float32)
        x_test = np.random.randn(*(n_test, n_features)).astype(np.float32) + mu
        
        # Initialise detector, make and log predictions
        p_val = .05
        dd_kwargs = dict(p_val=p_val, backend=backend, n_permutations=100)
        if detector == 'mmd':
            dd = MMDDrift(x_ref, **dd_kwargs)
        elif detector == 'learned_kernel':
            d_out, sigma = 2, .1
            proj = Projection(n_features, d_out)
            Kernel = GaussianRBFKeops if backend == 'keops' else GaussianRBFTorch
            kernel_a = Kernel(trainable=True, sigma = torch.Tensor([sigma]))
            kernel_b = Kernel(trainable=True, sigma = torch.Tensor([sigma]))
            device = torch.device('cuda')
            DeepKernel = DeepKernelKeops if backend == 'keops' else DeepKernelTorch
            deep_kernel = DeepKernel(proj, kernel_a, kernel_b, eps=.01).to(device)
            if backend == 'pytorch' and n_ref + n_test > 20000:
                batch_size = 10000
                batch_size_predict = 10000 
            else:
                batch_size = 1000000
                batch_size_predict = 1000000
            dd_kwargs.update(
                dict(
                    epochs=2, train_size=.75, batch_size=batch_size, batch_size_predict=batch_size_predict
                )
            )
            dd = LearnedKernelDrift(x_ref, deep_kernel, **dd_kwargs)
        start = timer()
        pred = dd.predict(x_test)
        end = timer()
        
        if _ > 0:  # first run reserved for KeOps compilation
            t_detect.append(end - start)
            p_vals.append(pred['data']['p_val'])
            
        del dd, x_ref, x_test
        torch.cuda.empty_cache()
    
    p_vals = np.array(p_vals)
    t_mean, t_std = np.array(t_detect).mean(), np.array(t_detect).std()
    results = eval_detector(p_vals, p_val, mu != 0., t_mean, t_std)
    return results


def format_results(experiments: dict, n_features: list, backends: list, max_batch_size: int = 1e10) -> dict:
    T = {'batch_size': None, 'keops': None, 'pytorch': None}
    T['batch_size'] = np.unique([experiments['keops'][_]['n_ref'] for _ in experiments['keops'].keys()])
    T['batch_size'] = list(T['batch_size'][T['batch_size'] <= max_batch_size])
    T['keops'] = {f: [] for f in n_features}
    T['pytorch'] = {f: [] for f in n_features}

    for backend in backends:
        for f in T[backend].keys():
            for bs in T['batch_size']:
                for k, v in experiments[backend].items():
                    if f == v['n_features'] and bs == v['n_ref']:
                        T[backend][f].append(results[backend][k]['time']['mean'])

    for k, v in T['keops'].items():  # apply padding
        n_pad = len(v) - len(T['pytorch'][k])
        T['pytorch'][k] += [np.nan for _ in range(n_pad)]
    return T


def plot_absolute_time(experiments: dict, results: dict, n_features: list, y_scale: str = 'linear', 
                       detector: str = 'MMD', max_batch_size: int = 1e10):
    T = format_results(experiments, n_features, ['keops', 'pytorch'], max_batch_size)
    colors = ['b', 'g', 'r', 'c', 'm', 'y', 'b']
    legend, n_c = [], 0
    for f in n_features:
        plt.plot(T['batch_size'], T['keops'][f], linestyle='solid', color=colors[n_c]);
        legend.append(f'keops - {f}')
        plt.plot(T['batch_size'], T['pytorch'][f], linestyle='dashed', color=colors[n_c]);
        legend.append(f'pytorch - {f}')
        n_c += 1
    plt.title(f'{detector} drift detection time for 100 permutations')
    plt.legend(legend, loc=(1.1,.1));
    plt.xlabel('Batch size');
    plt.ylabel('Time (s)');
    plt.yscale(y_scale);
    plt.show();


def plot_relative_time(experiments: dict, results: dict, n_features: list, y_scale: str = 'linear',
                       detector: str = 'MMD', max_batch_size: int = 1e10):
    T = format_results(experiments, n_features, ['keops', 'pytorch'], max_batch_size)
    colors = ['b', 'g', 'r', 'c', 'm', 'y', 'b']
    legend, n_c = [], 0
    for f in n_features:
        t_keops, t_torch = T['keops'][f], T['pytorch'][f]
        ratio = [tt / tk for tt, tk in zip(t_torch, t_keops)]
        plt.plot(T['batch_size'], ratio, linestyle='solid', color=colors[n_c]);
        legend.append(f'pytorch/keops - {f}')
        n_c += 1
    plt.title(f'{detector} drift detection pytorch/keops time ratio for 100 permutations')
    plt.legend(legend, loc=(1.1,.1));
    plt.xlabel('Batch size');
    plt.ylabel('time pytorch / keops');
    plt.yscale(y_scale);
    plt.show();

As detailed earlier, we will compare the PyTorch with the KeOps implementation of the MMD and learned kernel MMD detectors for a variety of reference and test data batch sizes as well as different feature dimensions. Note that for the PyTorch implementation, the portion of the kernel matrix for the reference data itself can already be computed at initialisation of the detector. This computation will not be included when we record the detector's prediction time. Since use cases where $N_\text{ref} >> N_\text{test}$ are quite common, we will also test for this specific setting. The key reason is that we cannot amortise this computation for the KeOps detector since we are working with lazily evaluated symbolic matrices.

MMD detector

1. $N_\text{ref} = N_\text{test}$

Note that for KeOps we could further increase the number of instances in the reference and test sets (e.g. to 500,000) without running into memory issues.

experiments_eq = {
    'keops': {
        0: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 2},
        1: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 2},
        2: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 2},
        3: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 5, 'n_features': 2},
        4: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 5, 'n_features': 2},
        5: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 5, 'n_features': 2},
        6: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 10},
        7: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 10},
        8: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 10},
        9: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 5, 'n_features': 10},
        10: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 5, 'n_features': 10},
        11: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 5, 'n_features': 10},
        12: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 50},
        13: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 50},
        14: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 50},
        15: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 5, 'n_features': 50},
        16: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 5, 'n_features': 50},
        17: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 5, 'n_features': 50}
    },
    'pytorch': {  # runs OOM after 10k instances in ref and test sets
        0: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 2},
        1: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 2},
        2: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 2},
        3: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 10},
        4: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 10},
        5: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 10},
        6: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 50},
        7: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 50},
        8: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 50}
    }
}

#| scrolled: true
backends = ['keops', 'pytorch']
results = {backend: {} for backend in backends}

for backend in backends:
    exps = experiments_eq[backend]
    for i, exp in exps.items():
        results[backend][i] = experiment(
            'mmd', backend, exp['n_runs'], exp['n_ref'], exp['n_test'], exp['n_features']
        )

Below we visualise the runtimes of the different experiments. We can make the following observations:

The relative speed improvements of KeOps over vanilla PyTorch increase with increasing batch size.
Due to the explicit kernel computation and storage, the PyTorch detector runs out-of-memory after a little over 10,000 instances in each of the reference and test sets while KeOps keeps scaling up without any issues.
The relative speed improvements decline with growing feature dimension. Note however that we would not recommend using a (untrained) MMD detector on very high-dimensional data in the first place.

The plots show both the absolute and relative (PyTorch / KeOps) mean prediction times for the MMD drift detector for different feature dimensions $[2, 10, 50]$.

n_features = [2, 10, 50]
max_batch_size = 100000

plot_absolute_time(experiments_eq, results, n_features, max_batch_size=max_batch_size)

plot_relative_time(experiments_eq, results, n_features, max_batch_size=max_batch_size)

The difference between KeOps and PyTorch is even more striking when we only look at $[2, 10]$ features:

plot_absolute_time(experiments_eq, results, [2, 10], max_batch_size=max_batch_size)

2. $N_\text{ref} >> N_\text{test}$

Now we check whether the speed improvements still hold when $N_\text{ref} >> N_\text{test}$ ($N_\text{ref} / N_\text{test} = 10$) and a large part of the kernel can already be computed at initialisation time of the PyTorch (but not the KeOps) detector.

experiments_neq = {
    'keops': {
        0: {'n_ref': 2000, 'n_test': 200, 'n_runs': 10, 'n_features': 2},
        1: {'n_ref': 5000, 'n_test': 500, 'n_runs': 10, 'n_features': 2},
        2: {'n_ref': 10000, 'n_test': 1000, 'n_runs': 10, 'n_features': 2},
        3: {'n_ref': 20000, 'n_test': 2000, 'n_runs': 10, 'n_features': 2},
        4: {'n_ref': 50000, 'n_test': 5000, 'n_runs': 10, 'n_features': 2},
        5: {'n_ref': 100000, 'n_test': 10000, 'n_runs': 10, 'n_features': 2}
    },
    'pytorch': {
        0: {'n_ref': 2000, 'n_test': 200, 'n_runs': 10, 'n_features': 2},
        1: {'n_ref': 5000, 'n_test': 500, 'n_runs': 10, 'n_features': 2},
        2: {'n_ref': 10000, 'n_test': 1000, 'n_runs': 10, 'n_features': 2}
    }
}

results = {backend: {} for backend in backends}

for backend in backends:
    exps = experiments_neq[backend]
    for i, exp in exps.items():
        results[backend][i] = experiment(
            'mmd', backend, exp['n_runs'], exp['n_ref'], exp['n_test'], exp['n_features']
        )

The below plots illustrate that KeOps indeed still provides large speed ups over PyTorch. The x-axis shows the reference batch size $N_\text{ref}$. Note that $N_\text{ref} / N_\text{test} = 10$.

plot_absolute_time(experiments_neq, results, [2], max_batch_size=max_batch_size)

plot_relative_time(experiments_neq, results, [2], max_batch_size=max_batch_size)

Learned kernel MMD detector

We conduct similar experiments as for the MMD detector for $N_\text{ref} = N_\text{test}$ and n_features=50. We use a deep learned kernel with an MLP followed by Gaussian RBF kernels and project the input features on a d_out=2-dimensional space. Since the learned kernel detector computes the kernel matrix in a batch-wise manner, we can also scale up the number of instances for the PyTorch backend without running out-of-memory.

experiments_eq = {
    'keops': {
        0: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 3, 'n_features': 50},
        1: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 3, 'n_features': 50},
        2: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 3, 'n_features': 50},
        3: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 3, 'n_features': 50},
        4: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 3, 'n_features': 50},
        5: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 3, 'n_features': 50}
    },
    'pytorch': {
        0: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 3, 'n_features': 50},
        1: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 3, 'n_features': 50},
        2: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 3, 'n_features': 50},
        3: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 3, 'n_features': 50},
        4: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 3, 'n_features': 50},
        5: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 3, 'n_features': 50}
    }
}

#| scrolled: true
results = {backend: {} for backend in backends}

for backend in backends:
    exps = experiments_eq[backend]
    for i, exp in exps.items():
        results[backend][i] = experiment(
            'learned_kernel', backend, exp['n_runs'], exp['n_ref'], exp['n_test'], exp['n_features']
        )

We again plot the absolute and relative (PyTorch / KeOps) mean prediction times for the learned kernel MMD drift detector for different feature dimensions:

max_batch_size = 100000

plot_absolute_time(experiments_eq, results, [50], max_batch_size=max_batch_size)

plot_relative_time(experiments_eq, results, [50], max_batch_size=max_batch_size)

Conclusion

As illustrated in the experiments, KeOps allows you to drastically speed up and scale up drift detection to larger datasets without running into memory issues. The speed benefit of KeOps over the PyTorch (or TensorFlow) MMD detectors decrease as the number of features increases. Note though that it is not advised to apply the (untrained) MMD detector to very high-dimensional data in the first place and that we can apply dimensionality reduction via the deep kernel for the learned kernel MMD detector.

Kolmogorov-Smirnov data drift detector on CIFAR-10

Method

Backend

Dataset

import matplotlib.pyplot as plt
import numpy as np
import os
import tensorflow as tf

from alibi_detect.cd import KSDrift
from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model, fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c

Load data

Original CIFAR-10 data:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

corruptions = corruption_types_cifar10c()
print(corruptions)

Let's pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the original test set in a reference dataset and a dataset which should not be rejected under the H0 of the K-S test. We also split the corrupted data by corruption type:

np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

# check that the classes are more or less balanced
classes, counts_ref = np.unique(y_ref, return_counts=True)
counts_h0 = np.unique(y_h0, return_counts=True)[1]
print('Class Ref H0')
for cl, cref, ch0 in zip(classes, counts_ref, counts_h0):
    assert cref + ch0 == n_test // 10
    print('{}     {} {}'.format(cl, cref, ch0))

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

We can visualise the same instance for each corruption type:

#| tags: [hide_input]
i = 1

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:

dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
    acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
    clf_accuracy[corruption[_]] = acc
    print('{} {:.4f}'.format(corruption[_], acc))

Given the drop in performance, it is important that we detect the harmful data drift!

Detect drift

Random encoder

First we try the randomly initialized encoder:

#| scrolled: false
from functools import partial
from tensorflow.keras.layers import Conv2D, Dense, Flatten, InputLayer, Reshape
from alibi_detect.cd.tensorflow import preprocess_drift

tf.random.set_seed(0)

# define encoder
encoding_dim = 32
encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(32, 32, 3)),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(512, 4, strides=2, padding='same', activation=tf.nn.relu),
      Flatten(),
      Dense(encoding_dim,)
  ]
)

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=512)

# initialise drift detector
p_val = .05
cd = KSDrift(X_ref, p_val=p_val, preprocess_fn=preprocess_fn)

# we can also save/load an initialised detector
filepath = 'my_path'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

assert cd.p_val / cd.n_features == p_val / encoding_dim

Let's check whether the detector thinks drift occurred on the different test sets and time the prediction calls:

from timeit import default_timer as timer

labels = ['No!', 'Yes!']

def make_predictions(cd, x_h0, x_corr, corruption):
    t = timer()
    preds = cd.predict(x_h0)
    dt = timer() - t
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('Feature-wise p-values:')
    print(preds['data']['p_val'])
    print(f'Time (s) {dt:.3f}')
    
    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            t = timer()
            preds = cd.predict(x)
            dt = timer() - t
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print('Feature-wise p-values:')
            print(preds['data']['p_val'])
            print(f'Time (s) {dt:.3f}')

make_predictions(cd, X_h0, X_c, corruption)

As expected, drift was only detected on the corrupted datasets. The feature-wise p-values for each univariate K-S test per (encoded) feature before multivariate correction show that most of them are well above the $0.05$ threshold for H0 and below for the corrupted datasets.

BBSDs

X_train = scale_by_instance(X_train)
X_test = scale_by_instance(X_test)
X_ref = scale_by_instance(X_ref)
X_h0 = scale_by_instance(X_h0)
X_c = [scale_by_instance(X_c[i]) for i in range(n_corr)]

Now we initialize the detector. Here we use the output of the softmax layer to detect the drift, but other hidden layers can be extracted as well by setting 'layer' to the index of the desired hidden layer in the model:

from alibi_detect.cd.tensorflow import HiddenOutput

# define preprocessing function, we use the
preprocess_fn = partial(preprocess_drift, model=HiddenOutput(clf, layer=-1), batch_size=128)

cd = KSDrift(X_ref, p_val=p_val, preprocess_fn=preprocess_fn)

assert cd.p_val / cd.n_features == p_val / 10

There is no drift on the original held out test set:

make_predictions(cd, X_h0, X_c, corruption)

Label drift

We can also check what happens when we introduce class imbalances between the reference data X_ref and the tested data X_imb. The reference data will use $75$% of the instances of the first 5 classes and only $25$% of the last 5. The data used for drift testing then uses respectively $25$% and $75$% of the test instances for the first and last 5 classes.

np.random.seed(0)
# get index for each class in the test set
num_classes = len(np.unique(y_test))
idx_by_class = [np.where(y_test == c)[0] for c in range(num_classes)]
# sample imbalanced data for different classes for X_ref and X_imb
perc_ref = .75
perc_ref_by_class = [perc_ref if c < 5 else 1 - perc_ref for c in range(num_classes)]
n_by_class = n_test // num_classes
X_ref = []
X_imb, y_imb = [], []
for _ in range(num_classes):
    idx_class_ref = np.random.choice(n_by_class, size=int(perc_ref_by_class[_] * n_by_class), replace=False)
    idx_ref = idx_by_class[_][idx_class_ref]
    idx_class_imb = np.delete(np.arange(n_by_class), idx_class_ref, axis=0)
    idx_imb = idx_by_class[_][idx_class_imb]
    assert not np.array_equal(idx_ref, idx_imb)
    X_ref.append(X_test[idx_ref])
    X_imb.append(X_test[idx_imb])
    y_imb.append(y_test[idx_imb])
X_ref = np.concatenate(X_ref)
X_imb = np.concatenate(X_imb)
y_imb = np.concatenate(y_imb)
print(X_ref.shape, X_imb.shape, y_imb.shape)

Update reference dataset for the detector and make predictions. Note that we store the preprocessed reference data since the preprocess_at_init kwarg is by default True:

cd.x_ref = cd.preprocess_fn(X_ref)

preds_imb = cd.predict(X_imb)
print('Drift? {}'.format(labels[preds_imb['data']['is_drift']]))
print(preds_imb['data']['p_val'])

Update reference data

N = 7500
cd = KSDrift(X_ref, p_val=.05, preprocess_fn=preprocess_fn, update_x_ref={'reservoir_sampling': N})

The reference data is now updated with each predict call. Say we start with our imbalanced reference set and make a prediction on the remaining test set data X_imb, then the drift detector will figure out data drift has occurred.

preds_imb = cd.predict(X_imb)
print('Drift? {}'.format(labels[preds_imb['data']['is_drift']]))

We can now see that the reference data consists of N instances, obtained through reservoir sampling.

assert cd.x_ref.shape[0] == N

We then draw a random sample from the training set and compare it with the updated reference data. This still highlights that there is data drift but will update the reference data again:

np.random.seed(0)
perc_train = .5
n_train = X_train.shape[0]
idx_train = np.random.choice(n_train, size=int(perc_train * n_train), replace=False)

preds_train = cd.predict(X_train[idx_train])
print('Drift? {}'.format(labels[preds_train['data']['is_drift']]))

When we draw a new sample from the training set, it highlights that it is not drifting anymore against the reservoir in X_ref.

np.random.seed(1)
perc_train = .1
idx_train = np.random.choice(n_train, size=int(perc_train * n_train), replace=False)
preds_train = cd.predict(X_train[idx_train])
print('Drift? {}'.format(labels[preds_train['data']['is_drift']]))

Multivariate correction mechanism

cd = KSDrift(X_ref, p_val=.05, preprocess_fn=preprocess_fn, correction='fdr')

preds_imb = cd.predict(X_imb)
print('Drift? {}'.format(labels[preds_imb['data']['is_drift']]))

Adversarial autoencoder as a malicious drift detector

load_pretrained = True

#| scrolled: true
from tensorflow.keras.regularizers import l1
from tensorflow.keras.layers import Conv2DTranspose
from alibi_detect.ad import AdversarialAE

# change filepath to (absolute) directory where model is downloaded
filepath = os.path.join(os.getcwd(), 'my_path')
detector_type = 'adversarial'
detector_name = 'base'
filepath = os.path.join(filepath, detector_name)
if load_pretrained:
    ad = fetch_detector(filepath, detector_type, dataset, detector_name, model=model)
else:  # train detector from scratch
    # define encoder and decoder networks
    encoder_net = tf.keras.Sequential(
            [
                InputLayer(input_shape=(32, 32, 3)),
                Conv2D(32, 4, strides=2, padding='same', 
                       activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Conv2D(64, 4, strides=2, padding='same', 
                       activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Conv2D(256, 4, strides=2, padding='same', 
                       activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Flatten(),
                Dense(40)
            ]
        )
    
    decoder_net = tf.keras.Sequential(
        [
                InputLayer(input_shape=(40,)),
                Dense(4 * 4 * 128, activation=tf.nn.relu),
                Reshape(target_shape=(4, 4, 128)),
                Conv2DTranspose(256, 4, strides=2, padding='same', 
                                activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Conv2DTranspose(64, 4, strides=2, padding='same', 
                                activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Conv2DTranspose(3, 4, strides=2, padding='same', 
                                activation=None, kernel_regularizer=l1(1e-5))
            ]
        )
    
    # initialise and train detector
    ad = AdversarialAE(encoder_net=encoder_net, decoder_net=decoder_net, model=clf)
    ad.fit(X_train, epochs=50, batch_size=128, verbose=True)
    
    # save the trained adversarial detector
    save_detector(ad, filepath)

Initialise the drift detector:

np.random.seed(0)
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
X_ref = scale_by_instance(X_test[idx])

# adversarial score fn = preprocess step
preprocess_fn = partial(ad.score, batch_size=128)

cd = KSDrift(X_ref, p_val=.05, preprocess_fn=preprocess_fn)

Make drift predictions on the original test set and corrupted data:

clf_accuracy['h0'] = clf.evaluate(X_h0, y_h0, batch_size=128, verbose=0)[1]
preds_h0 = cd.predict(X_h0)
print('H0: Accuracy {:.4f} -- Drift? {}'.format(
    clf_accuracy['h0'], labels[preds_h0['data']['is_drift']]))
clf_accuracy['imb'] = clf.evaluate(X_imb, y_imb, batch_size=128, verbose=0)[1]
preds_imb = cd.predict(X_imb)
print('imbalance: Accuracy {:.4f} -- Drift? {}'.format(
    clf_accuracy['imb'], labels[preds_imb['data']['is_drift']]))
for x, c in zip(X_c, corruption):
    preds = cd.predict(x)
    print('{}: Accuracy {:.4f} -- Drift? {}'.format(
        c, clf_accuracy[c],labels[preds['data']['is_drift']]))

While X_imb clearly exhibits input data drift due to the introduced class imbalances, it is not flagged by the adversarial drift detector since the performance of the classifier is not affected and the drift is not malicious. We can visualise this by plotting the adversarial scores together with the harmfulness of the data corruption as reflected by the drop in classifier accuracy:

adv_scores = {}
score = ad.score(X_ref, batch_size=128)
adv_scores['original'] = {'mean': score.mean(), 'std': score.std()}
score = ad.score(X_h0, batch_size=128)
adv_scores['h0'] = {'mean': score.mean(), 'std': score.std()}
score = ad.score(X_imb, batch_size=128)
adv_scores['imb'] = {'mean': score.mean(), 'std': score.std()}

for x, c in zip(X_c, corruption):
    score_x = ad.score(x, batch_size=128)
    adv_scores[c] = {'mean': score_x.mean(), 'std': score_x.std()}

mu = [v['mean'] for _, v in adv_scores.items()]
stdev = [v['std'] for _, v in adv_scores.items()]
xlabels = list(adv_scores.keys())
acc = [clf_accuracy[label] for label in xlabels]
xticks = np.arange(len(mu))

width = .35

fig, ax = plt.subplots()
ax2 = ax.twinx()

p1 = ax.bar(xticks, mu, width, yerr=stdev, capsize=2)
color = 'tab:red'
p2 = ax2.bar(xticks + width, acc, width, color=color)

ax.set_title('Adversarial Scores and Accuracy by Corruption Type')
ax.set_xticks(xticks + width / 2)
ax.set_xticklabels(xlabels, rotation=45)
ax.legend((p1[0], p2[0]), ('Score', 'Accuracy'), loc='upper right', ncol=2)
ax.set_ylabel('Adversarial Score')

color = 'tab:red'
ax2.set_ylabel('Accuracy')
ax2.set_ylim((-.26,1.2))
ax.set_ylim((-2,9))

plt.show()

We can therefore use the scores of the detector itself to quantify the harmfulness of the drift! We can generalise this to all the corruptions at each severity level in CIFAR-10-C:

def accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return (y_true == y_pred).astype(int).sum() / y_true.shape[0]

from alibi_detect.utils.tensorflow import predict_batch

severities = [1, 2, 3, 4, 5]

score_drift = {
    1: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    2: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    3: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    4: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    5: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
}

y_pred = predict_batch(X_test, clf, batch_size=256).argmax(axis=1)
score_x = ad.score(X_test, batch_size=256)

for s in severities:
    print('\nSeverity: {} of {}'.format(s, len(severities)))
    
    print('Loading corrupted dataset...')
    X_corr, y_corr = fetch_cifar10c(corruption=corruptions, severity=s, return_X_y=True)
    X_corr = X_corr.astype('float32')
    
    print('Preprocess data...')
    X_corr = scale_by_instance(X_corr)
    
    print('Make predictions on corrupted dataset...')
    y_pred_corr = predict_batch(X_corr, clf, batch_size=256).argmax(axis=1)
    
    print('Compute adversarial scores on corrupted dataset...')
    score_corr = ad.score(X_corr, batch_size=256)
    
    print('Get labels for malicious corruptions...')
    labels_corr = np.zeros(score_corr.shape[0])
    repeat = y_corr.shape[0] // y_test.shape[0]
    y_pred_repeat = np.tile(y_pred, (repeat,))
    # malicious/harmful corruption: original prediction correct but 
    # prediction on corrupted data incorrect
    idx_orig_right = np.where(y_pred_repeat == y_corr)[0]
    idx_corr_wrong = np.where(y_pred_corr != y_corr)[0]
    idx_harmful = np.intersect1d(idx_orig_right, idx_corr_wrong)
    labels_corr[idx_harmful] = 1
    labels = np.concatenate([np.zeros(X_test.shape[0]), labels_corr]).astype(int)
    # harmless corruption: original prediction correct and prediction
    # on corrupted data correct
    idx_corr_right = np.where(y_pred_corr == y_corr)[0]
    idx_harmless = np.intersect1d(idx_orig_right, idx_corr_right)
    
    score_drift[s]['all'] = score_corr
    score_drift[s]['harm'] = score_corr[idx_harmful]
    score_drift[s]['noharm'] = score_corr[idx_harmless]
    score_drift[s]['acc'] = accuracy(y_corr, y_pred_corr)

We now compute mean scores and standard deviations per severity level and plot the results. The plot shows the mean adversarial scores (lhs) and ResNet-32 accuracies (rhs) for increasing data corruption severity levels. Level 0 corresponds to the original test set. Harmful scores are scores from instances which have been flipped from the correct to an incorrect prediction because of the corruption. Not harmful means that the prediction was unchanged after the corruption.

mu_noharm, std_noharm = [], []
mu_harm, std_harm = [], []
acc = [clf_accuracy['original']]
for k, v in score_drift.items():
    mu_noharm.append(v['noharm'].mean())
    std_noharm.append(v['noharm'].std())
    mu_harm.append(v['harm'].mean())
    std_harm.append(v['harm'].std())
    acc.append(v['acc'])

plot_labels = ['0', '1', '2', '3', '4', '5']

N = 6
ind = np.arange(N)
width = .35

fig_bar_cd, ax = plt.subplots()
ax2 = ax.twinx()

p0 = ax.bar(ind[0], score_x.mean(), yerr=score_x.std(), capsize=2)
p1 = ax.bar(ind[1:], mu_noharm, width, yerr=std_noharm, capsize=2)
p2 = ax.bar(ind[1:] + width, mu_harm, width, yerr=std_harm, capsize=2)

ax.set_title('Adversarial Scores and Accuracy by Corruption Severity')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(plot_labels)
ax.set_ylim((-1,6))
ax.legend((p1[0], p2[0]), ('Not Harmful', 'Harmful'), loc='upper right', ncol=2)
ax.set_ylabel('Score')
ax.set_xlabel('Corruption Severity')

color = 'tab:red'
ax2.set_ylabel('Accuracy', color=color)
ax2.plot(acc, color=color)
ax2.tick_params(axis='y', labelcolor=color)

plt.show()

Drift detection on molecular graphs

Methods

We illustrate drift detection on molecular graphs using a variety of detectors:

Dataset

We will train a classification model and detect drift on the ogbg-molhiv dataset. The dataset contains molecular graphs with both atom features (atomic number-1, chirality, node degree, formal charge, number of H bonds, number of radical electrons, hybridization, aromatic?, in a ring?) and bond level properties (bond type (e.g. single or double), bond stereo code, conjugated?). The goal is to predict whether a molecule inhibits HIV virus replication or not, so the task is binary classification.

The dataset is split using the scaffold splitting procedure. This means that the molecules are split based on their 2D structural framework. Structurally different molecules are grouped into different subsets (train, validation, test) which could mean that there is drift between the splits.

Dependencies

import numpy as np
import os
import torch

def set_seed(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)

set_seed(0)

Load and analyze data

from ogb.graphproppred import PygGraphPropPredDataset
from torch_geometric.data import DataLoader

#| scrolled: true
dataset_name = 'ogbg-molhiv'
batch_size = 32
dataset = PygGraphPropPredDataset(name=dataset_name)
split_idx = dataset.get_idx_split()

We set some samples apart to serve as the reference data for our drift detectors. Note that the allowed format of the reference data is very flexible and can be np.ndarray or List[Any]:

n_ref = 1000
n_h0 = 500

idx_tr = split_idx['train']
idx_sample = np.random.choice(idx_tr.numpy(), size=n_ref + n_h0, replace=False)
idx_ref, idx_h0 = idx_sample[:n_ref], idx_sample[n_ref:]
x_ref = [dataset[i] for i in idx_ref]
x_h0 = [dataset[i] for i in idx_h0]
idx_tr = torch.from_numpy(np.setdiff1d(idx_tr, idx_sample))
print(f'Number of reference instances: {len(x_ref)}')
print(f'Number of H0 instances: {len(x_h0)}')

dl_tr = DataLoader(dataset[idx_tr], batch_size=batch_size, shuffle=True)
dl_val = DataLoader(dataset[split_idx['valid']], batch_size=batch_size, shuffle=False)
dl_te = DataLoader(dataset[split_idx['test']], batch_size=batch_size, shuffle=False)
print(f'Number of train, val and test batches: {len(dl_tr)}, {len(dl_val)} and {len(dl_te)}')

ds = dataset
print()
print(f'Dataset: {ds}:')
print('=============================================================')
print(f'Number of graphs: {len(ds)}')
print(f'Number of node features: {ds.num_node_features}')
print(f'Number of edge features: {ds.num_edge_features}')
print(f'Number of classes: {ds.num_classes}')

i = 0
d = ds[i]

print(f'\nExample: {d}')
print('=============================================================')

print(f'Number of nodes: {d.num_nodes}')
print(f'Number of edges: {d.num_edges}')
print(f'Average node degree: {d.num_edges / d.num_nodes:.2f}')
print(f'Contains isolated nodes: {d.contains_isolated_nodes()}')
print(f'Contains self-loops: {d.contains_self_loops()}')
print(f'Is undirected: {d.is_undirected()}')

Let's plot some graph summary statistics such as the distribution of the node degrees, number of nodes and edges as well as the clustering coefficients:

#| scrolled: true
import matplotlib.pyplot as plt
import networkx as nx
from networkx.algorithms.cluster import clustering
from torch_geometric.utils import degree, to_networkx
from tqdm import tqdm
from typing import Tuple


def degrees_and_clustering(loader: DataLoader) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    degrees, c_coeff, num_nodes, num_edges = [], [], [], []
    for data in tqdm(loader):
        row, col = data.edge_index
        deg = degree(row, data.x.size(0), dtype=data.x.dtype)
        degrees.append(deg.numpy())
        g = to_networkx(data, node_attrs=['x'], edge_attrs=['edge_attr'], to_undirected=True)
        c = list(clustering(g).values())
        c_coeff.append(c)
        num_nodes += [d.num_nodes for d in data.to_data_list()]
        num_edges += [d.num_edges for d in data.to_data_list()]
    degrees = np.concatenate(degrees, axis=0)
    c_coeff = np.concatenate(c_coeff, axis=0)
    return degrees, c_coeff, np.array(num_nodes), np.array(num_edges)


# x: nodes, edges, degree, cluster
def plot_histogram(x: str, bins: int = None, log: bool = True) -> None:
    if x == 'nodes':
        vals = [num_nodes_tr, num_nodes_val, num_nodes_te]
    elif x == 'edges':
        vals = [num_edges_tr, num_edges_val, num_edges_te]
    elif x == 'degree':
        vals = [degree_tr, degree_val, degree_te]
    elif x == 'cluster':
        vals = [cluster_tr, cluster_val, cluster_te]
    labels = ['train', 'val', 'test']
    for v, l in zip(vals, labels):
        plt.hist(v, density=True, log=log, label=l, bins=bins)
    plt.title(f'{x} distribution')
    plt.legend()
    plt.show()

degree_tr, cluster_tr, num_nodes_tr, num_edges_tr = degrees_and_clustering(dl_tr)
degree_val, cluster_val, num_nodes_val, num_edges_val = degrees_and_clustering(dl_val)
degree_te, cluster_te, num_nodes_te, num_edges_te = degrees_and_clustering(dl_te)

print('Average number and stdev of nodes, edges, degree and clustering coefficients:')
print('\nTrain...')
print(f'Nodes: {num_nodes_tr.mean():.1f} +- {num_nodes_tr.std():.1f}')
print(f'Edges: {num_edges_tr.mean():.1f} +- {num_edges_tr.std():.1f}')
print(f'Degree: {degree_tr.mean():.1f} +- {degree_tr.std():.1f}')
print(f'Clustering: {cluster_tr.mean():.3f} +- {cluster_tr.std():.3f}')

print('\nValidation...')
print(f'Nodes: {num_nodes_val.mean():.1f} +- {num_nodes_val.std():.1f}')
print(f'Edges: {num_edges_val.mean():.1f} +- {num_edges_val.std():.1f}')
print(f'Degree: {degree_val.mean():.1f} +- {degree_val.std():.1f}')
print(f'Clustering: {cluster_val.mean():.3f} +- {cluster_val.std():.3f}')

print('\nTest...')
print(f'Nodes: {num_nodes_te.mean():.1f} +- {num_nodes_te.std():.1f}')
print(f'Edges: {num_edges_te.mean():.1f} +- {num_edges_te.std():.1f}')
print(f'Degree: {degree_te.mean():.1f} +- {degree_te.std():.1f}')
print(f'Clustering: {cluster_te.mean():.3f} +- {cluster_te.std():.3f}')

plot_histogram('nodes', bins=50)
plot_histogram('edges', bins=50)
plot_histogram('degree')
plot_histogram('cluster')

While the average number of nodes and edges are similar across the splits, the histograms show that the tails are slightly heavier for the training graphs.

Plot molecules

def draw_molecule(g, edge_mask=None, draw_edge_labels=False):
    g = g.copy().to_undirected()
    node_labels = {}
    for u, data in g.nodes(data=True):
        node_labels[u] = data['name']
    pos = nx.planar_layout(g)
    pos = nx.spring_layout(g, pos=pos)
    if edge_mask is None:
        edge_color = 'black'
        widths = None
    else:
        edge_color = [edge_mask[(u, v)] for u, v in g.edges()]
        widths = [x * 10 for x in edge_color]
    nx.draw(g, pos=pos, labels=node_labels, width=widths,
            edge_color=edge_color, edge_cmap=plt.cm.Blues,
            node_color='azure')
    
    if draw_edge_labels and edge_mask is not None:
        edge_labels = {k: ('%.2f' % v) for k, v in edge_mask.items()}    
        nx.draw_networkx_edge_labels(g, pos, edge_labels=edge_labels,
                                    font_color='red')
    plt.show()


def to_molecule(data):
    ATOM_MAP = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P',
                'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn',
                'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y']
    g = to_networkx(data, node_attrs=['x'])
    for u, data in g.nodes(data=True):
        data['name'] = ATOM_MAP[data['x'][0]]
        del data['x']
    return g

i = 0
mol = to_molecule(dataset[i])
plt.figure(figsize=(10, 5))
draw_molecule(mol)

Train and evaluate a GNN classification model

import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data.batch import Batch
from torch_geometric.nn import MessagePassing, global_add_pool, global_max_pool, global_mean_pool, LayerNorm
from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder


class GINConv(MessagePassing):
    def __init__(self, emb_dim: int) -> None:
        super().__init__(aggr='add')
        self.mlp = nn.Sequential(
            nn.Linear(emb_dim, 2 * emb_dim),
            nn.BatchNorm1d(2 * emb_dim),
            nn.ReLU(),
            nn.Linear(2 * emb_dim, emb_dim)
        )
        self.eps = nn.Parameter(torch.Tensor([0.]))
        self.bond_encoder = BondEncoder(emb_dim=emb_dim)  # encode edge features

    def forward(self, x: torch.Tensor, edge_index: torch.Tensor, edge_attr: torch.Tensor) -> torch.Tensor:
        edge_emb = self.bond_encoder(edge_attr)
        return self.mlp((1 + self.eps) * x + self.propagate(edge_index, x=x, edge_attr=edge_emb))

    def message(self, x_j: torch.Tensor, edge_attr: torch.Tensor) -> torch.Tensor:
        return x_j + edge_attr

    def update(self, aggr_out: torch.Tensor) -> torch.Tensor:
        return aggr_out


class GIN(nn.Module):
    def __init__(self, n_layer: int = 5, emb_dim: int = 64, n_out: int = 2, dropout: float = .5,
                 jk: bool = True, residual: bool = True, pool: str = 'add', norm: str = 'batch') -> None:
        super().__init__()
        self.n_layer = n_layer
        self.jk = jk  # jumping-knowledge
        self.residual = residual  # residual/skip connections
        self.atom_encoder = AtomEncoder(emb_dim=emb_dim)  # encode node features
        self.convs = nn.ModuleList([GINConv(emb_dim) for _ in range(n_layer)])
        norm = nn.BatchNorm1d if norm == 'batch' else LayerNorm
        self.bns = nn.ModuleList([norm(emb_dim) for _ in range(n_layer)])
        if pool == 'mean':
            self.pool = global_mean_pool
        elif pool == 'add':
            self.pool = global_add_pool
        elif pool == 'max':
            self.pool = global_max_pool
        pool_dim = (n_layer + 1) * emb_dim if jk else emb_dim
        self.linear = nn.Linear(pool_dim, n_out)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, data: Batch) -> torch.Tensor:
        x, edge_index, edge_attr, batch = data.x, data.edge_index, data.edge_attr, data.batch
        # node embeddings
        hs = [self.atom_encoder(x)]
        for layer in range(self.n_layer):
            h = self.convs[layer](hs[layer], edge_index, edge_attr)
            h = self.bns[layer](h)
            if layer < self.n_layer - 1:
                h = F.relu(h)
            if self.residual:
                h += hs[layer]
            hs += [h]
        # graph embedding and prediction
        if self.jk:
            h = torch.cat([h for h in hs], -1)
        h_pool = self.pool(h, batch)
        h_drop = self.dropout(h_pool)
        h_out = self.linear(h_drop)
        return h_out

#| scrolled: true
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')

n_layer = 5
emb_dim = 300
n_out = 1
dropout = .5
jk = True
residual = False
pool = 'mean'
norm = 'batch'

model = GIN(n_layer, emb_dim, n_out, dropout, jk, residual, pool, norm).to(device)

load_path = 'gnn'  # set to None if no pretrained model available

#| scrolled: true
from ogb.graphproppred import Evaluator
from tqdm import tqdm

criterion = nn.BCEWithLogitsLoss()
optim = torch.optim.Adam(model.parameters(), lr=.001)
evaluator = Evaluator(name=dataset_name)  # ROC-AUC for ogbg-molhiv


def train(loader: DataLoader, verbose: bool = False) -> None:
    dl = tqdm(loader, total=len(loader)) if verbose else loader
    model.train()
    for data in dl:
        data = data.to(device)
        optim.zero_grad()
        y_hat = model(data)
        is_labeled = data.y == data.y
        loss = criterion(y_hat[is_labeled], data.y[is_labeled].float())
        loss.backward()
        optim.step()
        if verbose:
            dl.set_postfix(dict(loss=loss.item()))


def evaluate(loader: DataLoader, split: str, verbose: bool = False) -> float:
    dl = tqdm(loader, total=len(loader)) if verbose else loader
    model.eval()
    y_pred, y_true = [], []
    for data in dl:
        data = data.to(device)
        with torch.no_grad():
            y_hat = model(data)
        y_pred.append(y_hat.cpu())
        y_true.append(data.y.float().cpu())
    y_true = torch.cat(y_true, dim=0)
    y_pred = torch.cat(y_pred, dim=0)
    loss = criterion(y_pred, y_true)
    input_dict = dict(y_true=y_true, y_pred=y_pred)
    result_dict = evaluator.eval(input_dict)
    print(f'{split} ROC-AUC: {result_dict["rocauc"]:.3f} -- loss: {loss:.3f}')
    return result_dict["rocauc"]


if load_path is None or not os.path.isdir(load_path):
    epochs = 150
    rocauc_best = 0.
    save_path = 'gnn'
    for epoch in range(epochs):
        print(f'\nEpoch {epoch + 1} / {epochs}')
        train(dl_tr)
        _ = evaluate(dl_tr, 'train')
        rocauc = evaluate(dl_val, 'val')
        if rocauc > rocauc_best and os.path.isdir(save_path):
            print('Saving new best model.')
            rocauc_best = rocauc
            torch.save(model.state_dict(), os.path.join(save_path, 'model.dict'))
        _ = evaluate(dl_te, 'test')
    load_path = save_path


# load (best) model
model.load_state_dict(torch.load(os.path.join(load_path, 'model.dict')))
_ = evaluate(dl_tr, 'train')
_ = evaluate(dl_val, 'val')
_ = evaluate(dl_te, 'test')

Detect drift

Prediction distribution drift

from torch_geometric.data import Batch, Data
from typing import Dict, List, Union


labels = ['No!', 'Yes!']
def make_predictions(dd, xs: Dict[str, List[Data]]) -> None:
    for split, x in xs.items():
        preds = dd.predict(x)
        dl = DataLoader(x, batch_size=32, shuffle=False)
        _ = evaluate(dl, split)
        print('Drift? {}'.format(labels[preds['data']['is_drift']]))
        if isinstance(preds["data"]["p_val"], (list, np.ndarray)):
            print(f'p-value: {preds["data"]["p_val"]}')
        else:
            print(f'p-value: {preds["data"]["p_val"]:.3f}')
        print('')
        

def sample(split: str, n: int, ) -> List[Data]:
    idx = np.random.choice(split_idx[split].numpy(), size=n, replace=False)
    return [dataset[i] for i in idx]

Because we pass lists with torch_geometric.data.Data objects to the detector, we need to preprocess the data using the batch_fn into torch_geometric.data.Batch objects which can be fed to the model. Then we detect drift on the model prediction distribution.

from alibi_detect.cd import KSDrift
from alibi_detect.utils.pytorch import predict_batch
from functools import partial


def batch_fn(data: Union[List[Data], Batch]) -> Batch:
    if isinstance(data, Batch):
        return data
    else:
        return Batch().from_data_list(data)


preprocess_fn = partial(predict_batch, model=model, device=device, preprocess_fn=batch_fn, batch_size=32)
dd = KSDrift(x_ref, p_val=.05, preprocess_fn=preprocess_fn)

Since the dataset is heavily imbalanced, we will test the detectors on a sample which oversamples from the minority class (molecules which inhibit HIV virus replication):

split = 'test'
x_imb = sample(split, 500)
n = 0
for i in split_idx[split]:
    if dataset[i].y[0].item() == 1:
        x_imb.append(dataset[i])
        n += 1
print(f'# instances: {len(x_imb)} -- # class 1: {n}')

xs = {'H0': x_h0, 'test sample': sample('test', 500), 'imbalanced sample': x_imb}
make_predictions(dd, xs)

As expected, prediction distribution shift is detected for the imbalanced sample but not for the random test sample with similar label distribution as the reference data.

Prediction uncertainty drift

#| scrolled: false
from alibi_detect.cd import RegressorUncertaintyDrift

dd = RegressorUncertaintyDrift(x_ref, model=model, backend='pytorch', p_val=.05, n_evals=100,
                               uncertainty_type='mc_dropout', preprocess_batch_fn=batch_fn)

make_predictions(dd, xs)

Although we didn't pick up drift in the GIN model prediction distribution for the test sample, we can see that the model is less certain about the predictions on the test set, illustrated by the lower ROC-AUC.

Input data drift using the MMD detector

class Encoder(nn.Module):
    def __init__(self, n_layer: int = 1, emb_dim: int = 64, jk: bool = True, 
                 residual: bool = True, pool: str = 'add', norm: str = 'batch') -> None:
        super().__init__()
        self.n_layer = n_layer
        self.jk = jk  # jumping-knowledge
        self.residual = residual  # residual/skip connections
        self.atom_encoder = AtomEncoder(emb_dim=emb_dim)  # encode node features
        self.convs = nn.ModuleList([GINConv(emb_dim) for _ in range(n_layer)])
        norm = nn.BatchNorm1d if norm == 'batch' else LayerNorm
        self.bns = nn.ModuleList([norm(emb_dim) for _ in range(n_layer)])
        self.pool = global_add_pool

    def forward(self, data: Batch) -> torch.Tensor:
        x, edge_index, edge_attr, batch = data.x, data.edge_index, data.edge_attr, data.batch
        # node embeddings
        hs = [self.atom_encoder(x)]
        for layer in range(self.n_layer):
            h = self.convs[layer](hs[layer], edge_index, edge_attr)
            h = self.bns[layer](h)
            if layer < self.n_layer - 1:
                h = F.relu(h)
            if self.residual:
                h += hs[layer]
            hs += [h]
        # graph embedding and prediction
        if self.jk:
            h = torch.cat([h for h in hs], -1)
        h_out = self.pool(h, batch)
        return h_out

from alibi_detect.cd import MMDDrift

enc = Encoder(n_layer=1).to(device)
preprocess_fn = partial(predict_batch, model=enc, device=device, preprocess_fn=batch_fn, batch_size=32)
dd = MMDDrift(x_ref, backend='pytorch', p_val=.05, n_permutations=1000, preprocess_fn=preprocess_fn)

make_predictions(dd, xs)

Input data drift using a learned kernel

from alibi_detect.cd import LearnedKernelDrift
from alibi_detect.utils.pytorch import DeepKernel

kernel = DeepKernel(enc, kernel_b=None)  # use the already defined random encoder in the deep kernel
dd = LearnedKernelDrift(x_ref, kernel, backend='pytorch', p_val=.05, dataloader=DataLoader, 
                        preprocess_batch_fn=batch_fn, epochs=2)

make_predictions(dd, xs)

Since the molecular scaffolds are different across the train, validation and test sets, we expect that this type of data shift is picked up in the input data (technically not the input but the graph embedding).

Drift on graph statistics

# return number of nodes, edges and average clustering coefficient per graph
def graph_stats(data: List[Data]) -> np.ndarray:
    num_nodes = np.array([d.num_nodes for d in data])
    num_edges = np.array([d.num_edges for d in data])
    c = np.array([np.array(list(clustering(to_networkx(d)).values())).mean() for d in data])
    return np.concatenate([num_nodes[:, None], num_edges[:, None], c[:, None]], axis=-1)

dd = KSDrift(x_ref, p_val=.05, preprocess_fn=graph_stats)

make_predictions(dd, xs)

The 3 returned p-values correspond to respectively the p-values for the number of nodes, edges and clustering coefficient. We already saw in the EDA that the distributions of the node, edge and clustering coefficients look similar across the train, validation and test sets except for the tails. This is confirmed by running the drift detector on the graph statistics which cannot seem to pick up on the differences in molecular scaffolds between the datasets, unless we heavily oversample from the minority class where the number of nodes and edges but not the clustering coefficient significantly differ.