1 of 18

Examples

Categorical and mixed type data drift detection on income prediction

Method

The drift detector applies feature-wise two-sample Kolmogorov-Smirnov (K-S) tests for the continuous numerical features and Chi-Squared tests for the categorical features. For multivariate data, the obtained p-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur.

Dataset

The instances contain a person's characteristics like age, marital status or education while the label represents whether the person makes more or less than $50k per year. The dataset consists of a mixture of numerical and categorical features. It is fetched using the Alibi library, which can be installed with pip:

!pip install alibi

import alibi
import matplotlib.pyplot as plt
import numpy as np

from alibi_detect.cd import ChiSquareDrift, TabularDrift
from alibi_detect.saving import save_detector, load_detector

Load income prediction dataset

The fetch_adult function returns a Bunch object containing the instances, the targets, the feature names and a dictionary with as keys the column indices of the categorical features and as values the possible categories for each categorical variable.

adult = alibi.datasets.fetch_adult()
X, y = adult.data, adult.target
feature_names = adult.feature_names
category_map = adult.category_map
X.shape, y.shape

We split the data in a reference set and 2 test sets on which we test the data drift:

n_ref = 10000
n_test = 10000

X_ref, X_t0, X_t1 = X[:n_ref], X[n_ref:n_ref + n_test], X[n_ref + n_test:n_ref + 2 * n_test]
X_ref.shape, X_t0.shape, X_t1.shape

Detect drift

We need to provide the drift detector with the columns which contain categorical features so it knows which features require the Chi-Squared and which ones require the K-S univariate test. We can either provide a dict with as keys the column indices and as values the number of possible categories or just set the values to None and let the detector infer the number of categories from the reference data as in the example below:

categories_per_feature = {f: None for f in list(category_map.keys())}

Initialize the detector:

cd = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)

We can also save/load an initialised detector:

filepath = 'my_path'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

Now we can check whether the 2 test sets are drifting from the reference data:

preds = cd.predict(X_t0)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

Let's take a closer look at each of the features. The preds dictionary also returns the K-S or Chi-Squared test statistics and p-value for each feature:

for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

None of the feature-level p-values are below the threshold:

preds['data']['threshold']

If you are interested in individual feature-wise drift, this is also possible:

fpreds = cd.predict(X_t0, drift_type='feature')

for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = fpreds['data']['is_drift'][f]
    stat_val, p_val = fpreds['data']['distance'][f], fpreds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

What about the second test set?

preds = cd.predict(X_t1)
labels = ['No!', 'Yes!']
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

We can again investigate the individual features:

for f in range(cd.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

It seems like there is little divergence in the distributions of the features between the reference and test set. Let's visualize this:

def plot_categories(idx: int) -> None:
    # reference data
    x_ref_count = {f: [(X_ref[:, f] == v).sum() for v in vals] 
                   for f, vals in cd.x_ref_categories.items()}
    fref_drift = {cat: x_ref_count[idx][i] for i, cat in enumerate(category_map[idx])}
    
    # test set
    cats = {f: list(np.unique(X_t1[:, f])) for f in categories_per_feature.keys()}
    X_count = {f: [(X_t1[:, f] == v).sum() for v in vals] for f, vals in cats.items()}
    fxt1_drift = {cat: X_count[idx][i] for i, cat in enumerate(category_map[idx])}
    
    # plot bar chart
    plot_labels = list(fxt1_drift.keys())
    ind = np.arange(len(plot_labels))
    width = .35
    fig, ax = plt.subplots()
    p1 = ax.bar(ind, list(fref_drift.values()), width)
    p2 = ax.bar(ind + width, list(fxt1_drift.values()), width)
    ax.set_title(f'Counts per category for {feature_names[idx]} feature')
    ax.set_xticks(ind + width / 2)
    ax.set_xticklabels(plot_labels)
    ax.legend((p1[0], p2[0]), ('Reference', 'Test'), loc='upper right', ncol=2)
    ax.set_ylabel('Counts')
    ax.set_xlabel('Categories')
    plt.xticks(list(np.arange(len(plot_labels))), plot_labels, rotation='vertical')
    plt.show()

plot_categories(2)
plot_categories(3)
plot_categories(4)

Categorical data drift

While the TabularDrift detector works fine with numerical or categorical features only, we can also directly use a categorical drift detector. In this case, we don't need to specify the categorical feature columns. First we construct a categorical-only dataset and then use the ChiSquareDrift detector:

cols = list(category_map.keys())
cat_names = [feature_names[_] for _ in list(category_map.keys())]
X_ref_cat, X_t0_cat = X_ref[:, cols], X_t0[:, cols]
X_ref_cat.shape, X_t0_cat.shape

cd = ChiSquareDrift(X_ref_cat, p_val=.05)
preds = cd.predict(X_t0_cat)
print('Drift? {}'.format(labels[preds['data']['is_drift']]))

print(f"Threshold {preds['data']['threshold']}")
for f in range(cd.n_features):
    fname = cat_names[f]
    is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- {stat} {stat_val:.3f} -- p-value {p_val:.3f}')

Learned drift detectors on Adult Census

Under the hood, drift detectors leverage a function (also known as a test-statistic) that is expected to take a large value if drift has occurred and a low value if not. The power of the detector is partly determined by how well the function satisfies this property. However, specifying such a function in advance can be very difficult.

Detecting drift with a learned classifier

The classifier-based drift detector simply tries to correctly distinguish instances from the reference data vs. the test set. The classifier is trained to output the probability that a given instance belongs to the test set. If the probabilities it assigns to unseen tests instances are significantly higher (as determined by a Kolmogorov-Smirnov test) than those it assigns to unseen reference instances then the test set must differ from the reference set and drift is flagged. To leverage all the available reference and test data, stratified cross-validation can be applied and the out-of-fold predictions are used for the significance test. Note that a new classifier is trained for each test set or even each fold within the test set.

Backend

The method works with both the PyTorch, TensorFlow, and Sklearn frameworks. We will focus exclusively on the Sklearn backend in this notebook.

Dataset

Adult dataset consists of 32,561 distributed over 2 classes based on whether the annual income is >50K. We evaluate drift on particular subsets of the data which are constructed based on the education level. As we will further discuss, our reference dataset will consist of people having a low education level, while our test dataset will consist of people having a high education level.

Note: we need to install alibi to fetch the adult dataset.

!pip install alibi

import numpy as np
import pandas as pd
from typing import List, Tuple, Dict, Callable

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split

from alibi.datasets import fetch_adult
from alibi_detect.cd import ClassifierDrift

Load Adult Census Dataset

# fetch adult dataset
adult = fetch_adult()

# separate columns in numerical and categorical.
categorical_names = [adult.feature_names[i] for i in adult.category_map.keys()]
categorical_ids = list(adult.category_map.keys())

numerical_names = [name for i, name in enumerate(adult.feature_names) if i not in adult.category_map.keys()]
numerical_ids = [i for i in range(len(adult.feature_names)) if i not in adult.category_map.keys()]

X = adult.data

We split the dataset in two based on the education level. We define a low_education level consisting of: 'Dropout', 'High School grad', 'Bachelors', and a high_education level consisting of: 'Bachelors', 'Masters', 'Doctorate'. Intentionally we included an overlap between the two distributions consisting of people that have a Bachelors degree. Our goal is to detect that the two distributions are different.

education_col = adult.feature_names.index('Education')
education = adult.category_map[education_col]
print(education)

# define low education
low_education = [
    education.index('Dropout'),
    education.index('High School grad'),
    education.index('Bachelors')
    
]
# define high education
high_education = [
    education.index('Bachelors'),
    education.index('Masters'),
    education.index('Doctorate')
]
print("Low education:", [education[i] for i in low_education])
print("High education:", [education[i] for i in high_education])

# select instances for low and high education
low_education_mask = pd.Series(X[:, education_col]).isin(low_education).to_numpy()
high_education_mask = pd.Series(X[:, education_col]).isin(high_education).to_numpy()
X_low, X_high = X[low_education_mask], X[high_education_mask]

We sample our reference dataset from the low_education level. In addition, we sample two other datasets:

x_h0 - sampled from the low_education level to support the null hypothesis (i.e., the two distributions are identical);
x_h1 - sampled from the high_education level to support the alternative hypothesis (i.e., the two distributions are different);

size = 1000
np.random.seed(0)

# define reference and H0 dataset
idx_low = np.random.choice(np.arange(X_low.shape[0]), size=2*size, replace=False)
x_ref, x_h0 = train_test_split(X_low[idx_low], test_size=0.5, random_state=5, shuffle=True)

# define reference and H1 dataset
idx_high = np.random.choice(np.arange(X_high.shape[0]), size=size, replace=False)
x_h1 = X_high[idx_high]

Define dataset pre-processor

# define numerical standard scaler.
num_transf = StandardScaler()

# define categorical one-hot encoder.
cat_transf = OneHotEncoder(
    categories=[range(len(x)) for x in adult.category_map.values()],
    handle_unknown="ignore"
)

# Define column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", cat_transf, categorical_ids),
        ("num", num_transf, numerical_ids),
    ],
    sparse_threshold=0
)

# fit preprocessor.
preprocessor = preprocessor.fit(np.concatenate([x_ref, x_h0, x_h1]))

Utils

labels = ['No!', 'Yes!']

def print_preds(preds: dict, preds_name: str) -> None:
    print(preds_name)
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print(f'p-value: {preds["data"]["p_val"]:.3f}')
    print('')

Drift detection

We perform a binomial test using a RandomForestClassifier.

# define classifier
model = RandomForestClassifier()

# define drift detector with binarize prediction
detector = ClassifierDrift(
    x_ref=x_ref,
    model=model,
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=True,
    n_folds=2,
)

# print results
print_preds(detector.predict(x=x_h0), "H0")
print_preds(detector.predict(x=x_h1), "H1")

As expected, when testing against x_h0, we fail to reject $H_0$, while for the second case there is enough evidence to reject $H_0$ and flag that the data has drifted.

For the classifiers that do not support predict_proba but offer support for decision_function, we can perform a K-S test on the scores by setting preds_type='scores'.

# define model
model = GradientBoostingClassifier()


# define drift detector
detector = ClassifierDrift(
    x_ref=x_ref,
    model=model,
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    preds_type='scores',
    binarize_preds=False,
    n_folds=2,
)

# print results
print_preds(detector.predict(x=x_h0), "H0")
print_preds(detector.predict(x=x_h1), "H1")

Some models can return a poor estimate of the class label probability or some might not even support probability predictions. We can add calibration on top of each classifier to obtain better probability estimates and perform a K-S test. For demonstrative purposes, we will calibrate a LinearSVC which does not support predict_proba, but any other classifier would work.

# define model - does not support predict_proba
model = LinearSVC(max_iter=10000)

# define drift detector
detector = ClassifierDrift(
    x_ref=x_ref,
    model=model,
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=False,
    n_folds=2,
    use_calibration=True,
    calibration_kwargs={'method': 'isotonic'}
)

# print results
print_preds(detector.predict(x=x_h0), "H0")
print_preds(detector.predict(x=x_h1), "H1")

Speeding things up

In order to use the entire dataset and obtain unbiased predictions required to perform the statistical test, the ClassifierDrift detector has the option to perform a n_folds split. Although appealing due to its data efficiency, this method can be slow since it is required to train a number of n_folds classifiers.

For the RandomForestClassifier we can avoid retraining n_folds classifiers by using the out-of-bag predictions. In a RandomForestClassifier each tree is trained on a separate dataset obtained by sampling with replacement the original training set, a method known as bagging. On average, only 63% unique samples from the original dataset are used to train each tree (Bostrom). Thus, for each tree, we can obtain predictions for the remaining out-of-bag samples (i.e., the rest of 37%). By cumulating the out-of-bag predictions across all the trees we can eventually obtain a prediction for each sample in the original dataset. Note that we used the word 'eventually' because if the number of trees is too small, covering the entire original dataset might be unlikely.

For demonstrative purposes, we will compare the running time of the ClassifierDrift detector when using a RandomForestClassifier in two setups: n_folds=5, use_oob=False and use_oob=True.

n_estimators = 400
n_folds = 5

%%time
# define drift detector
detector_rf = ClassifierDrift(
    x_ref=x_ref,
    model=RandomForestClassifier(n_estimators=n_estimators),
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=False,
    n_folds=n_folds
)

# print results
print_preds(detector_rf.predict(x=x_h0), "H0")
print_preds(detector_rf.predict(x=x_h1), "H1")

%%time
# define drift detector
detector_rf_oob = ClassifierDrift(
    x_ref=x_ref,
    model=RandomForestClassifier(n_estimators=n_estimators),
    backend='sklearn',
    preprocess_fn=preprocessor.transform,
    binarize_preds=False,
    use_oob=True
)

# print results
print_preds(detector_rf_oob.predict(x=x_h0), "H0")
print_preds(detector_rf_oob.predict(x=x_h1), "H1")

We can observe that in this particular setting, using the out-of-bag prediction can speed up the procedure up to almost x4.

Learned drift detectors on CIFAR-10

Under the hood drift detectors leverage a function (also known as a test-statistic) that is expected to take a large value if drift has occurred and a low value if not. The power of the detector is partly determined by how well the function satisfies this property. However, specifying such a function in advance can be very difficult. In this example notebook we consider two ways in which a portion of the available data may be used to learn such a function before then applying it on the held out portion of the data to test for drift.

Detecting drift with a learned classifier

The classifier-based drift detector simply tries to correctly distinguish instances from the reference data vs. the test set. The classifier is trained to output the probability that a given instance belongs to the test set. If the probabilities it assigns to unseen tests instances are significantly higher (as determined by a Kolmogorov-Smirnov test) to those it assigns to unseen reference instances then the test set must differ from the reference set and drift is flagged. To leverage all the available reference and test data, stratified cross-validation can be applied and the out-of-fold predictions are used for the significance test. Note that a new classifier is trained for each test set or even each fold within the test set.

Backend

The method works with both the PyTorch and TensorFlow frameworks. Alibi Detect does however not install PyTorch for you. Check the PyTorch docs how to do this.

Dataset

CIFAR10 consists of 60,000 32 by 32 RGB images equally distributed over 10 classes. We evaluate the drift detector on the CIFAR-10-C dataset (Hendrycks & Dietterich, 2019). The instances in CIFAR-10-C have been corrupted and perturbed by various types of noise, blur, brightness etc. at different levels of severity, leading to a gradual decline in the classification model performance. We also check for drift against the original test set with class imbalances.

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from alibi_detect.cd import ClassifierDrift
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c

Load data

Original CIFAR-10 data:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

corruptions = corruption_types_cifar10c()
print(corruptions)

Let's pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the original test set in a reference dataset and a dataset which should not be flagged as drift. We also split the corrupted data by corruption type:

np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

We can visualise the same instance for each corruption type:

i = 6

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

Detect drift with a TensorFlow classifier

Single fold

We use a simple classification model and try to distinguish between the reference data and the corrupted test sets. The detector defaults to binarize=False which means a Kolmogorov-Smirnov test will be used to test for significant disparity between continuous model predictions (e.g. probabilities or logits). Initially we'll test at a significance level of $p=0.05$, use $75$% of the shuffled reference and test data for training and evaluate the detector on the remaining $25$%. We only train for 1 epoch.

from tensorflow.keras.layers import Conv2D, Dense, Flatten, Input

tf.random.set_seed(0)

model = tf.keras.Sequential(
  [
      Input(shape=(32, 32, 3)),
      Conv2D(8, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(16, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(32, 4, strides=2, padding='same', activation=tf.nn.relu),
      Flatten(),
      Dense(2, activation='softmax')
  ]
)

cd = ClassifierDrift(X_ref, model, p_val=.05, train_size=.75, epochs=1)

If needed, the detector can be saved and loaded with save_detector and load_detector:

from alibi_detect.saving import save_detector, load_detector
# Save detector
filepath = 'tf_detector'
save_detector(cd, filepath)

# Load detector
cd = load_detector(filepath)

Let's check whether the detector thinks drift occurred on the different test sets and time the prediction calls:

from timeit import default_timer as timer

labels = ['No!', 'Yes!']

def make_predictions(cd, x_h0, x_corr, corruption):
    t = timer()
    preds = cd.predict(x_h0)
    dt = timer() - t
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print(f'p-value: {preds["data"]["p_val"]:.3f}')
    print(f'Time (s) {dt:.3f}')
    
    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            t = timer()
            preds = cd.predict(x)
            dt = timer() - t
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print(f'p-value: {preds["data"]["p_val"]:.3f}')
            print(f'Time (s) {dt:.3f}')

make_predictions(cd, X_h0, X_c, corruption)

As expected, drift was only detected on the corrupted datasets and the classifier could easily distinguish the corrupted from the reference data.

Use all the available data via cross-validation

So far we've only used $25$% of the data to detect the drift since $75$% is used for training purposes. At the cost of additional training time we can however leverage all the data via stratified cross-validation. We just need to set the number of folds and keep everything else the same. So for each test set n_folds models are trained, and the out-of-fold predictions combined for the significance test:

cd = ClassifierDrift(X_ref, model, p_val=.05, n_folds=5, epochs=1)

#| scrolled: true
make_predictions(cd, X_h0, X_c, corruption)

Detecting drift with a learned kernel

An alternative to training a classifier to output high probabilities for instances from the test window and low probabilities for instances from the reference window is to learn a kernel that outputs high similarities between instances from the same window and low similarities between instances from different windows. The kernel may then be used within an MMD-test for drift. Liu et al. (2020) propose this learned approach and note that it is in fact a generalisation of the above classifier-based method. However, in this case we can train the kernel to directly optimise an estimate of the detector's power, which can result in superior performance.

Detect drift with a learned PyTorch kernel

This can be implemented as shown below. We use Pytorch instead of TensorFlow this time for the sake of variety. Because we are dealing with images we give our projection $\Phi$ a convolutional architecture.

import torch
import torch.nn as nn

# set random seed and device
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# define the projection
proj = nn.Sequential(
    nn.Conv2d(3, 8, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(8, 16, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(16, 32, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Flatten(),
).to(device)

We may then specify a DeepKernel in the following manner. By default GaussianRBF kernels are used for $k_a$ and $k_b$ and here we specify $\epsilon=0.01$, but we could alternatively set eps='trainable'.

from alibi_detect.utils.pytorch.kernels import DeepKernel
kernel = DeepKernel(proj, eps=0.01)

Since our PyTorch encoder expects the images in a (batch size, channels, height, width) format, we transpose the data. Note that this step could also be passed to the drift detector via the preprocess_fn kwarg:

def permute_c(x):
    return np.transpose(x.astype(np.float32), (0, 3, 1, 2))

X_ref_pt = permute_c(X_ref)
X_h0_pt = permute_c(X_h0)
X_c_pt = [permute_c(xc) for xc in X_c]
print(X_ref_pt.shape, X_h0_pt.shape, X_c_pt[0].shape)

We then pass the kernel to the LearnedKernelDrift detector. By default $75%$ of the data is used to train the kernel and the MMD-test is performed on the other $25%$.

from alibi_detect.cd import LearnedKernelDrift
cd = LearnedKernelDrift(X_ref_pt, kernel, backend='pytorch', p_val=.05, epochs=1)

Again, the detector can be saved and loaded:

from alibi_detect.saving import save_detector, load_detector
# Save detector
filepath = 'torch_detector'
save_detector(cd, filepath)

# Load detector
cd = load_detector(filepath)

Finally, lets make some predictions with the detector:

make_predictions(cd, X_h0_pt, X_c_pt, corruption)

Context-aware drift detection on news articles

Introduction

In this notebook we show how to detect drift on text data given a specific context using the (). Consider the following simple example: the upcoming elections result in an increase of political news articles compared to other topics such as sports or science. Given the context (the elections), it is however not surprising that we observe this uptick. Moreover, assume we have a machine learning model which is trained to classify news topics, and this model performs well on political articles. So given that we fully expect this uptick to occur given the context, and that our model performs fine on the political news articles, we do not want to flag this type of drift in the data. This setting corresponds more closely to many real-life settings than traditional drift detection where we make the assumption that both the reference and test data are i.i.d. samples from their underlying distributions.

In our news topics example, each different topic such as politics, sports or weather represents a subpopulation of the data. Our context-aware drift detector can then detect changes in the data distribution which cannot be attributed to a change in the relative prevalences of these subpopulations, which we deem permissible. As a cherry on the cake, the context-aware detector allows you to understand which subpopulations are present in both the reference and test data. This allows you to obtain deep insights into the distribution underlying the test data.

Useful context (or conditioning) variables for the context-aware drift detector include but are not limited to:

Domain or application specific contexts such as the time of day or the weather.
Conditioning on the relative prevalences of known subpopulations, such as the frequency of political articles. It is important to note that while the relative frequency of each subpopulation might change, the distribution underlying each subpopulation cannot change.
Conditioning on model predictions. Assume we trained a classifier which tries to figure out which news topic an article belongs to. Given our model predictions we then want to understand whether our test data follows the same underlying distribution as reference instances with similar model predictions. This conditioning would also be useful in case of trending news topics which cause the model prediction distribution to shift but not necessarily the distribution within each of the news topics.
Conditioning on model uncertainties which would allow increases in model uncertainty due to drift into familiar regions of high aleatoric uncertainty (often fine) to be distinguished from that into unfamiliar regions of high epistemic uncertainty (often problematic).

The following settings will be illustrated throughout the notebook:

A change in the prevalences of subpopulations (i.e. news topics) relative to their prevalences in the training data. Contrary to traditional drift detection approaches, the context-aware detector does not flag drift as this change in frequency of news topics is permissible given the context provided (e.g. more political news articles around elections).
A change in the underlying distribution of one or more subpopulations takes place. While we allow changes in the prevalence of the subpopulations accounted for by the context variable, we do not allow changes of the subpopulations themselves. Let's assume that a newspaper usually has a certain tone (e.g. more conservative) when it comes to politics. If this tone changes (to less conservative) around elections (increased frequency of political news articles), then we want to flag it as drift since the change cannot be attributed to the context given to the detector.
A change in the distribution as we observe a previously unseen news topic. A newspaper might for instance add a classified ads section, which was not present in the reference data.

Under setting 1. we want our detector to be well-calibrated (a controlled False Positive Rate (FPR) and more generally a p-value which is uniformly distributed between 0 and 1) while under settings 2. and 3. we want our detector to be powerful and flag the drift. Lastly, we show how the detector can help you to understand the connection between the reference and test data distributions better.

Data

We use the which contains about 18,000 newsgroups post across 20 topics, including politics, science sports or religion.

Requirements

The notebook requires the umap-learn, torch, sentence-transformers, statsmodels, seaborn and datasets packages to be installed, which can be done via pip:

Before we start let's fix the random seeds for reproducibility:

Load data

First we load the data, show which classes (news topics) are present and what an instance looks like.

Let's take a look at an instance from the dataset:

Define models and train a classifier

We define respectively a generic clustering model using UMAP, a model to embed the text input using pre-trained SentenceTransformers embeddings, a text classifier and a utility function to place the data on the right device.

First we train a classifier on a small subset of the data. The aim of the classifier is to predict the news topic of each instance. Below we define a few simple training and evaluation functions.

We now split the data in 2 sets. The first set (x_train) we will use to train our text classifier, and the second set (x_drift) is held out to test our drift detector on.

Let's train our classifier. The classifier consists of a simple MLP head on top of a pre-trained SentenceTransformer model as the backbone. The SentenceTransformer remains frozen during training and only the MLP head is finetuned.

Detector calibration under no change

We start with an example where no drift occurs and the reference and test data are both sampled randomly from all news topics. Under this scenario, we expect no drift to be detected by either a normal MMD detector or by the context-aware MMD detector.

First we define some helper functions. The first one visualises the clustered text data while the second function samples disjoint reference and test sets with a specified number of instances per class (i.e. per news topic).

We first define the embedding model using the pre-trained SentenceTransformer embeddings and then embed both the reference and test sets.

By applying UMAP clustering on the SentenceTransformer embeddings, we can visually inspect the various news topic clusters. Note that we fit the clustering model on the held out data first, and then make predictions on the reference and test sets.

We can visually see that the reference and test set are made up of similar clusters of data, grouped by news topic. As a result, we would not expect drift to be flagged. If the data distribution did not change, we can expect the p-value distribution of our statistical test to be uniformly distributed between 0 and 1. So let's see if this assumption holds.

Importantly, first we need to define our context variable for the context-aware MMD detector. In our experiments we allow the relative prevalences of subpopulations to vary while the distributions underlying each of the subpopulations remain unchanged. To achieve this we condition on the prediction probabilities of the classifier we trained earlier to distinguish each of the 20 different news topics. We can do this because the prediction probabilities can account for the frequency of occurrence of each of the topics (be it imperfectly given our classifier makes the occasional mistake).

Before we set off our experiments, we embed all the instances in x_drift and compute all contexts c_drift so we don't have to call our transformer model every single pass in the for loop.

As expected we can see that both the normal MMD and the context-aware MMD detectors are well-calibrated.

Changing the relative subpopulation prevalence

We now focus our attention on a more realistic problem where the relative frequency of one or more subpopulations (i.e. news topics) is changing in a way which can be attributed to external events. Importantly, the distribution underlying each subpopulation (e.g. the distribution of hockey news itself) remains unchanged, only its frequency changes.

In our example we assume that the World Series and Stanley Cup coincide on the calendar leading to a spike in news articles on respectively baseball and hockey. Furthermore, there is not too much news on Mac or Windows since there are no new releases or products planned anytime soon.

While the context-aware detector remains well calibrated, the MMD detector consistently flags drift (low p-values). Note that this is the expected behaviour since the vanilla MMD detector cannot take any external context into account and correctly detects that the reference and test data do not follow the same underlying distribution.

We can also easily see this on the plot below where the p-values of the context-aware detector are uniformly distributed while the MMD detector's p-values are consistently close to 0. Note that we limited the y-axis range to make the plot easier to read.

Changing the subpopulation distribution

In the following example we change the distribution of one or more of the underlying subpopulations. Notice that now we do want to flag drift since our context variable, which permits changes in relative subpopulation prevalences, can no longer explain the change in distribution.

Imagine our news topic classification model is not as granular as before and instead of the 20 categories only predicts the 6 super classes, organised by subject matter:

Computers: comp.graphics; comp.os.ms-windows.misc; comp.sys.ibm.pc.hardware; comp.sys.mac.hardware; comp.windows.x
Recreation: rec.autos; rec.motorcycles; rec.sport.baseball; rec.sport.hockey
Science: sci.crypt; sci.electronics; sci.med; sci.space
Miscellaneous: misc.forsale
Politics: talk.politics.misc; talk.politics.guns; talk.politics.mideast
Religion: talk.religion.misc; talk.atheism; soc.religion.christian

What if baseball and hockey become less popular and the distribution underlying the Recreation class changes? We will want to detect this as the change in distributions of the subpopulations (the 6 super classes) cannot be explained anymore by the context variable.

In order to reuse our pretrained classifier for the super classes, we add the following helper function to map the predictions on the super classes and return one-hot encoded predictions over the 6 super classes. Note that our context variable now changes from a probability distribution over the 20 news topics to a one-hot encoded representation over the 6 super classes.

We can see that the context-aware detector is powerful to detect changes in the distributions of the subpopulations.

Detect unseen topics

Next we illustrate the effectiveness of the context-aware detector to detect new topics which are not present in the reference data. Obviously we also want to flag drift in this case. As an example we introduce movie reviews in the test data.

Changing the context variable

So far we have conditioned the context-aware detector on the model predictions. There are however many other useful contexts possible. One such example would be to condition on the predictions of an unsupervised clustering algorithm. To facilitate this, we first apply kernel PCA on the embedding vectors, followed by a Gaussian mixture model which clusters the data into 6 classes (same as the super classes). We will test both the calibration under the null hypothesis (no distribution change) as well as the power when a new topic (movie reviews) is injected.

Next we change the number of instances in each cluster between the reference and test sets. Note that we do not alter the underlying distribution of each of the clusters, just the frequency.

Now we run the experiment and show the context-aware detector's calibration when changing the cluster frequencies. We also show how the usual MMD detector will consistently flag drift. Furthermore, we inject instances from the movie reviews dataset and illustrate that the context-aware detector remains powerful when the underlying cluster distribution changes (by including a previously unseen topic).

Interpretability of the context-aware detector

The test statistic $\hat{t}$ of the context-aware MMD detector can be formulated as follows: $\hat{t} = \langle K_{0,0}, W_{0,0} \rangle + \langle K_{1,1}, W_{1,1} \rangle -2\langle K_{0,1}, W_{0,1}\rangle$ where $0$ refers to the reference data, $1$ to the test data, and $W_{.,.}$ and $K_{.,.}$ are the weight and kernel matrices, respectively. The weight matrices $W_{.,.}$ allow us to focus on the distribution's subpopulations of interest. Reference instances which have similar contexts as the test data will have higher values for their entries in $W_{0,1}$ than instances with dissimilar contexts. We can therefore interpret $W_{0,1}$ as the coupling matrix between instances in the reference and the test sets. This allows us to investigate which subpopulations from the reference set are present and which are missing in the test data. If we also have a good understanding of the model performance on various subpopulations of the reference data, we could even try and use this coupling matrix to roughly proxy model performance on the unlabeled test instances. Note that in this case we would require labels from the reference data and make sure the reference instances come from the validation, not the training set.

In the following example we only pick 2 classes to be present in the test set while all 20 are present in the reference set. We can then investigate via the coupling matrix whether the test statistic $\hat{t}$ focused on the right classes in the reference data via $W_{0,1}$. More concretely, we can sum over the columns (the test instances) of $W_{0,1}$ and check which reference instances obtained the highest weights.

Context-aware drift detection on ECGs

Introduction

In this notebook we show how to detect drift on ECG data given a specific context using the context-aware MMD detector (Cobb and Van Looveren, 2022). Consider the following simple example: we have a heatbeat monitoring system which is trained on a wide variety of heartbeats sampled from people of all ages across a variety of activities (e.g. rest or running). Then we deploy the system to monitor individual people during certain activities. The distribution of the heartbeats monitored during deployment will then be drifting against the reference data which resembles the full training distribution, simply because only individual people in a specific setting are being tracked. However, this does not mean that the system is not working and requires re-training. We are instead interested in flagging drift given the relevant context such as the person's characteristics (e.g. age or medical history) and the activity. Traditional drift detectors cannot flexibly deal with this setting since they rely on the i.i.d. assumption when sampling the reference and test sets. The context-aware detector however allows us to pass this context to the detector and flag drift appropriately. More generally, the context-aware drift detector detects changes in the data distribution which cannot be attributed to a permissible change in the context variable. On top of that, the detector allows you to understand which subpopulations are present in both the reference and test data which provides deeper insights into the distribution underlying the test data.

Useful context (or conditioning) variables for the context-aware drift detector include but are not limited to:

Domain or application specific contexts such as the time of day or the activity (e.g. running or resting).
Conditioning on the relative prevalences of known subpopulations, such as the frequency of different types of heartbeats. It is important to note that while the relative frequency of each subpopulation (e.g. the different heartbeat types) might change, the distribution underlying each individual subpopulation (e.g. each specific type of heartbeat) cannot change.
Conditioning on model predictions. Assume we trained a classifier which detects arrhythmia, then we can provide the classifier model predictions as context and understand if, given the model prediction, the data comes from the same underlying distribution as the reference data or not.
Conditioning on model uncertainties which would allow increases in model uncertainty due to drift into familiar regions of high aleatoric uncertainty (often fine) to be distinguished from that into unfamiliar regions of high epistemic uncertainty (often problematic).

The following settings will be showcased throughout the notebook:

A change in the prevalences of subpopulations (i.e. different types of heartbeats as determined by an unsupervised clustering model or an ECG classifier) which are also present in the reference data is observed. Contrary to traditional drift detection approaches, the context-aware detector does not flag drift as this change in frequency of various heartbeats is permissible given the context provided.
A change in the underlying distribution underlying one or more subpopulations takes place. While we allow changes in the prevalences of the subpopulations accounted for by the context variable, we do not allow changes of the subpopulations themselves. If for instance the ECGs are corrupted by noise on the sensor measurements, we want to flag drift.

We also show how to condition the detector on different context variables such as the ECG classifier model predictions, cluster membership by an unsupervised clustering algorithm and timestamps.

Under setting 1. we want our detector to be well-calibrated (a controlled False Positive Rate (FPR) and more generally a p-value which is uniformly distributed between 0 and 1) while under setting 2. we want our detector to be powerful and flag drift. Lastly, we show how the detector can help you to understand the connection between the reference and test data distributions better.

Data

The dataset contains 5000 ECG’s, originally obtained from Physionet from the BIDMC Congestive Heart Failure Database, record chf07. The data has been pre-processed in 2 steps: first each heartbeat is extracted, and then each beat is made equal length via interpolation. The data is labeled and contains 5 classes. The first class $N$ which contains almost 60% of the observations is seen as normal while the others are supraventricular ectopic beats ($S$), ventricular ectopic beats ($V$), fusion beats ($F$) and unknown beats ($Q$).

Requirements

The notebook requires the torch and statsmodels packages to be installed, which can be done via pip:

!pip install torch statsmodels

Before we start let's fix the random seeds for reproducibility:

import numpy as np
import torch

def set_seed(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)

set_seed(2022)

Load data

First we load the data, show the distribution across the ECG classes and visualise some ECGs from each class.

from alibi_detect.datasets import fetch_ecg
import matplotlib.pyplot as plt

(x_train, y_train), (x_test, y_test) = fetch_ecg(return_X_y=True)
y_train -= 1  # classes start at 1
y_test -= 1
x_all = np.concatenate([x_train, x_test], 0)
y_all = np.concatenate([y_train, y_test], 0)

n_total = x_train.shape[0] + x_test.shape[0]
n_class = len(np.unique(y_test))
x_by_class = {c: [] for c in range(n_class)}

# check number of instances per class
for c in range(n_class):
    idx_tr, idx_te = np.where(y_train == c)[0], np.where(y_test == c)[0]
    x_c = np.concatenate([x_train[idx_tr], x_test[idx_te]], axis=0)
    x_by_class[c] = x_c

# plot breakdown of all instances
plt.figure(figsize=(14,7))
labels = ['N','Q','V','S','F']
plt.pie([v.shape[0] for v in x_by_class.values()], labels=labels, 
         colors=['red','green','blue','skyblue','orange'], autopct='%1.1f%%')
p = plt.gcf()
p.gca().add_artist(plt.Circle((0,0), 0.7, color='white'))
plt.title(f'Breakdown of all {n_total} instances by type of heartbeat')
plt.show()

# visualise an instance from each class
for k, v in x_by_class.items():
    plt.plot(v[0], label=labels[k])
plt.title('ECGs of Different Classes')
plt.xlabel('Time step')
plt.legend()
plt.show()

We can see that most heartbeats can be classified as normal, followed by the unknown class. We will now sample 500 heartbeats to train a simple ECG classifier. Importantly, we leave out the $F$ and $V$ classes which are used to detect drift. First we define a helper function to sample data.

def split_data(x, y, n1, n2, seed=None):
        
    if seed:
        np.random.seed(seed)
    
    # split data by class
    cs = np.unique(y)
    n_c = len(np.unique(y))
    idx_c = {_: np.where(y == _)[0] for _ in cs}

    # convert nb instances per class to a list if needed
    n1_c = [n1] * n_c if isinstance(n1, int) else n1
    n2_c = [n2] * n_c if isinstance(n2, int) else n2
    
    # sample reference, test and held out data
    idx1, idx2 = [], []
    for _, c in enumerate(cs):
        idx = np.random.choice(idx_c[c], size=len(idx_c[c]), replace=False)
        idx1.append(idx[:n1_c[_]])
        idx2.append(idx[n1_c[_]:n1_c[_] + n2_c[_]])
    idx1 = np.concatenate(idx1)
    idx2 = np.concatenate(idx2)    
    x1, y1 = x[idx1], y[idx1]
    x2, y2 = x[idx2], y[idx2]
    return (x1, y1), (x2, y2)

We use a prop_train fraction of all samples to train the classifier and then remove instances from the $F$ and $V$ classes. The rest of the data is used by our drift detectors.

prop_train = .15
n_train_c = [int(prop_train * len(v)) for v in x_by_class.values()]
n_train_c[2], n_train_c[4] = 0, 0  # remove F and V classes from the training data
# the remainder of the data is used by the drift detectors
n_drift_c = [len(v) - n_train_c[_] for _, v in enumerate(x_by_class.values())]
(x_train, y_train), (x_drift, y_drift) = split_data(x_all, y_all, n_train_c, n_drift_c, seed=0)
print('train:', x_train.shape, 'drift detection:', x_drift.shape)

Train an ECG classifier

Now we define and train our classifier on the training set.

import torch.nn as nn
import torch.nn.functional as F


class Classifier(nn.Module):
    def __init__(self, dim_in: int = 140, dim_hidden: int = 128, dim_out: int = 5) -> None:
        super().__init__()
        self.lin_in = nn.Linear(dim_in, dim_hidden)
        self.bn1 = nn.BatchNorm1d(dim_hidden)
        self.lin_hidden = nn.Linear(dim_hidden, dim_hidden)
        self.bn2 = nn.BatchNorm1d(dim_hidden)
        self.lin_out = nn.Linear(dim_hidden, dim_out)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.leaky_relu(self.bn1(self.lin_in(x)))
        x = F.leaky_relu(self.bn2(self.lin_hidden(x)))
        return self.lin_out(x)

from torch.utils.data import TensorDataset, DataLoader
from alibi_detect.models.pytorch import trainer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

ds_train = TensorDataset(torch.from_numpy(x_train), torch.from_numpy(y_train).long())
dl_train = DataLoader(ds_train, batch_size=32, shuffle=True, drop_last=True)
model = Classifier().to(device)
trainer(model, nn.CrossEntropyLoss(), dl_train, device, torch.optim.Adam, learning_rate=.001, epochs=5)

Let's evaluate out classifier on both the training and drift portions of the datasets.

model.eval()
with torch.no_grad():
    y_pred_train = model(torch.from_numpy(x_train).to(device)).argmax(-1).cpu().numpy()
    y_pred_drift = model(torch.from_numpy(x_drift).to(device)).argmax(-1).cpu().numpy()
acc_train = (y_pred_train == y_train).mean()
acc_drift = (y_pred_drift == y_drift).mean()
print(f'Model accuracy: train {acc_train:.2f} - drift {acc_drift:.2f}')

Detector calibration under no change

We start with an example where no drift occurs and the reference and test data are both sampled randomly from all classes present in the reference data (classes 0, 1 and 3). Under this scenario, we expect no drift to be detected by either a normal MMD detector or by the context-aware MMD detector.

Before we can start using the context-aware drift detector, first we need to define our context variable. In our experiments we allow the relative prevalences of subpopulations (i.e. the relative frequency of different types of hearbeats also present in the reference data) to vary while the distributions underlying each of the subpopulations remain unchanged. To achieve this we condition on the prediction probabilities of the classifier we trained earlier to distinguish the different types of ECGs. We can do this because the prediction probabilities can account for the frequency of occurrence of each of the heartbeat types (be it imperfectly given our classifier makes the occasional mistake).

from scipy.special import softmax


def context(x: np.ndarray) -> np.ndarray:
    """ Condition on classifier prediction probabilities. """
    model.eval()
    with torch.no_grad():
        logits = model(torch.from_numpy(x).to(device)).cpu().numpy()
    return softmax(logits, -1)

from alibi_detect.cd import MMDDrift, ContextMMDDrift
from tqdm import tqdm

n_ref, n_test = 200, 200
n_drift = x_drift.shape[0]

# filter out classes not in training set
idx_filter = np.concatenate([np.where(y_drift == _)[0] for _ in np.unique(y_train)])

n_runs = 300  # number of drift detection runs, each with a different reference and test sample

p_vals_mmd, p_vals_cad = [], []
for _ in tqdm(range(n_runs)):
    
    # sample data
    idx = np.random.choice(idx_filter, size=len(idx_filter), replace=False)
    idx_ref, idx_test = idx[:n_ref], idx[n_ref:n_ref+n_test]
    x_ref = x_drift[idx_ref]
    x_test = x_drift[idx_test]
    
    # mmd drift detector
    dd_mmd = MMDDrift(x_ref, p_val=.05, n_permutations=100, backend='pytorch')
    preds_mmd = dd_mmd.predict(x_test)
    p_vals_mmd.append(preds_mmd['data']['p_val'])
    
    # context-aware mmd drift detector 
    c_ref = context(x_ref)
    c_test = context(x_test)
    dd_cad = ContextMMDDrift(x_ref, c_ref, p_val=.05, n_permutations=100, backend='pytorch')
    preds_cad = dd_cad.predict(x_test, c_test)
    p_vals_cad.append(preds_cad['data']['p_val'])
    
p_vals_mmd = np.array(p_vals_mmd)
p_vals_cad = np.array(p_vals_cad)

The below figure of the Q-Q (Quantile-Quantile) plots of a random sample from the uniform distribution U[0,1] against the obtained p-values from the vanilla and context-aware MMD detectors illustrate how well both detectors are calibrated. A perfectly calibrated detector should have a Q-Q plot which closely follows the diagonal. Only the middle plot in the grid shows the detector's p-values. The other plots correspond to n_runs p-values actually sampled from U[0,1] to contextualise how well the central plot follows the diagonal given the limited number of samples.

As expected we can see that both the normal MMD and the context-aware MMD detectors are well-calibrated.

import statsmodels.api as sm
from scipy.stats import uniform


def plot_p_val_qq(p_vals: np.ndarray, title: str) -> None:
    fig, axes = plt.subplots(nrows=3, ncols=3, sharex=True, sharey=True, figsize=(12,10))
    fig.suptitle(title)
    n = len(p_vals)
    for i in range(9):
        unifs = p_vals if i==4 else np.random.rand(n)
        sm.qqplot(unifs, uniform(), line='45', ax=axes[i//3,i%3])
        if i//3 < 2:
            axes[i//3,i%3].set_xlabel('')
        if i%3 != 0:
            axes[i//3,i%3].set_ylabel('')

plot_p_val_qq(p_vals_mmd, 'Q-Q plot MMD detector')
plot_p_val_qq(p_vals_cad, 'Q-Q plot Context-Aware MMD detector')

Changing the relative subpopulation prevalences

We now focus our attention on a more realistic problem where the relative frequency of one or more subpopulations (i.e. types of hearbeats) is changing while the underlying subpopulation distribution stays the same. This would be the expected setting when we monitor the heartbeat of a specific person (e.g. only normal heartbeats) and we don't want to flag drift.

n_ref_c = 400
# only 3 classes in train set and class 0 contains the normal heartbeats
n_test_c = [200, 0, 0]

x_c_train, y_c_train = x_drift[idx_filter], y_drift[idx_filter]

n_runs = 300

p_vals_mmd, p_vals_cad = [], []
for _ in tqdm(range(n_runs)):
    
    # sample data
    (x_ref, y_ref), (x_test, y_test) = split_data(x_c_train, y_c_train, n_ref_c, n_test_c, seed=_)
    
    # mmd drift detector
    dd_mmd = MMDDrift(x_ref, p_val=.05, n_permutations=100, backend='pytorch')
    preds_mmd = dd_mmd.predict(x_test)
    p_vals_mmd.append(preds_mmd['data']['p_val'])
    
    # context-aware mmd drift detector 
    c_ref = context(x_ref)
    c_test = context(x_test)
    dd_cad = ContextMMDDrift(x_ref, c_ref, p_val=.05, n_permutations=100, backend='pytorch')
    preds_cad = dd_cad.predict(x_test, c_test)
    p_vals_cad.append(preds_cad['data']['p_val'])
    
p_vals_mmd = np.array(p_vals_mmd)
p_vals_cad = np.array(p_vals_cad)

While the usual MMD detector only returns very low p-values (mostly 0), the context-aware MMD detector remains calibrated.

plot_p_val_qq(p_vals_mmd, 'Q-Q plot MMD detector')
plot_p_val_qq(p_vals_cad, 'Q-Q plot Context-Aware MMD detector')

Changing the subpopulation distribution

In the following example we change the distribution of one or more of the underlying subpopulations (i.e. the different types of heartbeats). Notice that now we do want to flag drift since our context variable, which permits changes in relative subpopulation prevalences, can no longer explain the change in distribution.

We will again sample from the normal heartbeats, but now we will add random noise to a fraction of the extracted heartbeats to change the distribution. This could be the result of an error with some of the sensors. The perturbation is illustrated below:

i = 0

plt.plot(x_train[i], label='original')
plt.plot(x_train[i] + np.random.normal(size=140), label='noise')
plt.title('Original vs. perturbed ECG')
plt.xlabel('Time step')
plt.legend()
plt.show()

noise_frac = .5  # 50% of the test set samples are corrupted, the rest stays in-distribution

n_runs = 300

p_vals_cad = []
for _ in tqdm(range(n_runs)):
    
    # sample data
    (x_ref, y_ref), (x_test, y_test) = split_data(x_c_train, y_c_train, n_ref_c, n_test_c, seed=_)
    
    # perturb a fraction of the test data
    n_test, n_features = x_test.shape
    n_noise = int(noise_frac * n_test)
    x_noise = np.random.normal(size=n_noise * n_features).reshape(n_noise, n_features)
    idx_noise = np.random.choice(n_test, size=n_noise, replace=False)
    x_test[idx_noise] += x_noise
    
    # cad drift detector 
    c_ref = context(x_ref)
    c_test = context(x_test)
    dd_cad = ContextMMDDrift(x_ref, c_ref, p_val=.05, n_permutations=100, backend='pytorch')
    preds_cad = dd_cad.predict(x_test, c_test)
    p_vals_cad.append(preds_cad['data']['p_val'])
    
p_vals_cad = np.array(p_vals_cad)

As we can see from the Q-Q and power of the detector, the changes in the subpopulation are easily detected:

threshold = .05
print(f'Power at {threshold * 100}% significance level')
print(f'Context-aware MMD: {(p_vals_cad < threshold).mean():.3f}')

plot_p_val_qq(p_vals_cad, 'Q-Q plot Context-Aware MMD detector')

Changing the context variable

We now use the cluster membership probabilities of a Gaussian mixture model which is fit on the training instances as context variables instead of the model predictions. We will test both the calibration when the frequency of the subpopulations (the cluster memberships) changes as well as the power when the $F$ and $V$ heartbeats are included.

from sklearn.mixture import GaussianMixture

n_clusters = 2  # normal heartbeats + S/Q which look fairly similar as illustrated earlier
gmm = GaussianMixture(n_components=n_clusters, covariance_type='full', random_state=2022)
gmm.fit(x_train)

# compute all contexts
c_all_proba = gmm.predict_proba(x_drift)
c_all_class = gmm.predict(x_drift)

n_ref_c = [200, 200]
n_test_c = [100, 25]

def sample_from_clusters():
    idx_ref, idx_test = [], []
    for _, (i_ref, i_test) in enumerate(zip(n_ref_c, n_test_c)):
        idx_c = np.where(c_all_class == _)[0]
        idx_shuffle = np.random.choice(idx_c, size=len(idx_c), replace=False)
        idx_ref.append(idx_shuffle[:i_ref])
        idx_test.append(idx_shuffle[i_ref:i_ref+i_test])
    idx_ref = np.concatenate(idx_ref, 0)
    idx_test = np.concatenate(idx_test, 0)
    c_ref = c_all_proba[idx_ref]
    c_test = c_all_proba[idx_test]
    x_ref = x_drift[idx_ref]
    x_test = x_drift[idx_test]
    return c_ref, c_test, x_ref, x_test

#| scrolled: true
n_runs = 300

p_vals_null, p_vals_new = [], []
for _ in tqdm(range(n_runs)):
    
    # sample data
    c_ref, c_test_null, x_ref, x_test_null = sample_from_clusters()
        
    # previously unseen classes
    x_test_new = np.concatenate([x_drift[y_drift == 2], x_drift[y_drift == 4]], 0)
    c_test_new = gmm.predict_proba(x_test_new)
    
    # detect drift
    dd = ContextMMDDrift(x_ref, c_ref, p_val=.05, n_permutations=100, backend='pytorch')
    preds_null = dd.predict(x_test_null, c_test_null)
    preds_new = dd.predict(x_test_new, c_test_new)
    p_vals_null.append(preds_null['data']['p_val'])
    p_vals_new.append(preds_new['data']['p_val'])
    
p_vals_null = np.array(p_vals_null)
p_vals_new = np.array(p_vals_new)

plot_p_val_qq(p_vals_null, 'Q-Q plot Context-Aware MMD detector when changing the subpopulation prevalence')
threshold = .05
print(f'Power at {threshold * 100}% significance level')
print(f'Context-aware MMD on F and V classes: {(p_vals_new < threshold).mean():.3f}')

Interpretability of the context-aware detector

In the following example we only pick 1 type of heartbeat (the normal one) to be present in the test set while 3 types are present in the reference set. We can then investigate via the coupling matrix whether the test statistic $\hat{t}$ focused on the right types of heartbeats in the reference data via $W_{0,1}$. More concretely, we can sum over the columns (the test instances) of $W_{0,1}$ and check which reference instances obtained the highest weights.

n_ref_c = 400
n_test_c = [200, 0, 0]

(x_ref, y_ref), (x_test, y_test) = split_data(x_c_train, y_c_train, n_ref_c, n_test_c)

# condition using the model pred
c_ref = context(x_ref)
c_test = context(x_test)

# initialise detector and make predictions
dd = ContextMMDDrift(x_ref, c_ref, p_val=.05, n_permutations=100, backend='pytorch')
preds = dd.predict(x_test, c_test, return_coupling=True)

# no drift is detected since the distribution of 
# the subpopulations in the test set remain the same
print(f'p-value: {preds["data"]["p_val"]:.3f}')

# extract coupling matrix between reference and test data
W_01 = preds['data']['coupling_xy']

# sum over test instances
w_ref = W_01.sum(1)

As expected no drift was detected since the test set only contains normal heartbeats. We now sort the weights of w_ref in descending order. We expect the top 400 entries to be fairly high and consistent since these represent the normal heartbeats in the reference set. Afterwards, the weight attribution to the other instances in the reference set should be low. The plot below confirms that this is indeed what happens.

inds_ref_sort = np.argsort(w_ref)[::-1]
plt.plot(w_ref[inds_ref_sort]);
plt.title('Sorted reference weights from the coupling matrix W_01');
plt.ylabel('Reference instance weight in W_01');
plt.xlabel('Instances sorted by weight in W_01');
plt.show()

Time conditioning

The dataset consists of nicely extracted and aligned ECGs of 140 data points for each observation. However in reality it is likely that we will continuously or periodically observe instances which are not nicely aligned. We could however assign a timestamp to the data (e.g. starting from a peak) and use time as the context variable. This is illustrated in the example below.

First we create a new dataset where we split each instance in slices of non-overlapping ECG segments. Each of the segments will have an associated timestamp as context variable. Then we can check the calibration under no change (besides the time-varying behaviour which is accounted for) as well as the power for ECG segments where we add incorrect time stamps to some of the segments.

# filter out normal heartbeats
idx_normal = np.where(y_drift == 0)[0]
x_normal, y_normal = x_drift[idx_normal], y_drift[idx_normal]
n_normal = len(x_normal)

# determine segment length and starting points in each original ECG
segment_len = 40
n_segments = 3
max_start = n_features - n_segments * segment_len
idx_start = np.random.choice(max_start, size=n_normal, replace=True)

# split original ECGs in segments
x_split = np.concatenate(
    [
        np.concatenate(
            [x_normal[_, idx+i*segment_len:idx+(i+1)*segment_len][None, :] for i in range(n_segments)], 0
        ) for _, idx in enumerate(idx_start)
    ], 0
)

# time-varying context, standardised
c_split = np.repeat(idx_start, n_segments).astype(np.float32)
c_add = np.tile(np.array([i*segment_len for i in range(n_segments)]), len(idx_start)).astype(np.float32)
c_split += c_add
c_split = (c_split - c_split.mean()) / c_split.std()
c_split = c_split[:, None]

n_ref = 500
n_test = 500

mismatch_frac = .4  # fraction of instances where the time stamps are incorrect given the segment
n_mismatch = int(mismatch_frac * n_test)

n_runs = 300

p_vals_null, p_vals_alt = [], []
for _ in tqdm(range(n_runs)):
    
    # sample data
    # no change
    idx = np.random.choice(n_normal, size=n_normal, replace=False)    
    idx_ref, idx_test = idx[:n_ref], idx[n_ref:n_ref+n_test]
    x_ref = x_split[idx_ref]
    x_test_null = x_split[idx_test]
    x_test_alt = x_test_null
    
    # context
    c_ref, c_test_null = c_split[idx_ref], c_split[idx_test]
    
    # mismatched time stamps
    c_test_alt = c_test_null.copy()
    idx_mismatch = np.random.choice(n_test-1, size=n_mismatch, replace=False)
    c_test_alt[idx_mismatch] = c_test_alt[idx_mismatch+1]  # shift 1 spot to the right
    
    # detect drift
    dd = ContextMMDDrift(x_ref, c_ref, p_val=.05, n_permutations=100, backend='pytorch')
    preds_null = dd.predict(x_test_null, c_test_null)
    preds_alt = dd.predict(x_test_alt, c_test_alt)
    p_vals_null.append(preds_null['data']['p_val'])
    p_vals_alt.append(preds_alt['data']['p_val'])
    
p_vals_null = np.array(p_vals_null)
p_vals_alt = np.array(p_vals_alt)

#| scrolled: false
plot_p_val_qq(p_vals_null, 'Q-Q plot Context-Aware MMD detector under no change')
threshold = .05
print(f'Power at {threshold * 100}% significance level')
print(f'Context-aware MMD with mismatched time stamps: {(p_vals_alt < threshold).mean():.3f}')

Model Distillation drift detector on CIFAR-10

Method

Model distillation is a technique that is used to transfer knowledge from a large network to a smaller network. Typically, it consists of training a second model with a simplified architecture on soft targets (the output distributions or the logits) obtained from the original model.

Here, we apply model distillation to obtain harmfulness scores, by comparing the output distributions of the original model with the output distributions of the distilled model, in order to detect adversarial data, malicious data drift or data corruption. We use the following definition of harmful and harmless data points:

Harmful data points are defined as inputs for which the model's predictions on the uncorrupted data are correct while the model's predictions on the corrupted data are wrong.
Harmless data points are defined as inputs for which the model's predictions on the uncorrupted data are correct and the model's predictions on the corrupted data remain correct.

Analogously to the adversarial AE detector, which is also part of the library, the model distillation detector picks up drift that reduces the performance of the classification model.

Moreover, in this example a drift detector that applies two-sample Kolmogorov-Smirnov (K-S) tests to the scores is employed. The p-values obtained are used to assess the harmfulness of the data.

Dataset

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import tensorflow as tf
from alibi_detect.cd import KSDrift
from alibi_detect.ad import ModelDistillation

from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model, fetch_detector
from alibi_detect.utils.tensorflow import predict_batch
from alibi_detect.saving import save_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c

Load data

Original CIFAR-10 data:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_train = scale_by_instance(X_train)
y_train = y_train.astype('int64').reshape(-1,)
X_test = X_test.astype('float32') / 255
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

corruptions = corruption_types_cifar10c()
print(corruptions)

Let's pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the corrupted data by corruption type:

X_c = []
n_corr = len(corruption)
n_test = X_test.shape[0]
for i in range(n_corr):
    X_c.append(X_corr[i * n_test:(i + 1) * n_test])

We can visualise the same instance for each corruption type:

#| tags: [hide_input]
i = 1

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:

dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
    acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
    clf_accuracy[corruption[_]] = acc
    print('{} {:.4f}'.format(corruption[_], acc))

Model distillation as a malicious drift detector

Analogously to the adversarial AE detector, which uses an autoencoder to reproduce the output distribution of a classifier and produce adversarial scores, the model distillation detector achieves the same goal by using a simple classifier in place of the autoencoder. This approach is more flexible since it bypasses the instance's generation step, and it can be applied in a straightforward way to a variety of data sets such as text or time series.

We can use the adversarial scores produced by the Model Distillation detector in the context of drift detection. The score function of the detector becomes the preprocessing function for the drift detector. The K-S test is then a simple univariate test between the adversarial scores of the reference batch and the test data. Higher adversarial scores indicate more harmful drift. Importantly, a harmfulness detector flags malicious data drift. We can fetch the pretrained model distillation detector from a Google Cloud Bucket or train one from scratch:

Definition and training of the distilled model

from tensorflow.keras.layers import Conv2D, Dense, Flatten, InputLayer
from tensorflow.keras.regularizers import l1

def distilled_model_cifar10(clf, nb_conv_layers=3, nb_filters1=256, nb_dense=40,
                            kernel1=4, kernel2=4, kernel3=4, ae_arch=False):
    print('Define distilled model')
    nb_filters1 = int(nb_filters1)
    nb_filters2 = int(nb_filters1 / 2)
    nb_filters3 = int(nb_filters1 / 4)
    layers = [InputLayer(input_shape=(32, 32, 3)),
              Conv2D(nb_filters1, kernel1, strides=2, padding='same')]
    if nb_conv_layers > 1:
        layers.append(Conv2D(nb_filters2, kernel2, strides=2, padding='same',
                             activation=tf.nn.relu, kernel_regularizer=l1(1e-5)))
    if nb_conv_layers > 2:
        layers.append(Conv2D(nb_filters3, kernel3, strides=2, padding='same',
                             activation=tf.nn.relu, kernel_regularizer=l1(1e-5)))
    layers.append(Flatten())
    layers.append(Dense(nb_dense))
    layers.append(Dense(clf.output_shape[1], activation='softmax'))
    distilled_model = tf.keras.Sequential(layers)
    return distilled_model

def accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return (y_true == y_pred).astype(int).sum() / y_true.shape[0]

load_pretrained = True

filepath = 'my_path' # change to (absolute) directory where model is downloaded
detector_type = 'adversarial'
detector_name = 'model_distillation'
filepath = os.path.join(filepath, detector_name)
if load_pretrained:
    ad = fetch_detector(filepath, detector_type, dataset, detector_name, model=model)
else:
    distilled_model = distilled_model_cifar10(clf)
    print(distilled_model.summary())
    ad = ModelDistillation(distilled_model=distilled_model, model=clf)
    ad.fit(X_train, epochs=50, batch_size=128, verbose=True)
    save_detector(ad, filepath)

Scores and p-values calculation

Here we initialize the K-S drift detector using the harmfulness scores as a preprocessing function. The KS test is performed on these scores.

batch_size = 100
nb_batches = 100
severities = [1, 2, 3, 4, 5]

def sample_batch(x_orig, x_corr, batch_size, p):
    nb_orig = int(batch_size * (1 - p))
    nb_corr = batch_size - nb_orig
    perc = np.round(nb_corr / batch_size, 2)
    
    idx_orig = np.random.choice(range(x_orig.shape[0]), nb_orig)
    x_sample_orig = x_orig[idx_orig]    
    
    idx_corr = np.random.choice(range(x_corr.shape[0]), nb_corr)
    x_sample_corr = x_corr[idx_corr]
    
    x_batch = np.concatenate([x_sample_orig, x_sample_corr])
    return x_batch, perc

Initialise the drift detector:

from functools import partial

np.random.seed(0)
n_ref = 1000
idx_ref = np.random.choice(range(X_test.shape[0]), n_ref)
X_test = scale_by_instance(X_test)
X_ref = X_test[idx_ref]
labels = ['No!', 'Yes!']

# adversarial score fn = preprocess step
preprocess_fn = partial(ad.score, batch_size=128)

# initialize the drift detector
cd = KSDrift(X_ref, p_val=.05, preprocess_fn=preprocess_fn)

Calculate scores. We split the corrupted data into harmful and harmless data and visualize the harmfulness scores for various values of corruption severity.

dfs = {}
score_drift = {
    1: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    2: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    3: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    4: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    5: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
}
y_pred = predict_batch(X_test, clf, batch_size=256).argmax(axis=1)
score_x = ad.score(X_test, batch_size=256)

for s in severities:
    print('Loading corrupted data. Severity = {}'.format(s))
    X_corr, y_corr = fetch_cifar10c(corruption=corruptions, severity=s, return_X_y=True)
    print('Preprocess data...')
    X_corr = X_corr.astype('float32') / 255
    X_corr = scale_by_instance(X_corr)
    
    print('Make predictions on corrupted dataset...')
    y_pred_corr = predict_batch(X_corr, clf, batch_size=1000).argmax(axis=1)
    
    print('Compute adversarial scores on corrupted dataset...')
    score_corr = ad.score(X_corr, batch_size=256)
    
    labels_corr = np.zeros(score_corr.shape[0])
    repeat = y_corr.shape[0] // y_test.shape[0]
    y_pred_repeat = np.tile(y_pred, (repeat,))
    
    # malicious/harmful corruption: original prediction correct but
    # prediction on corrupted data incorrect
    idx_orig_right = np.where(y_pred_repeat == y_corr)[0]
    idx_corr_wrong = np.where(y_pred_corr != y_corr)[0]
    idx_harmful = np.intersect1d(idx_orig_right, idx_corr_wrong)
    
    # harmless corruption: original prediction correct and prediction
    # on corrupted data correct
    labels_corr[idx_harmful] = 1
    labels = np.concatenate([np.zeros(X_test.shape[0]), labels_corr]).astype(int)
    idx_corr_right = np.where(y_pred_corr == y_corr)[0]
    idx_harmless = np.intersect1d(idx_orig_right, idx_corr_right)
    
    # Split corrupted inputs in harmful and harmless
    X_corr_harm = X_corr[idx_harmful]
    X_corr_noharm = X_corr[idx_harmless]

    # Store adversarial scores for harmful and harmless data
    score_drift[s]['all'] = score_corr
    score_drift[s]['harm'] = score_corr[idx_harmful]
    score_drift[s]['noharm'] = score_corr[idx_harmless]
    score_drift[s]['acc'] = accuracy(y_corr, y_pred_corr)
    
    print('Compute p-values')
    for j in range(nb_batches):
        ps = []
        pvs_harm = []
        pvs_noharm = []
        for p in np.arange(0, 1, 0.1):
            # Sampling a batch of size `batch_size` where a fraction p of the data
            # is corrupted harmful data and a fraction 1 - p is non-corrupted data
            X_batch_harm, _ = sample_batch(X_test, X_corr_harm, batch_size, p)
            
            # Sampling a batch of size `batch_size` where a fraction p of the data
            # is corrupted harmless data and a fraction 1 - p is non-corrupted data
            X_batch_noharm, perc = sample_batch(X_test, X_corr_noharm, batch_size, p)
            
            # Calculating p-values for the harmful and harmless data by applying
            # K-S test on the adversarial scores
            pv_harm = cd.score(X_batch_harm)
            pv_noharm = cd.score(X_batch_noharm)
            ps.append(perc * 100)
            pvs_harm.append(pv_harm[0])
            pvs_noharm.append(pv_noharm[0])
        if j == 0:
            df = pd.DataFrame({'p': ps})
        df['pvalue_harm_{}'.format(j)] = pvs_harm
        df['pvalue_noharm_{}'.format(j)] = pvs_noharm 

    for name in ['pvalue_harm', 'pvalue_noharm']:
        df[name + '_mean'] = df[[col for col in df.columns if name in col]].mean(axis=1)
        df[name + '_std'] = df[[col for col in df.columns if name in col]].std(axis=1)
        df[name + '_max'] = df[[col for col in df.columns if name in col]].max(axis=1)
        df[name + '_min'] = df[[col for col in df.columns if name in col]].min(axis=1)
    df.set_index('p', inplace=True)
    dfs[s] = df

Plot scores

We now plot the mean scores and standard deviations per severity level. The plot shows the mean harmfulness scores (lhs) and ResNet-32 accuracies (rhs) for increasing data corruption severity levels. Level 0 corresponds to the original test set. Harmful scores are scores from instances which have been flipped from the correct to an incorrect prediction because of the corruption. Not harmful means that a correct prediction was unchanged after the corruption.

mu_noharm, std_noharm = [], []
mu_harm, std_harm = [], []
acc = [clf_accuracy['original']]
for k, v in score_drift.items():
    mu_noharm.append(v['noharm'].mean())
    std_noharm.append(v['noharm'].std())
    mu_harm.append(v['harm'].mean())
    std_harm.append(v['harm'].std())
    acc.append(v['acc'])

plot_labels = ['0', '1', '2', '3', '4', '5']

N = 6
ind = np.arange(N)
width = .35

fig_bar_cd, ax = plt.subplots()
ax2 = ax.twinx()

p0 = ax.bar(ind[0], score_x.mean(), yerr=score_x.std(), capsize=2)
p1 = ax.bar(ind[1:], mu_noharm, width, yerr=std_noharm, capsize=2)
p2 = ax.bar(ind[1:] + width, mu_harm, width, yerr=std_harm, capsize=2)

ax.set_title('Harmfullness Scores and Accuracy by Corruption Severity')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(plot_labels)
ax.set_ylim((-2))
ax.legend((p1[0], p2[0]), ('Not Harmful', 'Harmful'), loc='upper right', ncol=2)
ax.set_ylabel('Score')
ax.set_xlabel('Corruption Severity')

color = 'tab:red'
ax2.set_ylabel('Accuracy', color=color)
ax2.plot(acc, color=color)
ax2.tick_params(axis='y', labelcolor=color)

plt.show()

Plot p-values for contaminated batches

In order to simulate a realistic scenario, we perform a K-S test on batches of instance which are increasingly contaminated with corrupted data. The following steps are implemented:

We randomly pick n_ref=1000 samples from the non-currupted test set to be used as a reference set in the initialization of the K-S drift detector.
We sample batches of data of size batch_size=100 contaminated with an increasing number of harmful corrupted data and harmless corrupted data.
The K-S detector predicts whether drift occurs between the contaminated batches and the reference data and returns the p-values of the test.
We observe that contamination of the batches with harmful data reduces the p-values much faster than contamination with harmless data. In the latter case, the p-values remain above the detection threshold even when the batch is heavily contaminated

We repeat the test for 100 randomly sampled batches and we plot the mean and the maximum p-values for each level of severity and contamination below. We can see from the plot that the detector is able to clearly detect a batch contaminated with harmful data compared to a batch contaminated with harmless data when the percentage of currupted data reaches 20%-30%.

#| scrolled: false
for s in severities:
    nrows = 1
    ncols = 2
    figsize = (15, 8)
    fig, ax = plt.subplots(nrows=nrows, ncols=ncols, figsize=figsize)
    title0 = ('Mean p-values for various percentages of corrupted data. \n' 
             ' Nb of batches = {}, batch size = {}, severity = {}'.format(
                 nb_batches, batch_size, s))
    title1 = ('Maximum p-values for various  percentages of corrupted data. \n' 
             ' Nb of batches = {}, batch size = {}, severity = {}'.format(
                 nb_batches, batch_size, s))
    dfs[s][['pvalue_harm_mean', 'pvalue_noharm_mean']].plot(ax=ax[0], title=title0)
    dfs[s][['pvalue_harm_max', 'pvalue_noharm_max']].plot(ax=ax[1], title=title1)
    for a in ax:
        a.set_xlabel('Percentage of corrupted data')
        a.set_ylabel('p-value')

Kolmogorov-Smirnov data drift detector on CIFAR-10

Method

The drift detector applies feature-wise two-sample Kolmogorov-Smirnov (K-S) tests. For multivariate data, the obtained p-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur.

For high-dimensional data, we typically want to reduce the dimensionality before computing the feature-wise univariate K-S tests and aggregating those via the chosen correction method. Following suggestions in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift, we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs (BBSDs) as out-of-the box preprocessing methods and note that PCA can also be easily implemented using scikit-learn. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift. The adversarial detector which is part of the library can also be transformed into a drift detector picking up drift that reduces the performance of the classification model. We can therefore combine different preprocessing techniques to figure out if there is drift which hurts the model performance, and whether this drift can be classified as input drift or label shift.

Backend

The method works with both the PyTorch and TensorFlow frameworks for the optional preprocessing step. Alibi Detect does however not install PyTorch for you. Check the PyTorch docs how to do this.

Dataset

import matplotlib.pyplot as plt
import numpy as np
import os
import tensorflow as tf

from alibi_detect.cd import KSDrift
from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model, fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c

Load data

Original CIFAR-10 data:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

corruptions = corruption_types_cifar10c()
print(corruptions)

Let's pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the original test set in a reference dataset and a dataset which should not be rejected under the H0 of the K-S test. We also split the corrupted data by corruption type:

np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

# check that the classes are more or less balanced
classes, counts_ref = np.unique(y_ref, return_counts=True)
counts_h0 = np.unique(y_h0, return_counts=True)[1]
print('Class Ref H0')
for cl, cref, ch0 in zip(classes, counts_ref, counts_h0):
    assert cref + ch0 == n_test // 10
    print('{}     {} {}'.format(cl, cref, ch0))

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

We can visualise the same instance for each corruption type:

#| tags: [hide_input]
i = 1

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:

dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
    acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
    clf_accuracy[corruption[_]] = acc
    print('{} {:.4f}'.format(corruption[_], acc))

Given the drop in performance, it is important that we detect the harmful data drift!

Detect drift

First we try a drift detector using the TensorFlow framework for the preprocessing step. We are trying to detect data drift on high-dimensional (32x32x3) data using feature-wise univariate tests. It therefore makes sense to apply dimensionality reduction first. Some dimensionality reduction methods also used in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift are readily available: a randomly initialized encoder (UAE or Untrained AutoEncoder in the paper), BBSDs (black-box shift detection using the classifier's softmax outputs) and PCA.

Random encoder

First we try the randomly initialized encoder:

#| scrolled: false
from functools import partial
from tensorflow.keras.layers import Conv2D, Dense, Flatten, InputLayer, Reshape
from alibi_detect.cd.tensorflow import preprocess_drift

tf.random.set_seed(0)

# define encoder
encoding_dim = 32
encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(32, 32, 3)),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(512, 4, strides=2, padding='same', activation=tf.nn.relu),
      Flatten(),
      Dense(encoding_dim,)
  ]
)

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=512)

# initialise drift detector
p_val = .05
cd = KSDrift(X_ref, p_val=p_val, preprocess_fn=preprocess_fn)

# we can also save/load an initialised detector
filepath = 'my_path'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

The p-value used by the detector for the multivariate data with encoding_dim features is equal to p_val / encoding_dim because of the Bonferroni correction.

assert cd.p_val / cd.n_features == p_val / encoding_dim

Let's check whether the detector thinks drift occurred on the different test sets and time the prediction calls:

from timeit import default_timer as timer

labels = ['No!', 'Yes!']

def make_predictions(cd, x_h0, x_corr, corruption):
    t = timer()
    preds = cd.predict(x_h0)
    dt = timer() - t
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('Feature-wise p-values:')
    print(preds['data']['p_val'])
    print(f'Time (s) {dt:.3f}')
    
    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            t = timer()
            preds = cd.predict(x)
            dt = timer() - t
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print('Feature-wise p-values:')
            print(preds['data']['p_val'])
            print(f'Time (s) {dt:.3f}')

make_predictions(cd, X_h0, X_c, corruption)

As expected, drift was only detected on the corrupted datasets. The feature-wise p-values for each univariate K-S test per (encoded) feature before multivariate correction show that most of them are well above the $0.05$ threshold for H0 and below for the corrupted datasets.

BBSDs

For BBSDs, we use the classifier's softmax outputs for black-box shift detection. This method is based on Detecting and Correcting for Label Shift with Black Box Predictors. The ResNet classifier is trained on data standardised by instance so we need to rescale the data.

X_train = scale_by_instance(X_train)
X_test = scale_by_instance(X_test)
X_ref = scale_by_instance(X_ref)
X_h0 = scale_by_instance(X_h0)
X_c = [scale_by_instance(X_c[i]) for i in range(n_corr)]

Now we initialize the detector. Here we use the output of the softmax layer to detect the drift, but other hidden layers can be extracted as well by setting 'layer' to the index of the desired hidden layer in the model:

from alibi_detect.cd.tensorflow import HiddenOutput

# define preprocessing function, we use the
preprocess_fn = partial(preprocess_drift, model=HiddenOutput(clf, layer=-1), batch_size=128)

cd = KSDrift(X_ref, p_val=p_val, preprocess_fn=preprocess_fn)

Again we can see that the p-value used by the detector for the multivariate data with 10 features (number of CIFAR-10 classes) is equal to p_val / 10 because of the Bonferroni correction.

assert cd.p_val / cd.n_features == p_val / 10

There is no drift on the original held out test set:

make_predictions(cd, X_h0, X_c, corruption)

Label drift

We can also check what happens when we introduce class imbalances between the reference data X_ref and the tested data X_imb. The reference data will use $75$% of the instances of the first 5 classes and only $25$% of the last 5. The data used for drift testing then uses respectively $25$% and $75$% of the test instances for the first and last 5 classes.

np.random.seed(0)
# get index for each class in the test set
num_classes = len(np.unique(y_test))
idx_by_class = [np.where(y_test == c)[0] for c in range(num_classes)]
# sample imbalanced data for different classes for X_ref and X_imb
perc_ref = .75
perc_ref_by_class = [perc_ref if c < 5 else 1 - perc_ref for c in range(num_classes)]
n_by_class = n_test // num_classes
X_ref = []
X_imb, y_imb = [], []
for _ in range(num_classes):
    idx_class_ref = np.random.choice(n_by_class, size=int(perc_ref_by_class[_] * n_by_class), replace=False)
    idx_ref = idx_by_class[_][idx_class_ref]
    idx_class_imb = np.delete(np.arange(n_by_class), idx_class_ref, axis=0)
    idx_imb = idx_by_class[_][idx_class_imb]
    assert not np.array_equal(idx_ref, idx_imb)
    X_ref.append(X_test[idx_ref])
    X_imb.append(X_test[idx_imb])
    y_imb.append(y_test[idx_imb])
X_ref = np.concatenate(X_ref)
X_imb = np.concatenate(X_imb)
y_imb = np.concatenate(y_imb)
print(X_ref.shape, X_imb.shape, y_imb.shape)

Update reference dataset for the detector and make predictions. Note that we store the preprocessed reference data since the preprocess_at_init kwarg is by default True:

cd.x_ref = cd.preprocess_fn(X_ref)

preds_imb = cd.predict(X_imb)
print('Drift? {}'.format(labels[preds_imb['data']['is_drift']]))
print(preds_imb['data']['p_val'])

Update reference data

So far we have kept the reference data the same throughout the experiments. It is possible however that we want to test a new batch against the last N instances or against a batch of instances of fixed size where we give each instance we have seen up until now the same chance of being in the reference batch (reservoir sampling). The update_x_ref argument allows you to change the reference data update rule. It is a Dict which takes as key the update rule ('last' for last N instances or 'reservoir_sampling') and as value the batch size N of the reference data. You can also save the detector after the prediction calls to save the updated reference data.

N = 7500
cd = KSDrift(X_ref, p_val=.05, preprocess_fn=preprocess_fn, update_x_ref={'reservoir_sampling': N})

The reference data is now updated with each predict call. Say we start with our imbalanced reference set and make a prediction on the remaining test set data X_imb, then the drift detector will figure out data drift has occurred.

preds_imb = cd.predict(X_imb)
print('Drift? {}'.format(labels[preds_imb['data']['is_drift']]))

We can now see that the reference data consists of N instances, obtained through reservoir sampling.

assert cd.x_ref.shape[0] == N

We then draw a random sample from the training set and compare it with the updated reference data. This still highlights that there is data drift but will update the reference data again:

np.random.seed(0)
perc_train = .5
n_train = X_train.shape[0]
idx_train = np.random.choice(n_train, size=int(perc_train * n_train), replace=False)

preds_train = cd.predict(X_train[idx_train])
print('Drift? {}'.format(labels[preds_train['data']['is_drift']]))

When we draw a new sample from the training set, it highlights that it is not drifting anymore against the reservoir in X_ref.

np.random.seed(1)
perc_train = .1
idx_train = np.random.choice(n_train, size=int(perc_train * n_train), replace=False)
preds_train = cd.predict(X_train[idx_train])
print('Drift? {}'.format(labels[preds_train['data']['is_drift']]))

Multivariate correction mechanism

Instead of the Bonferroni correction for multivariate data, we can also use the less conservative False Discovery Rate (FDR) correction. See here or here for nice explanations. While the Bonferroni correction controls the probability of at least one false positive, the FDR correction controls for an expected amount of false positives. The p_val argument at initialisation time can be interpreted as the acceptable q-value when the FDR correction is applied.

cd = KSDrift(X_ref, p_val=.05, preprocess_fn=preprocess_fn, correction='fdr')

preds_imb = cd.predict(X_imb)
print('Drift? {}'.format(labels[preds_imb['data']['is_drift']]))

Adversarial autoencoder as a malicious drift detector

We can leverage the adversarial scores obtained from an adversarial autoencoder trained on normal data and transform it into a data drift detector. The score function of the adversarial autoencoder becomes the preprocessing function for the drift detector. The K-S test is then a simple univariate test on the adversarial scores. Importantly, an adversarial drift detector flags malicious data drift. We can fetch the pretrained adversarial detector from a Google Cloud Bucket or train one from scratch:

load_pretrained = True

#| scrolled: true
from tensorflow.keras.regularizers import l1
from tensorflow.keras.layers import Conv2DTranspose
from alibi_detect.ad import AdversarialAE

# change filepath to (absolute) directory where model is downloaded
filepath = os.path.join(os.getcwd(), 'my_path')
detector_type = 'adversarial'
detector_name = 'base'
filepath = os.path.join(filepath, detector_name)
if load_pretrained:
    ad = fetch_detector(filepath, detector_type, dataset, detector_name, model=model)
else:  # train detector from scratch
    # define encoder and decoder networks
    encoder_net = tf.keras.Sequential(
            [
                InputLayer(input_shape=(32, 32, 3)),
                Conv2D(32, 4, strides=2, padding='same', 
                       activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Conv2D(64, 4, strides=2, padding='same', 
                       activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Conv2D(256, 4, strides=2, padding='same', 
                       activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Flatten(),
                Dense(40)
            ]
        )
    
    decoder_net = tf.keras.Sequential(
        [
                InputLayer(input_shape=(40,)),
                Dense(4 * 4 * 128, activation=tf.nn.relu),
                Reshape(target_shape=(4, 4, 128)),
                Conv2DTranspose(256, 4, strides=2, padding='same', 
                                activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Conv2DTranspose(64, 4, strides=2, padding='same', 
                                activation=tf.nn.relu, kernel_regularizer=l1(1e-5)),
                Conv2DTranspose(3, 4, strides=2, padding='same', 
                                activation=None, kernel_regularizer=l1(1e-5))
            ]
        )
    
    # initialise and train detector
    ad = AdversarialAE(encoder_net=encoder_net, decoder_net=decoder_net, model=clf)
    ad.fit(X_train, epochs=50, batch_size=128, verbose=True)
    
    # save the trained adversarial detector
    save_detector(ad, filepath)

Initialise the drift detector:

np.random.seed(0)
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
X_ref = scale_by_instance(X_test[idx])

# adversarial score fn = preprocess step
preprocess_fn = partial(ad.score, batch_size=128)

cd = KSDrift(X_ref, p_val=.05, preprocess_fn=preprocess_fn)

Make drift predictions on the original test set and corrupted data:

clf_accuracy['h0'] = clf.evaluate(X_h0, y_h0, batch_size=128, verbose=0)[1]
preds_h0 = cd.predict(X_h0)
print('H0: Accuracy {:.4f} -- Drift? {}'.format(
    clf_accuracy['h0'], labels[preds_h0['data']['is_drift']]))
clf_accuracy['imb'] = clf.evaluate(X_imb, y_imb, batch_size=128, verbose=0)[1]
preds_imb = cd.predict(X_imb)
print('imbalance: Accuracy {:.4f} -- Drift? {}'.format(
    clf_accuracy['imb'], labels[preds_imb['data']['is_drift']]))
for x, c in zip(X_c, corruption):
    preds = cd.predict(x)
    print('{}: Accuracy {:.4f} -- Drift? {}'.format(
        c, clf_accuracy[c],labels[preds['data']['is_drift']]))

While X_imb clearly exhibits input data drift due to the introduced class imbalances, it is not flagged by the adversarial drift detector since the performance of the classifier is not affected and the drift is not malicious. We can visualise this by plotting the adversarial scores together with the harmfulness of the data corruption as reflected by the drop in classifier accuracy:

adv_scores = {}
score = ad.score(X_ref, batch_size=128)
adv_scores['original'] = {'mean': score.mean(), 'std': score.std()}
score = ad.score(X_h0, batch_size=128)
adv_scores['h0'] = {'mean': score.mean(), 'std': score.std()}
score = ad.score(X_imb, batch_size=128)
adv_scores['imb'] = {'mean': score.mean(), 'std': score.std()}

for x, c in zip(X_c, corruption):
    score_x = ad.score(x, batch_size=128)
    adv_scores[c] = {'mean': score_x.mean(), 'std': score_x.std()}

mu = [v['mean'] for _, v in adv_scores.items()]
stdev = [v['std'] for _, v in adv_scores.items()]
xlabels = list(adv_scores.keys())
acc = [clf_accuracy[label] for label in xlabels]
xticks = np.arange(len(mu))

width = .35

fig, ax = plt.subplots()
ax2 = ax.twinx()

p1 = ax.bar(xticks, mu, width, yerr=stdev, capsize=2)
color = 'tab:red'
p2 = ax2.bar(xticks + width, acc, width, color=color)

ax.set_title('Adversarial Scores and Accuracy by Corruption Type')
ax.set_xticks(xticks + width / 2)
ax.set_xticklabels(xlabels, rotation=45)
ax.legend((p1[0], p2[0]), ('Score', 'Accuracy'), loc='upper right', ncol=2)
ax.set_ylabel('Adversarial Score')

color = 'tab:red'
ax2.set_ylabel('Accuracy')
ax2.set_ylim((-.26,1.2))
ax.set_ylim((-2,9))

plt.show()

We can therefore use the scores of the detector itself to quantify the harmfulness of the drift! We can generalise this to all the corruptions at each severity level in CIFAR-10-C:

def accuracy(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return (y_true == y_pred).astype(int).sum() / y_true.shape[0]

from alibi_detect.utils.tensorflow import predict_batch

severities = [1, 2, 3, 4, 5]

score_drift = {
    1: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    2: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    3: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    4: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
    5: {'all': [], 'harm': [], 'noharm': [], 'acc': 0},
}

y_pred = predict_batch(X_test, clf, batch_size=256).argmax(axis=1)
score_x = ad.score(X_test, batch_size=256)

for s in severities:
    print('\nSeverity: {} of {}'.format(s, len(severities)))
    
    print('Loading corrupted dataset...')
    X_corr, y_corr = fetch_cifar10c(corruption=corruptions, severity=s, return_X_y=True)
    X_corr = X_corr.astype('float32')
    
    print('Preprocess data...')
    X_corr = scale_by_instance(X_corr)
    
    print('Make predictions on corrupted dataset...')
    y_pred_corr = predict_batch(X_corr, clf, batch_size=256).argmax(axis=1)
    
    print('Compute adversarial scores on corrupted dataset...')
    score_corr = ad.score(X_corr, batch_size=256)
    
    print('Get labels for malicious corruptions...')
    labels_corr = np.zeros(score_corr.shape[0])
    repeat = y_corr.shape[0] // y_test.shape[0]
    y_pred_repeat = np.tile(y_pred, (repeat,))
    # malicious/harmful corruption: original prediction correct but 
    # prediction on corrupted data incorrect
    idx_orig_right = np.where(y_pred_repeat == y_corr)[0]
    idx_corr_wrong = np.where(y_pred_corr != y_corr)[0]
    idx_harmful = np.intersect1d(idx_orig_right, idx_corr_wrong)
    labels_corr[idx_harmful] = 1
    labels = np.concatenate([np.zeros(X_test.shape[0]), labels_corr]).astype(int)
    # harmless corruption: original prediction correct and prediction
    # on corrupted data correct
    idx_corr_right = np.where(y_pred_corr == y_corr)[0]
    idx_harmless = np.intersect1d(idx_orig_right, idx_corr_right)
    
    score_drift[s]['all'] = score_corr
    score_drift[s]['harm'] = score_corr[idx_harmful]
    score_drift[s]['noharm'] = score_corr[idx_harmless]
    score_drift[s]['acc'] = accuracy(y_corr, y_pred_corr)

We now compute mean scores and standard deviations per severity level and plot the results. The plot shows the mean adversarial scores (lhs) and ResNet-32 accuracies (rhs) for increasing data corruption severity levels. Level 0 corresponds to the original test set. Harmful scores are scores from instances which have been flipped from the correct to an incorrect prediction because of the corruption. Not harmful means that the prediction was unchanged after the corruption.

mu_noharm, std_noharm = [], []
mu_harm, std_harm = [], []
acc = [clf_accuracy['original']]
for k, v in score_drift.items():
    mu_noharm.append(v['noharm'].mean())
    std_noharm.append(v['noharm'].std())
    mu_harm.append(v['harm'].mean())
    std_harm.append(v['harm'].std())
    acc.append(v['acc'])

plot_labels = ['0', '1', '2', '3', '4', '5']

N = 6
ind = np.arange(N)
width = .35

fig_bar_cd, ax = plt.subplots()
ax2 = ax.twinx()

p0 = ax.bar(ind[0], score_x.mean(), yerr=score_x.std(), capsize=2)
p1 = ax.bar(ind[1:], mu_noharm, width, yerr=std_noharm, capsize=2)
p2 = ax.bar(ind[1:] + width, mu_harm, width, yerr=std_harm, capsize=2)

ax.set_title('Adversarial Scores and Accuracy by Corruption Severity')
ax.set_xticks(ind + width / 2)
ax.set_xticklabels(plot_labels)
ax.set_ylim((-1,6))
ax.legend((p1[0], p2[0]), ('Not Harmful', 'Harmful'), loc='upper right', ncol=2)
ax.set_ylabel('Score')
ax.set_xlabel('Corruption Severity')

color = 'tab:red'
ax2.set_ylabel('Accuracy', color=color)
ax2.plot(acc, color=color)
ax2.tick_params(axis='y', labelcolor=color)

plt.show()

Maximum Mean Discrepancy drift detector on CIFAR-10

Method

The Maximum Mean Discrepancy (MMD) detector is a kernel-based method for multivariate 2 sample testing. The MMD is a distance-based measure between 2 distributions p and q based on the mean embeddings $\mu_{p}$ and $\mu_{q}$ in a reproducing kernel Hilbert space $F$:

MMD(F, p, q) = || \mu_{p} - \mu_{q} ||^2_{F}

We can compute unbiased estimates of $MMD^2$ from the samples of the 2 distributions after applying the kernel trick. We use by default a radial basis function kernel, but users are free to pass their own kernel of preference to the detector. We obtain a $p$-value via a permutation test on the values of $MMD^2$. This method is also described in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift.

Backend

The method is implemented in both the PyTorch and TensorFlow frameworks with support for CPU and GPU. Various preprocessing steps are also supported out-of-the box in Alibi Detect for both frameworks and illustrated throughout the notebook. Alibi Detect does however not install PyTorch for you. Check the PyTorch docs how to do this.

Dataset

from functools import partial
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from alibi_detect.cd import MMDDrift
from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c

Load data

Original CIFAR-10 data:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

corruptions = corruption_types_cifar10c()
print(corruptions)

Let's pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the original test set in a reference dataset and a dataset which should not be rejected under the H0 of the MMD test. We also split the corrupted data by corruption type:

np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

# check that the classes are more or less balanced
classes, counts_ref = np.unique(y_ref, return_counts=True)
counts_h0 = np.unique(y_h0, return_counts=True)[1]
print('Class Ref H0')
for cl, cref, ch0 in zip(classes, counts_ref, counts_h0):
    assert cref + ch0 == n_test // 10
    print('{}     {} {}'.format(cl, cref, ch0))

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

We can visualise the same instance for each corruption type:

#| tags: [hide_input]
i = 4

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:

dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
    acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
    clf_accuracy[corruption[_]] = acc
    print('{} {:.4f}'.format(corruption[_], acc))

Given the drop in performance, it is important that we detect the harmful data drift!

Detect drift with TensorFlow backend

First we try a drift detector using the TensorFlow framework for both the preprocessing and the MMD computation steps.

We are trying to detect data drift on high-dimensional (32x32x3) data using a multivariate MMD permutation test. It therefore makes sense to apply dimensionality reduction first. Some dimensionality reduction methods also used in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift are readily available: a randomly initialized encoder (UAE or Untrained AutoEncoder in the paper), BBSDs (black-box shift detection using the classifier's softmax outputs) and PCA (using scikit-learn).

Random encoder

First we try the randomly initialized encoder:

#| scrolled: false
from tensorflow.keras.layers import Conv2D, Dense, Flatten, InputLayer, Reshape
from alibi_detect.cd.tensorflow import preprocess_drift

tf.random.set_seed(0)

# define encoder
encoding_dim = 32
encoder_net = tf.keras.Sequential(
  [
      InputLayer(input_shape=(32, 32, 3)),
      Conv2D(64, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(128, 4, strides=2, padding='same', activation=tf.nn.relu),
      Conv2D(512, 4, strides=2, padding='same', activation=tf.nn.relu),
      Flatten(),
      Dense(encoding_dim,)
  ]
)

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=512)

# initialise drift detector
cd = MMDDrift(X_ref, backend='tensorflow', p_val=.05, 
              preprocess_fn=preprocess_fn, n_permutations=100)

# we can also save/load an initialised detector
filepath = 'detector_tf'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

Let's check whether the detector thinks drift occurred on the different test sets and time the prediction calls:

from timeit import default_timer as timer

labels = ['No!', 'Yes!']

def make_predictions(cd, x_h0, x_corr, corruption):
    t = timer()
    preds = cd.predict(x_h0)
    dt = timer() - t
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print(f'p-value: {preds["data"]["p_val"]:.3f}')
    print(f'Time (s) {dt:.3f}')
    
    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            t = timer()
            preds = cd.predict(x)
            dt = timer() - t
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print(f'p-value: {preds["data"]["p_val"]:.3f}')
            print(f'Time (s) {dt:.3f}')

#| scrolled: false
make_predictions(cd, X_h0, X_c, corruption)

As expected, drift was only detected on the corrupted datasets.

BBSDs

X_ref_bbsds = scale_by_instance(X_ref)
X_h0_bbsds = scale_by_instance(X_h0)
X_c_bbsds = [scale_by_instance(X_c[i]) for i in range(n_corr)]

Initialisation of the drift detector. Here we use the output of the softmax layer to detect the drift, but other hidden layers can be extracted as well by setting 'layer' to the index of the desired hidden layer in the model:

from alibi_detect.cd.tensorflow import HiddenOutput

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=HiddenOutput(clf, layer=-1), batch_size=128)

# initialise drift detector
cd = MMDDrift(X_ref_bbsds, backend='tensorflow', p_val=.05, 
              preprocess_fn=preprocess_fn, n_permutations=100)

make_predictions(cd, X_h0_bbsds, X_c_bbsds, corruption)

Again drift is only flagged on the perturbed data.

Detect drift with PyTorch backend

We can do the same thing using the PyTorch backend. We illustrate this using the randomly initialized encoder as preprocessing step:

import torch
import torch.nn as nn

# set random seed and device
seed = 0
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

Since our PyTorch encoder expects the images in a (batch size, channels, height, width) format, we transpose the data:

def permute_c(x):
    return np.transpose(x.astype(np.float32), (0, 3, 1, 2))

X_ref_pt = permute_c(X_ref)
X_h0_pt = permute_c(X_h0)
X_c_pt = [permute_c(xc) for xc in X_c]
print(X_ref_pt.shape, X_h0_pt.shape, X_c_pt[0].shape)

from alibi_detect.cd.pytorch import preprocess_drift

# define encoder
encoder_net = nn.Sequential(
    nn.Conv2d(3, 64, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(64, 128, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Conv2d(128, 512, 4, stride=2, padding=0),
    nn.ReLU(),
    nn.Flatten(),
    nn.Linear(2048, encoding_dim)
).to(device).eval()

# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, device=device, batch_size=512)

# initialise drift detector
cd = MMDDrift(X_ref_pt, backend='pytorch', p_val=.05, 
              preprocess_fn=preprocess_fn, n_permutations=100)

# we can also save/load an initialised PyTorch based detector
filepath = 'detector_pt'  # change to directory where detector is saved
save_detector(cd, filepath)
cd = load_detector(filepath)

make_predictions(cd, X_h0_pt, X_c_pt, corruption)

The drift detector will attempt to use the GPU if available and otherwise falls back on the CPU. We can also explicitly specify the device. Let's compare the GPU speed up with the CPU implementation:

device = torch.device('cpu')
preprocess_fn = partial(preprocess_drift, model=encoder_net.to(device), 
                        device=device, batch_size=512)

cd = MMDDrift(X_ref_pt, backend='pytorch', preprocess_fn=preprocess_fn, device='cpu')

make_predictions(cd, X_h0_pt, X_c_pt, corruption)

Notice the over 30x acceleration provided by the GPU.

Similar to the TensorFlow implementation, PyTorch can also use the hidden layer output from a pretrained model for the preprocessing step via:

from alibi_detect.cd.pytorch import HiddenOutput

Scaling up drift detection with KeOps

Introduction

A number of convenient and powerful kernel-based drift detectors such as the MMD detector (Gretton et al., 2012) or the learned kernel MMD detector (Liu et al., 2020) do not scale favourably with increasing dataset size $n$, leading to quadratic complexity $\mathcal{O}(n^2)$ for naive implementations. As a result, we can quickly run into memory issues by having to store the $[N_\text{ref} + N_\text{test}, N_\text{ref} + N_\text{test}]$ kernel matrix (on the GPU if applicable) used for an efficient implementation of the permutation test. Note that $N_\text{ref}$ is the reference data size and $N_\text{test}$ the test data size.

We can however drastically speed up and scale up kernel-based drift detectors to large dataset sizes by working with symbolic kernel matrices instead and leverage the KeOps library to do so. For the user of $\texttt{Alibi Detect}$ the only thing that changes is the specification of the detector's backend, e.g. for the MMD detector:

from alibi_detect.cd import MMDDrift

detector_torch = MMDDrift(x_ref, backend='pytorch')
detector_keops = MMDDrift(x_ref, backend='keops')

In this notebook we will run a few simple benchmarks to illustrate the speed and memory improvements from using KeOps over vanilla PyTorch on the GPU (1x RTX 2080 Ti) for both the standard MMD and learned kernel MMD detectors.

Data

We randomly sample points from the standard normal distribution and run the detectors with PyTorch and KeOps backends for the following settings:

$N_\text{ref}, N_\text{test} = [2, 5, 10, 20, 50, 100]$ (batch sizes in '000s)
$D = [2, 10, 50]$

Where $D$ denotes the number of features.

Requirements

The notebook requires PyTorch and KeOps to be installed. Once PyTorch is installed, KeOps can be installed via pip:

!pip install pykeops

Before we start let’s fix the random seeds for reproducibility:

import numpy as np
import torch

def set_seed(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)

set_seed(2022)

Vanilla PyTorch vs. KeOps comparison

Utility functions

First we define some utility functions to run the experiments:

from alibi_detect.cd import MMDDrift, LearnedKernelDrift
from alibi_detect.utils.keops.kernels import DeepKernel as DeepKernelKeops
from alibi_detect.utils.keops.kernels import GaussianRBF as GaussianRBFKeops
from alibi_detect.utils.pytorch.kernels import DeepKernel as DeepKernelTorch
from alibi_detect.utils.pytorch.kernels import GaussianRBF as GaussianRBFTorch
import matplotlib.pyplot as plt
from scipy.stats import kstest
from timeit import default_timer as timer
import torch.nn as nn
import torch.nn.functional as F


class Projection(nn.Module):
    def __init__(self, d_in: int, d_out: int = 2):
        super().__init__()
        self.lin1 = nn.Linear(d_in, d_out)
        self.lin2 = nn.Linear(d_out, d_out)
    
    def forward(self, x):
        return self.lin2(F.relu(self.lin1(x)))
    

def eval_detector(p_vals: np.ndarray, threshold: float, is_drift: bool, t_mean: float, t_std: float) -> dict:
    """ In case of drifted data (ground truth) it returns the detector's power.
    In case of no drift, it computes the false positive rate (FPR) and whether the p-values
    are uniformly distributed U[0,1] which is checked via a KS test. """
    results = {'power': None, 'fpr': None, 'ks': None}
    below_p_val_threshold = (p_vals <= threshold).mean()
    if is_drift:
        results['power'] = below_p_val_threshold
    else:
        results['fpr'] = below_p_val_threshold
        stat_ks, p_val_ks = kstest(p_vals, 'uniform')
        results['ks'] = {'p_val': p_val_ks, 'stat': stat_ks}
    results['p_vals'] = p_vals
    results['time'] = {'mean': t_mean, 'stdev': t_std}
    return results


def experiment(detector: str, backend: str, n_runs: int, n_ref: int, n_test: int, n_features: int, 
               mu: float = 0.) -> dict:
    """ Runs the experiment n_runs times, each time with newly sampled reference and test data.
    Returns the p-values for each test as well as the mean and standard deviations of the runtimes. """
    p_vals, t_detect = [], []
    for _ in range(n_runs):
        # Sample reference and test data
        x_ref = np.random.randn(*(n_ref, n_features)).astype(np.float32)
        x_test = np.random.randn(*(n_test, n_features)).astype(np.float32) + mu
        
        # Initialise detector, make and log predictions
        p_val = .05
        dd_kwargs = dict(p_val=p_val, backend=backend, n_permutations=100)
        if detector == 'mmd':
            dd = MMDDrift(x_ref, **dd_kwargs)
        elif detector == 'learned_kernel':
            d_out, sigma = 2, .1
            proj = Projection(n_features, d_out)
            Kernel = GaussianRBFKeops if backend == 'keops' else GaussianRBFTorch
            kernel_a = Kernel(trainable=True, sigma = torch.Tensor([sigma]))
            kernel_b = Kernel(trainable=True, sigma = torch.Tensor([sigma]))
            device = torch.device('cuda')
            DeepKernel = DeepKernelKeops if backend == 'keops' else DeepKernelTorch
            deep_kernel = DeepKernel(proj, kernel_a, kernel_b, eps=.01).to(device)
            if backend == 'pytorch' and n_ref + n_test > 20000:
                batch_size = 10000
                batch_size_predict = 10000 
            else:
                batch_size = 1000000
                batch_size_predict = 1000000
            dd_kwargs.update(
                dict(
                    epochs=2, train_size=.75, batch_size=batch_size, batch_size_predict=batch_size_predict
                )
            )
            dd = LearnedKernelDrift(x_ref, deep_kernel, **dd_kwargs)
        start = timer()
        pred = dd.predict(x_test)
        end = timer()
        
        if _ > 0:  # first run reserved for KeOps compilation
            t_detect.append(end - start)
            p_vals.append(pred['data']['p_val'])
            
        del dd, x_ref, x_test
        torch.cuda.empty_cache()
    
    p_vals = np.array(p_vals)
    t_mean, t_std = np.array(t_detect).mean(), np.array(t_detect).std()
    results = eval_detector(p_vals, p_val, mu != 0., t_mean, t_std)
    return results


def format_results(experiments: dict, n_features: list, backends: list, max_batch_size: int = 1e10) -> dict:
    T = {'batch_size': None, 'keops': None, 'pytorch': None}
    T['batch_size'] = np.unique([experiments['keops'][_]['n_ref'] for _ in experiments['keops'].keys()])
    T['batch_size'] = list(T['batch_size'][T['batch_size'] <= max_batch_size])
    T['keops'] = {f: [] for f in n_features}
    T['pytorch'] = {f: [] for f in n_features}

    for backend in backends:
        for f in T[backend].keys():
            for bs in T['batch_size']:
                for k, v in experiments[backend].items():
                    if f == v['n_features'] and bs == v['n_ref']:
                        T[backend][f].append(results[backend][k]['time']['mean'])

    for k, v in T['keops'].items():  # apply padding
        n_pad = len(v) - len(T['pytorch'][k])
        T['pytorch'][k] += [np.nan for _ in range(n_pad)]
    return T


def plot_absolute_time(experiments: dict, results: dict, n_features: list, y_scale: str = 'linear', 
                       detector: str = 'MMD', max_batch_size: int = 1e10):
    T = format_results(experiments, n_features, ['keops', 'pytorch'], max_batch_size)
    colors = ['b', 'g', 'r', 'c', 'm', 'y', 'b']
    legend, n_c = [], 0
    for f in n_features:
        plt.plot(T['batch_size'], T['keops'][f], linestyle='solid', color=colors[n_c]);
        legend.append(f'keops - {f}')
        plt.plot(T['batch_size'], T['pytorch'][f], linestyle='dashed', color=colors[n_c]);
        legend.append(f'pytorch - {f}')
        n_c += 1
    plt.title(f'{detector} drift detection time for 100 permutations')
    plt.legend(legend, loc=(1.1,.1));
    plt.xlabel('Batch size');
    plt.ylabel('Time (s)');
    plt.yscale(y_scale);
    plt.show();


def plot_relative_time(experiments: dict, results: dict, n_features: list, y_scale: str = 'linear',
                       detector: str = 'MMD', max_batch_size: int = 1e10):
    T = format_results(experiments, n_features, ['keops', 'pytorch'], max_batch_size)
    colors = ['b', 'g', 'r', 'c', 'm', 'y', 'b']
    legend, n_c = [], 0
    for f in n_features:
        t_keops, t_torch = T['keops'][f], T['pytorch'][f]
        ratio = [tt / tk for tt, tk in zip(t_torch, t_keops)]
        plt.plot(T['batch_size'], ratio, linestyle='solid', color=colors[n_c]);
        legend.append(f'pytorch/keops - {f}')
        n_c += 1
    plt.title(f'{detector} drift detection pytorch/keops time ratio for 100 permutations')
    plt.legend(legend, loc=(1.1,.1));
    plt.xlabel('Batch size');
    plt.ylabel('time pytorch / keops');
    plt.yscale(y_scale);
    plt.show();

As detailed earlier, we will compare the PyTorch with the KeOps implementation of the MMD and learned kernel MMD detectors for a variety of reference and test data batch sizes as well as different feature dimensions. Note that for the PyTorch implementation, the portion of the kernel matrix for the reference data itself can already be computed at initialisation of the detector. This computation will not be included when we record the detector's prediction time. Since use cases where $N_\text{ref} >> N_\text{test}$ are quite common, we will also test for this specific setting. The key reason is that we cannot amortise this computation for the KeOps detector since we are working with lazily evaluated symbolic matrices.

MMD detector

1. $N_\text{ref} = N_\text{test}$

Note that for KeOps we could further increase the number of instances in the reference and test sets (e.g. to 500,000) without running into memory issues.

experiments_eq = {
    'keops': {
        0: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 2},
        1: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 2},
        2: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 2},
        3: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 5, 'n_features': 2},
        4: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 5, 'n_features': 2},
        5: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 5, 'n_features': 2},
        6: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 10},
        7: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 10},
        8: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 10},
        9: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 5, 'n_features': 10},
        10: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 5, 'n_features': 10},
        11: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 5, 'n_features': 10},
        12: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 50},
        13: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 50},
        14: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 50},
        15: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 5, 'n_features': 50},
        16: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 5, 'n_features': 50},
        17: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 5, 'n_features': 50}
    },
    'pytorch': {  # runs OOM after 10k instances in ref and test sets
        0: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 2},
        1: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 2},
        2: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 2},
        3: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 10},
        4: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 10},
        5: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 10},
        6: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 5, 'n_features': 50},
        7: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 5, 'n_features': 50},
        8: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 5, 'n_features': 50}
    }
}

#| scrolled: true
backends = ['keops', 'pytorch']
results = {backend: {} for backend in backends}

for backend in backends:
    exps = experiments_eq[backend]
    for i, exp in exps.items():
        results[backend][i] = experiment(
            'mmd', backend, exp['n_runs'], exp['n_ref'], exp['n_test'], exp['n_features']
        )

Below we visualise the runtimes of the different experiments. We can make the following observations:

The relative speed improvements of KeOps over vanilla PyTorch increase with increasing batch size.
Due to the explicit kernel computation and storage, the PyTorch detector runs out-of-memory after a little over 10,000 instances in each of the reference and test sets while KeOps keeps scaling up without any issues.
The relative speed improvements decline with growing feature dimension. Note however that we would not recommend using a (untrained) MMD detector on very high-dimensional data in the first place.

The plots show both the absolute and relative (PyTorch / KeOps) mean prediction times for the MMD drift detector for different feature dimensions $[2, 10, 50]$.

n_features = [2, 10, 50]
max_batch_size = 100000

plot_absolute_time(experiments_eq, results, n_features, max_batch_size=max_batch_size)

plot_relative_time(experiments_eq, results, n_features, max_batch_size=max_batch_size)

The difference between KeOps and PyTorch is even more striking when we only look at $[2, 10]$ features:

plot_absolute_time(experiments_eq, results, [2, 10], max_batch_size=max_batch_size)

2. $N_\text{ref} >> N_\text{test}$

Now we check whether the speed improvements still hold when $N_\text{ref} >> N_\text{test}$ ($N_\text{ref} / N_\text{test} = 10$) and a large part of the kernel can already be computed at initialisation time of the PyTorch (but not the KeOps) detector.

experiments_neq = {
    'keops': {
        0: {'n_ref': 2000, 'n_test': 200, 'n_runs': 10, 'n_features': 2},
        1: {'n_ref': 5000, 'n_test': 500, 'n_runs': 10, 'n_features': 2},
        2: {'n_ref': 10000, 'n_test': 1000, 'n_runs': 10, 'n_features': 2},
        3: {'n_ref': 20000, 'n_test': 2000, 'n_runs': 10, 'n_features': 2},
        4: {'n_ref': 50000, 'n_test': 5000, 'n_runs': 10, 'n_features': 2},
        5: {'n_ref': 100000, 'n_test': 10000, 'n_runs': 10, 'n_features': 2}
    },
    'pytorch': {
        0: {'n_ref': 2000, 'n_test': 200, 'n_runs': 10, 'n_features': 2},
        1: {'n_ref': 5000, 'n_test': 500, 'n_runs': 10, 'n_features': 2},
        2: {'n_ref': 10000, 'n_test': 1000, 'n_runs': 10, 'n_features': 2}
    }
}

results = {backend: {} for backend in backends}

for backend in backends:
    exps = experiments_neq[backend]
    for i, exp in exps.items():
        results[backend][i] = experiment(
            'mmd', backend, exp['n_runs'], exp['n_ref'], exp['n_test'], exp['n_features']
        )

The below plots illustrate that KeOps indeed still provides large speed ups over PyTorch. The x-axis shows the reference batch size $N_\text{ref}$. Note that $N_\text{ref} / N_\text{test} = 10$.

plot_absolute_time(experiments_neq, results, [2], max_batch_size=max_batch_size)

plot_relative_time(experiments_neq, results, [2], max_batch_size=max_batch_size)

Learned kernel MMD detector

We conduct similar experiments as for the MMD detector for $N_\text{ref} = N_\text{test}$ and n_features=50. We use a deep learned kernel with an MLP followed by Gaussian RBF kernels and project the input features on a d_out=2-dimensional space. Since the learned kernel detector computes the kernel matrix in a batch-wise manner, we can also scale up the number of instances for the PyTorch backend without running out-of-memory.

experiments_eq = {
    'keops': {
        0: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 3, 'n_features': 50},
        1: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 3, 'n_features': 50},
        2: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 3, 'n_features': 50},
        3: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 3, 'n_features': 50},
        4: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 3, 'n_features': 50},
        5: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 3, 'n_features': 50}
    },
    'pytorch': {
        0: {'n_ref': 2000, 'n_test': 2000, 'n_runs': 3, 'n_features': 50},
        1: {'n_ref': 5000, 'n_test': 5000, 'n_runs': 3, 'n_features': 50},
        2: {'n_ref': 10000, 'n_test': 10000, 'n_runs': 3, 'n_features': 50},
        3: {'n_ref': 20000, 'n_test': 20000, 'n_runs': 3, 'n_features': 50},
        4: {'n_ref': 50000, 'n_test': 50000, 'n_runs': 3, 'n_features': 50},
        5: {'n_ref': 100000, 'n_test': 100000, 'n_runs': 3, 'n_features': 50}
    }
}

#| scrolled: true
results = {backend: {} for backend in backends}

for backend in backends:
    exps = experiments_eq[backend]
    for i, exp in exps.items():
        results[backend][i] = experiment(
            'learned_kernel', backend, exp['n_runs'], exp['n_ref'], exp['n_test'], exp['n_features']
        )

We again plot the absolute and relative (PyTorch / KeOps) mean prediction times for the learned kernel MMD drift detector for different feature dimensions:

max_batch_size = 100000

plot_absolute_time(experiments_eq, results, [50], max_batch_size=max_batch_size)

plot_relative_time(experiments_eq, results, [50], max_batch_size=max_batch_size)

Conclusion

As illustrated in the experiments, KeOps allows you to drastically speed up and scale up drift detection to larger datasets without running into memory issues. The speed benefit of KeOps over the PyTorch (or TensorFlow) MMD detectors decrease as the number of features increases. Note though that it is not advised to apply the (untrained) MMD detector to very high-dimensional data in the first place and that we can apply dimensionality reduction via the deep kernel for the learned kernel MMD detector.

Model uncertainty based drift detection on CIFAR-10 and Wine-Quality datasets

Method

Model-uncertainty drift detectors aim to directly detect drift that's likely to effect the performance of a model of interest. The approach is to test for change in the number of instances falling into regions of the input space on which the model is uncertain in its predictions. For each instance in the reference set the detector obtains the model's prediction and some associated notion of uncertainty. For example for a classifier this may be the entropy of the predicted label probabilities or for a regressor with dropout layers dropout Monte Carlo can be used to provide a notion of uncertainty. The same is done for the test set and if significant differences in uncertainty are detected (via a Kolmogorov-Smirnoff test) then drift is flagged.

It is important that the detector uses a reference set that is disjoint from the model's training set (on which the model's confidence may be higher).

Backend

For models that require batch evaluation both PyTorch and TensorFlow frameworks are supported. Alibi Detect does however not install PyTorch for you. Check the PyTorch docs how to do this.

Classifier uncertainty based drift detection

We start by demonstrating how to leverage model uncertainty to detect malicious drift when the model of interest is a classifer.

Dataset

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import os
import tensorflow as tf
import torch
from torch import nn

from alibi_detect.cd import ClassifierUncertaintyDrift, RegressorUncertaintyDrift
from alibi_detect.models.tensorflow import scale_by_instance
from alibi_detect.utils.fetching import fetch_tf_model, fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.datasets import fetch_cifar10c, corruption_types_cifar10c
from alibi_detect.models.pytorch import trainer
from alibi_detect.cd.utils import encompass_batching

Original CIFAR-10 data:

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.cifar10.load_data()
X_train = X_train.astype('float32') / 255
X_test = X_test.astype('float32') / 255
y_train = y_train.astype('int64').reshape(-1,)
y_test = y_test.astype('int64').reshape(-1,)

For CIFAR-10-C, we can select from the following corruption types at 5 severity levels:

corruptions = corruption_types_cifar10c()
print(corruptions)

Let's pick a subset of the corruptions at corruption level 5. Each corruption type consists of perturbations on all of the original test set images.

corruption = ['gaussian_noise', 'motion_blur', 'brightness', 'pixelate']
X_corr, y_corr = fetch_cifar10c(corruption=corruption, severity=5, return_X_y=True)
X_corr = X_corr.astype('float32') / 255

We split the original test set in a reference dataset and a dataset which should not be rejected under the no-change null H0. We also split the corrupted data by corruption type:

np.random.seed(0)
n_test = X_test.shape[0]
idx = np.random.choice(n_test, size=n_test // 2, replace=False)
idx_h0 = np.delete(np.arange(n_test), idx, axis=0)
X_ref,y_ref = X_test[idx], y_test[idx]
X_h0, y_h0 = X_test[idx_h0], y_test[idx_h0]
print(X_ref.shape, X_h0.shape)

# check that the classes are more or less balanced
classes, counts_ref = np.unique(y_ref, return_counts=True)
counts_h0 = np.unique(y_h0, return_counts=True)[1]
print('Class Ref H0')
for cl, cref, ch0 in zip(classes, counts_ref, counts_h0):
    assert cref + ch0 == n_test // 10
    print('{}     {} {}'.format(cl, cref, ch0))

n_corr = len(corruption)
X_c = [X_corr[i * n_test:(i + 1) * n_test] for i in range(n_corr)]

We can visualise the same instance for each corruption type:

#| tags: [hide_input]
i = 1

n_test = X_test.shape[0]
plt.title('Original')
plt.axis('off')
plt.imshow(X_test[i])
plt.show()
for _ in range(len(corruption)):
    plt.title(corruption[_])
    plt.axis('off')
    plt.imshow(X_corr[n_test * _+ i])
    plt.show()

We can also verify that the performance of a classification model on CIFAR-10 drops significantly on this perturbed dataset:

dataset = 'cifar10'
model = 'resnet32'
clf = fetch_tf_model(dataset, model)
acc = clf.evaluate(scale_by_instance(X_test), y_test, batch_size=128, verbose=0)[1]
print('Test set accuracy:')
print('Original {:.4f}'.format(acc))
clf_accuracy = {'original': acc}
for _ in range(len(corruption)):
    acc = clf.evaluate(scale_by_instance(X_c[_]), y_test, batch_size=128, verbose=0)[1]
    clf_accuracy[corruption[_]] = acc
    print('{} {:.4f}'.format(corruption[_], acc))

Given the drop in performance, it is important that we detect the harmful data drift!

Detect drift

Unlike many other approaches we needn't specify a dimension-reducing preprocessing step as the detector operates directly on the data as it is input to the model of interest. In fact, the two-stage projection input -> prediction -> uncertainty can be thought of as the projection from the input space onto the real line, ready to perform the test.

We simply pass the model to the detector and inform it that the predictions should be interpreted as 'probs' rather than 'logits' (i.e. a softmax has already been applied). By default uncertainty_type='entropy' is used as the notion of uncertainty for classifier predictions, however uncertainty_type='margin' can be specified to deem the classifier's prediction uncertain if they fall within a margin (e.g. in [0.45,0.55] for binary classifier probabilities) (similar to Sethi and Kantardzic (2017)).

#| scrolled: false
cd = ClassifierUncertaintyDrift(
  X_ref, model=clf, backend='tensorflow', p_val=0.05, preds_type='probs'
)

Let's check whether the detector thinks drift occurred on the different test sets and time the prediction calls:

from timeit import default_timer as timer

labels = ['No!', 'Yes!']

def make_predictions(cd, x_h0, x_corr, corruption):
    t = timer()
    preds = cd.predict(x_h0)
    dt = timer() - t
    print('No corruption')
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('Feature-wise p-values:')
    print(preds['data']['p_val'])
    print(f'Time (s) {dt:.3f}')
    
    if isinstance(x_corr, list):
        for x, c in zip(x_corr, corruption):
            t = timer()
            preds = cd.predict(x)
            dt = timer() - t
            print('')
            print(f'Corruption type: {c}')
            print('Drift? {}'.format(labels[preds['data']['is_drift']]))
            print('Feature-wise p-values:')
            print(preds['data']['p_val'])
            print(f'Time (s) {dt:.3f}')

make_predictions(cd, X_h0, X_c, corruption)

Note here how drift is only detected for the corrupted datasets on which the model's performance is significantly degraded. For the 'brightness' corruption, for which the model maintains 89% classification accuracy, the change in model uncertainty is not deemed significant (p-value 0.11, above the 0.05 threshold). For the other corruptions which signficiantly hamper model performance, the malicious drift is detected.

Regressor uncertainty based drift detection

We now demonstrate how to leverage model uncertainty to detect malicious drift when the model of interest is a regressor. This is a less general approach as regressors often make point-predictions with no associated notion of uncertainty. However, if the model makes its predictions by ensembling the predicitons of sub-models then we can consider the variation in the sub-model predictions as a notion of uncertainty. RegressorUncertaintyDetector facilitates models that output a vector of such sub-model predictions (uncertainty_type='ensemble') or deep learning models that include dropout layers and can therefore (as noted by Gal and Ghahramani 2016) be considered as an ensemble (uncertainty_type='mc_dropout', the default option).

Dataset

The Wine Quality Data Set consists of 1599 and 4898 samples of red and white wine respectively. Each sample has an associated quality (as determined by experts) and 11 numeric features indicating its acidity, density, pH etc. We consider the regression problem of tring to predict the quality of red wine sample given these features. We will then consider whether the model remains suitable for predicting the quality of white wine samples or whether the associated change in the underlying distribution should be considered as malicious drift.

First we load in the data.

red = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep=';'
)
white = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep=';'
)
red.describe()

We can see that the data for both red and white wine samples take the same format.

white.describe()

We shuffle and normalise the data such that each feature takes a value in [0,1], as does the quality we seek to predict.

red, white = np.asarray(red, np.float32), np.asarray(white, np.float32)
n_red, n_white = red.shape[0], white.shape[0]

col_maxes = red.max(axis=0)
red, white = red / col_maxes, white / col_maxes
red, white = red[np.random.permutation(n_red)], white[np.random.permutation(n_white)]
X, y = red[:, :-1], red[:, -1:]
X_corr, y_corr = white[:, :-1], white[:, -1:]

We split the red wine data into a set on which to train the model, a reference set with which to instantiate the detector and a set which the detector should not flag drift. We then instantiate a DataLoader to pass the training data to a PyTorch model in batches.

X_train, y_train = X[:(n_red//2)], y[:(n_red//2)]
X_ref, y_ref = X[(n_red//2):(3*n_red//4)], y[(n_red//2):(3*n_red//4)]
X_h0, y_h0 = X[(3*n_red//4):], y[(3*n_red//4):]

X_train_ds = torch.utils.data.TensorDataset(torch.tensor(X_train), torch.tensor(y_train))
X_train_dl = torch.utils.data.DataLoader(X_train_ds, batch_size=32, shuffle=True, drop_last=True)

Regression model

We now define the regression model that we'll train to predict the quality from the features. The exact details aren't important other than the presence of at least one dropout layer. We then train the model for 20 epochs to optimise the mean square error on the training data.

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

reg = nn.Sequential(
    nn.Linear(11, 16),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(16, 32),
    nn.ReLU(),
    nn.Dropout(0.5),
    nn.Linear(32, 1)
).to(device)

trainer(reg, nn.MSELoss(), X_train_dl, device, torch.optim.Adam, learning_rate=0.001, epochs=30)

We now evaluate the trained model on both unseen samples of red wine and white wine. We see that, unsurprisingly, the model is better able to predict the quality of unseen red wine samples.

reg = reg.eval()
reg_fn = encompass_batching(reg, backend='pytorch', batch_size=32)
preds_ref = reg_fn(X_ref)
preds_corr = reg_fn(X_corr)

ref_mse = np.square(preds_ref - y_ref).mean()
corr_mse = np.square(preds_corr - y_corr).mean()

print(f'MSE when predicting the quality of unseen red wine samples: {ref_mse}')
print(f'MSE when predicting the quality of unseen white wine samples: {corr_mse}')

Detect drift

We now look at whether a regressor-uncertainty detector would have picked up on this malicious drift. We instantiate the detector and obtain drift predictions on both the held-out red-wine samples and the white-wine samples. We specify uncertainty_type='mc_dropout' in this case, but alternatively we could have trained an ensemble model that for each instance outputs a vector of multiple independent predictions and specified uncertainty_type='ensemble'.

cd = RegressorUncertaintyDrift(
    X_ref, model=reg, backend='pytorch', p_val=0.05, uncertainty_type='mc_dropout', n_evals=100
)
preds_h0 = cd.predict(X_h0)
preds_h1 = cd.predict(X_corr)

print(f"Drift detected on unseen red wine samples? {'yes' if preds_h0['data']['is_drift']==1 else 'no'}")
print(f"Drift detected on white wine samples? {'yes' if preds_h1['data']['is_drift']==1 else 'no'}")

print(f"p-value on unseen red wine samples? {preds_h0['data']['p_val']}")
print(f"p-value on white wine samples? {preds_h1['data']['p_val']}")

Drift detection on molecular graphs

Methods

We illustrate drift detection on molecular graphs using a variety of detectors:

Kolmogorov-Smirnov detector on the output of the binary classification Graph Isomorphism Network to detect prediction distribution shift.
Model Uncertainty detector which leverages a measure of uncertainty on the model predictions (in this case MC dropout) to detect drift which could lead to degradation of model performance.
Maximum Mean Discrepancy detector on graph embeddings to flag drift in the input data.
Learned Kernel detector which flags drift in the input data using a (deep) learned kernel. The method trains a (deep) kernel on part of the data to maximise an estimate of the test power. Once the kernel is learned a permutation test is performed in the usual way on the value of the Maximum Mean Discrepancy (MMD) on the held out test set.
Kolmogorov-Smirnov detector to see if drift occurred on graph level statistics such as the number of nodes, edges and the average clustering coefficient.

Dataset

We will train a classification model and detect drift on the ogbg-molhiv dataset. The dataset contains molecular graphs with both atom features (atomic number-1, chirality, node degree, formal charge, number of H bonds, number of radical electrons, hybridization, aromatic?, in a ring?) and bond level properties (bond type (e.g. single or double), bond stereo code, conjugated?). The goal is to predict whether a molecule inhibits HIV virus replication or not, so the task is binary classification.

The dataset is split using the scaffold splitting procedure. This means that the molecules are split based on their 2D structural framework. Structurally different molecules are grouped into different subsets (train, validation, test) which could mean that there is drift between the splits.

The dataset is retrieved from the Open Graph Benchmark dataset collection.

Dependencies

Besides alibi-detect, this example notebook also uses PyTorch Geometric and OGB, both of which can be installed via pip/conda.

import numpy as np
import os
import torch

def set_seed(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)

set_seed(0)

Load and analyze data

from ogb.graphproppred import PygGraphPropPredDataset
from torch_geometric.data import DataLoader

#| scrolled: true
dataset_name = 'ogbg-molhiv'
batch_size = 32
dataset = PygGraphPropPredDataset(name=dataset_name)
split_idx = dataset.get_idx_split()

We set some samples apart to serve as the reference data for our drift detectors. Note that the allowed format of the reference data is very flexible and can be np.ndarray or List[Any]:

n_ref = 1000
n_h0 = 500

idx_tr = split_idx['train']
idx_sample = np.random.choice(idx_tr.numpy(), size=n_ref + n_h0, replace=False)
idx_ref, idx_h0 = idx_sample[:n_ref], idx_sample[n_ref:]
x_ref = [dataset[i] for i in idx_ref]
x_h0 = [dataset[i] for i in idx_h0]
idx_tr = torch.from_numpy(np.setdiff1d(idx_tr, idx_sample))
print(f'Number of reference instances: {len(x_ref)}')
print(f'Number of H0 instances: {len(x_h0)}')

dl_tr = DataLoader(dataset[idx_tr], batch_size=batch_size, shuffle=True)
dl_val = DataLoader(dataset[split_idx['valid']], batch_size=batch_size, shuffle=False)
dl_te = DataLoader(dataset[split_idx['test']], batch_size=batch_size, shuffle=False)
print(f'Number of train, val and test batches: {len(dl_tr)}, {len(dl_val)} and {len(dl_te)}')

ds = dataset
print()
print(f'Dataset: {ds}:')
print('=============================================================')
print(f'Number of graphs: {len(ds)}')
print(f'Number of node features: {ds.num_node_features}')
print(f'Number of edge features: {ds.num_edge_features}')
print(f'Number of classes: {ds.num_classes}')

i = 0
d = ds[i]

print(f'\nExample: {d}')
print('=============================================================')

print(f'Number of nodes: {d.num_nodes}')
print(f'Number of edges: {d.num_edges}')
print(f'Average node degree: {d.num_edges / d.num_nodes:.2f}')
print(f'Contains isolated nodes: {d.contains_isolated_nodes()}')
print(f'Contains self-loops: {d.contains_self_loops()}')
print(f'Is undirected: {d.is_undirected()}')

Let's plot some graph summary statistics such as the distribution of the node degrees, number of nodes and edges as well as the clustering coefficients:

#| scrolled: true
import matplotlib.pyplot as plt
import networkx as nx
from networkx.algorithms.cluster import clustering
from torch_geometric.utils import degree, to_networkx
from tqdm import tqdm
from typing import Tuple


def degrees_and_clustering(loader: DataLoader) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    degrees, c_coeff, num_nodes, num_edges = [], [], [], []
    for data in tqdm(loader):
        row, col = data.edge_index
        deg = degree(row, data.x.size(0), dtype=data.x.dtype)
        degrees.append(deg.numpy())
        g = to_networkx(data, node_attrs=['x'], edge_attrs=['edge_attr'], to_undirected=True)
        c = list(clustering(g).values())
        c_coeff.append(c)
        num_nodes += [d.num_nodes for d in data.to_data_list()]
        num_edges += [d.num_edges for d in data.to_data_list()]
    degrees = np.concatenate(degrees, axis=0)
    c_coeff = np.concatenate(c_coeff, axis=0)
    return degrees, c_coeff, np.array(num_nodes), np.array(num_edges)


# x: nodes, edges, degree, cluster
def plot_histogram(x: str, bins: int = None, log: bool = True) -> None:
    if x == 'nodes':
        vals = [num_nodes_tr, num_nodes_val, num_nodes_te]
    elif x == 'edges':
        vals = [num_edges_tr, num_edges_val, num_edges_te]
    elif x == 'degree':
        vals = [degree_tr, degree_val, degree_te]
    elif x == 'cluster':
        vals = [cluster_tr, cluster_val, cluster_te]
    labels = ['train', 'val', 'test']
    for v, l in zip(vals, labels):
        plt.hist(v, density=True, log=log, label=l, bins=bins)
    plt.title(f'{x} distribution')
    plt.legend()
    plt.show()

degree_tr, cluster_tr, num_nodes_tr, num_edges_tr = degrees_and_clustering(dl_tr)
degree_val, cluster_val, num_nodes_val, num_edges_val = degrees_and_clustering(dl_val)
degree_te, cluster_te, num_nodes_te, num_edges_te = degrees_and_clustering(dl_te)

print('Average number and stdev of nodes, edges, degree and clustering coefficients:')
print('\nTrain...')
print(f'Nodes: {num_nodes_tr.mean():.1f} +- {num_nodes_tr.std():.1f}')
print(f'Edges: {num_edges_tr.mean():.1f} +- {num_edges_tr.std():.1f}')
print(f'Degree: {degree_tr.mean():.1f} +- {degree_tr.std():.1f}')
print(f'Clustering: {cluster_tr.mean():.3f} +- {cluster_tr.std():.3f}')

print('\nValidation...')
print(f'Nodes: {num_nodes_val.mean():.1f} +- {num_nodes_val.std():.1f}')
print(f'Edges: {num_edges_val.mean():.1f} +- {num_edges_val.std():.1f}')
print(f'Degree: {degree_val.mean():.1f} +- {degree_val.std():.1f}')
print(f'Clustering: {cluster_val.mean():.3f} +- {cluster_val.std():.3f}')

print('\nTest...')
print(f'Nodes: {num_nodes_te.mean():.1f} +- {num_nodes_te.std():.1f}')
print(f'Edges: {num_edges_te.mean():.1f} +- {num_edges_te.std():.1f}')
print(f'Degree: {degree_te.mean():.1f} +- {degree_te.std():.1f}')
print(f'Clustering: {cluster_te.mean():.3f} +- {cluster_te.std():.3f}')

plot_histogram('nodes', bins=50)
plot_histogram('edges', bins=50)
plot_histogram('degree')
plot_histogram('cluster')

While the average number of nodes and edges are similar across the splits, the histograms show that the tails are slightly heavier for the training graphs.

Plot molecules

We borrow code from the PyTorch Geometric GNN explanation example to visualize molecules from the graph objects.

def draw_molecule(g, edge_mask=None, draw_edge_labels=False):
    g = g.copy().to_undirected()
    node_labels = {}
    for u, data in g.nodes(data=True):
        node_labels[u] = data['name']
    pos = nx.planar_layout(g)
    pos = nx.spring_layout(g, pos=pos)
    if edge_mask is None:
        edge_color = 'black'
        widths = None
    else:
        edge_color = [edge_mask[(u, v)] for u, v in g.edges()]
        widths = [x * 10 for x in edge_color]
    nx.draw(g, pos=pos, labels=node_labels, width=widths,
            edge_color=edge_color, edge_cmap=plt.cm.Blues,
            node_color='azure')
    
    if draw_edge_labels and edge_mask is not None:
        edge_labels = {k: ('%.2f' % v) for k, v in edge_mask.items()}    
        nx.draw_networkx_edge_labels(g, pos, edge_labels=edge_labels,
                                    font_color='red')
    plt.show()


def to_molecule(data):
    ATOM_MAP = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P',
                'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn',
                'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y']
    g = to_networkx(data, node_attrs=['x'])
    for u, data in g.nodes(data=True):
        data['name'] = ATOM_MAP[data['x'][0]]
        del data['x']
    return g

i = 0
mol = to_molecule(dataset[i])
plt.figure(figsize=(10, 5))
draw_molecule(mol)

Train and evaluate a GNN classification model

As our classifier we use a variation of a Graph Isomorphism Network incorporating edge (bond) as well as node (atom) features.

import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data.batch import Batch
from torch_geometric.nn import MessagePassing, global_add_pool, global_max_pool, global_mean_pool, LayerNorm
from ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder


class GINConv(MessagePassing):
    def __init__(self, emb_dim: int) -> None:
        super().__init__(aggr='add')
        self.mlp = nn.Sequential(
            nn.Linear(emb_dim, 2 * emb_dim),
            nn.BatchNorm1d(2 * emb_dim),
            nn.ReLU(),
            nn.Linear(2 * emb_dim, emb_dim)
        )
        self.eps = nn.Parameter(torch.Tensor([0.]))
        self.bond_encoder = BondEncoder(emb_dim=emb_dim)  # encode edge features

    def forward(self, x: torch.Tensor, edge_index: torch.Tensor, edge_attr: torch.Tensor) -> torch.Tensor:
        edge_emb = self.bond_encoder(edge_attr)
        return self.mlp((1 + self.eps) * x + self.propagate(edge_index, x=x, edge_attr=edge_emb))

    def message(self, x_j: torch.Tensor, edge_attr: torch.Tensor) -> torch.Tensor:
        return x_j + edge_attr

    def update(self, aggr_out: torch.Tensor) -> torch.Tensor:
        return aggr_out


class GIN(nn.Module):
    def __init__(self, n_layer: int = 5, emb_dim: int = 64, n_out: int = 2, dropout: float = .5,
                 jk: bool = True, residual: bool = True, pool: str = 'add', norm: str = 'batch') -> None:
        super().__init__()
        self.n_layer = n_layer
        self.jk = jk  # jumping-knowledge
        self.residual = residual  # residual/skip connections
        self.atom_encoder = AtomEncoder(emb_dim=emb_dim)  # encode node features
        self.convs = nn.ModuleList([GINConv(emb_dim) for _ in range(n_layer)])
        norm = nn.BatchNorm1d if norm == 'batch' else LayerNorm
        self.bns = nn.ModuleList([norm(emb_dim) for _ in range(n_layer)])
        if pool == 'mean':
            self.pool = global_mean_pool
        elif pool == 'add':
            self.pool = global_add_pool
        elif pool == 'max':
            self.pool = global_max_pool
        pool_dim = (n_layer + 1) * emb_dim if jk else emb_dim
        self.linear = nn.Linear(pool_dim, n_out)
        self.dropout = nn.Dropout(p=dropout)

    def forward(self, data: Batch) -> torch.Tensor:
        x, edge_index, edge_attr, batch = data.x, data.edge_index, data.edge_attr, data.batch
        # node embeddings
        hs = [self.atom_encoder(x)]
        for layer in range(self.n_layer):
            h = self.convs[layer](hs[layer], edge_index, edge_attr)
            h = self.bns[layer](h)
            if layer < self.n_layer - 1:
                h = F.relu(h)
            if self.residual:
                h += hs[layer]
            hs += [h]
        # graph embedding and prediction
        if self.jk:
            h = torch.cat([h for h in hs], -1)
        h_pool = self.pool(h, batch)
        h_drop = self.dropout(h_pool)
        h_out = self.linear(h_drop)
        return h_out

#| scrolled: true
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f'device: {device}')

n_layer = 5
emb_dim = 300
n_out = 1
dropout = .5
jk = True
residual = False
pool = 'mean'
norm = 'batch'

model = GIN(n_layer, emb_dim, n_out, dropout, jk, residual, pool, norm).to(device)

Train and evaluate the model. Evaluation is done using ROC-AUC. If you already have a trained model saved, you can directly load it by specifying the load_path:

load_path = 'gnn'  # set to None if no pretrained model available

#| scrolled: true
from ogb.graphproppred import Evaluator
from tqdm import tqdm

criterion = nn.BCEWithLogitsLoss()
optim = torch.optim.Adam(model.parameters(), lr=.001)
evaluator = Evaluator(name=dataset_name)  # ROC-AUC for ogbg-molhiv


def train(loader: DataLoader, verbose: bool = False) -> None:
    dl = tqdm(loader, total=len(loader)) if verbose else loader
    model.train()
    for data in dl:
        data = data.to(device)
        optim.zero_grad()
        y_hat = model(data)
        is_labeled = data.y == data.y
        loss = criterion(y_hat[is_labeled], data.y[is_labeled].float())
        loss.backward()
        optim.step()
        if verbose:
            dl.set_postfix(dict(loss=loss.item()))


def evaluate(loader: DataLoader, split: str, verbose: bool = False) -> float:
    dl = tqdm(loader, total=len(loader)) if verbose else loader
    model.eval()
    y_pred, y_true = [], []
    for data in dl:
        data = data.to(device)
        with torch.no_grad():
            y_hat = model(data)
        y_pred.append(y_hat.cpu())
        y_true.append(data.y.float().cpu())
    y_true = torch.cat(y_true, dim=0)
    y_pred = torch.cat(y_pred, dim=0)
    loss = criterion(y_pred, y_true)
    input_dict = dict(y_true=y_true, y_pred=y_pred)
    result_dict = evaluator.eval(input_dict)
    print(f'{split} ROC-AUC: {result_dict["rocauc"]:.3f} -- loss: {loss:.3f}')
    return result_dict["rocauc"]


if load_path is None or not os.path.isdir(load_path):
    epochs = 150
    rocauc_best = 0.
    save_path = 'gnn'
    for epoch in range(epochs):
        print(f'\nEpoch {epoch + 1} / {epochs}')
        train(dl_tr)
        _ = evaluate(dl_tr, 'train')
        rocauc = evaluate(dl_val, 'val')
        if rocauc > rocauc_best and os.path.isdir(save_path):
            print('Saving new best model.')
            rocauc_best = rocauc
            torch.save(model.state_dict(), os.path.join(save_path, 'model.dict'))
        _ = evaluate(dl_te, 'test')
    load_path = save_path


# load (best) model
model.load_state_dict(torch.load(os.path.join(load_path, 'model.dict')))
_ = evaluate(dl_tr, 'train')
_ = evaluate(dl_val, 'val')
_ = evaluate(dl_te, 'test')

Detect drift

Prediction distribution drift

We will first detect drift on the prediction distribution of the GIN model. Since the binary classification model returns continuous numerical univariate predictions, we use the Kolmogorov-Smirnov drift detector. First we define some utility functions:

from torch_geometric.data import Batch, Data
from typing import Dict, List, Union


labels = ['No!', 'Yes!']
def make_predictions(dd, xs: Dict[str, List[Data]]) -> None:
    for split, x in xs.items():
        preds = dd.predict(x)
        dl = DataLoader(x, batch_size=32, shuffle=False)
        _ = evaluate(dl, split)
        print('Drift? {}'.format(labels[preds['data']['is_drift']]))
        if isinstance(preds["data"]["p_val"], (list, np.ndarray)):
            print(f'p-value: {preds["data"]["p_val"]}')
        else:
            print(f'p-value: {preds["data"]["p_val"]:.3f}')
        print('')
        

def sample(split: str, n: int, ) -> List[Data]:
    idx = np.random.choice(split_idx[split].numpy(), size=n, replace=False)
    return [dataset[i] for i in idx]

Because we pass lists with torch_geometric.data.Data objects to the detector, we need to preprocess the data using the batch_fn into torch_geometric.data.Batch objects which can be fed to the model. Then we detect drift on the model prediction distribution.

from alibi_detect.cd import KSDrift
from alibi_detect.utils.pytorch import predict_batch
from functools import partial


def batch_fn(data: Union[List[Data], Batch]) -> Batch:
    if isinstance(data, Batch):
        return data
    else:
        return Batch().from_data_list(data)


preprocess_fn = partial(predict_batch, model=model, device=device, preprocess_fn=batch_fn, batch_size=32)
dd = KSDrift(x_ref, p_val=.05, preprocess_fn=preprocess_fn)

Since the dataset is heavily imbalanced, we will test the detectors on a sample which oversamples from the minority class (molecules which inhibit HIV virus replication):

split = 'test'
x_imb = sample(split, 500)
n = 0
for i in split_idx[split]:
    if dataset[i].y[0].item() == 1:
        x_imb.append(dataset[i])
        n += 1
print(f'# instances: {len(x_imb)} -- # class 1: {n}')

xs = {'H0': x_h0, 'test sample': sample('test', 500), 'imbalanced sample': x_imb}
make_predictions(dd, xs)

As expected, prediction distribution shift is detected for the imbalanced sample but not for the random test sample with similar label distribution as the reference data.

Prediction uncertainty drift

The model uncertainty drift detector can pick up when the model predictions drift into areas of changed uncertainty compared to the reference data. This can be a good proxy for drift which results in model performance degradation. The uncertainty is estimated via a Monte Carlo estimate (MC dropout). We use the RegressorUncertaintyDrift detector since our binary classification model returns 1D logits.

#| scrolled: false
from alibi_detect.cd import RegressorUncertaintyDrift

dd = RegressorUncertaintyDrift(x_ref, model=model, backend='pytorch', p_val=.05, n_evals=100,
                               uncertainty_type='mc_dropout', preprocess_batch_fn=batch_fn)

make_predictions(dd, xs)

Although we didn't pick up drift in the GIN model prediction distribution for the test sample, we can see that the model is less certain about the predictions on the test set, illustrated by the lower ROC-AUC.

Input data drift using the MMD detector

We can also more detect drift on the input data by encoding the data with a randomly initialized GNN to extract graph embeddings. Then we apply our detector of choice, e.g. the MMD detector on the extracted embeddings.

class Encoder(nn.Module):
    def __init__(self, n_layer: int = 1, emb_dim: int = 64, jk: bool = True, 
                 residual: bool = True, pool: str = 'add', norm: str = 'batch') -> None:
        super().__init__()
        self.n_layer = n_layer
        self.jk = jk  # jumping-knowledge
        self.residual = residual  # residual/skip connections
        self.atom_encoder = AtomEncoder(emb_dim=emb_dim)  # encode node features
        self.convs = nn.ModuleList([GINConv(emb_dim) for _ in range(n_layer)])
        norm = nn.BatchNorm1d if norm == 'batch' else LayerNorm
        self.bns = nn.ModuleList([norm(emb_dim) for _ in range(n_layer)])
        self.pool = global_add_pool

    def forward(self, data: Batch) -> torch.Tensor:
        x, edge_index, edge_attr, batch = data.x, data.edge_index, data.edge_attr, data.batch
        # node embeddings
        hs = [self.atom_encoder(x)]
        for layer in range(self.n_layer):
            h = self.convs[layer](hs[layer], edge_index, edge_attr)
            h = self.bns[layer](h)
            if layer < self.n_layer - 1:
                h = F.relu(h)
            if self.residual:
                h += hs[layer]
            hs += [h]
        # graph embedding and prediction
        if self.jk:
            h = torch.cat([h for h in hs], -1)
        h_out = self.pool(h, batch)
        return h_out

from alibi_detect.cd import MMDDrift

enc = Encoder(n_layer=1).to(device)
preprocess_fn = partial(predict_batch, model=enc, device=device, preprocess_fn=batch_fn, batch_size=32)
dd = MMDDrift(x_ref, backend='pytorch', p_val=.05, n_permutations=1000, preprocess_fn=preprocess_fn)

make_predictions(dd, xs)

Input data drift using a learned kernel

Instead of applying the MMD detector on the pooling output of a randomly initialized GNN encoder, we use the Learned Kernel detector which trains the encoder and kernel on part of the data to maximise an estimate of the detector's test power. Once the kernel is learned a permutation test is performed in the usual way on the value of the MMD on the held out test set.

from alibi_detect.cd import LearnedKernelDrift
from alibi_detect.utils.pytorch import DeepKernel

kernel = DeepKernel(enc, kernel_b=None)  # use the already defined random encoder in the deep kernel
dd = LearnedKernelDrift(x_ref, kernel, backend='pytorch', p_val=.05, dataloader=DataLoader, 
                        preprocess_batch_fn=batch_fn, epochs=2)

make_predictions(dd, xs)

Since the molecular scaffolds are different across the train, validation and test sets, we expect that this type of data shift is picked up in the input data (technically not the input but the graph embedding).

Drift on graph statistics

We could also compute graph-level statistics such as the number of nodes, edges and clustering coefficient and detect drift on those statistics using the Kolmogorov-Smirnov test with multivariate correction (e.g. Bonferroni). First we define a preprocessing step to extract the summary statistics from the graphs:

# return number of nodes, edges and average clustering coefficient per graph
def graph_stats(data: List[Data]) -> np.ndarray:
    num_nodes = np.array([d.num_nodes for d in data])
    num_edges = np.array([d.num_edges for d in data])
    c = np.array([np.array(list(clustering(to_networkx(d)).values())).mean() for d in data])
    return np.concatenate([num_nodes[:, None], num_edges[:, None], c[:, None]], axis=-1)

dd = KSDrift(x_ref, p_val=.05, preprocess_fn=graph_stats)

make_predictions(dd, xs)

The 3 returned p-values correspond to respectively the p-values for the number of nodes, edges and clustering coefficient. We already saw in the EDA that the distributions of the node, edge and clustering coefficients look similar across the train, validation and test sets except for the tails. This is confirmed by running the drift detector on the graph statistics which cannot seem to pick up on the differences in molecular scaffolds between the datasets, unless we heavily oversample from the minority class where the number of nodes and edges but not the clustering coefficient significantly differ.

Online drift detection for Camelyon17 medical imaging dataset

This notebook demonstrates a typical workflow for applying online drift detectors to streams of image data. For those unfamiliar with how the online drift detectors operate in alibi_detect we recommend first checking out the more introductory example Online Drift Detection on the Wine Quality Dataset where online drift detection is performed for the wine quality dataset.

This notebook requires the wilds, torch and torchvision packages which can be installed via pip:

!pip install wilds torch torchvision

from typing import Tuple, Generator, Callable, Optional
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
import torchvision.transforms as transforms
from wilds.common.data_loaders import get_train_loader
from wilds import get_dataset

torch.manual_seed(0)
np.random.seed(0)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
%matplotlib inline

Data

We will use the Camelyon17 dataset, one of the WILDS datasets of Koh et al, (2020) that represent "in-the-wild" distribution shifts for various data modalities. It contains tissue scans to be classificatied as benign or cancerous. The pre-change distribution corresponds to scans from across three hospitals and the post-change distribution corresponds to scans from a new fourth hospital.

Koh et al, (2020) show that models trained on scans from the pre-change distribution achieve an accuracy of 93.2% on unseen scans from same distribution, but only 70.3% accuracy on scans from the post-change distribution.

First we create a function that converts the Camelyon dataset to a stream in order to simulate a live deployment environment. We extract N instances to act as the reference set on which a model of interest was trained. We then consider a stream of images from the pre-change (same) distribution and a stream of images from the post-change (drifted) distribution.

WILDS_PATH = './data/wilds'
DOWNLOAD = False  # set to True for first run
N = 2000  # size of reference set

The following cell will download the Camelyon dataset (if DOWNLOAD=True). The download size is ~10GB and size on disk is ~15GB.

def stream_camelyon(
    split: str='train', 
    img_size: Tuple[int]=(96,96), 
    root_dir: str=None, 
    download: bool=False
) -> Generator:

    camelyon = get_dataset('camelyon17', root_dir=root_dir, download=download)
    ds = camelyon.get_subset(
        split, 
        transform=transforms.Compose([transforms.Resize(img_size), transforms.ToTensor()])
    )
    ds_iter = iter(get_train_loader('standard', ds, batch_size=1))

    while True:
        try:
            img = next(ds_iter)[0][0]
        except Exception:
            ds_iter = iter(get_train_loader('standard', ds, batch_size=1))
            img = next(ds_iter)[0][0]
        yield img.numpy()

stream_p = stream_camelyon(split='train', root_dir=WILDS_PATH, download=DOWNLOAD)
x_ref = np.stack([next(stream_p) for _ in range(N)], axis=0)

stream_q_h0 = stream_camelyon(split='id_val', root_dir=WILDS_PATH, download=DOWNLOAD)
stream_q_h1 = stream_camelyon(split='test', root_dir=WILDS_PATH, download=DOWNLOAD)

Shown below are samples from the pre-change distribution:

#| scrolled: true
fig, axs = plt.subplots(nrows=1, ncols=6, figsize=(15,4))
for i in range(6):
    axs[i].imshow(np.transpose(next(stream_p), (1,2,0)))
    axs[i].axis('off')

And samples from the post-change distribution:

fig, axs = plt.subplots(nrows=1, ncols=6, figsize=(15,4))
for i in range(6):
    axs[i].imshow(np.transpose(next(stream_q_h1), (1,2,0)))
    axs[i].axis('off')

Kernel Projection

The images are of dimension 96x96x3. We train an autoencoder in order to define a more structured representational space of lower dimension. This projection can be thought of as an extension of the kernel. It is important that trained preprocessing components are trained on a split of data that doesn't then form part of the reference data passed to the drift detector.

ENC_DIM = 32
BATCH_SIZE = 32
EPOCHS = 5
LEARNING_RATE = 1e-3

encoder = nn.Sequential(
    nn.Conv2d(3, 8, 5, stride=3, padding=1),    # [batch, 8, 32, 32]
    nn.ReLU(),
    nn.Conv2d(8, 12, 4, stride=2, padding=1),   # [batch, 12, 16, 16]
    nn.ReLU(),
    nn.Conv2d(12, 16, 4, stride=2, padding=1),   # [batch, 16, 8, 8]
    nn.ReLU(),
    nn.Conv2d(16, 20, 4, stride=2, padding=1),   # [batch, 20, 4, 4]
    nn.ReLU(),
    nn.Conv2d(20, ENC_DIM, 4, stride=1, padding=0),   # [batch, enc_dim, 1, 1]
    nn.Flatten(), 
)
decoder = nn.Sequential(
    nn.Unflatten(1, (ENC_DIM, 1, 1)),
    nn.ConvTranspose2d(ENC_DIM, 20, 4, stride=1, padding=0),  # [batch, 20, 4, 4]
    nn.ReLU(),
    nn.ConvTranspose2d(20, 16, 4, stride=2, padding=1),  # [batch, 16, 8, 8]
    nn.ReLU(),
    nn.ConvTranspose2d(16, 12, 4, stride=2, padding=1),  # [batch, 12, 16, 16]
    nn.ReLU(),
    nn.ConvTranspose2d(12, 8, 4, stride=2, padding=1),  # [batch, 8, 32, 32]
    nn.ReLU(),
    nn.ConvTranspose2d(8, 3, 5, stride=3, padding=1),   # [batch, 3, 96, 96]
    nn.Sigmoid(),
)
ae = nn.Sequential(encoder, decoder).to(device)

x_fit, x_ref = np.split(x_ref, [len(x_ref)//2])
x_fit = torch.as_tensor(x_fit)
x_fit_dl = DataLoader(TensorDataset(x_fit, x_fit), BATCH_SIZE, shuffle=True)

We can train the autoencoder using a helper function provided for convenience in alibi-detect.

from alibi_detect.models.pytorch import trainer

trainer(ae, nn.MSELoss(), x_fit_dl, device, learning_rate=LEARNING_RATE, epochs=EPOCHS)

The preprocessing/projection functions are expected to map numpy arrays to numpy array, so we wrap the encoder within the function below.

def encoder_fn(x: np.ndarray) -> np.ndarray:
    x = torch.as_tensor(x).to(device)
    with torch.no_grad():
        x_proj = encoder(x)
    return x_proj.cpu().numpy()

Drift Detection

alibi-detect's online drift detectors window the stream of data in an 'overlapping window' manner such that a test is performed at every time step. We will use an estimator of MMD as the test statistic. The estimate is updated incrementally at low cost. The thresholds are configured via simulation in an initial configuration phase to target the desired expected runtime (ERT) in the absence of change. For a detailed description of this calibration procedure see Cobb et al, 2021.

ERT = 150  # expected run-time in absence of change
W = 20  # size of test window
B = 50_000  # number of simulations to configure threshold

from alibi_detect.cd import MMDDriftOnline

dd = MMDDriftOnline(x_ref, ERT, W, backend='pytorch', preprocess_fn=encoder_fn)

We define a function which will apply the detector to the streams and return the time at which drift was detected.

def compute_runtime(detector: Callable, stream: Generator) -> int:

    t = 0
    detector.reset_state()
    detected = False

    while not detected:
        t += 1
        z = next(stream)
        pred = detector.predict(z)
        detected = pred['data']['is_drift']
    print(t)
    return t

First we apply the detector multiple times to the pre-change stream where the distribution is unchanged.

#| scrolled: true
times_h0 = [compute_runtime(dd, stream_p) for i in range(15)]
print(f"Average runtime in absence of change: {np.array(times_h0).mean()}")

We see that the average runtime in the absence of change is close to the desired ERT, as expected. We can inspect the detector's test_stats and thresholds properties to see how the test statistic varied over time and how close it got to exceeding the threshold.

ts = np.arange(dd.t)
plt.plot(ts, dd.test_stats, label='Test statistic')
plt.plot(ts, dd.thresholds, label='Thresholds')
plt.xlabel('t', fontsize=16)
plt.ylabel('$T_t$', fontsize=16)
plt.legend(loc='upper right', fontsize=14)
plt.show()

Now we apply it to the post-change stream where the images are from a drifted distribution.

times_h1 = [compute_runtime(dd, stream_q_h1) for i in range(15)]
print(f"Average detection delay following change: {np.array(times_h1).mean()}")

We see that the detector is quick to flag drift when it has occured.

ts = np.arange(dd.t)
plt.plot(ts, dd.test_stats, label='Test statistic')
plt.plot(ts, dd.thresholds, label='Thresholds')
plt.xlabel('t', fontsize=16)
plt.ylabel('$T_t$', fontsize=16)
plt.legend(loc='upper right', fontsize=14)
plt.show()

Online Drift Detection on the Wine Quality Dataset

In the context of deployed models, data (model queries) usually arrive sequentially and we wish to detect it as soon as possible after its occurence. One approach is to perform a test for drift every $W$ time-steps, using the $W$ samples that have arrived since the last test. Such a strategy could be implemented using any of the offline detectors implemented in alibi-detect, but being both sensitive to slight drift and responsive to severe drift is difficult. If the window size $W$ is too small then slight drift will be undetectable. If it is too large then the delay between test-points hampers responsiveness to severe drift.

An alternative strategy is to perform a test each time data arrives. However the usual offline methods are not applicable because the process for computing p-values is too expensive and doesn't account for correlated test outcomes when using overlapping windows of test data.

Online detectors instead work by computing the test-statistic once using the first $W$ data points and then updating the test-statistic sequentially at low cost. When no drift has occured the test-statistic fluctuates around its expected value and once drift occurs the test-statistic starts to drift upwards. When it exceeds some preconfigured threshold value, drift is detected.

Unlike offline detectors which require the specification of a threshold p-value (a false positive rate), the online detectors in alibi-detect require the specification of an expected run-time (ERT) (an inverted FPR). This is the number of time-steps that we insist our detectors, on average, should run for in the absense of drift before making a false detection. Usually we would like the ERT to be large, however this results in insensitive detectors which are slow to respond when drift does occur. There is a tradeoff between the expected run time and the expected detection delay.

To target the desired ERT, thresholds are configured during an initial configuration phase via simulation. This configuration process is only suitable when the amount reference data (most likely the training data of the model of interest) is relatively large (ideally around an order of magnitude larger than the desired ERT). Configuration can be expensive (less so with a GPU) but allows the detector to operate at low-cost during deployment.

This notebook demonstrates online drift detection using two different two-sample distance metrics for the test-statistic, the maximum mean discrepency (MMD) and least-squared density difference (LSDD), both of which can be updated sequentially at low cost.

Backend

The online detectors are implemented in both the PyTorch and TensorFlow frameworks with support for CPU and GPU. Various preprocessing steps are also supported out-of-the box in Alibi Detect for both frameworks and an example will be given in this notebook. Alibi Detect does however not install PyTorch for you. Check the PyTorch docs how to do this.

Dataset

The Wine Quality Data Set consists of 4898 and 1599 samples of white and red wine respectively. Each sample has an associated quality (as determined by experts) and 11 numeric features indicating its acidity, density, pH etc. We consider the regression problem of tring to predict the quality of white wine samples given these features. We will then consider whether the model remains suitable for predicting the quality of red wine samples or whether the associated change in the underlying distribution should be considered as drift.

Online detection with MMD and Pytorch

The Maximum Mean Discepency (MMD) is a distance-based measure between 2 distributions p and q based on the mean embeddings $\mu_{p}$ and $\mu_{q}$ in a reproducing kernel Hilbert space $F$:

MMD(F, p, q) = || \mu_{p} - \mu_{q} ||^2_{F}

Given reference samples ${X_i}{i=1}^{N}$ and test samples ${Y_i}{i=t}^{t+W}$ we may compute an unbiased estimate $\widehat{MMD}^2(F, {X_i}{i=1}^N, {Y_i}{i=t}^{t+W})$ of the squared MMD between the two underlying distributions. Depending on the size of the reference and test windows, $N$ and $W$ respectively, this can be relatively expensive. However, once computed it is possible to update the statistic to estimate to the squared MMD between the distributions underlying ${X_i}{i=1}^{N}$ and ${Y_i}{i=t+1}^{t+1+W}$ at a very low cost, making it suitable for online drift detection.

By default we use a radial basis function kernel, but users are free to pass their own kernel of preference to the detector.

import matplotlib.pyplot as plt
import numpy as np
import torch
import tensorflow as tf
import pandas as pd
import scipy
from sklearn.decomposition import PCA

np.random.seed(0)
torch.manual_seed(0)
tf.random.set_seed(0)

Load data

First we load in the data:

red = pd.read_csv(
    "https://storage.googleapis.com/seldon-datasets/wine_quality/winequality-red.csv", sep=';'
)
white = pd.read_csv(
    "https://storage.googleapis.com/seldon-datasets/wine_quality/winequality-white.csv", sep=';'
)
white.describe()

We can see that the data for both red and white wine samples take the same format.

red.describe()

We shuffle and normalise the data such that each feature takes a value in [0,1], as does the quality we seek to predict. We assue that our model was trained on white wine samples, which therefore forms the reference distribution, and that red wine samples can be considered to be drawn from a drifted distribution.

white, red = np.asarray(white, np.float32), np.asarray(red, np.float32)
n_white, n_red = white.shape[0], red.shape[0]

col_maxes = white.max(axis=0)
white, red = white / col_maxes, red / col_maxes
white, red = white[np.random.permutation(n_white)], red[np.random.permutation(n_red)]
X = white[:, :-1]
X_corr = red[:, :-1]

Although it may not be necessary on this relatively low-dimensional data for which individual features are semantically meaningful, we demonstrate how principle component analysis (PCA) can be performed as a preprocessing stage to project raw data onto a lower dimensional representation which more concisely captures the factors of variation in the data. As not to bias the detector it is necessary to fit the projection using a split of the data which isn't then passed as reference data. We additionally split off some white wine samples to act as undrifted data during deployment.

X_train = X[:(n_white//2)]
X_ref = X[(n_white//2):(3*n_white//4)]
X_h0 = X[(3*n_white//4):]

Now we define a PCA object to be used as a preprocessing function to project the 11-D data onto a 2-D representation. We learn the first 2 principal components on the training split of the reference data.

pca = PCA(2)
pca.fit(X_train)

Hopefully the learned preprocessing step has learned a projection such that in the lower dimensional space the two samples are distinguishable.

enc_h0 = pca.transform(X_h0)
enc_h1 = pca.transform(X_corr)

plt.scatter(enc_h0[:,0], enc_h0[:,1], alpha=0.2, color='green', label='white wine')
plt.scatter(enc_h1[:,0], enc_h1[:,1], alpha=0.2, color='red', label='red wine')
plt.legend(loc='upper right')
plt.show()

Now we can define our online drift detector. We specify an expected run-time (in the absence of drift) of 50 time-steps, and a window size of 10 time-steps. Upon initialising the detector thresholds will be computed using 2500 boostrap samples. These values of ert, window_size and n_bootstraps are lower than a typical use-case in order to demonstrate the average behaviour of the detector over a large number of runs in a reasonable time.

from alibi_detect.cd import MMDDriftOnline

ert = 50
window_size = 10

cd = MMDDriftOnline(
    X_ref, ert, window_size, backend='pytorch', preprocess_fn=pca.transform, n_bootstraps=2500
)

We now define a function which will simulate a single run and return the run-time. Note how the detector acts on single instances at a time, the run-time is considered as the time elapsed after the test-window has been filled, and that the detector is stateful and must be reset between detections.

def time_run(cd, X, window_size):
    n = X.shape[0]
    perm = np.random.permutation(n)
    t = 0
    cd.reset_state()
    while True:
        pred = cd.predict(X[perm[t%n]])
        if pred['data']['is_drift'] == 1:
            return t
        else:
            t += 1

Now we look at the distribution of run-times when operating on the held-out data from the reference distribution of white wine samples. We report the average run-time, however note that the targeted run-time distribution, a Geometric distribution with mean ert, is very high variance so the empirical average may not be that close to ert over a relatively small number of runs. We can see that the detector accurately targets the desired Geometric distribution however by inspecting the linearity of a Q-Q plot.

n_runs = 250
times_h0 = [time_run(cd, X_h0, window_size) for _ in range(n_runs)]
print(f"Average run-time under no-drift: {np.mean(times_h0)}")
_ = scipy.stats.probplot(np.array(times_h0), dist=scipy.stats.geom, sparams=1/ert, plot=plt)

If we run the detector in an identical manner but on data from the drifted distribution of red wine samples the average run-time is much lower.

n_runs = 250
times_h1 = [time_run(cd, X_corr, window_size) for _ in range(n_runs)]
print(f"Average run-time under drift: {np.mean(times_h1)}")

Online detection with LSDD and TensorFlow

We additionally show that TensorFlow can also be used as the backend and that sometimes it is not necessary to perform preprocessing, making definition of the drift detector simpler. Moreover, in the absence of a learned preprocessing stage we may use all of the reference data available.

X_ref = np.concatenate([X_train, X_ref], axis=0)

And now we define the LSDD-based online drift detector, again with an ert of 50 and window_size of 10.

from alibi_detect.cd import LSDDDriftOnline

cd = LSDDDriftOnline(
    X_ref, ert, window_size, backend='tensorflow', n_bootstraps=2500,
)

We run this new detector on the held out reference data and again see that in the absence of drift the distribution of run-times follows a Geometric distribution with mean ert.

n_runs = 250
times_h0 = [time_run(cd, X_h0, window_size) for _ in range(n_runs)]
print(f"Average run-time under no-drift: {np.mean(times_h0)}")
_ = scipy.stats.probplot(np.array(times_h0), dist=scipy.stats.geom, sparams=1/ert, plot=plt)

And when drift has occured the detector is very fast to respond.

n_runs = 250
times_h1 = [time_run(cd, X_corr, window_size) for _ in range(n_runs)]
print(f"Average run-time under drift: {np.mean(times_h1)}")

Interpretable drift detection with the spot-the-diff detector on MNIST and Wine-Quality datasets

Under the hood drift detectors leverage a function of the data that is expected to be large when drift has occured and small when it hasn't. In the Learned drift detectors on CIFAR-10 example notebook we note that we can learn a function satisfying this property by training a classifer to distinguish reference and test samples. However we now additionally note that if the classifier is specified in a certain way then when drift is detected we can inspect the weights of the classifier to shine light on exactly which features of the data were used to distinguish reference from test samples and therefore caused drift to be detected.

The SpotTheDiffDrift detector is designed to make this process straightforward. Like the ClassifierDrift detector, it uses a portion of the available data to train a classifier to discriminate between reference and test instances. Letting $\hat{p}_T(x)$ represent the probability assigned by the classifier that the instance $x$ is from the test set rather than reference set, the difference here is that we use a classifier of the form $\text{logit}(\hat{p}_T) = b_0 + b_1 k(x,w_1) + ... + b_Jk(x,w_J),$ where $k(\cdot,\cdot)$ is a kernel specifying a notion of similarity between instances, $w_i$ are learnable test locations and $b_i$ are learnable regression coefficients.

The idea here is that if the detector flags drift and $b_i >0$ then we know that it reached its decision by considering how similar each instance is to the instance $w_i$, with those being more similar being more likely to be test instances than reference instances. Alternatively if $b_i < 0$ then instances more similar to $w_i$ were deemed more likely to be reference instances.

In order to provide less noisy and therefore more interpretable results, we define each test location as $w_i = \bar{x}+d_i$ where $\bar{x}$ is the mean reference instance. We may then interpret $d_i$ as the additive transformation deemed to make the average reference more ($b_i>0$) or less ($b_i<0$) similar to a test instance. Defining the test locations in this way allows us to instead learn the difference $d_i$ and apply regularisation such that non-zero values must be justified by improved classification performance. This allows us to more clearly identify which features any detected drift should be attributed to.

This approach to interpretable drift detection is inspired by the work of Jitkrittum et al. (2016), however several major adaptations have been made.

Backend

The method works with both the PyTorch and TensorFlow frameworks. Alibi Detect does however not install PyTorch for you. Check the PyTorch docs how to do this.

Dataset

We start with an image example in order to provide a visual illustration of how the detector works. For this prupose we use the MNIST dataset of 28 by 28 grayscale handwritten digits. To represent the common problem of new classes emerging during the deployment phase we consider a reference set of ~9,000 instances containing only the digits 1-9 and a test set of 10,000 instances containing all of the digits 0-9. We would like drift to be detected in this scenario because a model trained of the reference instances will not know how to process instances from the new class.

This notebook requires the torchvision package which can be installed via pip:

!pip install torchvision

import torch
import tensorflow as tf
import torchvision
import numpy as np
import matplotlib.pyplot as plt
from alibi_detect.cd import SpotTheDiffDrift

np.random.seed(0)
torch.manual_seed(0)
tf.random.set_seed(0)
%matplotlib inline

MNIST_PATH = 'my_path'
DOWNLOAD = True
MISSING_NUMBER = 0
N = 10000

# Load and shuffle data
mnist_train_ds = torchvision.datasets.MNIST(MNIST_PATH, train=True, download=DOWNLOAD)
all_x, all_y = mnist_train_ds.data, mnist_train_ds.targets
perm = np.random.permutation(len(all_x))
all_x, all_y = all_x[perm], all_y[perm]
all_x = all_x[:, None, : , :].numpy().astype(np.float32)/255.

# Create a reference and test set
x_ref = all_x[:N]
x = all_x[N:2*N]

# Remove a class from reference set
x_ref = x_ref[all_y[:10000] != MISSING_NUMBER]

When instantiating the detector we should specify the number of "diffs" we would like it to use to discriminate reference from test instances. Here there is a trade off. Using n_diffs=1 is the simplest to interpret and seems to work well in practice. Using more diffs may result in stronger detection power but the diffs may be harder to interpret due to intereactions and conditional dependencies.

The strength of the regularisation (l1_reg) to apply to the diffs should also be specified. Stronger regularisation results in sparser diffs as the classifier is encouraged to discriminate using fewer features. This may make the diff more interpretable but may again come at the cost of detection power.

We should also specify how the classifier should be trained with standard arguments such as learning_rate, epochs and batch_size. By default a Gaussian RBF is used for the kernel but alternatives can be specified via the kernel kwarg. Additionally the classifier can be initialised with any desired diffs by passing them with the initial_diffs kwarg -- by default they are initialised with Gaussian noise with standard deviation equal to that observed in the reference data.

cd = SpotTheDiffDrift(
    x_ref,
    n_diffs=1,
    l1_reg=1e-4,
    backend='tensorflow',
    verbose=1,
    learning_rate=1e-2, 
    epochs=5, 
    batch_size=64,
)

When we then call the detector to detect drift on the deployment/test set it trains the classifier (thereby learning the diffs) and the usual is_drift and p_val properties can be inspected in the usual way:

preds = cd.predict(x)

print(f"Drift? {'Yes' if preds['data']['is_drift'] else 'No'}")
print(f"p-value: {preds['data']['p_val']}")

As expected, the drift was detected. However we may now additionally look at the learned diffs and corresponding coefficients to determine how the detector reached this decision.

print(f"Diff coeff: {preds['data']['diff_coeffs']}")
diff = preds['data']['diffs'][0,0]
plt.imshow(diff, cmap='RdBu', vmin=-np.max(np.abs(diff)), vmax=np.max(np.abs(diff)))
plt.colorbar()

The detector has identified the zero that was missing from the reference data -- it realised that test instances were on average more (coefficient > 0) simmilar to an instance with below average middle pixel values and above average zero-region pixel values than reference instances were. It used this information to determine that drift had occured.

Interpretable Drift Detection on the Wine Quality Dataset

To provide an example on tabular data we consider the Wine Quality Data Set consisting of 4898 and 1599 samples of white and red wine respectively. Each sample has an associated quality (as determined by experts) and 11 numeric features indicating its acidity, density, pH etc. To represent the problem of a model being trained on one distribution and deployed on a subtly different one, we take as a reference set the samples of white wine and consider the red wine samples to form a 'corrupted' deployment set.

import pandas as pd
red_df = pd.read_csv(
    "https://storage.googleapis.com/seldon-datasets/wine_quality/winequality-red.csv", sep=';'
)
white_df = pd.read_csv(
    "https://storage.googleapis.com/seldon-datasets/wine_quality/winequality-white.csv", sep=';'
)
white_df.describe()

We can see that the data for both red and white wine samples take the same format.

red_df.describe()

We extract the features and shuffle and normalise them such that they take values in [0,1].

white, red = np.asarray(white_df, np.float32)[:, :-1], np.asarray(red_df, np.float32)[:, :-1]
n_white, n_red = white.shape[0], red.shape[0]

col_maxes = white.max(axis=0)
white, red = white / col_maxes, red / col_maxes
white, red = white[np.random.permutation(n_white)], red[np.random.permutation(n_red)]
x, x_corr = white, red

We then split off half of the reference set to act as an unseen sample from the same underlying distribution for which drift should not be detected.

x_ref = x[:len(x)//2]
x_h0 = x[len(x)//2:]

We instantiate our detector in the same way as we do above, but this time using the Pytorch backend for the sake of variety. We then get the predictions of the detector on both the undrifted and corrupted test sets.

cd = SpotTheDiffDrift(
    x_ref,
    n_diffs=1,
    l1_reg=1e-4,
    backend='pytorch',
    verbose=1,
    learning_rate=1e-2, 
    epochs=5, 
    batch_size=64,
)

preds_h0 = cd.predict(x_h0)
preds_corr = cd.predict(x_corr)

print(f"Drift on h0? {'Yes' if preds_h0['data']['is_drift'] else 'No'}")
print(f"p-value on h0: {preds_h0['data']['p_val']}")
print(f"Drift on corrupted? {'Yes' if preds_corr['data']['is_drift'] else 'No'}")
print(f"p-value on corrupted:: {preds_corr['data']['p_val']}")

As expected drift is detected on the red wine samples but not the held out white wine samples from the same distribution. Now we can inspect the returned diff to determine how the detector reached its decision

diff = preds_corr['data']['diffs'][0]
print(f"Diff coeff: {preds_corr['data']['diff_coeffs']}")
plt.barh(white_df.columns[:-1], diff)
plt.xlim((-1.1*np.max(np.abs(diff)), 1.1*np.max(np.abs(diff))))
plt.axvline(0, linestyle='--', color='black')
plt.show()

We see that the detector was able to discriminate the corrupted (red) wine samples from the reference (white) samples by noting that on average reference samples (coeff < 0) typically contain more sulfur dioxide and residual sugars but have less sulphates and chlorides and have lower pH and volatile and fixed acidity.

Supervised drift detection on the penguins dataset

Method

When true outputs/labels are available, we can perform supervised drift detection; monitoring the model's performance directly in order to check for harmful drift. Two detectors ideal for this application are the Fisher’s Exact Test (FET) detector and Cramér-von Mises (CVM) detector detectors.

The FET detector is designed for use on binary data, such as the instance level performance indicators from a classifier (i.e. 0/1 for each incorrect/correct classification). The CVM detector is designed use on continuous data, such as a regressor's instance level loss or error scores.

In this example we will use the offline versions of these detectors, which are suitable for use on batches of data. In many cases data may arrive sequentially, and the user may wish to perform drift detection as the data arrives to ensure it is detected as soon as possible. In this case, the online versions of the FET and CVM detectors can be used, as will be explored in a future example.

Dataset

The palmerpenguins dataset consists of data on 344 penguins from 3 islands in the Palmer Archipelago, Antarctica. There are 3 different species of penguin in the dataset, and a common task is to classify the the species of each penguin based upon two features, the length and depth of the peguin's bill, or beak.

Artwork by Allison Horst

This notebook requires the seaborn package for visualization and the palmerpenguins package to load data. Thse can be installed via pip:

!pip install palmerpenguins
!pip install seaborn

from functools import partial
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import seaborn as sns 

# construct cmap
sns.set_style('whitegrid')
sns.set(font_scale = 1.2)

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from alibi_detect.cd import MMDDrift, FETDrift, CVMDrift

# Set color pallette to match palmerpenguins
mypalette = sns.color_palette(["#ff7300","#008b8b", "#c15bcb"], as_cmap=True)
sns.set_palette(mypalette)
my_cmap = ListedColormap(mypalette)

Load data

To download the dataset we use the palmerpenguins package:

from palmerpenguins import load_penguins

data = load_penguins().dropna()
data.head()

The data consists of 333 rows (one row is removed as contains a NaN), one for each penguin. There are 8 features describing the peguins' physical characteristics, their species and sex, the island each resides on, and the year measurements were taken.

Classification example

For our first example use case, we will perform the popular species classification task. Here we wish the classify the species based on only bill_length_mm and bill_depth_mm. To start we remove the other features and visualise those that remain.

data = data.drop(['island', 'flipper_length_mm', 'body_mass_g', 'sex', 'year'], axis=1)
y = data['species']

pairplot_figure = sns.pairplot(data, hue='species')
pairplot_figure.fig.set_size_inches(9, 6.5)

The above plot shows that the Adeilie species can primarily be identified by looking at bill length. Then to further distinguish between Gentoo and Chinstrap, we can look at the bill depth.

Next we separate the data into inputs and outputs, and encoder the species data to integers. Finally, we now split into three data sets; one to train the classifier, one to act a reference set when testing for drift, and one to test for drift on.

X = data[['bill_length_mm', 'bill_depth_mm']]
y = data['species']
mymap = {'Adelie':0, 'Gentoo':1, 'Chinstrap':2}
y = y.map(mymap)

X_train, X_ref, y_train, y_ref = train_test_split(X.to_numpy(), y.to_numpy(), train_size=60, random_state=42)
X_ref, X_test, y_ref, y_test = train_test_split(X_ref, y_ref, train_size=0.5, random_state=42)

Train a classifier

For this dataset, a relatively shallow decision tree classifier should be sufficient, and so we train an sklearn one on the training data.

clf = DecisionTreeClassifier(max_depth=5, random_state=42)
clf = clf.fit(X_train, y_train)

print('Training accuracy = %.1f %%' % (100*clf.score(X_train, y_train)))
print('Test accuracy = %.1f %%' % (100*clf.score(X_test, y_test)))

As expected, the decision tree is able to give acceptable classification accuracy on the train and test sets.

Shift data

In order to demonstrate use of the drift detectors, we first need to add some artificial drift to the test data X_test. We add two types of drift here; to create covariate drift we subtract 5mm from the bill length of all the Gentoo penguins. $P(y|\mathbf{X})$ is unchanged here, but clearly we have introduced a delta $\Delta P(\mathbf{X})$. To create concept drift, we switch the labels of the Gentoo and Chinstrap penguins, so that the underlying process $P(y|\mathbf{X})$ is changed.

X_covar, y_covar = X_test.copy(), y_test.copy()
X_concept, y_concept = X_test.copy(), y_test.copy()

# Apply covariate drift by altering the bill depth of the Gentoo species
idx1 = np.argwhere(y_test==1)
X_covar[idx1,1] -= 5

# Apply concept drift by switching two species
idx2 = np.argwhere(y_test==2)
y_concept[idx1] = 2
y_concept[idx2] = 1

Xs = {'No drift': X_test, 'Covariate drift': X_covar, 'Concept drift': X_concept}

We now define a utility function to plot the classifier's decision boundaries, and we use this to visualise the reference data set, the test set, and the two new data sets where drift is present.

def plot_decision_boundaries(X, y, clf, ax=None, title=None):
    """
    Helper function to visualize a classifier's decision boundaries. 
    """
    if ax is None:
        f, ax = plt.subplots(figsize=(6, 6))
    ax.set_xlabel('Bill length (mm)')
    ax.set_ylabel('Bill Depth (mm)')
    
    # Plotting decision regions
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1), np.arange(y_min, y_max, 0.1))

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    ax.contourf(xx, yy, Z, alpha=0.2, cmap=my_cmap)
    ax.scatter(X[:, 0], X[:, 1], c=y, s=80, edgecolor="k", cmap=my_cmap)
    
    ax.text(0.02, 0.98, 'Model accuracy = %.1f %%' % (100*clf.score(X, y)),
           ha='left', va='top', transform=ax.transAxes, fontweight='bold')
    
    if title is not None:
        ax.set_title(title)

fig, axs = plt.subplots(2, 2, figsize=(12,12))
plot_decision_boundaries(X_ref, y_ref, clf, ax=axs[0,0], title='Reference data')
plot_decision_boundaries(X_test, y_test, clf, ax=axs[0,1], title='No drift')
plot_decision_boundaries(X_covar, y_covar, clf, ax=axs[1,0], title='Covariate drift')
plot_decision_boundaries(X_concept, y_concept, clf, ax=axs[1,1], title='Concept drift')
plt.subplots_adjust(wspace=0.3, hspace=0.3)

These plots serve as a visualisation of the differences between covariate drift and concept drift. Importantly, the model accuracies shown above also highlight the fact that not all drift is necessarily malicious, in the sense that even relatively significant drift does not always lead to degradation in a model's performance indicators. For example, the model actually gives a slightly higher accuracy on the covariate drift data set than on the no drift set in this case. Conversely, the concept drift unsuprisingly leads to severely degraded model performance.

Unsupervised drift detection

Before getting to the main task in this example, monitoring malicious drift with a supervised drift detector, we will first use the MMD detector to check for covariate drift. To do this we initialise it in an unsupervised manner by passing it the input data X_ref.

cd_mmd = MMDDrift(X_ref, p_val = 0.05)

Applying this detector on the no drift, covariate drift and concept drift data sets, we see that the detector only detects drift in the covariate drift case. Not detecting drift in the no drift case is desirable, but not detecting drift in the concept drift case is potentially problematic.

labels = ['No!', 'Yes!']
for name, Xarr in Xs.items():
    print('\n%s' % name)
    np.random.seed(0)  # Set the seed used in the MMD permutation test (only for notebook reproducibility)
    preds = cd_mmd.predict(Xarr)
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('p-value: {}'.format(preds['data']['p_val']))

Supervised drift detection

The fact that the unsupervised detector above does not detect the severe concept drift demonstrates the motivation for using supervised drift detectors that directly check for malicious drift, which can include malicious concept drift.

To perform supervised drift detection we first need to compute the model's performance indicators. Since this is a classification task, a suitable performance indicator is the instance level binary losses, which are computed below.

loss_ref = (clf.predict(X_ref) == y_ref).astype(int)
loss_test = (clf.predict(X_test) == y_test).astype(int)
loss_covar = (clf.predict(X_covar) == y_covar).astype(int)
loss_concept = (clf.predict(X_concept) == y_concept).astype(int)
losses = {'No drift': loss_test, 'Covariate drift': loss_covar, 'Concept drift': loss_concept}

print(loss_ref)

As seen above, these losses are binary data, where 0 represents an incorrect classification for each instance, and 1 represents a correct classification.

Since this is binary data, the FET detector is chosen, and initialised on the reference loss data. The alternative hypothesis is set to less, meaning we will only flag drift if the proportion of 1s to 0s is reduced compared to the reference data. In other words, we only flag drift if the model's performance has degraded.

cd_fet = FETDrift(loss_ref, p_val=0.05, alternative='less')

Applying this detector to the same three data sets, we see that malicious drift isn't detected in the no drift or covariate drift cases, which is unsurprising since the model performance isn't degraded in these cases. However, with this supervised detector, we now detect the malicious concept drift as desired.

labels = ['No!', 'Yes!']
for name, loss_arr in losses.items():
    print('\n%s' % name)
    preds = cd_fet.predict(loss_arr)
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('p-value: {}'.format(preds['data']['p_val'][0]))

Regression example

To provide a short example of supervised detection in a regression setting, we now rework the dataset into a regression task, and use the CVM detector on the model's squared error.

Warning: Must have scipy >= 1.7.0 installed for this example.

Load data and train model

For a regression task, we take the penguins' flipper length and sex as inputs, and aim to predict the penguins' body mass. Looking at a scatter plot of these features, we can see there is substantial correlation between the chosen inputs and outputs.

data_r = load_penguins().dropna()
Xr = data_r[['flipper_length_mm', 'sex']].replace({'sex': {'female': 1, 'male': 0}})
yr = data_r['body_mass_g']

_ = sns.scatterplot(data=data_r, x='flipper_length_mm', y='body_mass_g', hue='sex')

Again, we split the dataset into the same three sets; a training set, reference set and test set.

Xr_train, Xr_ref, yr_train, yr_ref = train_test_split(Xr.to_numpy(), yr.to_numpy(), 
                                                          train_size=60, random_state=42)
Xr_ref, Xr_test, yr_ref, yr_test = train_test_split(Xr_ref, yr_ref, train_size=0.5, random_state=42)

This time we train a linear regressor on the training data, and find that it gives acceptable training and test accuracy.

reg = LinearRegression()
reg.fit(Xr_train, yr_train)
print('Training RMS error = %.3f' % np.sqrt(np.mean((reg.predict(Xr_train)-yr_train)**2)))
print('Test RMS error = %.3f' % np.sqrt(np.mean((reg.predict(Xr_test)-yr_test)**2)))

Shift data

To generate a copy of the test data with concept drift added, we use the model to create new output data, with a multiplicative factor and some Gaussian noise added. The quality of our synthetic output data is of course affected by the accuracy of the model, but it serves to demonstrate the behavior of the model (and detector) when $P(y|\mathbf{X})$ is changed.

Xr_concept = Xr_test.copy()
yr_concept = reg.predict(Xr_concept)*1.1 + np.random.normal(0, 100, size=len(yr_test))

Unsurprisingly, the covariate drift leads to degradation in the model accuracy.

reg.score(Xr_concept, yr_concept)
print('Test RMS error = %.3f' % np.sqrt(np.mean((reg.predict(Xr_concept)-yr_concept)**2)))

Supervised drift detection

As in the classification example, in order to perform supervised drift detection we need to compute the models performance indicators. For this regression example, the instance level squared errors are used.

lossr_ref = (reg.predict(Xr_ref) - yr_ref)**2
lossr_test = (reg.predict(Xr_test) - yr_test)**2
lossr_concept = (reg.predict(Xr_concept) - yr_concept)**2

lossesr = {'No drift': lossr_test, 'Concept drift': lossr_concept}

The CVM detector is trained on the reference losses:

cd_cvm = CVMDrift(lossr_ref, p_val=0.05)

As desired, the CVM detector does not detect drift on the no drift data, but does on covariate drift data.

labels = ['No!', 'Yes!']
for name, loss_arr in lossesr.items():
    print('\n%s' % name)
    preds = cd_cvm.predict(loss_arr)
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print('p-value: {}'.format(preds['data']['p_val'][0]))

Drift detection on Amazon reviews

Methods

We illustrate drift detection on text data using the following detectors:

Maximum Mean Discrepancy (MMD) detector using pre-trained transformers to flag drift in the embedding space.
Classifier drift detector to detect drift in the input space.

Dataset

The Amazon dataset contains product reviews with a star rating. We will test whether drift can be detected if the ratings start to drift. For more information, check the WILDS documentation page.

Dependencies

Besides alibi-detect, this example notebook also uses the Amazon dataset through the WILDS package. WILDS is a curated collection of benchmark datasets that represent distribution shifts faced in the wild and can be installed via pip:

!pip install wilds

Throughout the notebook we use detectors with both PyTorch and TensorFlow backends.

import numpy as np
import torch

def set_seed(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    np.random.seed(seed)

seed = 1234
set_seed(seed)

Load and prepare data

We first load the dataset and create reference data, data which should not be rejected under the null of the test (H0) and data which should exhibit drift (H1). The drift is introduced later by specifying a specific star rating for the test instances.

AMAZON_PATH = './data/amazon' # path to save data
DOWNLOAD = False  # set to True for first run

The following cell will download the Amazon dataset (if DOWNLOAD=True). The download size is ~7GB and size on disk is ~7GB.

#| scrolled: true
from functools import partial
from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, Subset
from wilds import get_dataset
from wilds.common.data_loaders import get_train_loader

ds = get_dataset(dataset='amazon', root_dir=AMAZON_PATH, download=DOWNLOAD)
ds_tr = ds.get_subset('train')
idx_ref, idx_h0 = train_test_split(np.arange(len(ds_tr)), train_size=.5, random_state=seed, shuffle=True)
ds_ref = Subset(ds_tr, idx_ref)
ds_h0 = Subset(ds_tr, idx_h0)
ds_h1 = ds.get_subset('test')
dl = partial(DataLoader, shuffle=True, batch_size=100, collate_fn=ds.collate, num_workers=2)
dl_ref, dl_h0, dl_h1 = dl(ds_ref), dl(ds_h0), dl(ds_h1)

Detect drift

MMD detector on transformer embeddings

First we embed instances using a pretrained transformer model and detect data drift using the MMD detector on the embeddings.

Helper functions:

from typing import List


def update_flat_list(x: List[list]):
    return [item for sublist in x for item in sublist]


def accumulate_sample(dataloader: DataLoader, sample_size: int, stars: int = None):
    """ Create batches of data from dataloaders. """
    batch_count, stars_count = 0, 0
    x_out, y_out, meta_out = [], [], []
    for x, y, meta in dataloader:
        y, meta = y.numpy(), meta.numpy()
        if isinstance(stars, int):
            idx_stars = np.where(y == stars)[0]
            y, meta = y[idx_stars], meta[idx_stars]
            x = tuple([x[idx] for idx in idx_stars])
        n_batch = y.shape[0]
        idx = min(sample_size - batch_count, n_batch)
        batch_count += n_batch
        x_out += [x[:idx]]
        y_out += [y[:idx]]
        meta_out += [meta[:idx]]
        if batch_count >= sample_size:
            break
    x_out = update_flat_list(x_out)
    y_out = np.concatenate(y_out, axis=0)
    meta_out = np.concatenate(meta_out, axis=0)
    return x_out, y_out, meta_out

Define the transformer embedding preprocessing step:

#| scrolled: true
from alibi_detect.cd import MMDDrift
from alibi_detect.cd.pytorch import preprocess_drift
from alibi_detect.models.pytorch import TransformerEmbedding
from functools import partial
from transformers import AutoTokenizer

emb_type = 'hidden_state'  # pooler_output, last_hidden_state or hidden_state
# layers to extract hidden states from for the embedding used in drift detection
# only relevant for emb_type = 'hidden_state'
n_layers = 8
layers = [-_ for _ in range(1, n_layers + 1)]
max_len = 100  # max length for the tokenizer

model_name = 'bert-base-cased'  # a model supported by the transformers library
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
embedding = TransformerEmbedding(model_name, emb_type, layers).to(device).eval()
preprocess_fn = partial(preprocess_drift, model=embedding, tokenizer=tokenizer, max_len=max_len, batch_size=32)

Define a function which will for a specified number of iterations (n_sample):

Configure the MMDDrift detector with a new reference data sample
Detect drift on the H0 and H1 splits

labels = ['No!', 'Yes!']


def print_preds(preds: dict, preds_name: str) -> None:
    print(preds_name)
    print('Drift? {}'.format(labels[preds['data']['is_drift']]))
    print(f'p-value: {preds["data"]["p_val"]:.3f}')
    print('') 
    

def make_predictions(ref_size: int, test_size: int, n_sample: int, stars_h1: int = 4) -> None:
    """ Create drift MMD detector, init, sample data and make predictions. """
    for _ in range(n_sample):
        # sample data
        x_ref, y_ref, meta_ref = accumulate_sample(dl_ref, ref_size)
        x_h0, y_h0, meta_h0 = accumulate_sample(dl_h0, test_size)
        x_h1, y_h1, meta_h1 = accumulate_sample(dl_h1, test_size, stars=stars_h1)
        # init and run detector
        dd = MMDDrift(x_ref, backend='pytorch', p_val=.05, preprocess_fn=preprocess_fn, n_permutations=1000)
        preds_h0 = dd.predict(x_h0)
        preds_h1 = dd.predict(x_h1)
        print_preds(preds_h0, 'H0')
        print_preds(preds_h1, 'H1')

#| scrolled: false
make_predictions(ref_size=1000, test_size=1000, n_sample=2, stars_h1=4)

Classifier drift detector

Now we will use the ClassifierDrift detector which uses a binary classification model to try and distinguish the reference from the test (H0 or H1) data. Drift is then detected on the difference between the prediction distributions on out-of-fold reference vs. test instances using a Kolmogorov-Smirnov 2 sample test on the prediction probabilities or via a binomial test on the binarized predictions. We use a pretrained transformer model but freeze its weights and only train the head which consists of 2 dense layers with a leaky ReLU non-linearity:

import torch.nn as nn
from transformers import DistilBertModel

model_name = 'distilbert-base-uncased'

class Classifier(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.lm = DistilBertModel.from_pretrained(model_name)
        for param in self.lm.parameters():  # freeze language model weights
            param.requires_grad = False
        self.head = nn.Sequential(nn.Linear(768, 512), nn.LeakyReLU(.1), nn.Linear(512, 2))
    
    def forward(self, tokens) -> torch.Tensor:
        h = self.lm(**tokens).last_hidden_state
        h = nn.MaxPool1d(kernel_size=100)(h.permute(0, 2, 1)).squeeze(-1)
        return self.head(h)

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = Classifier()

from alibi_detect.cd import ClassifierDrift
from alibi_detect.utils.prediction import tokenize_transformer


def make_predictions(model, backend: str, ref_size: int, test_size: int, n_sample: int, stars_h1: int = 4) -> None:
    """ Create drift Classifier detector, init, sample data and make predictions. """
    
    # batch_fn tokenizes each batch of instances of the reference and test set during training
    b = 'pt' if backend == 'pytorch' else 'tf'
    batch_fn = partial(tokenize_transformer, tokenizer=tokenizer, max_len=max_len, backend=b)
    
    for _ in range(n_sample):
        # sample data
        x_ref, y_ref, meta_ref = accumulate_sample(dl_ref, ref_size)
        x_h0, y_h0, meta_h0 = accumulate_sample(dl_h0, test_size)
        x_h1, y_h1, meta_h1 = accumulate_sample(dl_h1, test_size, stars=stars_h1)
        # init and run detector
        # since our classifier returns logits, we set preds_type to 'logits'
        # n_folds determines the number of folds used for cross-validation, this makes sure all 
        #   test data is used but only out-of-fold predictions taken into account for the drift detection
        #   alternatively we can set train_size to a fraction between 0 and 1 and not apply cross-validation
        # epochs specifies how many epochs the classifier will be trained for each sample or fold
        # preprocess_batch_fn is applied to each batch of instances and translates the text into tokens
        dd = ClassifierDrift(x_ref, model, backend=backend, p_val=.05, preds_type='logits', 
                             n_folds=3, epochs=2, preprocess_batch_fn=batch_fn, train_size=None)
        preds_h0 = dd.predict(x_h0)
        preds_h1 = dd.predict(x_h1)
        print_preds(preds_h0, 'H0')
        print_preds(preds_h1, 'H1')

#| scrolled: true
make_predictions(model, 'pytorch', ref_size=1000, test_size=1000, n_sample=2, stars_h1=4)

TensorFlow drift detector

We can do the same using TensorFlow instead of PyTorch as backend. We first define the classifier again and then simply run the detector:

import tensorflow as tf
from tensorflow.keras.layers import Dense, LeakyReLU, MaxPool1D
from transformers import TFDistilBertModel

class ClassifierTF(tf.keras.Model):
    def __init__(self) -> None:
        super(ClassifierTF, self).__init__()
        self.lm = TFDistilBertModel.from_pretrained(model_name)
        self.lm.trainable = False  # freeze language model weights
        self.head = tf.keras.Sequential([Dense(512), LeakyReLU(alpha=.1), Dense(2)])
    
    def call(self, tokens) -> tf.Tensor:
        h = self.lm(**tokens).last_hidden_state
        h = tf.squeeze(MaxPool1D(pool_size=100)(h), axis=1)
        return self.head(h)
    
    @classmethod
    def from_config(cls, config):  # not needed for sequential/functional API models
        return cls(**config)

model = ClassifierTF()

#| scrolled: false
make_predictions(model, 'tensorflow', ref_size=1000, test_size=1000, n_sample=2, stars_h1=4)

Text drift detection on IMDB movie reviews

Method

We detect drift on text data using both the and detectors. In this example notebook we will focus on detecting covariate shift $\Delta p(x)$ as detecting predicted label distribution drift does not differ from other modalities (check and drift on CIFAR-10).

It becomes however a little bit more involved when we want to pick up input data drift $\Delta p(x)$. When we deal with tabular or image data, we can either directly apply the two sample hypothesis test on the input or do the test after a preprocessing step with for instance a randomly initialized encoder as proposed in (they call it an Untrained AutoEncoder or UAE). It is not as straightforward when dealing with text, both in string or tokenized format as they don't directly represent the semantics of the input.

As a result, we extract (contextual) embeddings for the text and detect drift on those. This procedure has a significant impact on the type of drift we detect. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract.

The library contains functionality to leverage pre-trained embeddings from but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in this notebook.

Note

As is done in this example, it is recommended to pass text data to detectors as a list of strings (List[str]). This allows for seamless integration with HuggingFace's transformers library.

One exception to the above is when custom embeddings are used. Here, it is important to ensure that the data is passed to the custom embedding model in a compatible format. In , a preprocess_batch_fn is defined in order to convert list's to the np.ndarray's expected by the custom TensorFlow embedding.

Backend

The method works with both the PyTorch and TensorFlow frameworks for the statistical tests and preprocessing steps. Alibi Detect does however not install PyTorch for you. Check the how to do this.

Dataset

Binary sentiment classification containing $25,000$ movie reviews for training and $25,000$ for testing. Install the nlp library to fetch the dataset:

Load tokenizer

Load data

Let's take a look at respectively a negative and positive review:

We split the original test set in a reference dataset and a dataset which should not be rejected under the H0 of the statistical test. We also create imbalanced datasets and inject selected words in the reference set.

Reference, H0 and imbalanced data:

Inject words in reference data:

Preprocessing

First we need to specify the type of embedding we want to extract from the BERT model. We can extract embeddings from the ...

pooler_output: Last layer hidden-state of the first token of the sequence (classification token; CLS) further processed by a Linear layer and a Tanh activation function. The Linear layer weights are trained from the next sentence prediction (classification) objective during pre-training. Note: this output is usually not a good summary of the semantic content of the input, you’re often better with averaging or pooling the sequence of hidden-states for the whole input sequence.
last_hidden_state: Sequence of hidden states at the output of the last layer of the model, averaged over the tokens.
hidden_state: Hidden states of the model at the output of each layer, averaged over the tokens.
hidden_state_cls: See hidden_state but use the CLS token output.

If hidden_state or hidden_state_cls is used as embedding type, you also need to pass the layer numbers used to extract the embedding from. As an example we extract embeddings from the last 8 hidden states.

Let's check what an embedding looks like:

So the BERT model's embedding space used by the drift detector consists of a $768$-dimensional vector for each instance. We will therefore first apply a dimensionality reduction step with an Untrained AutoEncoder (UAE) before conducting the statistical hypothesis test. We use the embedding model as the input for the UAE which then projects the embedding on a lower dimensional space.

Let's test this again:

K-S detector

Initialize

Detect drift

Let’s first check if drift occurs on a similar sample from the training set as the reference data.

Detect drift on imbalanced and perturbed datasets:

MMD TensorFlow detector

Initialize

Detect drift

H0:

Imbalanced data:

Perturbed data:

MMD PyTorch detector

Initialize

We can run the same detector with PyTorch backend for both the preprocessing step and MMD implementation:

Detect drift

H0:

Imbalanced data:

Perturbed data:

Train embeddings from scratch

So far we used pre-trained embeddings from a BERT model. We can however also use embeddings from a model trained from scratch. First we define and train a simple classification model consisting of an embedding and LSTM layer in TensorFlow.

Load data and train model

Load and tokenize data:

Let's check out an instance:

Define and train a simple model:

Extract the embedding layer from the trained model and combine with UAE preprocessing step:

Again, create reference, H0 and perturbed datasets. Also test against the Reuters news topic classification dataset.

Initialize detector and detect drift

H0:

Perturbed data:

The detector is not as sensitive as the Transformer-based K-S drift detector. The embeddings trained from scratch only trained on a small dataset and a simple model with cross-entropy loss function for 2 epochs. The pre-trained BERT model on the other hand captures semantics of the data better.

Sample from the Reuters dataset: