Drift Detection
Last updated
Last updated
Although powerful, modern machine learning models can be sensitive. Seemingly subtle changes in a data distribution can destroy the performance of otherwise state-of-the art models, which can be especially problematic when ML models are deployed in production. Typically, ML models are tested on held out data in order to estimate their future performance. Crucially, this assumes that the process underlying the input data $\mathbf{X}$ and output data $\mathbf{Y}$ remains constant.
Drift is said to occur when the process underlying $\mathbf{X}$ and $\mathbf{Y}$ at test time differs from the process that generated the training data. In this case, we can no longer expect the model’s performance on test data to match that observed on held out training data. At test time we always observe features $\mathbf{X}$, and the ground truth then refers to a corresponding label $\mathbf{Y}$. If ground truths are available at test time, supervised drift detection can be performed, with the model’s predictive performance monitored directly. However, in many scenarios, such as the binary classification example below, ground truths are not available and unsupervised drift detection methods are required.
To explore the different types of drift, consider the common scenario where we deploy a model $f: \boldsymbol{x} \mapsto y$ on input data $\mathbf{X}$ and output data $\mathbf{Y}$, jointly distributed according to $P(\mathbf{X},\mathbf{Y})$. The model is trained on training data drawn from a distribution $P_{ref}(\mathbf{X},\mathbf{Y})$. Drift is said to have occurred when $P(\mathbf{X},\mathbf{Y}) \ne P_{ref}(\mathbf{X},\mathbf{Y})$. Writing the joint distribution as
we can classify drift under a number of types:
Covariate drift: Also referred to as input drift, this occurs when the distribution of the input data has shifted $P(\mathbf{X}) \ne P_{ref}(\mathbf{X})$, whilst $P(\mathbf{Y}|\mathbf{X})$ = $P_{ref}(\mathbf{Y}|\mathbf{X})$. This may result in the model giving unreliable predictions.
Prior drift: Also referred to as label drift, this occurs when the distribution of the outputs has shifted $P(\mathbf{Y}) \ne P_{ref}(\mathbf{Y})$, whilst $P(\mathbf{X}|\mathbf{Y})=P_{ref}(\mathbf{X}|\mathbf{Y})$. This can affect the model's decision boundary, as well as the model's performance metrics.
Concept drift: This occurs when the process generating $y$ from $x$ has changed, such that $P(\mathbf{Y}|\mathbf{X}) \ne P_{ref}(\mathbf{Y}|\mathbf{X})$. It is possible that the model might no longer give a suitable approximation of the true process.
Note that a change in one of the conditional probabilities $P(\mathbf{X}|\mathbf{Y})$ and $P(\mathbf{Y}|\mathbf{X})$ does not necessarily imply a change in the other. For example, consider the pneumonia prediction example from Lipton et al., whereby a classification model $f$ is trained to predict $y$, the occurrence (or not) of pneumonia, given a list of symptoms $\boldsymbol{x}$. During a pneumonia outbreak, $P(\mathbf{Y}|\mathbf{X})$ (e.g. pneumonia given cough) might rise, but the manifestations of the disease $P(\mathbf{X}|\mathbf{Y})$ might not change. In many cases, knowledge of underlying causal structure of the problem can be used to deduce that one of the conditionals will remain unchanged.
Below, the different types of drift are visualised for a simple two-dimensional classification problem. It is possible for a drift to fall under more than one category, for example the prior drift below also happens to be a case of covariate drift.
It is relatively easy to spot drift by eyeballing these figures here. However, the task becomes considerably harder for high-dimensional real problems, especially since real-time ground truths are not typically available. Some types of drift, such as prior and concept drift, are especially difficult to detect without access to ground truths. As a workaround proxies are required, for example a model’s predictions can be monitored to check for prior drift.
Alibi Detect offers a wide array of methods for detecting drift (see here), some of which are examined in the NeurIPS 2019 paper Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. Generally, these aim to determine whether the distribution $P(\mathbf{z})$ has drifted from a reference distribution $P_{ref}(\mathbf{z})$, where $\mathbf{z}$ may represent input data $\mathbf{X}$, true output data $\mathbf{Y}$, or some form of model output, depending on what type of drift we wish to detect.
Due to natural randomness in the process being modelled, we don’t necessarily expect observations $\mathbf{z}1,\dots,\mathbf{z}N$ drawn from $P(\mathbf{z})$ to be identical to $\mathbf{z}^{ref}1,\dots,\mathbf{z}^{ref}M$ drawn from $P{ref}(\mathbf{z})$. To decide whether differences between $P(\mathbf{z})$ and $P{ref}(\mathbf{z})$ are due to drift or just natural randomness in the data, statistical two-sample hypothesis testing is used, with the null hypothesis $P(\mathbf{z})=P{ref}(\mathbf{z})$. If the $p$-value obtained is below a given threshold, the null is rejected and the alternative hypothesis $P(\mathbf{z}) \ne P{ref}(\mathbf{z})$ is accepted, suggesting drift is occurring.
Since $\mathbf{z}$ is often high-dimensional (even a 200 x 200 greyscale image has 40k dimensions!), performing hypothesis testing in the full-space is often either computationally intractable, or unsuitable for the chosen statistical test. Instead, the pipeline below is often used, with dimension reduction as a pre-processing step.
:::{figure} images/drift_pipeline.png :align: center :alt: Drift detection pipeline
Figure inspired by Figure 1 in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift. :::
Hypothesis testing involves first choosing a test statistic $S(\mathbf{z})$, which is expected to be small if the null hypothesis $H_0$ is true, and large if the alternative $H_a$ is true. For observed data $\mathbf{z}$, $S(\mathbf{z})$ is computed, followed by a $p$-value $\hat{p} = P(\text{such an extreme } S(\mathbf{z}) | H_0)$. In other words, $\hat{p}$ represents the probability of such an extreme value of $S(\mathbf{z})$ occurring given that $H_0$ is true. When $\hat{p}\le \alpha$, results are said to be statistically significant, and the null $P(\mathbf{z})=P_{ref}(\mathbf{z})$ is rejected. Conveniently, the threshold $\alpha$ represents the desired false positive rate.
The test statistics available in Alibi Detect can be broadly split into two categories; univariate and multivariate tests:
Univariate:
Chi-Squared (for categorical data)
Fisher's Exact Test (for binary data)
When applied to multidimensional data with dimension $d$, the univariate tests are applied in a feature-wise manner. The obtained $p$-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. If the tests (i.e. each feature dimension) are independent, these corrections preserve the desired false positive rate (FPR). However, usually this is not the case, resulting in FPR's up to $d$-times lower than desired, which becomes especially problematic when $d$ is large. Additionally, since the univariate tests examine the feature-wise marginal distributions, they may miss drift in cases where the joint distribution over all $d$ features has changed, but the marginals have not. The multivariate tests avoid these problems, at the cost of greater complexity.
Given an input dataset $\mathbf{X}\in \mathbb{R}^{N\times d}$, where $N$ is the number of observations and $d$ the number of dimensions, the aim is to reduce the data dimensionality from $d$ to $K$, where $K\ll d$. A drift detector can then be applied to the lower dimensional data $\hat{\mathbf{X}}\in \mathbb{R}^{N\times K}$, where distances more meaningfully capture notions of similarity/dissimilarity between instances. Dimension reduction approaches can be broadly categorised under:
Linear projections
Non-linear projections
Feature maps (from ML model)
Model uncertainty
Alibi Detect allows for a high degree of flexibility here, with a user’s chosen dimension reduction technique able to be incorporated into their chosen detector via the preprocess_fn
argument (and sometimes preprocess_batch_fn
and preprocess_at_init
, depending on the detector). In the following sections, the three categories of techniques are briefly introduced. Alibi Detect offers the following functionality using either TensorFlow or PyTorch backends and preprocessing utilities. For more details, see the examples.
This includes dimension reduction techniques such as principal component analysis (PCA) and sparse random projections (SRP). These techniques involve using a transformation or projection matrix $\mathbf{R}$ to reduce the dimensionality of a given data matrix $\mathbf{X}$, such that $\hat{\mathbf{X}} = \mathbf{XR}$. A straightforward way to include such techniques as a pre-processing stage is to pass them to the detectors via the preprocess_fn
argument, for example for the scikit-learn
library’s PCA
class:
:::{admonition} Note 1: Disjoint training and reference data sets
Astute readers may have noticed that in the code snippet above, the data X_train
is used to “train” the PCA
model, but the MMDDrift
detector is initialised with X_ref
. This is a subtle yet important point. If a detector’s preprocessor (a dimension reduction or other input preprocessing step) is trained on the reference data (X_ref
), any over-fitting to this data may make the resulting detector overly sensitive to differences between the reference and test data sets.
To avoid an overly discriminative detector, it is customary to draw two disjoint datasets from $P_{ref}(\mathbf{z})$, a training set and a held-out reference set. The training data is used to train any input preprocessing steps, and the detector is then initialised on the reference set, and used to detect drift between the reference and test set. This also applies to the learned drift detectors, which should be trained on the training set not the reference set. :::
A common strategy for obtaining non-linear dimension reducing representations is to use an autoencoder, but other non-linear techniques can also be used. Autoencoders consist of an encoder function $\phi : \mathcal{X} \mapsto \mathcal{H}$ and a decoder function $\psi : \mathcal{H} \mapsto \mathcal{X}$, where the latent space $\mathcal{H}$ has lower dimensionality than the input space $\mathcal{X}$. The output of the encoder $\hat{\mathbf{X}} \in \mathcal{H}$ can then be monitored by the drift detector. Training involves learning both the encoding function $\phi$ and the decoding function $\psi$, in order to reduce the reconstruction loss, e.g. if MSE is used: $\phi, \psi = \text{arg} \min_{\phi, \psi}, \lVert \mathbf{X}-(\phi \circ \psi)\mathbf{X}\rVert^2$. However, untrained (randomly initialised) autoencoders can also be used. For an example, a pytorch
autoencoder can be incorporated into a detector by packaging it as a callable function using {func}~alibi_detect.cd.pytorch.preprocess.preprocess_drift
and {func}~functools.partial
:
Following Detecting and Correcting for Label Shift with Black Box Predictors, feature maps can be extracted from existing pre-trained black-box models such as the image classifier shown below. Instead of using the latent space as the dimensionality-reducing representation, other layers of the model such as the softmax outputs or predicted class-labels can also be extracted and monitored. Since different layers yield different output dimensions, different hypothesis tests are required for each.
:::{figure} images/BBSD.png :align: center :alt: Black box shift detection
Figure inspired by this MNIST classification example from the timeserio package. :::
Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift shows that extracting feature maps from existing models can be an effective technique, which is encouraging since this allows the user to repurpose existing black-box models for use as drift detectors. The syntax for incorporating existing models into drift detectors is similar to the previous autoencoder example, with the added step of using {class}~alibi_detect.cd.tensorflow.preprocess.HiddenOutput
to select the model’s network layer to extract outputs from. The code snippet below is borrowed from Maximum Mean Discrepancy drift detector on CIFAR-10, where the softmax layer of the well-known ResNet-32 model is fed into an MMDDrift
detector.
The model uncertainty-based drift detector uses the ML model of interest itself to detect drift. These detectors aim to directly detect drift that’s likely to affect the performance of a model of interest. The approach is to test for change in the number of instances falling into regions of the input space on which the model is uncertain in its predictions. For each instance in the reference set the detector obtains the model’s prediction and some associated notion of uncertainty. The same is done for the test set and if significant differences in uncertainty are detected (via a Kolmogorov-Smirnoff test) then drift is flagged. The model’s notion of uncertainty depends on the type of model. For a classifier this may be the entropy of the predicted label probabilities. For a regressor with dropout layers, dropout Monte Carlo can be used to provide a notion of uncertainty.
The model uncertainty-based detectors are classed under the dimension reduction category since a model's uncertainty is by definition one-dimensional. However, the syntax for the uncertainty-based detectors is different to the other detectors. Instead of passing a pre-processing step to a detector via a preprocess_fn
(or similar) argument, the dimension reduction (in this case computing a notion of uncertainty) is performed internally by these detectors.
Dimension reduction is a common preprocessing task (e.g. for covariate drift detection on tabular or image data), but some modalities of data (e.g. text and graph data) require other forms of preprocessing in order for drift detection to be performed effectively.
When dealing with text data, performing drift detection on raw strings or tokenized data is not effective since they don’t represent the semantics of the input. Instead, we extract contextual embeddings from language transformer models and detect drift on those. This procedure has a significant impact on the type of drift we detect. Strictly speaking we are not detecting covariate/input drift anymore since the entire training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract.
:::{figure} images/BERT.png :align: center :alt: The DistilBERT language representation model
Figure based on Jay Alammar’s excellent visual guide to the BERT model :::
Alibi Detect contains functionality to leverage pre-trained embeddings from HuggingFace’s transformer package. Popular models such as BERT or DistilBERT (shown above) can be used, but Alibi Detect also allows you to easily use your own embeddings of choice. A subsequent dimension reduction step can also be applied if necessary, as is done in the Text drift detection on IMDB movie reviews example, where the 768-dimensional embeddings from the BERT model are passed through an untrained AutoEncoder to reduce their dimensionality. Alibi Detect allows various types of embeddings to be extracted from transformer models, using {class}~alibi_detect.models.tensorflow.embedding.TransformerEmbedding
.
In a similar manner to text data, graph data requires preprocessing before drift detection can be performed. This can be done by extracting graph embeddings from graph neural network (GNN) encoders, as shown below, and demonstrated in the Drift detection on molecular graphs example.
For a simple example, we’ll use the MMD detector to check for drift on the two-dimensional binary classification problem shown previously (see notebook). The MMD detector is a kernel-based method for multivariate two sample testing. Since the number of dimensions is already low, dimension reduction step is not necessary here here. For a more advanced example using the MMD detector with dimension reduction, check out the Maximum Mean Discrepancy drift detector on CIFAR-10 example.
The true model/process is defined as:
where the slope $s$ is set as $s=-1$.
The reference distribution is defined as a mixture of two Normal distributions:
with the standard deviation set at $\sigma=0.8$, and the weights set to $\phi_1=\phi_2=0.5$. Reference data $\mathbf{X}^{ref}$ and training data $\mathbf{X}^{train}$ (see Note 1) can be generated by sampling from this distribution. The corresponding labels $\mathbf{Y}^{ref}$ and $\mathbf{Y}^{train}$ are obtained by evaluating true_model()
.
For a model, we choose the well-known decision tree classifier. As well as training the model, this is a good time to initialise the MMD detector with the held-out reference data $\mathbf{X}^{ref}$ by calling:
The significance threshold is set at $\alpha=0.05$, meaning the detector will flag results as drift detected when the computed $p$-value is less than this i.e. $\hat{p}< \alpha$.
Before introducing drift, we first examine the case where no drift is present. We resample from the same mixture of Gaussian distributions to generate test data $\mathbf{X}$. The individual data observations are different, but the underlying distributions are unchanged, hence no true drift is present.
Unsurprisingly, the model’s mean test accuracy is relatively high. To run the detector on test data the .predict()
is used:
For the test statistic $S(\mathbf{X})$ (we write $S(\mathbf{X})$ instead of $S(\mathbf{z})$ since the detector is operating on input data), the MMD detector uses the kernel trick to compute unbiased estimates of $\text{MMD}^2$. The $\text{MMD}$ is a distance-based measure between the two distributions $P$ and $P_{ref}$, based on the mean embeddings $\mu$ and $\mu_{ref}$ in a reproducing kernel Hilbert space $F$:
A $p$-value is then obtained via a permutation test on the estimates of $\text{MMD}^2$. As expected, since we are sampling from the reference distribution $P_{ref}(\mathbf{X})$, the detector’s prediction is 'is_drift':0
here, indicating that drift is not detected. More specifically, the detector’s $p$-value (p_val
) is above the threshold of $\alpha=0.05$ (threshold
), indicating that no statistically significant drift has been detected. The .predict()
method also returns $\hat{S}(\mathbf{X})$ (distance_threshold
), which is the threshold in terms of the test statistic $S(\mathbf{X})$ i.e. when $S(\mathbf{X})\ge \hat{S}(\mathbf{X})$ statistically significant drift has been detected.
To impose covariate drift, we apply a shift to the mean of one of the normal distributions:
The test data has drifted into a previously unseen region of feature space, and the model is now misclassifying a number of test observations. If true test labels are available, this is easily detectable by monitoring the test accuracy. However, labels are not always available at test time, in which case a drift detector monitoring the covariates comes in handy. In this case, the MMD detector successfully detects the covariate drift.
In a similar manner, a proxy for prior drift can be monitored by initialising a detector on labels from the reference set, and then feeding it a model’s predicted labels:
It can often be challenging to specify a test statistic $S(\mathbf{z})$ that is large when drift is present and small otherwise. Alibi Detect offers a number of learned detectors, which try to explicitly learn a test statistic which satisfies this property:
Spot-the-diff(erence)
These detectors can be highly effective, but require training hence potentially increasing data requirements and set-up time. Similarly to when training preprocessing steps, it is important that the learned detectors are trained on training data which is held-out from the reference data set (see Note 1). A brief overview of these detectors is given below. For more details, see the detectors’ respective pages.
The MMD detector uses a kernel $k(\mathbf{z},\mathbf{z}^{ref})$ to compute unbiased estimates of $\text{MMD}^2$. The user is free to provide their own kernel, but by default a Gaussian RBF kernel is used. The Learned kernel drift detector (Liu et al., 2020) extends this approach by training a kernel to maximise an estimate of the resulting test power. The learned kernel is defined as
where $\Phi$ is a learnable projection, $k_a$ and $k_b$ are simple characteristic kernels (such as a Gaussian RBF), and $\epsilon>0$ is a small constant. By letting $\Phi$ be very flexible we can learn powerful kernels in this manner.
The figure below compares the use of a Gaussian and a learned kernel for identifying differences between two distributions $P$ and $P_{ref}$. The distributions are each equal mixtures of nine Gaussians with the same modes, but each component of $P_{ref}$ is an isotropic Gaussian, whereas the covariance of $P$ differs in each component. The Gaussian kernel (c) treats points isotropically throughout the space, based upon $\lVert \mathbf{z} - \mathbf{z}^{ref} \rVert$ only. The learned kernel (d) behaves differently in different regions of the space, adapting to local structure and therefore allowing better detection of differences between $P$ and $P_{ref}$.
:::{figure} images/deep_kernel.png :align: center :alt: Gaussian and deep kernels
Original image source: Liu et al., 2020. Captions modified to match notation used elsewhere on this page. :::
The classifier-based drift detector (Lopez-Paz and Oquab, 2017) attempts to detect drift by explicitly training a classifier to discriminate between data from the reference and test sets. The statistical test used depends on whether the classifier outputs probabilities or binarized (0 or 1) predictions, but the general idea is to determine whether the classifiers performance is statistically different from random chance. If the classifier can learn to discriminate better than randomly (in a generalisable manner) then drift must have occurred.
Liu et al. show that a classifier-based drift detector is actually a special case of the learned kernel. An important difference is that to train a classifier we maximise its accuracy (or a cross-entropy proxy), while for a learned kernel we maximise the test power directly. Liu et al. show that the latter approach is empirically superior.
The spot-the-diff(erence) drift detector is an extension of the Classifier drift detector, where the classifier is specified in a manner that makes detections interpretable at the feature level when they occur. The detector is inspired by the work of Jitkrittum et al. (2016) but various major adaptations have been made.
As with the usual classifier-based approach, a portion of the available data is used to train a classifier that can discriminate reference instances from test instances. However, the spot-the-diff detector is specified such that when drift is detected, we can inspect the weights of the classifier to shine light on exactly which features of the data were used to distinguish reference from test samples, and therefore caused drift to be detected. The Interpretable drift detection with the spot-the-diff detector on MNIST and Wine-Quality datasets example demonstrates this capability.
So far, we have discussed drift detection in an offline context, with the entire test set ${\mathbf{z}i}{i=1}^{N}$ compared to the reference dataset ${\mathbf{z}^{ref}i}{i=1}^{M}$. However, at test time, data sometimes arrives sequentially. Here it is desirable to detect drift in an online fashion, allowing us to respond as quickly as possible and limit the damage it might cause.
One approach is to perform a test for drift every $W$ time-steps, using the $W$ samples that have arrived since the last test. In other words, that is to compare ${\mathbf{z}i}{i=t-W+1}^{t}$ to ${\mathbf{z}^{ref}i}{i=1}^{M}$. Such a strategy could be implemented using any of the offline detectors implemented in alibi-detect, but being both sensitive to slight drift and responsive to severe drift is difficult. If the window size $W$ is too large the delay between consecutive statistical tests hampers responsiveness to severe drift, but an overly small window is unreliable. This is demonstrated below, where the offline MMD detector is used to monitor drift in data $\mathbf{X}$ sampled from a normal distribution $\mathcal{N}\left(\mu,\sigma^2 \right)$ over time $t$, with the mean starting to drift from $\mu=0$ to $\mu=0.5$ at $t=40$.
An alternative strategy is to perform a test each time data arrives. However the usual offline methods are not applicable because the process for computing $p$-values is too expensive. Additionally, they don’t account for correlated test outcomes when using overlapping windows of test data, leading to miscalibrated detectors operating at an unknown False Positive Rate (FPR). Well-calibrated FPR’s are crucial for judging the significance of a drift detection. In the absence of calibration, drift detection can be useless since there is no way of knowing what fraction of detections are false positives. To tackle this problem, Alibi Detect offers specialist online drift detectors:
These detectors leverage the calibration method introduced by Cobb et al. (2021) in order to ensure they are well well-calibrated when used in a sequential manner. The detectors compute a test statistic $S(\mathbf{z})$ during the configuration phase. Then, at test time, the test statistic is updated sequentially at a low cost. When no drift has occurred the test statistic fluctuates around its expected value, and once drift occurs the test statistic starts to drift upwards. When it exceeds some preconfigured threshold value, drift is detected. The online detectors are constructed in a similar manner to the offline detectors, for example for the online MMD detector:
But, in addition to providing the detector with reference data, the expected run-time (see below), and size of the sliding window must also be defined. Another important difference is that the online detectors make predictions on single data instances:
This can be seen in the animation below, where the online detector considers each incoming observation/sample individually, instead of considering a batch of observations like the offline detectors.
Unlike offline detectors which require the specification of a threshold $p$-value, which is equivalent to a false positive rate (FPR), the online detectors in alibi-detect require the specification of an expected run-time (ERT) (an inverted FPR). This is the number of time-steps that we insist our detectors, on average, should run for in the absence of drift, before making a false detection.
Usually we would like the ERT to be large, however this results in insensitive detectors which are slow to respond when drift does occur. Hence, there is a tradeoff between the expected run time and the expected detection delay (the time taken for the detector to respond to drift in the data). To target the desired ERT, thresholds are configured during an initial configuration phase via simulation (n_bootstraps
sets the number of boostrap simulations used here). This configuration process is only suitable when the amount of reference data is relatively large (ideally around an order of magnitude larger than the desired ERT). Configuration can be expensive (less so with a GPU), but allows the detector to operate at a low-cost at test time. For a more in-depth explanation, see Drift Detection: An Introduction with Seldon.