Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The FET drift detector is a non-parametric drift detector. It applies Fisher's Exact Test (FET) to each feature, and is intended for application to Bernoulli distributions, with binary univariate data consisting of either (True, False)
or (0, 1)
. This detector is ideal for use in a supervised setting, monitoring drift in a model's instance level accuracy (i.e. correct prediction = 0, and incorrect prediction = 1).
The detector is primarily intended for univariate data, but can also be used in a multivariate setting. For multivariate data, the obtained p-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. As with other univariate detectors such as the Kolmogorov-Smirnov detector, for high-dimensional data, we typically want to reduce the dimensionality before computing the feature-wise univariate FET tests and aggregating those via the chosen correction method. See Dimension Reduction for more guidance on this.
For the $j^{th}$ feature, the FET detector considers the 2x2 contingency table between the reference data $x_j^{ref}$ and test data $x_j$ for that feature:
$x_j$
$N_1$
$N_0$
$x_j^{ref}$
$N^{ref}_1$
$N^{ref}_0$
where $N^{ref}_1$ represents the number of 1's in the reference data (for the $j^{th}$ feature), $N^{ref}_0$ the number of 0's, and so on. These values can be used to define an odds ratio:
The null hypothesis is $H_0: \widehat{OR}=1$. In other words, the proportion of 1's to 0's is unchanged between the test and reference distributions, such that the odds of 1's vs 0's is independent of whether the data is drawn from the reference or test distribution. The offline FET detector can perform one-sided or two-sided tests, with the alternative hypothesis set by the alternative
keyword argument:
If alternative='greater'
, the alternative hypothesis is $H_a: \widehat{OR}>1$ i.e. proportion of 1's versus 0's has increased compared to the reference distribution.
If alternative='less'
, the alternative hypothesis is $H_a: \widehat{OR}<1$ i.e. the proportion of 1's versus 0's has decreased compared to the reference distribution.
If alternative='two-sided'
, the alternative hypothesis is $H_a: \widehat{OR} \ne 1$ i.e. the proportion of 1's versus 0's has changed compared to the reference distribution.
The p-value returned by the detector is then the probability of obtaining an odds ratio at least as extreme as that observed (in the direction specified by alternative
), assuming the null hypothesis is true.
Arguments:
x_ref
: Data used as reference distribution. Note this should be the raw data, for example np.array([0, 0, 1, 0, 0, 0])
, not the 2x2 contingency table.
Keyword arguments:
p_val
: p-value used for significance of the FET test. If the FDR correction method is used, this corresponds to the acceptable q-value.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
alternative
: Defines the alternative hypothesis. Options are 'greater' (default), 'less' or 'two-sided'.
n_features
: Number of features used in the FET test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
input_shape
: Shape of input data.
data_type
: can specify data type added to metadata. E.g. 'tabular' or 'image'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the FET test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: Feature-wise test statistics between the reference data and the new batch if return_distance
equals True. In this case, the test statistics correspond to the odds ratios.
The drift detector applies feature-wise two-sample (K-S) tests. For multivariate data, the obtained p-values for each feature are aggregated either via the or the (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur.
For high-dimensional data, we typically want to reduce the dimensionality before computing the feature-wise univariate K-S tests and aggregating those via the chosen correction method. Following suggestions in , we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs () as out-of-the box preprocessing methods and note that can also be easily implemented using scikit-learn
. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift. The which is part of the library can also be transformed into a drift detector picking up drift that reduces the performance of the classification model. We can therefore combine different preprocessing techniques to figure out if there is drift which hurts the model performance, and whether this drift can be classified as input drift or label shift.
Detecting input data drift (covariate shift) $\Delta p(x)$ for text data requires a custom preprocessing step. We can pick up changes in the semantics of the input by extracting (contextual) embeddings and detect drift on those. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract. The library contains functionality to leverage pre-trained embeddings from but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in the notebook.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the K-S test. If the FDR correction method is used, this corresponds to the acceptable q-value.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
alternative
: Defines the alternative hypothesis. Options are 'two-sided' (default), 'less' or 'greater'.
n_features
: Number of features used in the K-S test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
input_shape
: Shape of input data.
data_type
: can specify data type added to metadata. E.g. 'tabular' or 'image'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the K-S test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: feature-wise K-S statistics between the reference data and the new batch if return_distance
equals True.
The drift detector applies feature-wise tests for the categorical features. For multivariate data, the obtained p-values for each feature are aggregated either via the or the (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. Similarly to the other drift detectors, a preprocessing steps could be applied, but the output features need to be categorical.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the Chi-Squared test for. If the FDR correction method is used, this corresponds to the acceptable q-value.
categories_per_feature
: Optional dictionary with as keys the feature column index and as values the number of possible categorical values for that feature or a list with the possible values. If you know how many categories are present for a given feature you could pass this in the categories_per_feature
dict in the Dict[int, int] format, e.g. {0: 3, 3: 2}. If you pass N categories this will assume the possible values for the feature are [0, ..., N-1]. You can also explicitly pass the possible categories in the Dict[int, List[int]] format, e.g. {0: [0, 1, 2], 3: [0, 55]}. Note that the categories can be arbitrary int values. If it is not specified, categories_per_feature
is inferred from x_ref
.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique. Needs to return categorical features for the Chi-Squared detector.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
n_features
: Number of features used in the Chi-Squared test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
data_type
: can specify data type added to metadata. E.g. 'tabular'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the Chi-Square test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: feature-wise Chi-Square test statistics between the reference data and the new batch if return_distance
equals True.
The Maximum Mean Discrepancy (MMD) detector is a kernel-based method for multivariate 2 sample testing. The MMD is a distance-based measure between 2 distributions p and q based on the mean embeddings $\mu_{p}$ and $\mu_{q}$ in a reproducing kernel Hilbert space $F$:
We can compute unbiased estimates of $MMD^2$ from the samples of the 2 distributions after applying the kernel trick. We use by default a radial basis function kernel, but users are free to pass their own kernel of preference to the detector. We obtain a $p$-value via a permutation test on the values of $MMD^2$.
For high-dimensional data, we typically want to reduce the dimensionality before computing the permutation test. Following suggestions in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift, we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs (BBSDs) as out-of-the box preprocessing methods and note that PCA can also be easily implemented using scikit-learn
. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift.
Detecting input data drift (covariate shift) $\Delta p(x)$ for text data requires a custom preprocessing step. We can pick up changes in the semantics of the input by extracting (contextual) embeddings and detect drift on those. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract. The library contains functionality to leverage pre-trained embeddings from HuggingFace's transformer package but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in the Text drift detection on IMDB movie reviews notebook.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
backend
: TensorFlow, PyTorch and KeOps implementations of the MMD detector are available. Specify the backend (tensorflow, pytorch or keops). Defaults to tensorflow.
p_val
: p-value used for significance of the permutation test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
kernel
: Kernel used when computing the MMD. Defaults to a Gaussian RBF kernel (from alibi_detect.utils.pytorch import GaussianRBF
, from alibi_detect.utils.tensorflow import GaussianRBF
or from alibi_detect.utils.keops import GaussianRBF
dependent on the backend used). Note that for the KeOps backend, the diagonal entries of the kernel matrices kernel(x_ref, x_ref)
and kernel(x_test, x_test)
should be equal to 1. This is compliant with the default Gaussian RBF kernel.
sigma
: Optional bandwidth for the kernel as a np.ndarray
. We can also average over a number of different bandwidths, e.g. np.array([.5, 1., 1.5])
.
configure_kernel_from_x_ref
: If sigma
is not specified, the detector can infer it via a heuristic and set sigma
to the median (TensorFlow and PyTorch) or the mean pairwise distance between 2 samples (KeOps) by default. If configure_kernel_from_x_ref
is True, we can already set sigma
at initialization of the detector by inferring it from x_ref
, speeding up the prediction step. If set to False, sigma
is computed separately for each test batch at prediction time.
n_permutations
: Number of permutations used in the permutation test.
input_shape
: Optionally pass the shape of the input data.
data_type
: can specify data type added to the metadata. E.g. 'tabular' or 'image'.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
Additional KeOps keyword arguments:
batch_size_permutations
: KeOps computes the n_permutations
of the MMD^2 statistics in chunks of batch_size_permutations
. Defaults to 1,000,000.
Initialized drift detector examples for each of the available backends:
We can also easily add preprocessing functions for the TensorFlow and PyTorch frameworks. Note that we can also combine for instance a PyTorch preprocessing step with a KeOps detector. The following example uses a randomly initialized image encoder in PyTorch:
The same functionality is supported in TensorFlow and the main difference is that you would import from alibi_detect.cd.tensorflow import preprocess_drift
. Other preprocessing steps such as the output of hidden layers of a model or extracted text embeddings using transformer models can be used in a similar way in both frameworks. TensorFlow example for the hidden layer output:
Check out the Drift detection on CIFAR10 example for more details.
Alibi Detect also includes custom text preprocessing steps in both TensorFlow and PyTorch based on Huggingface's transformers package:
Again the same functionality is supported in TensorFlow but with from alibi_detect.cd.tensorflow import preprocess_drift
and from alibi_detect.models.tensorflow import TransformerEmbedding
imports. Check out the Text drift detection on IMDB movie reviews example for more information.
We detect data drift by simply calling predict
on a batch of instances x
. We can return the p-value and the threshold of the permutation test by setting return_p_val
to True and the maximum mean discrepancy metric and threshold by setting return_distance
to True.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains the p-value if return_p_val
equals True.
threshold
: p-value threshold if return_p_val
equals True.
distance
: MMD^2 metric between the reference data and the new batch if return_distance
equals True.
distance_threshold
: MMD^2 metric value from the permutation test which corresponds to the the p-value threshold.
Drift detection on molecular graphs
Scaling up drift detection with KeOps
The least-squares density difference detector is a method for multivariate 2 sample testing. The LSDD between two distributions $p$ and $q$ on $\mathcal{X}$ is defined as
Given two samples we can compute an estimate of the $LSDD$ between the two underlying distributions and use it as a test statistic. We then obtain a $p$-value via a permutation test on the values of the $LSDD$ estimates. In practice we actually estimate the LSDD scaled by a factor that maintains numerical stability when dimensionality is high.
Note
$LSDD$ is based on the assumption that a probability density exists for both distributions and hence is only suitable for continuous data. If you are working with tabular data containing categorical variables, we recommend using the TabularDrift detector instead.
For high-dimensional data, we typically want to reduce the dimensionality before computing the permutation test. Following suggestions in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift, we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs (BBSDs) as out-of-the box preprocessing methods and note that PCA can also be easily implemented using scikit-learn
. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift.
Detecting input data drift (covariate shift) $\Delta p(x)$ for text data requires a custom preprocessing step. We can pick up changes in the semantics of the input by extracting (contextual) embeddings and detect drift on those. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract. The library contains functionality to leverage pre-trained embeddings from HuggingFace's transformer package but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in the Text drift detection on IMDB movie reviews notebook.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
backend
: Both TensorFlow and PyTorch implementations of the LSD detector as well as various preprocessing steps are available. Specify the backend (tensorflow or pytorch). Defaults to tensorflow.
p_val
: p-value used for significance of the permutation test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
sigma
: Optionally set the bandwidth of the Gaussian kernel used in estimating the LSDD. Can also pass multiple bandwidth values as an array. The kernel evaluation is then averaged over those bandwidths. If sigma
is not specified, the 'median heuristic' is adopted whereby sigma
is set as the median pairwise distance between reference samples.
n_permutations
: Number of permutations used in the permutation test.
n_kernel_centers
: The number of reference samples to use as centers in the Gaussian kernel model used to estimate LSDD. Defaults to 1/20th of the reference data.
lambda_rd_max
: The maximum relative difference between two estimates of LSDD that the regularization parameter lambda is allowed to cause. Defaults to 0.2 as in the paper.
input_shape
: Optionally pass the shape of the input data.
data_type
: can specify data type added to the metadata. E.g. 'tabular' or 'image'.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
Initialized drift detector example:
The same detector in PyTorch:
We can also easily add preprocessing functions for both frameworks. The following example uses a randomly initialized image encoder in PyTorch:
The same functionality is supported in TensorFlow and the main difference is that you would import from alibi_detect.cd.tensorflow import preprocess_drift
. Other preprocessing steps such as the output of hidden layers of a model or extracted text embeddings using transformer models can be used in a similar way in both frameworks. TensorFlow example for the hidden layer output:
The LSDDDrift
detector can be used in exactly the same way as the MMDDrift
detector which is further demonstrated in the Drift detection on CIFAR10 example.
Alibi Detect also includes custom text preprocessing steps in both TensorFlow and PyTorch based on Huggingface's transformers package:
Again the same functionality is supported in TensorFlow but with from alibi_detect.cd.tensorflow import preprocess_drift
and from alibi_detect.models.tensorflow import TransformerEmbedding
imports. Check out the Text drift detection on IMDB movie reviews example for more information.
We detect data drift by simply calling predict
on a batch of instances x
. We can return the p-value and the threshold of the permutation test by setting return_p_val
to True and the maximum mean discrepancy metric and threshold by setting return_distance
to True.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains the p-value if return_p_val
equals True.
threshold
: p-value threshold if return_p_val
equals True.
distance
: LSDD metric between the reference data and the new batch if return_distance
equals True.
distance_threshold
: LSDD metric value from the permutation test which corresponds to the the p-value threshold.
For the related MMDDrift
detector.
The learned-kernel drift detector (Liu et al., 2020) is an extension of the Maximum Mean Discrepancy drift detector where the kernel used to define the MMD is trained using a portion of the data to maximise an estimate of the resulting test power. Once the kernel has been learned a permutation test is performed in the usual way on the value of the MMD.
This method is closely related to the classifier drift detector which trains a classifier to discriminate between instances from the reference window and instances from the test window. The difference here is that we train a kernel to output high similarity on instances from the same window and low similarity between instances from different windows. If this is possible in a generalisable manner then drift must have occured.
As with the classifier-based approach, we should specify the proportion of data to use for training and testing respectively as well as training arguments such as the learning rate and batch size. Note that a new kernel is trained for each test set that is passed for detection.
Arguments:
x_ref
: Data used as reference distribution.
kernel
: A differentiable TensorFlow or PyTorch module that takes two sets of instances as inputs and returns a kernel similarity matrix as output.
Keyword arguments:
backend
: TensorFlow, PyTorch and KeOps implementations of the learned kernel detector are available. The backend can be specified as tensorflow, pytorch or keops. Defaults to tensorflow.
p_val
: p-value threshold used for the significance of the test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed. If the input data type is of type List[Any]
then update_x_ref
needs to be set to None and the reference set remains fixed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
n_permutations
: The number of permutations to use in the permutation test once the MMD has been computed.
var_reg
: Constant added to the estimated variance of the MMD for stability.
reg_loss_fn
: The regularisation term reg_loss_fn(kernel) is added to the loss function being optimized.
train_size
: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size.
retrain_from_scratch
: Whether the kernel should be retrained from scratch for each set of test data or whether it should instead continue training from where it left off on the previous set. Defaults to True.
optimizer
: Optimizer used during training of the kernel. From torch.optim
for PyTorch and tf.keras.optimizers
for TensorFlow.
learning_rate
: Learning rate for the optimizer.
batch_size
: Batch size used during training of the kernel.
batch_size_predict
: Batch size used for the trained drift detector predictions.
preprocess_batch_fn
: Optional batch preprocessing function. For example to convert a list of generic objects to a tensor which can be processed by the kernel.
epochs
: Number of training epochs for the kernel.
verbose
: Verbosity level during the training of the kernel. 0 is silent and 1 prints a progress bar.
train_kwargs
: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer
) or PyTorch (from alibi_detect.models.pytorch import trainer
) trainer functions.
dataset
: Dataset object used during training of the kernel. Defaults to alibi_detect.utils.pytorch.TorchDataset
(an instance of torch.utils.data.Dataset
) for the PyTorch and KeOps backends and alibi_detect.utils.tensorflow.TFDataset
(an instance of tf.keras.utils.Sequence
) for the TensorFlow backend. For PyTorch or KeOps, the dataset should only take the windows x_ref and x_test as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x_ref, x_test) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence
, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x_ref, x_test, batch_size=batch_size, shuffle=True) is used. x_ref and x_test can be of type np.ndarray or List[Any].
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.
Additional PyTorch and KeOps keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader
: Dataloader object used during training of the kernel. Defaults to torch.utils.data.DataLoader
. The dataloader is not initialized yet, this is done during init off the detector using the batch_size
. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader
.
num_workers
: The number of workers used by the DataLoader
. The default (num_workers=0
) means multi-process data loading is disabled. Setting num_workers>0
may be unreliable on Windows.
Additional KeOps only keyword arguments:
batch_size_permutations
: KeOps computes the n_permutations
of the MMD^2 statistics in chunks of batch_size_permutations
. Defaults to 1,000,000.
Any differentiable Pytorch or TensorFlow module that takes as input two instances and outputs a scalar (representing similarity) can be used as the kernel for this drift detector. However, in order to ensure that MMD=0 implies no-drift the kernel should satify a characteristic property. This can be guaranteed by defining a kernel as where $\Phi$ is a learnable projection, $k_a$ and $k_b$ are simple characteristic kernels (such as a Gaussian RBF), and $\epsilon>0$ is a small constant. By letting $\Phi$ be very flexible we can learn powerful kernels in this manner.
This is easily implemented using the DeepKernel
class provided in alibi_detect
. We demonstrate below how we might define a convolutional kernel for images using Pytorch. By default GaussianRBF
kernels are used for $k_a$ and $k_b$ and here we specify $\epsilon=0.01$, but we could alternatively set eps='trainable'
.
It is important to note that, if retrain_from_scratch=True
and we have not initialised the kernel bandwidth sigma
for the default GaussianRBF
kernel $k_a$ and optionally also for $k_b$, we will initialise sigma
using a median (PyTorch and TensorFlow) or mean (KeOps) bandwidth heuristic for every detector prediction. For KeOps detectors specifically, this could form a computational bottleneck and should be avoided by already specifying a bandwidth in advance. To do this, we can leverage the library's built-in heuristics:
Instantiating the detector is then as simple as passing the reference data and the kernel as follows:
We could have alternatively defined the kernel and instantiated the detector using KeOps:
Or by using TensorFlow as the backend:
We detect data drift by simply calling predict
on a batch of instances x
. return_p_val
equal to True will also return the p-value of the test, return_distance
equal to True will return a notion of strength of the drift and return_kernel
equals True will also return the trained kernel.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
threshold
: the user-defined p-value threshold defining the significance of the test
p_val
: the p-value of the test if return_p_val
equals True.
distance
: MMD^2 metric between the reference data and the new batch if return_distance
equals True.
distance_threshold
: MMD^2 metric value from the permutation test which corresponds to the the p-value threshold if return_distance
equals True.
kernel
: The trained kernel if return_kernel
equals True.
Drift detection on molecular graphs
The classifier-based drift detector Lopez-Paz and Oquab, 2017 simply tries to correctly distinguish instances from the reference set vs. the test set. The classifier is trained to output the probability that a given instance belongs to the test set. If the probabilities it assigns to unseen test instances are significantly higher (as determined by a Kolmogorov-Smirnov test) to those it assigns to unseen reference instances then the test set must differ from the reference set and drift is flagged. Alternatively, the detector also allows to binarize the classifier predictions (0 or 1) and apply a binomial test on the binarized predictions of the reference vs. the test data. To leverage all the available reference and test data, stratified cross-validation can be applied and the out-of-fold predictions are used for the significance test. Note that a new classifier is trained for each test set or even each fold within the test set.
Arguments:
x_ref
: Data used as reference distribution.
model
: Binary classification model used for drift detection. TensorFlow, PyTorch and Sklearn models are supported.
Keyword arguments:
backend
: Specify the backend (tensorflow, pytorch or sklearn). This depends on the framework of the model
. Defaults to tensorflow.
p_val
: p-value threshold used for the significance of the test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed. If the input data type is of type List[Any]
then update_x_ref
needs to be set to None and the reference set remains fixed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
preds_type
: Whether the model outputs 'probs' (probabilities - for 'tensorflow', 'pytorch', 'sklearn' models), 'logits' (for 'pytorch', 'tensorflow' models), 'scores' (for 'sklearn' models if decision_function
is supported).
binarize_preds
: Whether to test for discrepancy on soft (e.g. probs/logits/scores) model predictions directly with a K-S test or binarise to 0-1 prediction errors and apply a binomial test. Defaults to False and therefore applies the K-S test.
train_size
: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size. Cannot be used in combination with n_folds
.
n_folds
: Optional number of stratified folds used for training. The model preds are then calculated on all the out-of-fold predictions. This allows to leverage all the reference and test data for drift detection at the expense of longer computation. If both train_size
and n_folds
are specified, n_folds
is prioritized.
seed
: Optional random seed for fold selection.
optimizer
: Optimizer used during training of the classifier. From torch.optim
for PyTorch and tf.keras.optimizers
for TensorFlow.
learning_rate
: Learning rate for the optimizer. Only relevant for tensorflow and pytorch backends.
batch_size
: Batch size used during training of the classifier.Only relevant for tensorflow and pytorch backends.
epochs
: Number of training epochs for the classifier. Applies to each fold if n_folds
is specified. Only relevant for tensorflow and pytorch backends.
verbose
: Verbosity level during the training of the classifier. 0 is silent and 1 prints a progress bar. Only relevant for tensorflow and pytorch backends.
train_kwargs
: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer
) or PyTorch (from alibi_detect.models.pytorch import trainer
) trainer functions.
dataset
: Dataset object used during training of the classifier. Defaults to alibi_detect.utils.pytorch.TorchDataset
(an instance of torch.utils.data.Dataset
) for the PyTorch backend and alibi_detect.utils.tensorflow.TFDataset
(an instance of tf.keras.utils.Sequence
) for the TensorFlow backend. For PyTorch, the dataset should only take the data x and the array of labels y as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x, y) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence
, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x, y, batch_size=batch_size, shuffle=True) is used. x can be of type np.ndarray or List[Any] while y is of type np.ndarray.
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader
: Dataloader object used during training of the model. Defaults to torch.utils.data.DataLoader
. The dataloader is not initialized yet, this is done during init off the detector using the batch_size
. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader
.
Additional Sklearn keyword arguments:
use_calibration
: Whether to use calibration. Calibration can be used on top of any model. Only relevant for 'sklearn' backend.
calibration_kwargs
: Optional additional kwargs for calibration. Only relevant for 'sklearn' backend. See https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html for more details.
use_oob
: Whether to use out-of-bag(OOB) predictions. Supported only for RandomForestClassifier
.
Initialized TensorFlow drift detector example:
A similar detector using PyTorch:
We detect data drift by simply calling predict
on a batch of instances x
. return_p_val
equal to True will also return the p-value of the test, return_distance
equal to True will return a notion of strength of the drift and return_probs
equals True also returns the out-of-fold classifier model prediction probabilities on the reference and test data (0 = reference data, 1 = test data) as well as the associated out-of-fold reference and test instances.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
threshold
: the user-defined threshold defining the significance of the test
p_val
: the p-value of the test if return_p_val
equals True.
distance
: a notion of strength of the drift if return_distance
equals True. Equal to the K-S test statistic assuming binarize_preds
equals False or the relative error reduction over the baseline error expected under the null if binarize_preds
equals True.
probs_ref
: the instance level prediction probability for the reference data x_ref
(0 = reference data, 1 = test data) if return_probs
is True.
probs_test
: the instance level prediction probability for the test data x
if return_probs
is true.
x_ref_oof
: the instances associated with probs_ref
if return_probs
equals True.
x_test_oof
: the instances associated with probs_test
if return_probs
equals True.
The context-aware maximum mean discrepancy drift detector () is a kernel based method for detecting drift in a manner that can take relevant context into account. A normal drift detector detects when the distributions underlying two sets of samples ${x^0_i}{i=1}^{n_0}$ and ${x^1_i}{i=1}^{n_1}$ differ. A context-aware drift detector only detects differences that can not be attributed to a corresponding difference between sets of associated context variables ${c^0_i}{i=1}^{n_0}$ and ${c^1_i}{i=1}^{n_1}$.
Context-aware drift detectors afford practitioners the flexibility to specify their desired context variable. It could be a transformation of the data, such as a subset of features, or an unrelated indexing quantity, such as the time or weather. Everything that the practitioner wishes to allow to change between the reference window and test window should be captured within the context variable.
On a technical level, the method operates in a manner similar to the . However, instead of using an estimate of the squared difference between kernel mean embeddings of $X_{\text{ref}}$ and $X_{\text{test}}$ as the test statistic, we now use an estimate of the expected squared difference between the kernel of $X_{\text{ref}}|C$ and $X_{\text{test}}|C$. As well as the kernel defined on the space of data $X$ required to define the test statistic, estimating the statistic additionally requires a kernel defined on the space of the context variable $C$. For any given realisation of the test statistic an associated p-value is then computed using a .
The detector is designed for cases where the training data contains a rich variety of contexts and individual test windows may cover a much more limited subset. It is assumed that the test contexts remain within the support of those observed in the reference set.
Arguments:
x_ref
: Data used as reference distribution.
c_ref
: Context for the reference distribution.
Keyword arguments:
backend
: Both TensorFlow and PyTorch implementations of the context-aware MMD detector as well as various preprocessing steps are available. Specify the backend (tensorflow or pytorch). Defaults to tensorflow.
p_val
: p-value used for significance of the permutation test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data x_ref
at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
update_ref
: Reference data can optionally be updated to the last N instances seen by the detector. The parameter should be passed as a dictionary {'last': N}.
preprocess_fn
: Function to preprocess the data (x_ref
and x
) before computing the data drift metrics. Typically a dimensionality reduction technique. NOTE: Preprocessing is not applied to the context data.
x_kernel
: Kernel defined on the data x_*
. Defaults to a Gaussian RBF kernel (from alibi_detect.utils.pytorch import GaussianRBF
or from alibi_detect.utils.tensorflow import GaussianRBF
dependent on the backend used).
c_kernel
: Kernel defined on the context c_*
. Defaults to a Gaussian RBF kernel (from alibi_detect.utils.pytorch import GaussianRBF
or from alibi_detect.utils.tensorflow import GaussianRBF
dependent on the backend used).
n_permutations
: Number of permutations used in the conditional permutation test.
prop_c_held
: Proportion of contexts held out to condition on.
n_folds
: Number of cross-validation folds used when tuning the regularisation parameters.
batch_size
: If not None
, then compute batches of MMDs at a time rather than all at once which could lead to memory issues.
input_shape
: Optionally pass the shape of the input data.
data_type
: can specify data type added to the metadata. E.g. 'tabular' or 'image'.
verbose
: Whether or not to print progress during configuration.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
Initialized drift detector example with the PyTorch backend:
The same detector in TensorFlow:
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains the p-value if return_p_val
equals True.
threshold
: p-value threshold if return_p_val
equals True.
distance
: conditional MMD^2 metric between the reference data and the new batch if return_distance
equals True.
distance_threshold
: conditional MMD^2 metric value from the permutation test which corresponds to the the p-value threshold.
coupling_xx
: coupling matrix $W_\text{ref,ref}$ for the reference data.
coupling_yy
: coupling matrix $W_\text{test,test}$ for the test data.
coupling_xy
: coupling matrix $W_\text{ref,test}$ between the reference and test data.
The spot-the-diff drift detector is an extension of the drift detector where the classifier is specified in a manner that makes detections interpretable at the feature level when they occur. The detector is inspired by the work of but various major adaptations have been made.
As with the usual classifier-based approach, a portion of the available data is used to train a classifier that can disciminate reference instances from test instances. If the classifier can learn to discriminate in a generalisable manner then drift must have occured. Here we additionally enforce that the classifier takes the form where $\hat{p}_T$ is the predicted probability that instance $x$ is from the test window (rather than reference), $k(\cdot,\cdot)$ is a kernel specifying a notion of similarity between instances, $w_i$ are learnable test locations and $b_i$ are learnable regression coefficients.
If the detector flags drift and $b_i >0$ then we know that it reached its decision by considering how similar each instance is to the instance $w_i$, with those being more similar being more likely to be test instances than reference instances. Alternatively if $b_i < 0$ then instances more similar to $w_i$ were deemed more likely to be reference instances.
In order to provide less noisy and therefore more interpretable results, we define each test location as where $\bar{x}$ is the mean reference instance. We may then interpret $d_i$ as the additive transformation deemed to make the average reference more ($b_i>0$) or less ($b_i<0$) similar to a test instance. Defining the test locations in this way allows us to instead learn the difference $d_i$ and apply regularisation such that non-zero values must be justified by improved classification performance. This allows us to more clearly identify which features any detected drift should be attributed to.
As with the standard classifier-based approach, we should specify the proportion of data to use for training and testing respectively as well as training arguments such as the learning rate and batch size. Note that classifier is trained for each test set that is passed for detection.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
backend
: Specify the backend (tensorflow or pytorch) to use for defining the kernel and training the test locations/differences.
p_val
: p-value threshold used for the significance of the test.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
kernel
: A differentiable TensorFlow or PyTorch module that takes two instances as input and returns a scalar notion of similarity os output. Defaults to a Gaussian radial basis function.
n_diffs
: The number of test locations to use, each corresponding to an interpretable difference.
initial_diffs
: Array used to initialise the diffs that will be learned. Defaults to Gaussian for each feature with equal variance to that of reference data.
l1_reg
: Strength of l1 regularisation to apply to the differences.
binarize_preds
: Whether to test for discrepency on soft (e.g. probs/logits) model predictions directly with a K-S test or binarise to 0-1 prediction errors and apply a binomial test.
train_size
: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size. Cannot be used in combination with n_folds
.
n_folds
: Optional number of stratified folds used for training. The model preds are then calculated on all the out-of-fold instances. This allows to leverage all the reference and test data for drift detection at the expense of longer computation. If both train_size
and n_folds
are specified, n_folds
is prioritized.
retrain_from_scratch
: Whether the classifier should be retrained from scratch for each set of test data or whether it should instead continue training from where it left off on the previous set.
seed
: Optional random seed for fold selection.
optimizer
: Optimizer used during training of the kernel. From torch.optim
for PyTorch and tf.keras.optimizers
for TensorFlow.
learning_rate
: Learning rate for the optimizer.
batch_size
: Batch size used during training of the kernel.
preprocess_batch_fn
: Optional batch preprocessing function. For example to convert a list of generic objects to a tensor which can be processed by the kernel.
epochs
: Number of training epochs for the kernel.
verbose
: Verbosity level during the training of the kernel. 0 is silent and 1 prints a progress bar.
train_kwargs
: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer
) or PyTorch (from alibi_detect.models.pytorch import trainer
) trainer functions.
dataset
: Dataset object used during training of the classifier. Defaults to alibi_detect.utils.pytorch.TorchDataset
(an instance of torch.utils.data.Dataset
) for the PyTorch backend and alibi_detect.utils.tensorflow.TFDataset
(an instance of tf.keras.utils.Sequence
) for the TensorFlow backend. For PyTorch, the dataset should only take the data x and the array of labels y as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x, y) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence
, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x, y, batch_size=batch_size, shuffle=True) is used. x can be of type np.ndarray or List[Any] while y is of type np.ndarray.
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader
: Dataloader object used during training of the classifier. Defaults to torch.utils.data.DataLoader
. The dataloader is not initialized yet, this is done during init off the detector using the batch_size
. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader
.
Instantiating the detector is as simple as passing your reference data and selecting a backend, but you should also consider the number of "diffs" you would like your model to use to discriminate reference from test instances and the strength of regularisation you would like to apply to them.
Using n_diffs=1
is the simplest to interpret and seems to work well in practice. Using more diffs may result in stronger detection power but the diffs may be harder to interpret due to intereactions and conditional dependencies.
The strength of the regularisation (l1_reg
) to apply to the diffs should also be specified. Stronger regularisation results in sparser diffs as the classifier is encouraged to discriminate using fewer features. This may make the diff more interpretable but may again come at the cost of detection power.
Alternatively we could have used the TensorFlow backend and defined a deep kernel with a convolutional structure:
We detect data drift by simply calling predict
on a batch of instances x
. return_p_val
equal to True will also return the p-value of the test, return_distance
equal to True will return a notion of strength of the drift, return_probs
equals True returns the out-of-fold classifier model prediction probabilities on the reference and test data (0 = reference data, 1 = test data) as well as the associated out-of-fold reference and test instances, and return_kernel
equals True will also return the trained kernel.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
diffs
: a numpy array containing the diffs used to discriminate reference from test instances.
diff_coeffs
a coefficient correspond to each diff where a coeffient greater than zero implies that the corresponding diff makes the average reference instances more similar to a test instance on average and less than zero implies less similar.
threshold
: the user-defined p-value threshold defining the significance of the test
p_val
: the p-value of the test if return_p_val
equals True.
distance
: a notion of strength of the drift if return_distance
equals True. Equal to the K-S test statistic assuming binarize_preds
equals False or the relative error reduction over the baseline error expected under the null if binarize_preds
equals True.
probs_ref
: the instance level prediction probability for the reference data x_ref
(0 = reference data, 1 = test data) if return_probs
is True.
probs_test
: the instance level prediction probability for the test data x
if return_probs
is true.
x_ref_oof
: the instances associated with probs_ref
if return_probs
equals True.
x_test_oof
: the instances associated with probs_test
if return_probs
equals True.
kernel
: The trained kernel if return_kernel
equals True.
Model-uncertainty drift detectors aim to directly detect drift that's likely to effect the performance of a model of interest. The approach is to test for change in the number of instances falling into regions of the input space on which the model is uncertain in its predictions. For each instance in the reference set the detector obtains the model's prediction and some associated notion of uncertainty. For example for a classifier this may be the entropy of the predicted label probabilities or for a regressor with dropout layers dropout Monte Carlo can be used to provide a notion of uncertainty. The same is done for the test set and if significant differences in uncertainty are detected (via a Kolmogorov-Smirnoff test) then drift is flagged. The detector's reference set should be disjoint from the model's training set (on which the model's confidence may be higher).
ClassifierUncertaintyDrift
should be used with classification models whereas RegressorUncertaintyDrift
should be used with regression models. They are used in much the same way.
By default ClassifierUncertaintyDrift
uses uncertainty_type='entropy'
as the notion of uncertainty for classifier predictions and a two-sample test is performed on these continuous values. However uncertainty_type='margin'
can also be specified to deem the classifier's prediction uncertain if they fall within a margin (e.g. in [0.45,0.55] for binary classifier probabilities) (similar to ) and a two-sample test is performed on these 0-1 flags of uncertainty.
By default RegressorUncertaintyDrift
uses uncertainty_type='mc_dropout'
and assumes a PyTorch or TensorFlow model with dropout layers as the regressor. This evaluates the model under multiple dropout configurations and uses the variation as the notion of uncertainty. Alternatively a model that outputs (for each instance) a vector of independent model predictions can be passed and uncertainty_type='ensemble'
can be specified. Again the variation is taken as the notion of uncertainty and in both cases a Kolmogorov-Smirnov two-sample test is performed on the continuous notions of uncertainty.
Arguments:
x_ref
: Data used as reference distribution. Should be disjoint from the model's training set
model
: The model of interest whose performance we'd like to remain constant.
Keyword arguments:
p_val
: p-value used for the significance of the test.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
input_shape
: Optionally pass the shape of the input data.
data_type
: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.
ClassifierUncertaintyDrift
-specific keyword arguments:
preds_type
: Type of prediction output by the model. Options are 'probs' (in [0,1]) or 'logits' (in [-inf,inf]).
uncertainty_type
: Method for determining the model's uncertainty for a given instance. Options are 'entropy' or 'margin'.
margin_width
: Width of the margin if uncertainty_type = 'margin'. The model is considered uncertain on an instance if the highest two class probabilities it assigns to the instance differ by less than this.
RegressorUncertaintyDrift
-specific keyword arguments:
uncertainty_type
: Method for determining the model's uncertainty for a given instance. Options are 'mc_dropout' or 'ensemble'. For the former the model should have dropout layers and output a scalar per instance. For the latter the model should output a vector of predictions per instance.
n_evals
: The number of times to evaluate the model under different dropout configurations. Only relavent when using the 'mc_dropout' uncertainty type.
Additional arguments if batch prediction required:
backend
: Framework that was used to define model. Options are 'tensorflow' or 'pytorch'.
batch_size
: Batch size to use to evaluate model. Defaults to 32.
device
: Device type to use. The default None tries to use the GPU and falls back on CPU if needed. Can be specified by passing either 'cuda', 'gpu' or 'cpu'. Only relevant for 'pytorch' backend.
Additional arguments for NLP models
tokenizer
: Tokenizer to use before passing data to model.
max_len
: Max length to be used by tokenizer.
Drift detector for a TensorFlow classifier outputting probabilities:
Drift detector for a PyTorch regressor (with dropout layers) outputting scalars:
Note that for the PyTorch RegressorUncertaintyDrift detector the dropout layers need to be defined within the nn.Module
init to be able to set them to train mode when computing the uncertainty estimates, e.g.:
We detect data drift by simply calling predict
on a batch of instances x
. return_p_val
equal to True will also return the p-value of the test and return_distance
equal to True will return the test-statistic.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
threshold
: the user-defined threshold defining the significance of the test.
p_val
: the p-value of the test if return_p_val
equals True.
distance
: the test-statistic if return_distance
equals True.
The online detector is a non-parametric method for online drift detection on continuous data. Like the detector, it applies a univariate Cramér-von Mises (CVM) test to each feature. This detector is an adaptation of that proposed in by Ross et al. .
Warning
This detector is multi-threaded, with Numba used to parallelise over the simulated streams. There is a on MacOS, where Numba's default OpenMP causes segfaults. A workaround is to use the slightly less performant workqueue
threading layer on MacOS by setting the NUMBA_THREADING_LAYER
enviroment variable or running:
Online detectors assume the reference data is large and fixed and operate on single data points at a time (rather than batches). These data points are passed into the test-windows, and a two-sample test-statistic between the reference data and test-window is computed at each time-step. When the test-statistic exceeds a preconfigured threshold, drift is detected. Configuration of the thresholds requires specification of the expected run-time (ERT) which specifies how many time-steps that the detector, on average, should run for in the absence of drift before making a false detection. Thresholds are then configured to target this ERT by simulating n_bootstraps
number of streams of length t_max = 2*max(window_sizes) - 1
. Conveniently, the non-parametric nature of the detector means that thresholds depend only on $M$, the length of the reference data set. Therefore, for multivariate data, configuration is only as costly as the univariate case.
Note
In order to reduce the memory requirements of the threshold configuration process, streams are simulated in batches of size $N_{batch}$, set with the batch_size
keyword argument. However, the memory requirements still scale with $O(M^2N_{batch})$. If configuration is requiring too much memory (or time), then consider subsampling the reference data. The quadratic growth of the cost with respect to the number of reference instances $M$, combined with the diminishing increase in test power, often makes this a worthwhile tradeoff.
Specification of test-window sizes (the detector accepts multiple windows of different size $W$) is also required, with smaller windows allowing faster response to severe drift and larger windows allowing more power to detect slight drift. Since this detector requires the windows to be full to function, the ERT is measured from t = min(window_sizes)-1
.
Although this detector is primarly intended for univariate data, it can also be applied to multivariate data. In this case, the detector makes a correction similar to the Bonferroni correction used for the offline detector. Given $d$ features, the detector configures thresholds by targeting the $1-\beta$ quantile of test statistics over the simulated streams, where $\beta = 1 - (1-(1/ERT))^{(1/d)}$. For the univariate case, this simplifies to $\beta = 1/ERT$. At prediction time, drift is flagged if the test statistic of any feature stream exceed the thresholds.
Note
In the multivariate case, for the ERT's upper bound to be accurate, the feature streams must be independent. Regardless of independence, the ERT will still be properly lower bounded.
Arguments:
x_ref
: Data used as reference distribution.
ert
: The expected run-time in the absence of drift, starting from t=min(windows_sizes).
window_sizes
: The sizes of the sliding test-windows used to compute the test-statistics. Smaller windows focus on responding quickly to severe drift, larger windows focus on ability to detect slight drift.
Keyword arguments:
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
n_bootstraps
: The number of bootstrap simulations used to configure the thresholds. The larger this is the more accurately the desired ERT will be targeted. Should ideally be at least an order of magnitude larger than the ERT.
batch_size
: The maximum number of bootstrap simulations to compute in each batch when configuring thresholds. A smaller batch size reduces memory requirements, but can result in a longer configuration run time.
n_features
: Number of features used in the FET test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
verbose
: Whether or not to print progress during configuration.
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (tabular, image or time-series). Added to metadata.
Initialized drift detector example:
We detect data drift by sequentially calling predict
on single instances x_t
(no batch dimension) as they each arrive. We can return the test-statistic and the threshold by setting return_test_stat
to True.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if any of the test-windows have drifted from the reference data and 0 otherwise.
time
: The number of observations that have been so far passed to the detector as test instances.
ert
: The expected run-time the detector was configured to run at in the absence of drift.
test_stat
: CVM test-statistics between the reference data and the test_windows if return_test_stat
equals True.
threshold
: The values the test-statsitics are required to exceed for drift to be detected if return_test_stat
equals True.
The detector's state may be saved with the save_state
method:
The previously saved state may then be loaded via the load_state
method:
The online Least Squares Density Difference detector is a non-parametric method for online drift detection. The LSDD between two distributions $p$ and $q$ on $\mathcal{X}$ is defined as and also has an empirical estimate $\widehat{LSDD}({X_i}{i=1}^N, {Y_i}{i=t}^{t+W})$ that can be updated at low cost as the test window is updated to ${Y_i}_{i=t+1}^{t+1+W}$. The detector is motivated by, but is a modified version of, .
Online detectors assume the reference data is large and fixed and operate on single data points at a time (rather than batches). These data points are passed into the test-window and a two-sample test-statistic (in this case an estimate of LSDD) between the reference data and test-window is computed at each time-step. When the test-statistic exceeds a preconfigured threshold, drift is detected. Configuration of the thresholds requires specification of the expected run-time (ERT) which specifies how many time-steps that the detector, on average, should run for in the absence of drift before making a false detection. It also requires specification of a test-window size, with smaller windows allowing faster response to severe drift and larger windows allowing more power to detect slight drift.
For high-dimensional data, we typically want to reduce the dimensionality before passing it to the detector. Following suggestions in , we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs () as out-of-the box preprocessing methods and note that can also be easily implemented using scikit-learn
. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift.
Detecting input data drift (covariate shift) $\Delta p(x)$ for text data requires a custom preprocessing step. We can pick up changes in the semantics of the input by extracting (contextual) embeddings and detect drift on those. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract. The library contains functionality to leverage pre-trained embeddings from but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in the notebook.
Arguments:
x_ref
: Data used as reference distribution.
ert
: The expected run-time in the absence of drift, starting from t=0.
window_size
: The size of the sliding test-window used to compute the test-statistic. Smaller windows focus on responding quickly to severe drift, larger windows focus on ability to detect slight drift.
Keyword arguments:
backend
: Backend used for the LSDD implementation and configuration.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
sigma
: Optionally set the bandwidth of the Gaussian kernel used in estimating the LSDD. Can also pass multiple bandwidth values as an array. The kernel evaluation is then averaged over those bandwidths. If sigma
is not specified, the 'median heuristic' is adopted whereby sigma
is set as the median pairwise distance between reference samples.
n_bootstraps
: The number of bootstrap simulations used to configure the thresholds. The larger this is the more accurately the desired ERT will be targeted. Should ideally be at least an order of magnitude larger than the ERT.
n_kernel_centers
: The number of reference samples to use as centers in the Gaussian kernel model used to estimate LSDD. Defaults to 2*window_size.
lambda_rd_max
: The maximum relative difference between two estimates of LSDD that the regularization parameter lambda is allowed to cause. Defaults to 0.2 as in the paper.
verbose
: Whether or not to print progress during configuration.
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (tabular, image or time-series). Added to metadata.
Additional PyTorch keyword arguments:
device
: Device type used. The default None tries to use the GPU and falls back on CPU if needed. Can be specified by passing either 'cuda', 'gpu' or 'cpu'. Only relevant for 'pytorch' backend.
Initialized drift detector example:
The same detector in PyTorch:
We can also easily add preprocessing functions for both frameworks. The following example uses a randomly initialized image encoder in PyTorch:
The same functionality is supported in TensorFlow and the main difference is that you would import from alibi_detect.cd.tensorflow import preprocess_drift
. Other preprocessing steps such as the output of hidden layers of a model or extracted text embeddings using transformer models can be used in a similar way in both frameworks. TensorFlow example for the hidden layer output:
Again the same functionality is supported in TensorFlow but with from alibi_detect.cd.tensorflow import preprocess_drift
and from alibi_detect.models.tensorflow import TransformerEmbedding
imports.
We detect data drift by sequentially calling predict
on single instances x_t
(no batch dimension) as they each arrive. We can return the test-statistic and the threshold by setting return_test_stat
to True.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the test-window (of the most recent window_size
observations) has drifted from the reference data and 0 otherwise.
time
: The number of observations that have been so far passed to the detector as test instances.
ert
: The expected run-time the detector was configured to run at in the absence of drift.
test_stat
: LSDD metric between the reference data and the test_window if return_test_stat
equals True.
threshold
: The value the test-statsitic is required to exceed for drift to be detected if return_test_stat
equals True.
The detector's state may be saved with the save_state
method:
The previously saved state may then be loaded via the load_state
method:
The online detector is a kernel-based method for online drift detection. The MMD is a distance-based measure between 2 distributions p and q based on the mean embeddings $\mu_{p}$ and $\mu_{q}$ in a reproducing kernel Hilbert space $F$:
Given reference samples ${X_i}{i=1}^{N}$ and test samples ${Y_i}{i=t}^{t+W}$ we may compute an unbiased estimate $\widehat{MMD}^2(F, {X_i}{i=1}^N, {Y_i}{i=t}^{t+W})$ of the squared MMD between the two underlying distributions. The estimate can be updated at low-cost as new data points enter into the test-window. We use by default a , but users are free to pass their own kernel of preference to the detector.
Online detectors assume the reference data is large and fixed and operate on single data points at a time (rather than batches). These data points are passed into the test-window and a two-sample test-statistic (in this case squared MMD) between the reference data and test-window is computed at each time-step. When the test-statistic exceeds a preconfigured threshold, drift is detected. Configuration of the thresholds requires specification of the expected run-time (ERT) which specifies how many time-steps that the detector, on average, should run for in the absence of drift before making a false detection. It also requires specification of a test-window size, with smaller windows allowing faster response to severe drift and larger windows allowing more power to detect slight drift.
For high-dimensional data, we typically want to reduce the dimensionality before passing it to the detector. Following suggestions in , we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs () as out-of-the box preprocessing methods and note that can also be easily implemented using scikit-learn
. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift.
Detecting input data drift (covariate shift) $\Delta p(x)$ for text data requires a custom preprocessing step. We can pick up changes in the semantics of the input by extracting (contextual) embeddings and detect drift on those. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract. The library contains functionality to leverage pre-trained embeddings from but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in the notebook.
Arguments:
x_ref
: Data used as reference distribution.
ert
: The expected run-time in the absence of drift, starting from t=0.
window_size
: The size of the sliding test-window used to compute the test-statistic. Smaller windows focus on responding quickly to severe drift, larger windows focus on ability to detect slight drift.
Keyword arguments:
backend
: Backend used for the MMD implementation and configuration.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
kernel
: Kernel used for the MMD computation, defaults to Gaussian RBF kernel.
sigma
: Optionally set the GaussianRBF kernel bandwidth. Can also pass multiple bandwidth values as an array. The kernel evaluation is then averaged over those bandwidths. If sigma
is not specified, the 'median heuristic' is adopted whereby sigma
is set as the median pairwise distance between reference samples.
n_bootstraps
: The number of bootstrap simulations used to configure the thresholds. The larger this is the more accurately the desired ERT will be targeted. Should ideally be at least an order of magnitude larger than the ERT.
verbose
: Whether or not to print progress during configuration.
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (tabular, image or time-series). Added to metadata.
Additional PyTorch keyword arguments:
device
: Device type used. The default None tries to use the GPU and falls back on CPU if needed. Can be specified by passing either 'cuda', 'gpu' or 'cpu'. Only relevant for 'pytorch' backend.
Initialized drift detector example:
The same detector in PyTorch:
We can also easily add preprocessing functions for both frameworks. The following example uses a randomly initialized image encoder in PyTorch:
The same functionality is supported in TensorFlow and the main difference is that you would import from alibi_detect.cd.tensorflow import preprocess_drift
. Other preprocessing steps such as the output of hidden layers of a model or extracted text embeddings using transformer models can be used in a similar way in both frameworks. TensorFlow example for the hidden layer output:
Again the same functionality is supported in TensorFlow but with from alibi_detect.cd.tensorflow import preprocess_drift
and from alibi_detect.models.tensorflow import TransformerEmbedding
imports.
We detect data drift by sequentially calling predict
on single instances x_t
(no batch dimension) as they each arrive. We can return the test-statistic and the threshold by setting return_test_stat
to True.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the test-window (of the most recent window_size
observations) has drifted from the reference data and 0 otherwise.
time
: The number of observations that have been so far passed to the detector as test instances.
ert
: The expected run-time the detector was configured to run at in the absence of drift.
test_stat
: MMD^2 metric between the reference data and the test_window if return_test_stat
equals True.
threshold
: The value the test-statsitic is required to exceed for drift to be detected if return_test_stat
equals True.
The detector's state may be saved with the save_state
method:
The previously saved state may then be loaded via the load_state
method:
The drift detector applies feature-wise two-sample (K-S) tests for the continuous numerical features and tests for the categorical features. For multivariate data, the obtained p-values for each feature are aggregated either via the or the (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. Similarly to the other drift detectors, a preprocessing steps could be applied, but the output features need to be categorical.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the K-S and Chi-Squared test across all features. If the FDR correction method is used, this corresponds to the acceptable q-value.
categories_per_feature
: Dictionary with as keys the column indices of the categorical features and optionally as values the number of possible categorical values for that feature or a list with the possible values. If you know which features are categorical and simply want to infer the possible values of the categorical feature from the reference data you can pass a Dict[int, NoneType] such as {0: None, 3: None} if features 0 and 3 are categorical. If you also know how many categories are present for a given feature you could pass this in the categories_per_feature
dict in the Dict[int, int] format, e.g. {0: 3, 3: 2}. If you pass N categories this will assume the possible values for the feature are [0, ..., N-1]. You can also explicitly pass the possible categories in the Dict[int, List[int]] format, e.g. {0: [0, 1, 2], 3: [0, 55]}. Note that the categories can be arbitrary int values.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
alternative
: Defines the alternative hypothesis for the K-S tests. Options are 'two-sided' (default), 'less' or 'greater'. Make sure to use 'two-sided' when mixing categorical and numerical features.
n_features
: Number of features used in the K-S and Chi-Squared tests. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
data_type
: can specify data type added to metadata. E.g. 'tabular'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the K-S and Chi-Squared tests. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: feature-wise K-S or Chi-Squared statistics between the reference data and the new batch if return_distance
equals True.
The online detector is a non-parametric method for online drift detection. Like the detector, it applies an Fisher's Exact Test (FET) to each feature. It is intended for application to streams, with binary data consisting of either (True, False)
or (0, 1)
. This detector is ideal for use in a supervised setting, monitoring drift in a model's instance level accuracy (i.e. correct prediction = 0, and incorrect prediction = 1).
Online detectors assume the reference data is large and fixed and operate on single data points at a time (rather than batches). These data points are passed into the test-windows, and a two-sample test-statistic (in this case $F=1-\hat{p}$) between the reference data and test-window is computed at each time-step. When the test-statistic exceeds a preconfigured threshold, drift is detected. Configuration of the thresholds requires specification of the expected run-time (ERT) which specifies how many time-steps that the detector, on average, should run for in the absence of drift before making a false detection.
In a similar manner to that proposed in by Ross et al. , thresholds are configured by simulating n_bootstraps
Bernoulli streams. The length of streams can be set with the t_max
parameter. Since the thresholds are expected to converge after t_max = 2*max(window_sizes) - 1
time steps, we only need to simulate trajectories and estimate thresholds up to this point, and t_max
is set to this value by default. Following , the test statistics are smoothed using an exponential moving average to remove their discreteness, allowing more precise quantiles to be targeted:
For a window size of $W$, at time $t$ the value of the statistic $F_t$ depends on more than just the previous $W$ values. If $\lambda$, set by lam
, is too small, thresholds may keep decreasing well past $2W - 1$ timesteps. To avoid this, the default lam
is set to a high value of $\lambda=0.99$, meaning that discreteness is still broken, but the value of the test statistic depends (almost) solely on the last $W$ observations. If more smoothing is desired, the t_max
parameter can be manually set at a larger value.
Note
The detector must configure thresholds for each window size and each feature. This can be a time consuming process if the number of features is high. For high-dimensional data users are recommended to apply a dimension reduction step via preprocess_fn
.
Specification of test-window sizes (the detector accepts multiple windows of different size $W$) is also required, with smaller windows allowing faster response to severe drift and larger windows allowing more power to detect slight drift. Since this detector requires a window to be full to function, the ERT is measured from t = min(window_sizes)-1
.
Although this detector is primarly intended for univariate data, it can also be applied to multivariate data. In this case, the detector makes a correction similar to the Bonferroni correction used for the offline detector. Given $d$ features, the detector configures thresholds by targeting the $1-\beta$ quantile of test statistics over the simulated streams, where $\beta = 1 - (1-(1/ERT))^{(1/d)}$. For the univariate case, this simplifies to $\beta = 1/ERT$. At prediction time, drift is flagged if the test statistic of any feature stream exceed the thresholds.
Note
In the multivariate case, for the ERT to be accurately targeted the feature streams must be independent.
Arguments:
x_ref
: Data used as reference distribution.
ert
: The expected run-time in the absence of drift, starting from t=min(windows_sizes).
window_sizes
: The sizes of the sliding test-windows used to compute the test-statistics. Smaller windows focus on responding quickly to severe drift, larger windows focus on ability to detect slight drift.
Keyword arguments:
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
n_bootstraps
: The number of bootstrap simulations used to configure the thresholds. The larger this is the more accurately the desired ERT will be targeted. Should ideally be at least an order of magnitude larger than the ERT.
t_max
: Length of streams to simulate when configuring thresholds. If None, this is set to 2 * max(window_sizes
) - 1.
alternative
: Defines the alternative hypothesis. Options are 'greater' (default) or 'less', corresponding to an increase or decrease in the mean of the Bernoulli stream.
lam
: Smoothing coefficient used for exponential moving average. If heavy smoothing is applied (lam
<<1), a larger t_max
may be necessary in order to ensure the thresholds have converged.
n_features
: Number of features used in the FET test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
verbose
: Whether or not to print progress during configuration.
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (tabular, image or time-series). Added to metadata.
Initialized drift detector example:
We detect data drift by sequentially calling predict
on single instances x_t
(no batch dimension) as they each arrive. We can return the test-statistic and the threshold by setting return_test_stat
to True.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if any of the test-windows have drifted from the reference data and 0 otherwise.
time
: The number of observations that have been so far passed to the detector as test instances.
ert
: The expected run-time the detector was configured to run at in the absence of drift.
test_stat
: FET test-statistics (1-p_val
) between the reference data and the test_windows if return_test_stat
equals True.
threshold
: The values the test-statsitics are required to exceed for drift to be detected if return_test_stat
equals True.
The detector's state may be saved with the save_state
method:
The previously saved state may then be loaded via the load_state
method:
The CVM drift detector is a non-parametric drift detector, which applies feature-wise two-sample (CVM) tests. For two empirical distributions $F(z)$ and $F_{ref}(z)$, the CVM test statistic is defined as
where $k$ is the joint sample. The CVM test is an alternative to the (K-S) two-sample test, which uses the maximum distance between two emphirical distributions $F(z)$ and $F_{ref}(z)$. By using the full joint sample, the CVM can exhibit greater power against shifts in higher moments, such as variance changes.
For multivariate data, the detector applies a separate CVM test to each feature, and the p-values obtained for each feature are aggregated either via the or the (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. As with other univariate detectors such as the detector, for high-dimensional data, we typically want to reduce the dimensionality before computing the feature-wise univariate FET tests and aggregating those via the chosen correction method. See for more guidance on this.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the CVM test. If the FDR correction method is used, this corresponds to the acceptable q-value.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
n_features
: Number of features used in the CVM test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
input_shape
: Shape of input data.
data_type
: can specify data type added to metadata. E.g. 'tabular' or 'image'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the CVM test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: feature-wise CVM statistics between the reference data and the new batch if return_distance
equals True.
We detect data drift by simply calling predict
on a batch of test or deployment instances x
and contexts c
. We can return the p-value and the threshold of the permutation test by setting return_p_val
to True and the context-aware maximum mean discrepancy metric and threshold by setting return_distance
to True. We can also set return_coupling
to True which additionally returns the coupling matrices $W_\text{ref,test}$, $W_\text{ref,ref}$ and $W_\text{test,test}$. As illustrated in the examples (, ) this can provide deep insights into where the reference and test distributions are similar and where they differ.
Any differentiable Pytorch or TensorFlow module that takes as input two instances and outputs a scalar (representing similarity) can be used as the kernel for this drift detector. By default a simple kernel is used. Keeping the kernel simple can aid interpretability, but alternatively a "deep kernel" of the form where $\Phi$ is a (differentiable) projection, $k_a$ and $k_b$ are simple kernels (such as a Gaussian RBF) and $\epsilon>0$ a small constant can be used. The DeepKernel
class found in either alibi_detect.utils.tensorflow
or alibi_detect.utils.pytorch
aims to make defining such kernels straightforward. You should not allow too many learnable parameters however as we would like the classifier to discriminate using the test locations rather than kernel parameters.
At any point, the state may be reset to t=0
with the reset_state
method. When saving the detector with save_detector
, the state will be saved, unless t=0
(see ).
Check out the example for more details.
Alibi Detect also includes custom text preprocessing steps in both TensorFlow and PyTorch based on Huggingface's package:
At any point, the state may be reset to t=0
with the reset_state
method. When saving the detector with save_detector
, the state will be saved, unless t=0
(see ).
Check out the example for more details.
Alibi Detect also includes custom text preprocessing steps in both TensorFlow and PyTorch based on Huggingface's package:
At any point, the state may be reset to t=0
with the reset_state
method. When saving the detector with save_detector
, the state will be saved, unless t=0
(see ).
At any point, the state may be reset to t=0
with the reset_state
method. When saving the detector with save_detector
, the state will be saved, unless t=0
(see ).
[1] Ross, G.J., Tasoulis, D.K. & Adams, N.M. Sequential monitoring of a Bernoulli sequence when the pre-change parameter is unknown. Comput Stat 28, 463–479 (2013). doi: . arXiv: .
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.