Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The drift detector applies feature-wise Chi-Squared tests for the categorical features. For multivariate data, the obtained p-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. Similarly to the other drift detectors, a preprocessing steps could be applied, but the output features need to be categorical.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the Chi-Squared test for. If the FDR correction method is used, this corresponds to the acceptable q-value.
categories_per_feature
: Optional dictionary with as keys the feature column index and as values the number of possible categorical values for that feature or a list with the possible values. If you know how many categories are present for a given feature you could pass this in the categories_per_feature
dict in the Dict[int, int] format, e.g. {0: 3, 3: 2}. If you pass N categories this will assume the possible values for the feature are [0, ..., N-1]. You can also explicitly pass the possible categories in the Dict[int, List[int]] format, e.g. {0: [0, 1, 2], 3: [0, 55]}. Note that the categories can be arbitrary int values. If it is not specified, categories_per_feature
is inferred from x_ref
.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique. Needs to return categorical features for the Chi-Squared detector.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
n_features
: Number of features used in the Chi-Squared test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
data_type
: can specify data type added to metadata. E.g. 'tabular'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the Chi-Square test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: feature-wise Chi-Square test statistics between the reference data and the new batch if return_distance
equals True.
The detector is a kernel-based method for multivariate 2 sample testing. The MMD is a distance-based measure between 2 distributions p and q based on the mean embeddings $\mu_{p}$ and $\mu_{q}$ in a reproducing kernel Hilbert space $F$:
We can compute unbiased estimates of $MMD^2$ from the samples of the 2 distributions after applying the kernel trick. We use by default a , but users are free to pass their own kernel of preference to the detector. We obtain a $p$-value via a on the values of $MMD^2$.
For high-dimensional data, we typically want to reduce the dimensionality before computing the permutation test. Following suggestions in , we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs () as out-of-the box preprocessing methods and note that can also be easily implemented using scikit-learn
. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift.
Detecting input data drift (covariate shift) $\Delta p(x)$ for text data requires a custom preprocessing step. We can pick up changes in the semantics of the input by extracting (contextual) embeddings and detect drift on those. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract. The library contains functionality to leverage pre-trained embeddings from but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in the notebook.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the permutation test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
kernel
: Kernel used when computing the MMD. Defaults to a Gaussian RBF kernel (from alibi_detect.utils.pytorch import GaussianRBF
, from alibi_detect.utils.tensorflow import GaussianRBF
or from alibi_detect.utils.keops import GaussianRBF
dependent on the backend used). Note that for the KeOps backend, the diagonal entries of the kernel matrices kernel(x_ref, x_ref)
and kernel(x_test, x_test)
should be equal to 1. This is compliant with the default Gaussian RBF kernel.
sigma
: Optional bandwidth for the kernel as a np.ndarray
. We can also average over a number of different bandwidths, e.g. np.array([.5, 1., 1.5])
.
configure_kernel_from_x_ref
: If sigma
is not specified, the detector can infer it via a heuristic and set sigma
to the median (TensorFlow and PyTorch) or the mean pairwise distance between 2 samples (KeOps) by default. If configure_kernel_from_x_ref
is True, we can already set sigma
at initialization of the detector by inferring it from x_ref
, speeding up the prediction step. If set to False, sigma
is computed separately for each test batch at prediction time.
n_permutations
: Number of permutations used in the permutation test.
input_shape
: Optionally pass the shape of the input data.
data_type
: can specify data type added to the metadata. E.g. 'tabular' or 'image'.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
Additional KeOps keyword arguments:
batch_size_permutations
: KeOps computes the n_permutations
of the MMD^2 statistics in chunks of batch_size_permutations
. Defaults to 1,000,000.
Initialized drift detector examples for each of the available backends:
We can also easily add preprocessing functions for the TensorFlow and PyTorch frameworks. Note that we can also combine for instance a PyTorch preprocessing step with a KeOps detector. The following example uses a randomly initialized image encoder in PyTorch:
The same functionality is supported in TensorFlow and the main difference is that you would import from alibi_detect.cd.tensorflow import preprocess_drift
. Other preprocessing steps such as the output of hidden layers of a model or extracted text embeddings using transformer models can be used in a similar way in both frameworks. TensorFlow example for the hidden layer output:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the p-value and the threshold of the permutation test by setting return_p_val
to True and the maximum mean discrepancy metric and threshold by setting return_distance
to True.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains the p-value if return_p_val
equals True.
threshold
: p-value threshold if return_p_val
equals True.
distance
: MMD^2 metric between the reference data and the new batch if return_distance
equals True.
distance_threshold
: MMD^2 metric value from the permutation test which corresponds to the the p-value threshold.
The CVM drift detector is a non-parametric drift detector, which applies feature-wise two-sample (CVM) tests. For two empirical distributions $F(z)$ and $F_{ref}(z)$, the CVM test statistic is defined as
where $k$ is the joint sample. The CVM test is an alternative to the (K-S) two-sample test, which uses the maximum distance between two emphirical distributions $F(z)$ and $F_{ref}(z)$. By using the full joint sample, the CVM can exhibit greater power against shifts in higher moments, such as variance changes.
For multivariate data, the detector applies a separate CVM test to each feature, and the p-values obtained for each feature are aggregated either via the or the (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. As with other univariate detectors such as the detector, for high-dimensional data, we typically want to reduce the dimensionality before computing the feature-wise univariate FET tests and aggregating those via the chosen correction method. See for more guidance on this.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the CVM test. If the FDR correction method is used, this corresponds to the acceptable q-value.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
n_features
: Number of features used in the CVM test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
input_shape
: Shape of input data.
data_type
: can specify data type added to metadata. E.g. 'tabular' or 'image'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the CVM test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: feature-wise CVM statistics between the reference data and the new batch if return_distance
equals True.
The FET drift detector is a non-parametric drift detector. It applies (FET) to each feature, and is intended for application to , with binary univariate data consisting of either (True, False)
or (0, 1)
. This detector is ideal for use in a supervised setting, monitoring drift in a model's instance level accuracy (i.e. correct prediction = 0, and incorrect prediction = 1).
The detector is primarily intended for univariate data, but can also be used in a multivariate setting. For multivariate data, the obtained p-values for each feature are aggregated either via the or the (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. As with other univariate detectors such as the detector, for high-dimensional data, we typically want to reduce the dimensionality before computing the feature-wise univariate FET tests and aggregating those via the chosen correction method. See for more guidance on this.
For the $j^{th}$ feature, the FET detector considers the 2x2 contingency table between the reference data $x_j^{ref}$ and test data $x_j$ for that feature:
True (1) | False (0) |
---|
where $N^{ref}_1$ represents the number of 1's in the reference data (for the $j^{th}$ feature), $N^{ref}_0$ the number of 0's, and so on. These values can be used to define an odds ratio:
The null hypothesis is $H_0: \widehat{OR}=1$. In other words, the proportion of 1's to 0's is unchanged between the test and reference distributions, such that the odds of 1's vs 0's is independent of whether the data is drawn from the reference or test distribution. The offline FET detector can perform one-sided or two-sided tests, with the alternative hypothesis set by the alternative
keyword argument:
If alternative='greater'
, the alternative hypothesis is $H_a: \widehat{OR}>1$ i.e. proportion of 1's versus 0's has increased compared to the reference distribution.
If alternative='less'
, the alternative hypothesis is $H_a: \widehat{OR}<1$ i.e. the proportion of 1's versus 0's has decreased compared to the reference distribution.
If alternative='two-sided'
, the alternative hypothesis is $H_a: \widehat{OR} \ne 1$ i.e. the proportion of 1's versus 0's has changed compared to the reference distribution.
The p-value returned by the detector is then the probability of obtaining an odds ratio at least as extreme as that observed (in the direction specified by alternative
), assuming the null hypothesis is true.
Arguments:
x_ref
: Data used as reference distribution. Note this should be the raw data, for example np.array([0, 0, 1, 0, 0, 0])
, not the 2x2 contingency table.
Keyword arguments:
p_val
: p-value used for significance of the FET test. If the FDR correction method is used, this corresponds to the acceptable q-value.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
alternative
: Defines the alternative hypothesis. Options are 'greater' (default), 'less' or 'two-sided'.
n_features
: Number of features used in the FET test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
input_shape
: Shape of input data.
data_type
: can specify data type added to metadata. E.g. 'tabular' or 'image'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the FET test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: Feature-wise test statistics between the reference data and the new batch if return_distance
equals True. In this case, the test statistics correspond to the odds ratios.
backend
: TensorFlow, PyTorch and implementations of the MMD detector are available. Specify the backend (tensorflow, pytorch or keops). Defaults to tensorflow.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
Check out the example for more details.
Alibi Detect also includes custom text preprocessing steps in both TensorFlow and PyTorch based on Huggingface's package:
Again the same functionality is supported in TensorFlow but with from alibi_detect.cd.tensorflow import preprocess_drift
and from alibi_detect.models.tensorflow import TransformerEmbedding
imports. Check out the example for more information.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
$x_j$ | $N_1$ | $N_0$ |
$x_j^{ref}$ | $N^{ref}_1$ | $N^{ref}_0$ |
The drift detector applies feature-wise two-sample Kolmogorov-Smirnov (K-S) tests. For multivariate data, the obtained p-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur.
For high-dimensional data, we typically want to reduce the dimensionality before computing the feature-wise univariate K-S tests and aggregating those via the chosen correction method. Following suggestions in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift, we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs (BBSDs) as out-of-the box preprocessing methods and note that PCA can also be easily implemented using scikit-learn
. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift. The adversarial detector which is part of the library can also be transformed into a drift detector picking up drift that reduces the performance of the classification model. We can therefore combine different preprocessing techniques to figure out if there is drift which hurts the model performance, and whether this drift can be classified as input drift or label shift.
Detecting input data drift (covariate shift) $\Delta p(x)$ for text data requires a custom preprocessing step. We can pick up changes in the semantics of the input by extracting (contextual) embeddings and detect drift on those. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract. The library contains functionality to leverage pre-trained embeddings from HuggingFace's transformer package but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in the Text drift detection on IMDB movie reviews notebook.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the K-S test. If the FDR correction method is used, this corresponds to the acceptable q-value.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
alternative
: Defines the alternative hypothesis. Options are 'two-sided' (default), 'less' or 'greater'.
n_features
: Number of features used in the K-S test. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
input_shape
: Shape of input data.
data_type
: can specify data type added to metadata. E.g. 'tabular' or 'image'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the K-S test. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: feature-wise K-S statistics between the reference data and the new batch if return_distance
equals True.
Drift detection on molecular graphs
Model-uncertainty drift detectors aim to directly detect drift that's likely to effect the performance of a model of interest. The approach is to test for change in the number of instances falling into regions of the input space on which the model is uncertain in its predictions. For each instance in the reference set the detector obtains the model's prediction and some associated notion of uncertainty. For example for a classifier this may be the entropy of the predicted label probabilities or for a regressor with dropout layers dropout Monte Carlo can be used to provide a notion of uncertainty. The same is done for the test set and if significant differences in uncertainty are detected (via a Kolmogorov-Smirnoff test) then drift is flagged. The detector's reference set should be disjoint from the model's training set (on which the model's confidence may be higher).
ClassifierUncertaintyDrift
should be used with classification models whereas RegressorUncertaintyDrift
should be used with regression models. They are used in much the same way.
By default ClassifierUncertaintyDrift
uses uncertainty_type='entropy'
as the notion of uncertainty for classifier predictions and a Kolmogorov-Smirnov two-sample test is performed on these continuous values. However uncertainty_type='margin'
can also be specified to deem the classifier's prediction uncertain if they fall within a margin (e.g. in [0.45,0.55] for binary classifier probabilities) (similar to Sethi and Kantardzic (2017)) and a Chi-Squared two-sample test is performed on these 0-1 flags of uncertainty.
By default RegressorUncertaintyDrift
uses uncertainty_type='mc_dropout'
and assumes a PyTorch or TensorFlow model with dropout layers as the regressor. This evaluates the model under multiple dropout configurations and uses the variation as the notion of uncertainty. Alternatively a model that outputs (for each instance) a vector of independent model predictions can be passed and uncertainty_type='ensemble'
can be specified. Again the variation is taken as the notion of uncertainty and in both cases a Kolmogorov-Smirnov two-sample test is performed on the continuous notions of uncertainty.
Arguments:
x_ref
: Data used as reference distribution. Should be disjoint from the model's training set
model
: The model of interest whose performance we'd like to remain constant.
Keyword arguments:
p_val
: p-value used for the significance of the test.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
input_shape
: Optionally pass the shape of the input data.
data_type
: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.
ClassifierUncertaintyDrift
-specific keyword arguments:
preds_type
: Type of prediction output by the model. Options are 'probs' (in [0,1]) or 'logits' (in [-inf,inf]).
uncertainty_type
: Method for determining the model's uncertainty for a given instance. Options are 'entropy' or 'margin'.
margin_width
: Width of the margin if uncertainty_type = 'margin'. The model is considered uncertain on an instance if the highest two class probabilities it assigns to the instance differ by less than this.
RegressorUncertaintyDrift
-specific keyword arguments:
uncertainty_type
: Method for determining the model's uncertainty for a given instance. Options are 'mc_dropout' or 'ensemble'. For the former the model should have dropout layers and output a scalar per instance. For the latter the model should output a vector of predictions per instance.
n_evals
: The number of times to evaluate the model under different dropout configurations. Only relavent when using the 'mc_dropout' uncertainty type.
Additional arguments if batch prediction required:
backend
: Framework that was used to define model. Options are 'tensorflow' or 'pytorch'.
batch_size
: Batch size to use to evaluate model. Defaults to 32.
device
: Device type to use. The default None tries to use the GPU and falls back on CPU if needed. Can be specified by passing either 'cuda', 'gpu' or 'cpu'. Only relevant for 'pytorch' backend.
Additional arguments for NLP models
tokenizer
: Tokenizer to use before passing data to model.
max_len
: Max length to be used by tokenizer.
Drift detector for a TensorFlow classifier outputting probabilities:
Drift detector for a PyTorch regressor (with dropout layers) outputting scalars:
Note that for the PyTorch RegressorUncertaintyDrift detector the dropout layers need to be defined within the nn.Module
init to be able to set them to train mode when computing the uncertainty estimates, e.g.:
We detect data drift by simply calling predict
on a batch of instances x
. return_p_val
equal to True will also return the p-value of the test and return_distance
equal to True will return the test-statistic.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
threshold
: the user-defined threshold defining the significance of the test.
p_val
: the p-value of the test if return_p_val
equals True.
distance
: the test-statistic if return_distance
equals True.
Drift detection on molecular graphs
The drift detector applies feature-wise two-sample Kolmogorov-Smirnov (K-S) tests for the continuous numerical features and Chi-Squared tests for the categorical features. For multivariate data, the obtained p-values for each feature are aggregated either via the Bonferroni or the False Discovery Rate (FDR) correction. The Bonferroni correction is more conservative and controls for the probability of at least one false positive. The FDR correction on the other hand allows for an expected fraction of false positives to occur. Similarly to the other drift detectors, a preprocessing steps could be applied, but the output features need to be categorical.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
p_val
: p-value used for significance of the K-S and Chi-Squared test across all features. If the FDR correction method is used, this corresponds to the acceptable q-value.
categories_per_feature
: Dictionary with as keys the column indices of the categorical features and optionally as values the number of possible categorical values for that feature or a list with the possible values. If you know which features are categorical and simply want to infer the possible values of the categorical feature from the reference data you can pass a Dict[int, NoneType] such as {0: None, 3: None} if features 0 and 3 are categorical. If you also know how many categories are present for a given feature you could pass this in the categories_per_feature
dict in the Dict[int, int] format, e.g. {0: 3, 3: 2}. If you pass N categories this will assume the possible values for the feature are [0, ..., N-1]. You can also explicitly pass the possible categories in the Dict[int, List[int]] format, e.g. {0: [0, 1, 2], 3: [0, 55]}. Note that the categories can be arbitrary int values.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
correction
: Correction type for multivariate data. Either 'bonferroni' or 'fdr' (False Discovery Rate).
alternative
: Defines the alternative hypothesis for the K-S tests. Options are 'two-sided' (default), 'less' or 'greater'. Make sure to use 'two-sided' when mixing categorical and numerical features.
n_features
: Number of features used in the K-S and Chi-Squared tests. No need to pass it if no preprocessing takes place. In case of a preprocessing step, this can also be inferred automatically but could be more expensive to compute.
data_type
: can specify data type added to metadata. E.g. 'tabular'.
Initialized drift detector example:
We detect data drift by simply calling predict
on a batch of instances x
. We can return the feature-wise p-values before the multivariate correction by setting return_p_val
to True. The drift can also be detected at the feature level by setting drift_type
to 'feature'. No multivariate correction will take place since we return the output of n_features univariate tests. For drift detection on all the features combined with the correction, use 'batch'. return_p_val
equal to True will also return the threshold used by the detector (either for the univariate case or after the multivariate correction).
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains feature-level p-values if return_p_val
equals True.
threshold
: for feature-level drift detection the threshold equals the p-value used for the significance of the K-S and Chi-Squared tests. Otherwise the threshold after the multivariate correction (either bonferroni or fdr) is returned.
distance
: feature-wise K-S or Chi-Squared statistics between the reference data and the new batch if return_distance
equals True.
The classifier-based drift detector Lopez-Paz and Oquab, 2017 simply tries to correctly distinguish instances from the reference set vs. the test set. The classifier is trained to output the probability that a given instance belongs to the test set. If the probabilities it assigns to unseen test instances are significantly higher (as determined by a Kolmogorov-Smirnov test) to those it assigns to unseen reference instances then the test set must differ from the reference set and drift is flagged. Alternatively, the detector also allows to binarize the classifier predictions (0 or 1) and apply a binomial test on the binarized predictions of the reference vs. the test data. To leverage all the available reference and test data, stratified cross-validation can be applied and the out-of-fold predictions are used for the significance test. Note that a new classifier is trained for each test set or even each fold within the test set.
Arguments:
x_ref
: Data used as reference distribution.
model
: Binary classification model used for drift detection. TensorFlow, PyTorch and Sklearn models are supported.
Keyword arguments:
backend
: Specify the backend (tensorflow, pytorch or sklearn). This depends on the framework of the model
. Defaults to tensorflow.
p_val
: p-value threshold used for the significance of the test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed. If the input data type is of type List[Any]
then update_x_ref
needs to be set to None and the reference set remains fixed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
preds_type
: Whether the model outputs 'probs' (probabilities - for 'tensorflow', 'pytorch', 'sklearn' models), 'logits' (for 'pytorch', 'tensorflow' models), 'scores' (for 'sklearn' models if decision_function
is supported).
binarize_preds
: Whether to test for discrepancy on soft (e.g. probs/logits/scores) model predictions directly with a K-S test or binarise to 0-1 prediction errors and apply a binomial test. Defaults to False and therefore applies the K-S test.
train_size
: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size. Cannot be used in combination with n_folds
.
n_folds
: Optional number of stratified folds used for training. The model preds are then calculated on all the out-of-fold predictions. This allows to leverage all the reference and test data for drift detection at the expense of longer computation. If both train_size
and n_folds
are specified, n_folds
is prioritized.
seed
: Optional random seed for fold selection.
optimizer
: Optimizer used during training of the classifier. From torch.optim
for PyTorch and tf.keras.optimizers
for TensorFlow.
learning_rate
: Learning rate for the optimizer. Only relevant for tensorflow and pytorch backends.
batch_size
: Batch size used during training of the classifier.Only relevant for tensorflow and pytorch backends.
epochs
: Number of training epochs for the classifier. Applies to each fold if n_folds
is specified. Only relevant for tensorflow and pytorch backends.
verbose
: Verbosity level during the training of the classifier. 0 is silent and 1 prints a progress bar. Only relevant for tensorflow and pytorch backends.
train_kwargs
: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer
) or PyTorch (from alibi_detect.models.pytorch import trainer
) trainer functions.
dataset
: Dataset object used during training of the classifier. Defaults to alibi_detect.utils.pytorch.TorchDataset
(an instance of torch.utils.data.Dataset
) for the PyTorch backend and alibi_detect.utils.tensorflow.TFDataset
(an instance of tf.keras.utils.Sequence
) for the TensorFlow backend. For PyTorch, the dataset should only take the data x and the array of labels y as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x, y) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence
, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x, y, batch_size=batch_size, shuffle=True) is used. x can be of type np.ndarray or List[Any] while y is of type np.ndarray.
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader
: Dataloader object used during training of the model. Defaults to torch.utils.data.DataLoader
. The dataloader is not initialized yet, this is done during init off the detector using the batch_size
. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader
.
Additional Sklearn keyword arguments:
use_calibration
: Whether to use calibration. Calibration can be used on top of any model. Only relevant for 'sklearn' backend.
calibration_kwargs
: Optional additional kwargs for calibration. Only relevant for 'sklearn' backend. See https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html for more details.
use_oob
: Whether to use out-of-bag(OOB) predictions. Supported only for RandomForestClassifier
.
Initialized TensorFlow drift detector example:
A similar detector using PyTorch:
We detect data drift by simply calling predict
on a batch of instances x
. return_p_val
equal to True will also return the p-value of the test, return_distance
equal to True will return a notion of strength of the drift and return_probs
equals True also returns the out-of-fold classifier model prediction probabilities on the reference and test data (0 = reference data, 1 = test data) as well as the associated out-of-fold reference and test instances.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
threshold
: the user-defined threshold defining the significance of the test
p_val
: the p-value of the test if return_p_val
equals True.
distance
: a notion of strength of the drift if return_distance
equals True. Equal to the K-S test statistic assuming binarize_preds
equals False or the relative error reduction over the baseline error expected under the null if binarize_preds
equals True.
probs_ref
: the instance level prediction probability for the reference data x_ref
(0 = reference data, 1 = test data) if return_probs
is True.
probs_test
: the instance level prediction probability for the test data x
if return_probs
is true.
x_ref_oof
: the instances associated with probs_ref
if return_probs
equals True.
x_test_oof
: the instances associated with probs_test
if return_probs
equals True.
The least-squares density difference detector is a method for multivariate 2 sample testing. The LSDD between two distributions $p$ and $q$ on $\mathcal{X}$ is defined as
Given two samples we can compute an estimate of the $LSDD$ between the two underlying distributions and use it as a test statistic. We then obtain a $p$-value via a permutation test on the values of the $LSDD$ estimates. In practice we actually estimate the LSDD scaled by a factor that maintains numerical stability when dimensionality is high.
Note
$LSDD$ is based on the assumption that a probability density exists for both distributions and hence is only suitable for continuous data. If you are working with tabular data containing categorical variables, we recommend using the TabularDrift detector instead.
For high-dimensional data, we typically want to reduce the dimensionality before computing the permutation test. Following suggestions in Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift, we incorporate Untrained AutoEncoders (UAE) and black-box shift detection using the classifier's softmax outputs (BBSDs) as out-of-the box preprocessing methods and note that PCA can also be easily implemented using scikit-learn
. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSDs focuses on label shift.
Detecting input data drift (covariate shift) $\Delta p(x)$ for text data requires a custom preprocessing step. We can pick up changes in the semantics of the input by extracting (contextual) embeddings and detect drift on those. Strictly speaking we are not detecting $\Delta p(x)$ anymore since the whole training procedure (objective function, training data etc) for the (pre)trained embeddings has an impact on the embeddings we extract. The library contains functionality to leverage pre-trained embeddings from HuggingFace's transformer package but also allows you to easily use your own embeddings of choice. Both options are illustrated with examples in the Text drift detection on IMDB movie reviews notebook.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
backend
: Both TensorFlow and PyTorch implementations of the LSD detector as well as various preprocessing steps are available. Specify the backend (tensorflow or pytorch). Defaults to tensorflow.
p_val
: p-value used for significance of the permutation test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics. Typically a dimensionality reduction technique.
sigma
: Optionally set the bandwidth of the Gaussian kernel used in estimating the LSDD. Can also pass multiple bandwidth values as an array. The kernel evaluation is then averaged over those bandwidths. If sigma
is not specified, the 'median heuristic' is adopted whereby sigma
is set as the median pairwise distance between reference samples.
n_permutations
: Number of permutations used in the permutation test.
n_kernel_centers
: The number of reference samples to use as centers in the Gaussian kernel model used to estimate LSDD. Defaults to 1/20th of the reference data.
lambda_rd_max
: The maximum relative difference between two estimates of LSDD that the regularization parameter lambda is allowed to cause. Defaults to 0.2 as in the paper.
input_shape
: Optionally pass the shape of the input data.
data_type
: can specify data type added to the metadata. E.g. 'tabular' or 'image'.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
Initialized drift detector example:
The same detector in PyTorch:
We can also easily add preprocessing functions for both frameworks. The following example uses a randomly initialized image encoder in PyTorch:
The same functionality is supported in TensorFlow and the main difference is that you would import from alibi_detect.cd.tensorflow import preprocess_drift
. Other preprocessing steps such as the output of hidden layers of a model or extracted text embeddings using transformer models can be used in a similar way in both frameworks. TensorFlow example for the hidden layer output:
The LSDDDrift
detector can be used in exactly the same way as the MMDDrift
detector which is further demonstrated in the Drift detection on CIFAR10 example.
Alibi Detect also includes custom text preprocessing steps in both TensorFlow and PyTorch based on Huggingface's transformers package:
Again the same functionality is supported in TensorFlow but with from alibi_detect.cd.tensorflow import preprocess_drift
and from alibi_detect.models.tensorflow import TransformerEmbedding
imports. Check out the Text drift detection on IMDB movie reviews example for more information.
We detect data drift by simply calling predict
on a batch of instances x
. We can return the p-value and the threshold of the permutation test by setting return_p_val
to True and the maximum mean discrepancy metric and threshold by setting return_distance
to True.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains the p-value if return_p_val
equals True.
threshold
: p-value threshold if return_p_val
equals True.
distance
: LSDD metric between the reference data and the new batch if return_distance
equals True.
distance_threshold
: LSDD metric value from the permutation test which corresponds to the the p-value threshold.
For the related MMDDrift
detector.
The learned-kernel drift detector (Liu et al., 2020) is an extension of the Maximum Mean Discrepancy drift detector where the kernel used to define the MMD is trained using a portion of the data to maximise an estimate of the resulting test power. Once the kernel has been learned a permutation test is performed in the usual way on the value of the MMD.
This method is closely related to the classifier drift detector which trains a classifier to discriminate between instances from the reference window and instances from the test window. The difference here is that we train a kernel to output high similarity on instances from the same window and low similarity between instances from different windows. If this is possible in a generalisable manner then drift must have occured.
As with the classifier-based approach, we should specify the proportion of data to use for training and testing respectively as well as training arguments such as the learning rate and batch size. Note that a new kernel is trained for each test set that is passed for detection.
Arguments:
x_ref
: Data used as reference distribution.
kernel
: A differentiable TensorFlow or PyTorch module that takes two sets of instances as inputs and returns a kernel similarity matrix as output.
Keyword arguments:
backend
: TensorFlow, PyTorch and KeOps implementations of the learned kernel detector are available. The backend can be specified as tensorflow, pytorch or keops. Defaults to tensorflow.
p_val
: p-value threshold used for the significance of the test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
x_ref_preprocessed
: Whether or not the reference data x_ref
has already been preprocessed. If True, the reference data will be skipped and preprocessing will only be applied to the test data passed to predict
.
update_x_ref
: Reference data can optionally be updated to the last N instances seen by the detector or via reservoir sampling with size N. For the former, the parameter equals {'last': N} while for reservoir sampling {'reservoir_sampling': N} is passed. If the input data type is of type List[Any]
then update_x_ref
needs to be set to None and the reference set remains fixed.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
n_permutations
: The number of permutations to use in the permutation test once the MMD has been computed.
var_reg
: Constant added to the estimated variance of the MMD for stability.
reg_loss_fn
: The regularisation term reg_loss_fn(kernel) is added to the loss function being optimized.
train_size
: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size.
retrain_from_scratch
: Whether the kernel should be retrained from scratch for each set of test data or whether it should instead continue training from where it left off on the previous set. Defaults to True.
optimizer
: Optimizer used during training of the kernel. From torch.optim
for PyTorch and tf.keras.optimizers
for TensorFlow.
learning_rate
: Learning rate for the optimizer.
batch_size
: Batch size used during training of the kernel.
batch_size_predict
: Batch size used for the trained drift detector predictions.
preprocess_batch_fn
: Optional batch preprocessing function. For example to convert a list of generic objects to a tensor which can be processed by the kernel.
epochs
: Number of training epochs for the kernel.
verbose
: Verbosity level during the training of the kernel. 0 is silent and 1 prints a progress bar.
train_kwargs
: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer
) or PyTorch (from alibi_detect.models.pytorch import trainer
) trainer functions.
dataset
: Dataset object used during training of the kernel. Defaults to alibi_detect.utils.pytorch.TorchDataset
(an instance of torch.utils.data.Dataset
) for the PyTorch and KeOps backends and alibi_detect.utils.tensorflow.TFDataset
(an instance of tf.keras.utils.Sequence
) for the TensorFlow backend. For PyTorch or KeOps, the dataset should only take the windows x_ref and x_test as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x_ref, x_test) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence
, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x_ref, x_test, batch_size=batch_size, shuffle=True) is used. x_ref and x_test can be of type np.ndarray or List[Any].
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.
Additional PyTorch and KeOps keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader
: Dataloader object used during training of the kernel. Defaults to torch.utils.data.DataLoader
. The dataloader is not initialized yet, this is done during init off the detector using the batch_size
. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader
.
num_workers
: The number of workers used by the DataLoader
. The default (num_workers=0
) means multi-process data loading is disabled. Setting num_workers>0
may be unreliable on Windows.
Additional KeOps only keyword arguments:
batch_size_permutations
: KeOps computes the n_permutations
of the MMD^2 statistics in chunks of batch_size_permutations
. Defaults to 1,000,000.
Any differentiable Pytorch or TensorFlow module that takes as input two instances and outputs a scalar (representing similarity) can be used as the kernel for this drift detector. However, in order to ensure that MMD=0 implies no-drift the kernel should satify a characteristic property. This can be guaranteed by defining a kernel as where $\Phi$ is a learnable projection, $k_a$ and $k_b$ are simple characteristic kernels (such as a Gaussian RBF), and $\epsilon>0$ is a small constant. By letting $\Phi$ be very flexible we can learn powerful kernels in this manner.
This is easily implemented using the DeepKernel
class provided in alibi_detect
. We demonstrate below how we might define a convolutional kernel for images using Pytorch. By default GaussianRBF
kernels are used for $k_a$ and $k_b$ and here we specify $\epsilon=0.01$, but we could alternatively set eps='trainable'
.
It is important to note that, if retrain_from_scratch=True
and we have not initialised the kernel bandwidth sigma
for the default GaussianRBF
kernel $k_a$ and optionally also for $k_b$, we will initialise sigma
using a median (PyTorch and TensorFlow) or mean (KeOps) bandwidth heuristic for every detector prediction. For KeOps detectors specifically, this could form a computational bottleneck and should be avoided by already specifying a bandwidth in advance. To do this, we can leverage the library's built-in heuristics:
Instantiating the detector is then as simple as passing the reference data and the kernel as follows:
We could have alternatively defined the kernel and instantiated the detector using KeOps:
Or by using TensorFlow as the backend:
We detect data drift by simply calling predict
on a batch of instances x
. return_p_val
equal to True will also return the p-value of the test, return_distance
equal to True will return a notion of strength of the drift and return_kernel
equals True will also return the trained kernel.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
threshold
: the user-defined p-value threshold defining the significance of the test
p_val
: the p-value of the test if return_p_val
equals True.
distance
: MMD^2 metric between the reference data and the new batch if return_distance
equals True.
distance_threshold
: MMD^2 metric value from the permutation test which corresponds to the the p-value threshold if return_distance
equals True.
kernel
: The trained kernel if return_kernel
equals True.
Drift detection on molecular graphs
The spot-the-diff drift detector is an extension of the Classifier drift detector where the classifier is specified in a manner that makes detections interpretable at the feature level when they occur. The detector is inspired by the work of Jitkrittum et al. (2016) but various major adaptations have been made.
As with the usual classifier-based approach, a portion of the available data is used to train a classifier that can disciminate reference instances from test instances. If the classifier can learn to discriminate in a generalisable manner then drift must have occured. Here we additionally enforce that the classifier takes the form where $\hat{p}_T$ is the predicted probability that instance $x$ is from the test window (rather than reference), $k(\cdot,\cdot)$ is a kernel specifying a notion of similarity between instances, $w_i$ are learnable test locations and $b_i$ are learnable regression coefficients.
If the detector flags drift and $b_i >0$ then we know that it reached its decision by considering how similar each instance is to the instance $w_i$, with those being more similar being more likely to be test instances than reference instances. Alternatively if $b_i < 0$ then instances more similar to $w_i$ were deemed more likely to be reference instances.
In order to provide less noisy and therefore more interpretable results, we define each test location as where $\bar{x}$ is the mean reference instance. We may then interpret $d_i$ as the additive transformation deemed to make the average reference more ($b_i>0$) or less ($b_i<0$) similar to a test instance. Defining the test locations in this way allows us to instead learn the difference $d_i$ and apply regularisation such that non-zero values must be justified by improved classification performance. This allows us to more clearly identify which features any detected drift should be attributed to.
As with the standard classifier-based approach, we should specify the proportion of data to use for training and testing respectively as well as training arguments such as the learning rate and batch size. Note that classifier is trained for each test set that is passed for detection.
Arguments:
x_ref
: Data used as reference distribution.
Keyword arguments:
backend
: Specify the backend (tensorflow or pytorch) to use for defining the kernel and training the test locations/differences.
p_val
: p-value threshold used for the significance of the test.
preprocess_fn
: Function to preprocess the data before computing the data drift metrics.
kernel
: A differentiable TensorFlow or PyTorch module that takes two instances as input and returns a scalar notion of similarity os output. Defaults to a Gaussian radial basis function.
n_diffs
: The number of test locations to use, each corresponding to an interpretable difference.
initial_diffs
: Array used to initialise the diffs that will be learned. Defaults to Gaussian for each feature with equal variance to that of reference data.
l1_reg
: Strength of l1 regularisation to apply to the differences.
binarize_preds
: Whether to test for discrepency on soft (e.g. probs/logits) model predictions directly with a K-S test or binarise to 0-1 prediction errors and apply a binomial test.
train_size
: Optional fraction (float between 0 and 1) of the dataset used to train the classifier. The drift is detected on 1 - train_size. Cannot be used in combination with n_folds
.
n_folds
: Optional number of stratified folds used for training. The model preds are then calculated on all the out-of-fold instances. This allows to leverage all the reference and test data for drift detection at the expense of longer computation. If both train_size
and n_folds
are specified, n_folds
is prioritized.
retrain_from_scratch
: Whether the classifier should be retrained from scratch for each set of test data or whether it should instead continue training from where it left off on the previous set.
seed
: Optional random seed for fold selection.
optimizer
: Optimizer used during training of the kernel. From torch.optim
for PyTorch and tf.keras.optimizers
for TensorFlow.
learning_rate
: Learning rate for the optimizer.
batch_size
: Batch size used during training of the kernel.
preprocess_batch_fn
: Optional batch preprocessing function. For example to convert a list of generic objects to a tensor which can be processed by the kernel.
epochs
: Number of training epochs for the kernel.
verbose
: Verbosity level during the training of the kernel. 0 is silent and 1 prints a progress bar.
train_kwargs
: Optional additional kwargs for the built-in TensorFlow (from alibi_detect.models.tensorflow import trainer
) or PyTorch (from alibi_detect.models.pytorch import trainer
) trainer functions.
dataset
: Dataset object used during training of the classifier. Defaults to alibi_detect.utils.pytorch.TorchDataset
(an instance of torch.utils.data.Dataset
) for the PyTorch backend and alibi_detect.utils.tensorflow.TFDataset
(an instance of tf.keras.utils.Sequence
) for the TensorFlow backend. For PyTorch, the dataset should only take the data x and the array of labels y as input, so when e.g. TorchDataset is passed to the detector at initialisation, during training TorchDataset(x, y) is used. For TensorFlow, the dataset is an instance of tf.keras.utils.Sequence
, so when e.g. TFDataset is passed to the detector at initialisation, during training TFDataset(x, y, batch_size=batch_size, shuffle=True) is used. x can be of type np.ndarray or List[Any] while y is of type np.ndarray.
input_shape
: Shape of input data.
data_type
: Optionally specify the data type (e.g. tabular, image or time-series). Added to metadata.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
dataloader
: Dataloader object used during training of the classifier. Defaults to torch.utils.data.DataLoader
. The dataloader is not initialized yet, this is done during init off the detector using the batch_size
. Custom dataloaders can be passed as well, e.g. for graph data we can use torch_geometric.data.DataLoader
.
Any differentiable Pytorch or TensorFlow module that takes as input two instances and outputs a scalar (representing similarity) can be used as the kernel for this drift detector. By default a simple Gaussian RBF kernel is used. Keeping the kernel simple can aid interpretability, but alternatively a "deep kernel" of the form where $\Phi$ is a (differentiable) projection, $k_a$ and $k_b$ are simple kernels (such as a Gaussian RBF) and $\epsilon>0$ a small constant can be used. The DeepKernel
class found in either alibi_detect.utils.tensorflow
or alibi_detect.utils.pytorch
aims to make defining such kernels straightforward. You should not allow too many learnable parameters however as we would like the classifier to discriminate using the test locations rather than kernel parameters.
Instantiating the detector is as simple as passing your reference data and selecting a backend, but you should also consider the number of "diffs" you would like your model to use to discriminate reference from test instances and the strength of regularisation you would like to apply to them.
Using n_diffs=1
is the simplest to interpret and seems to work well in practice. Using more diffs may result in stronger detection power but the diffs may be harder to interpret due to intereactions and conditional dependencies.
The strength of the regularisation (l1_reg
) to apply to the diffs should also be specified. Stronger regularisation results in sparser diffs as the classifier is encouraged to discriminate using fewer features. This may make the diff more interpretable but may again come at the cost of detection power.
Alternatively we could have used the TensorFlow backend and defined a deep kernel with a convolutional structure:
We detect data drift by simply calling predict
on a batch of instances x
. return_p_val
equal to True will also return the p-value of the test, return_distance
equal to True will return a notion of strength of the drift, return_probs
equals True returns the out-of-fold classifier model prediction probabilities on the reference and test data (0 = reference data, 1 = test data) as well as the associated out-of-fold reference and test instances, and return_kernel
equals True will also return the trained kernel.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
diffs
: a numpy array containing the diffs used to discriminate reference from test instances.
diff_coeffs
a coefficient correspond to each diff where a coeffient greater than zero implies that the corresponding diff makes the average reference instances more similar to a test instance on average and less than zero implies less similar.
threshold
: the user-defined p-value threshold defining the significance of the test
p_val
: the p-value of the test if return_p_val
equals True.
distance
: a notion of strength of the drift if return_distance
equals True. Equal to the K-S test statistic assuming binarize_preds
equals False or the relative error reduction over the baseline error expected under the null if binarize_preds
equals True.
probs_ref
: the instance level prediction probability for the reference data x_ref
(0 = reference data, 1 = test data) if return_probs
is True.
probs_test
: the instance level prediction probability for the test data x
if return_probs
is true.
x_ref_oof
: the instances associated with probs_ref
if return_probs
equals True.
x_test_oof
: the instances associated with probs_test
if return_probs
equals True.
kernel
: The trained kernel if return_kernel
equals True.
Interpretable Drift detection on MNIST and the Wine Quality dataset
The context-aware maximum mean discrepancy drift detector (Cobb and Van Looveren, 2022) is a kernel based method for detecting drift in a manner that can take relevant context into account. A normal drift detector detects when the distributions underlying two sets of samples ${x^0_i}{i=1}^{n_0}$ and ${x^1_i}{i=1}^{n_1}$ differ. A context-aware drift detector only detects differences that can not be attributed to a corresponding difference between sets of associated context variables ${c^0_i}{i=1}^{n_0}$ and ${c^1_i}{i=1}^{n_1}$.
Context-aware drift detectors afford practitioners the flexibility to specify their desired context variable. It could be a transformation of the data, such as a subset of features, or an unrelated indexing quantity, such as the time or weather. Everything that the practitioner wishes to allow to change between the reference window and test window should be captured within the context variable.
On a technical level, the method operates in a manner similar to the maximum mean discrepancy detector. However, instead of using an estimate of the squared difference between kernel mean embeddings of $X_{\text{ref}}$ and $X_{\text{test}}$ as the test statistic, we now use an estimate of the expected squared difference between the kernel conditional mean embeddings of $X_{\text{ref}}|C$ and $X_{\text{test}}|C$. As well as the kernel defined on the space of data $X$ required to define the test statistic, estimating the statistic additionally requires a kernel defined on the space of the context variable $C$. For any given realisation of the test statistic an associated p-value is then computed using a conditional permutation test.
The detector is designed for cases where the training data contains a rich variety of contexts and individual test windows may cover a much more limited subset. It is assumed that the test contexts remain within the support of those observed in the reference set.
Arguments:
x_ref
: Data used as reference distribution.
c_ref
: Context for the reference distribution.
Keyword arguments:
backend
: Both TensorFlow and PyTorch implementations of the context-aware MMD detector as well as various preprocessing steps are available. Specify the backend (tensorflow or pytorch). Defaults to tensorflow.
p_val
: p-value used for significance of the permutation test.
preprocess_at_init
: Whether to already apply the (optional) preprocessing step to the reference data x_ref
at initialization and store the preprocessed data. Dependent on the preprocessing step, this can reduce the computation time for the predict step significantly, especially when the reference dataset is large. Defaults to True. It is possible that it needs to be set to False if the preprocessing step requires statistics from both the reference and test data, such as the mean or standard deviation.
update_ref
: Reference data can optionally be updated to the last N instances seen by the detector. The parameter should be passed as a dictionary {'last': N}.
preprocess_fn
: Function to preprocess the data (x_ref
and x
) before computing the data drift metrics. Typically a dimensionality reduction technique. NOTE: Preprocessing is not applied to the context data.
x_kernel
: Kernel defined on the data x_*
. Defaults to a Gaussian RBF kernel (from alibi_detect.utils.pytorch import GaussianRBF
or from alibi_detect.utils.tensorflow import GaussianRBF
dependent on the backend used).
c_kernel
: Kernel defined on the context c_*
. Defaults to a Gaussian RBF kernel (from alibi_detect.utils.pytorch import GaussianRBF
or from alibi_detect.utils.tensorflow import GaussianRBF
dependent on the backend used).
n_permutations
: Number of permutations used in the conditional permutation test.
prop_c_held
: Proportion of contexts held out to condition on.
n_folds
: Number of cross-validation folds used when tuning the regularisation parameters.
batch_size
: If not None
, then compute batches of MMDs at a time rather than all at once which could lead to memory issues.
input_shape
: Optionally pass the shape of the input data.
data_type
: can specify data type added to the metadata. E.g. 'tabular' or 'image'.
verbose
: Whether or not to print progress during configuration.
Additional PyTorch keyword arguments:
device
: cuda or gpu to use the GPU and cpu for the CPU. If the device is not specified, the detector will try to leverage the GPU if possible and otherwise fall back on CPU.
Initialized drift detector example with the PyTorch backend:
The same detector in TensorFlow:
We detect data drift by simply calling predict
on a batch of test or deployment instances x
and contexts c
. We can return the p-value and the threshold of the permutation test by setting return_p_val
to True and the context-aware maximum mean discrepancy metric and threshold by setting return_distance
to True. We can also set return_coupling
to True which additionally returns the coupling matrices $W_\text{ref,test}$, $W_\text{ref,ref}$ and $W_\text{test,test}$. As illustrated in the examples (text, ECGs) this can provide deep insights into where the reference and test distributions are similar and where they differ.
The prediction takes the form of a dictionary with meta
and data
keys. meta
contains the detector's metadata while data
is also a dictionary which contains the actual predictions stored in the following keys:
is_drift
: 1 if the sample tested has drifted from the reference data and 0 otherwise.
p_val
: contains the p-value if return_p_val
equals True.
threshold
: p-value threshold if return_p_val
equals True.
distance
: conditional MMD^2 metric between the reference data and the new batch if return_distance
equals True.
distance_threshold
: conditional MMD^2 metric value from the permutation test which corresponds to the the p-value threshold.
coupling_xx
: coupling matrix $W_\text{ref,ref}$ for the reference data.
coupling_yy
: coupling matrix $W_\text{test,test}$ for the test data.
coupling_xy
: coupling matrix $W_\text{ref,test}$ between the reference and test data.
Context-aware drift detection on news articles