Learned drift detectors on Adult Census
Under the hood, drift detectors leverage a function (also known as a test-statistic) that is expected to take a large value if drift has occurred and a low value if not. The power of the detector is partly determined by how well the function satisfies this property. However, specifying such a function in advance can be very difficult.
Detecting drift with a learned classifier
The classifier-based drift detector simply tries to correctly distinguish instances from the reference data vs. the test set. The classifier is trained to output the probability that a given instance belongs to the test set. If the probabilities it assigns to unseen tests instances are significantly higher (as determined by a Kolmogorov-Smirnov test) than those it assigns to unseen reference instances then the test set must differ from the reference set and drift is flagged. To leverage all the available reference and test data, stratified cross-validation can be applied and the out-of-fold predictions are used for the significance test. Note that a new classifier is trained for each test set or even each fold within the test set.
Backend
The method works with both the PyTorch, TensorFlow, and Sklearn frameworks. We will focus exclusively on the Sklearn backend in this notebook.
Dataset
Adult dataset consists of 32,561 distributed over 2 classes based on whether the annual income is >50K. We evaluate drift on particular subsets of the data which are constructed based on the education level. As we will further discuss, our reference dataset will consist of people having a low education level, while our test dataset will consist of people having a high education level.
Note: we need to install alibi
to fetch the adult
dataset.
Load Adult Census Dataset
We split the dataset in two based on the education level. We define a low_education
level consisting of: 'Dropout'
, 'High School grad'
, 'Bachelors'
, and a high_education
level consisting of: 'Bachelors'
, 'Masters'
, 'Doctorate'
. Intentionally we included an overlap between the two distributions consisting of people that have a Bachelors
degree. Our goal is to detect that the two distributions are different.
We sample our reference dataset from the low_education
level. In addition, we sample two other datasets:
x_h0
- sampled from thelow_education
level to support the null hypothesis (i.e., the two distributions are identical);x_h1
- sampled from thehigh_education
level to support the alternative hypothesis (i.e., the two distributions are different);
Define dataset pre-processor
Utils
Drift detection
We perform a binomial test using a RandomForestClassifier
.
As expected, when testing against x_h0
, we fail to reject $H_0$, while for the second case there is enough evidence to reject $H_0$ and flag that the data has drifted.
For the classifiers that do not support predict_proba
but offer support for decision_function
, we can perform a K-S test on the scores by setting preds_type='scores'
.
Some models can return a poor estimate of the class label probability or some might not even support probability predictions. We can add calibration on top of each classifier to obtain better probability estimates and perform a K-S test. For demonstrative purposes, we will calibrate a LinearSVC
which does not support predict_proba
, but any other classifier would work.
Speeding things up
In order to use the entire dataset and obtain unbiased predictions required to perform the statistical test, the ClassifierDrift
detector has the option to perform a n_folds
split. Although appealing due to its data efficiency, this method can be slow since it is required to train a number of n_folds
classifiers.
For the RandomForestClassifier
we can avoid retraining n_folds
classifiers by using the out-of-bag predictions. In a RandomForestClassifier
each tree is trained on a separate dataset obtained by sampling with replacement the original training set, a method known as bagging. On average, only 63% unique samples from the original dataset are used to train each tree (Bostrom). Thus, for each tree, we can obtain predictions for the remaining out-of-bag samples (i.e., the rest of 37%). By cumulating the out-of-bag predictions across all the trees we can eventually obtain a prediction for each sample in the original dataset. Note that we used the word 'eventually' because if the number of trees is too small, covering the entire original dataset might be unlikely.
For demonstrative purposes, we will compare the running time of the ClassifierDrift
detector when using a RandomForestClassifier
in two setups: n_folds=5, use_oob=False
and use_oob=True
.
We can observe that in this particular setting, using the out-of-bag prediction can speed up the procedure up to almost x4.
Last updated
Was this helpful?