Likelihood Ratio Outlier Detection with PixelCNN++
Method
The outlier detector described by Ren et al. (2019) in Likelihood Ratios for Out-of-Distribution Detection uses the likelihood ratio between 2 generative models as the outlier score. One model is trained on the original data while the other is trained on a perturbed version of the dataset. This is based on the observation that the likelihood score for an instance under a generative model can be heavily affected by population level background statistics. The second generative model is therefore trained to capture the background statistics still present in the perturbed data while the semantic features have been erased by the perturbations.
The perturbations are added using an independent and identical Bernoulli distribution with rate $\mu$ which substitutes a feature with one of the other possible feature values with equal probability. For images, this means changing a pixel with a different pixel randomly sampled within the $0$ to $255$ pixel range.
The generative model used in the example is a PixelCNN++, adapted from the official TensorFlow Probability implementation, and available as a standalone model in from alibi_detect.models.tensorflow import PixelCNN.
Dataset
The training set Fashion-MNIST consists of 60,000 28 by 28 grayscale images distributed over 10 classes. The classes represent items of clothing such as shirts or trousers. At test time, we want to distinguish the Fashion-MNIST test set from MNIST, which represents 28 by 28 grayscale numbers from 0 to 9.
This notebook requires the seaborn package for visualization which can be installed via pip:
!pip install seaborn
import os
from functools import partial
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import tensorflow as tf
from alibi_detect.od import LLR
from alibi_detect.models.tensorflow import PixelCNN
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.utils.tensorflow import predict_batch
from alibi_detect.utils.visualize import plot_roc
num_hierarchies: number of blocks separated by expansions or contractions of dimensions. See Fig.2 PixelCNN++.
num_filters: number of convolutional filters.
num_logistic_mix: number of components in the logistic mixture distribution.
receptive_field_dims: height and width in pixels of the receptive field above and to the left of a given pixel.
Optionally, a different model can be passed to the detector with argument model_background. The Likelihood Ratio paper mentions that additional $L2$-regularization (l2_weight) for the background model could improve detection performance.
We can load our saved detector again by defining the PixelCNN architectures for the semantic and background models as well as providing the shape of the input data:
Most of the instances look like they represent the dataset well. When we do the same thing for our background model, we see that there is some background noise injected:
X_sample = od.dist_b.sample(n_sample).numpy()
plot_grid_img(X_sample)
Compare the log likelihoods
Let's compare the log likelihoods of the inliers vs. the outlier data under the semantic and background models. Although MNIST data looks very distinct from Fashion-MNIST, the generative model does not distinguish well between the 2 datasets as shown by the histograms of the log likelihoods:
This is due to the dominance of the background which is similar (basically lots of $0$'s for both datasets). If we however take the likelihood ratio, the MNIST data are detected as outliers. And this is exactly what the outlier detector does as well:
We follow the same procedure with the outlier detector. First we need to set an outlier threshold with infer_threshold. We need to pass a batch of instances and specify what percentage of those we consider to be normal via threshold_perc. Let's assume we have a small batch of data with roughly $50$% outliers but we don't know exactly which ones.
od_preds = od.predict(X_test,
batch_size=32,
outlier_type='instance', # use 'feature' or 'instance' level
return_feature_score=True, # scores used to determine outliers
return_instance_score=True)
Display results
F1 score, accuracy, precision, recall and confusion matrix:
To understand why the likelihood ratio works to detect outliers but the raw log likelihoods don't, it is helpful to look at the pixel-wise log likelihoods of both the semantic and background models.
It is clear that both the semantic and background model attach high probabilities to the background pixels. This effect is cancelled out in the likelihood ratio in the last column. The same applies to the out-of-distribution instances: