Likelihood Ratio Outlier Detection on Genomic Sequences
Method
The outlier detector described by Ren et al. (2019) in Likelihood Ratios for Out-of-Distribution Detection uses the likelihood ratio between 2 generative models as the outlier score. One model is trained on the original data while the other is trained on a perturbed version of the dataset. This is based on the observation that the likelihood score for an instance under a generative model can be heavily affected by population level background statistics. The second generative model is therefore trained to capture the background statistics still present in the perturbed data while the semantic features have been erased by the perturbations.
The perturbations are added using an independent and identical Bernoulli distribution with rate $\mu$ which substitutes a feature with one of the other possible feature values with equal probability. Each feature in the genome dataset can take 4 values (one of the ACGT nucleobases). This means that a perturbed feature is swapped with one of the other nucleobases. The generative model used in the example is a simple LSTM network.
Dataset
The bacteria genomics dataset for out-of-distribution detection was released as part of the Likelihood Ratios for Out-of-Distribution Detection paper. From the original TL;DR: The dataset contains genomic sequences of 250 base pairs from 10 in-distribution bacteria classes for training, 60 OOD bacteria classes for validation, and another 60 different OOD bacteria classes for test. There are respectively 1, 7 and again 7 million sequences in the training, validation and test sets. For detailed info on the dataset check the README.
This notebook requires the seaborn package for visualization which can be installed via pip:
!pip install seaborn
#| scrolled: true
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input, LSTM
from alibi_detect.od import LLR
from alibi_detect.datasets import fetch_genome
from alibi_detect.utils.fetching import fetch_detector
from alibi_detect.saving import save_detector, load_detector
from alibi_detect.utils.visualize import plot_roc
Load genome data
X represents the genome sequences and y whether they are outliers ($1$) or not ($0$).
There are no outliers in the training set and a majority of outliers (compared to the training data) in the validation and test sets:
print('Fraction of outliers in train, val and test sets: '
'{:.2f}, {:.2f} and {:.2f}'.format(y_train.mean(), y_val.mean(), y_test.mean()))
Define model
We need to define a generative model which models the genome sequences. We follow the paper and opt for a simple LSTM. Note that we don't actually need to define the model below if we simply load the pretrained detector later on:
genome_dim = 249 # not 250 b/c we use 1->249 as input and 2->250 as target
input_dim = 4 # ACGT nucleobases
hidden_dim = 2000
inputs = Input(shape=(genome_dim,), dtype=tf.int8)
x = tf.one_hot(tf.cast(inputs, tf.int32), input_dim)
x = LSTM(hidden_dim, return_sequences=True)(x)
logits = Dense(input_dim, activation=None)(x)
model = tf.keras.Model(inputs=inputs, outputs=logits, name='LlrLSTM')
We also need to define our loss function which we can utilize to evaluate the log-likelihood for the outlier detector:
Let's compare the log likelihoods of the inliers vs. the outlier test set data under the semantic and background models. We randomly sample $100,000$ instances from both distributions since the full test set contains $7,000,000$ genomic sequences. The histograms show that the generative model does not distinguish well between inliers and outliers.
This is because of the background-effect which is in this case the GC-content in the genomic sequences. This effect is partially reduced when taking the likelihood ratio:
We follow the same procedure with the outlier detector. First we need to set an outlier threshold with infer_threshold. We need to pass a batch of instances and specify what percentage of those we consider to be normal via threshold_perc. Let's assume we have a small batch of data with roughly $30$% outliers but we don't know exactly which ones.