ProtoSelect on Adult Census and CIFAR10
Bien and Tibshirani (2012) proposed ProtoSelect, which is a prototype selection method with the goal of constructing not only a condensed view of a dataset but also an interpretable model (applicable to classification only). Prototypes can be defined as instances that are representative of the entire training data distribution. Formally, consider a dataset of training points $\mathcal{X} = {x_1, ..., x_n } \subset \mathbf{R}^p$ and their corresponding labels $\mathcal{Y} = {y_1, ..., y_n}$, where $y_i \in {1, 2, ..., L}$. ProtoSelect finds sets $\mathcal{P}_{l} \subseteq \mathcal{X}$ for each class $l$ such that the set union of $\mathcal{P}_1, \mathcal{P}_2, ..., \mathcal{P}_L$ would provided a distilled view of the training dataset $(\mathcal{X}, \mathcal{Y})$.
Given the sets of prototypes, one can construct a simple interpretable classifier given by:
Note that the classifier defined in the equation above would be equivalent to 1-KNN if each set $\mathcal{P}_l$ would consist only of instances belonging to class $l$.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple
import tensorflow as tf
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from alibi.api.interfaces import Explanation
from alibi.datasets import fetch_adult
from alibi.prototypes import ProtoSelect, visualize_image_prototypes
from alibi.utils.kernel import EuclideanDistance
from alibi.prototypes.protoselect import cv_protoselect_euclideanUtils
Utility function to display the tabular data in a human-readable format.
Utility function to display image prototypes.
Adult Census dataset
Load Adult Census dataset
Fetch the Adult Census dataset and perform train-test-validation split. In this example, for demonstrative purposes, each split contains only 1000. One can increase the number of instances in each set but should be aware of the memory limitation since the kernel matrix used for prototype selection is precomputed and stored in memory.
Preprocessing function
Because the tabular dataset has low dimensionality, we can use a simple preprocessing function: numerical features are standardized and categorical features are one-hot encoded. The kernel dissimilarity used for prototype selection will operate on the preprocessed representation.
Prototypes selection
As with every kernel-based method, the performance of ProtoSelect is sensitive to the kernel selection and a predefined $\epsilon$-radius which characterizes the neighborhood of an instance $x$ as a hyper-sphere of radius $\epsilon$ centered in $x$ denoted as $B(x_i, \epsilon)$. Note that other kernel dissimilarities might require some tuning (e.g., Gaussian RBF), which means that we will have to jointly search for the optimum $\epsilon$ and kernel parameters. Luckily, in our case, we will use a simple Euclidean distance metric that does not require any tuning. Thus, we only need to search for the optimum $\epsilon$-radius to be used by ProtoSelect. Alibi already comes with support for a grid-based search of the optimum values of the $\epsilon$ when using a Euclidean distance metric.
To search for the optimum $\epsilon$-radius, we call thecv_protoselect_euclidean method, provided with a training dataset, an optional prototype dataset (i.e. training dataset is used by default if prototype dataset is not provided), and a validation set. Note that in the absence of a validation dataset, the method performs cross-validation on the training dataset.
Once we have the optimum value of $\epsilon$, we can instantiate ProtoSelect as follows:
Display prototypes
Let us inspect the returned prototypes:
0
33
Private
High School grad
Never-Married
Blue-Collar
Not-in-family
White
Male
0
0
40
United-States
<=50K
1
31
Private
High School grad
Never-Married
Service
Own-child
White
Female
0
0
30
United-States
<=50K
2
60
Private
Associates
Separated
Service
Not-in-family
White
Female
0
0
40
United-States
<=50K
3
61
?
High School grad
Married
?
Husband
White
Male
0
0
6
United-States
<=50K
4
20
?
High School grad
Never-Married
?
Own-child
White
Male
0
0
20
United-States
<=50K
5
31
Private
High School grad
Never-Married
Service
Own-child
White
Male
0
1721
16
United-States
<=50K
6
27
?
High School grad
Separated
?
Own-child
Black
Female
0
0
40
United-States
<=50K
7
63
Private
Dropout
Widowed
Service
Not-in-family
White
Female
0
0
31
United-States
<=50K
8
67
Federal-gov
Bachelors
Widowed
Admin
Not-in-family
White
Female
0
0
40
United-States
<=50K
9
25
Private
Dropout
Never-Married
Service
Unmarried
Black
Female
0
0
32
United-States
<=50K
10
38
Self-emp-not-inc
Bachelors
Never-Married
Blue-Collar
Not-in-family
Amer-Indian-Eskimo
Male
0
0
30
United-States
<=50K
11
31
Private
High School grad
Separated
Blue-Collar
Unmarried
White
Female
0
2238
40
United-States
<=50K
12
21
Private
High School grad
Never-Married
Sales
Own-child
Asian-Pac-Islander
Male
0
0
30
British-Commonwealth
<=50K
13
17
?
Dropout
Never-Married
?
Own-child
White
Male
0
0
20
South-America
<=50K
14
61
Local-gov
Masters
Married
White-Collar
Husband
White
Male
7298
0
60
United-States
>50K
15
51
Self-emp-inc
Doctorate
Married
Professional
Husband
White
Male
15024
0
40
United-States
>50K
16
55
Private
Masters
Married
White-Collar
Husband
White
Male
0
1977
40
United-States
>50K
17
33
Private
Bachelors
Married
Professional
Husband
White
Male
15024
0
75
United-States
>50K
18
49
Self-emp-inc
Prof-School
Married
Professional
Husband
White
Male
99999
0
37
United-States
>50K
19
90
Private
Prof-School
Married
Professional
Husband
White
Male
20051
0
72
United-States
>50K
By inspecting the prototypes, we can observe that features like Education and Marital Status can reveal some patterns. People with lower education level (e.g., High School grad, Dropout, etc.) and who don’t have a partner (e.g., Never-Married, Separated, Widowed etc.) tend to be classified as $\leq 50K$. On the other hand, we have people that have a higher education level (e.g., Bachelors, Masters, Doctorate, etc.) and who have a partner (e.g., Married) that are classified as $>50K$.
Train 1-KNN
A standard procedure to check the quality of the prototypes is to train a 1-KNN classifier and evaluate its performance.
To verify that ProtoSelect returns better prototypes than a simple random selection, we randomly sample multiple prototype sets, train a 1-KNN for each set, evaluate the classifiers, and return the average accuracy score.
Compare the results returned by ProtoSelect and by the random sampling.
We can observe that ProtoSelect chooses more representative instances than a naive random selection. The gap between the two should narrow as we increase the number of requested prototypes.
CIFAR10 dataset
Load dataset
Fetch the CIFAR10 dataset and perform train-test-validation split and standard preprocessing. For demonstrative purposes, we use a reduced dataset and remind the user about the memory limitation of pre-computing and storing the kernel matrix in memory.
Preprocessing function
For CIFAR10, we use a hidden layer output from a pre-trained network as our feature representation of the input images. The network was trained on the CIFAR10 dataset. Note that one can use any feature representation of choice (e.g., used from some self-supervised task).
Prototypes selection
To obtain the best results, we apply the same grid-search procedure as in the case of the tabular dataset.
Once we have the optimum value of $\epsilon$ we can instantiate ProtoSelect as follows:
Display prototypes
We can visualize and understand the importance of a prototype in a 2D image scatter plot. Alibi provides a helper function which fits a 1-KNN classifier on the prototypes and computes their importance as the logarithm of the number of assigned training instances correctly classified according to the 1-KNN classifier. Thus, the larger the image, the more important the prototype is.

Besides visualizing and understanding the prototypes importance (i.e., larger images correspond to more important prototypes), one can also understand the diversity of each class by simply displaying all their corresponding prototypes.

For example, within the current setup, we can observe that only two prototypes are required to cover the subset of the airplane instance. On the other hand, for the subset of the car instances, we need at least six prototypes. The visualization suggests that the car class has more diversity in the feature representation and implicitly requires more prototypes to cover its instances.
Train 1-KNN
As before, we train a 1-KNN classifier to verify the quality of the prototypes returned by ProtoSelect against a random sampling.
Last updated
Was this helpful?

