Feature Distributions

Distributions monitoring provides an ability to view the statistics and distributions of features and predictions made by your model between any given time. This feature also enables you to draw comparisons between the model predictions for different feature combinations, cohorts and/or time slices. It is a vital aspect of model monitoring cycle to understand if the deployed model has the desired prediction characteristics during different times and for different cohorts.

This demo uses a model trained to predict high or low income based on demographic features from a 1996 US census.

In this demo, we will:

Register an income classifier model with the relevant predictions schema
Launch a Seldon ML pipeline with the income classifier model
Run a Batch Job to send predictions using the model pipeline
Observe the feature distributions of the live predictions
Filter distributions by time or predictions and feature level filters

Prerequisites

Under the production installation, the following must be installed:

Elasticsearch (required) - used for storage of live predictions and reference data
Metadata storage (required) - used for storage of model metadata (which includes the prediction schema)
Argo Workflows (required for demo) - allows for batch jobs to be run
MinIO (optional) - allows for easy storage of and access to downloaded datasets

In addition to the prerequisites, this demo needs the request logger to connect to Seldon Enterprise Platform in order to fetch model level predictions schema. This requires specific request logger configuration. Also, this feature is supported with Open Inference Protocol (OIP) only. Support for json data, string data, bytes payload or multi-node graph use cases is not available yet.

Register an income classifier model

In the Model Catalog page, click Register a new model:
"Register a new model" button on the Model Catalog page
In the Register New Model wizard, enter the following information, then click Register Model:
- Model Name: income-classifier
- URI: gs://seldon-models/scv2/samples/mlserver_1.6.0/income-sklearn/classifier/
- Artifact Type: SciKit Learn
- Version: v1
Model configuration wizard

Configure predictions schema for classifier

Edit the model metadata to update the prediction schema for the model. The prediction schema is a generic schema structure for machine learning model predictions. It is a definition of feature inputs and output targets from the model prediction. Learn more about the predictions schema at the ML Predictions Schema open source repository. Use the income classifier model predictions schema income-classifier-prediction-schema.json to edit and save the model level metadata.

4KB

income-classifier-prediction-schema.json

Open

Click the model income-classifier that you registered.
Select "income-classifier" model on the Model Catalog page
Click Edit Metadata to update the Prediction schema field associated with the model using the contents of prediction schema income-classifier-prediction-schema.json.
Model's metadata wizard
Click Save Metadata.

Launch a Seldon ML pipeline

Deploy the income classifier model from the catalog into an appropriate namespace

In the Model catalog, select Deploy of the Action dropdown.
deploy model
Enter the deployment details in the deployment creation wizard and click Next:
- Name: income-distributions-demo
- Namespace: seldon
- Type: Seldon ML Pipeline
deployment creation wizard 1
All relevant details will be pre-filled from the model registry. Click Next:
income creation wizard
Click Next for the remaining steps, then click Launch.

Run a Batch Job to send predictions to the model pipeline

In order to observe the predictions and feature distributions, first we need to send some predictions to the model. To simulate such a use case, we are going to run a Batch Job that will require an input file of predictions formatted according to the Open Inference Protocol. To do so, we need to follow the steps below:

Download the predictions data file predictions.txt.

7KB

predictions.txt

Open

This is a dataset which contains 60 predictions. The first few lines of the input file 'predictions.txt' should show the following format:

{"inputs": [{"name": "income", "datatype": "INT64", "shape": [1, 12], "data": [30, 4, 4, 0, 2, 0, 4, 1, 5013, 0, 40, 9]}]}
{"inputs": [{"name": "income", "datatype": "INT64", "shape": [1, 12], "data": [30, 4, 1, 0, 6, 0, 4, 1, 2407, 0, 40, 9]}]}
{"inputs": [{"name": "income", "datatype": "INT64", "shape": [1, 12], "data": [32, 0, 3, 2, 0, 1, 4, 1, 0, 0, 40, 0]}]}

Upload the data to a bucket store of your choice. This demo uses MinIO and stores the data at bucket path minio://predictions-data/predictions.txt. Refer to the batch request demo for an example of how this can be done using the minio browser.
Open the Deployment Dashboard of your deployment by clicking on it in the Overview page.
Click the Batch Jobs in the left pane.
Create a Batch Job with the following details:

Input Data Location: minio://predictions-data/predictions.txt
Output Data Location: minio://predictions-data/output-data-{{workflow.name}}.txt
Number of Workers: 1
Number of Retries: 3
Batch Size: 1
Minimum Batch Wait Interval (sec): 5
Method: Predict
Transport Protocol: REST
Input Data Type: Open Inference Protocol (OIP)
Object Store Secret Name: minio-bucket-envvars

This will create a new Batch Job that will use the predictions data, located at minio://predictions-data/predictions.txt, to send requests to the model. The output data will be stored at minio://predictions-data/output-data-{{workflow.name}}.txt. The workflow will send requests to the model every 5 seconds using the REST transport protocol, and those will be formatted according to the Open Inference Protocol. We also specify the secret name minio-bucket-envvars which contains the credentials required for access to the MinIO bucket store and the predictions data.

Note: If the file containing the predictions data is uploaded to a private bucket storage, ensure that you have configured your storage access credentials secret.

Observe predictions and feature distributions

Select the income classifier deployment and go to the monitor section to view the predictions and feature distributions.

Filter distributions by time or feature level filters

Filter distributions by time or predictions and feature level filters to compare different cohorts and further analysis. For example let's look at the predictions for all individuals in the Age group 25-50 and also filter by their Marital Status as Married and Never-Married only and see how the average prediction frequency changes for this cohort.

Configuring parameters

Distributions parameters configuration allows you to configure your charts for further analysis. For example let's look at at the charts in the Age group and change the Histogram interval to 15 and Number of time buckets to 30 to see.

Reference Data Distributions Comparison

Note: This feature is experimental and has very limited functionality and only tabular reference data is supported.

A useful feature for monitoring feature distributions is the ability to compare features against features from a reference dataset. Feature level comparisons allow users to easily gauge which features are actually drifting visually.

Here, we define reference data as a set of data where the distribution of the data is a useful representation of the expected live predictions. Typically, this would be a sampled subset of the training data used to create the inference model.

In this section of the demo, we will extend the distributions monitoring demo and:

Show how reference data needs to be prepared before it can be inserted into Seldon Enterprise Platform
Create a bucket and upload some reference data onto minio
Trigger a retrieval job via the Seldon Enterprise Platform UI
Toggle the comparison of live predictions and reference data feature distributions

Preprocess Reference Data

Note: - Only tabular data that has been processed and saved in CSV format can be retrieved and inserted into Seldon Enterprise Platform as reference data. - There are no PROBA or ONE_HOT features in the income dataset used in this demo so no preprocessing is required. Instead, we'll use the Iris dataset.

Currently, there is a strict requirement for the types of data that can be inserted into Seldon Enterprise Platform that is dependent on the prediction schema.

The number of columns in the reference data must match the number of expected columns from the prediction schema. This means that some feature types, (i.e. ONE_HOT and PROBA) may need to be split into dummy/indicator values. For example, a PROBA feature (e.g. the Iris Species form the Iris Dataset) might have 3 categories in the schema:
```
{
   "name": "Iris Species",
   "type": "PROBA",
   "dataType": "FLOAT",
   "schema": [
     {
       "name": "Setosa"
     },
     {
       "name": "Versicolor"
     },
     {
       "name": "Virginica"
     }
   ]
}
```
However, the raw reference dataset represents this feature as a single column where the values are its categories. This is typically the case for output features where the reference data has the expected real output, while the model returns a probability distribution of the different possible outputs:
Iris Species
Versicolor
Setosa
Virginica
Virginica
Setosa
The processed data should have the following format
Setosa
Versicolor
Virginica
0
1
0
1
0
0
0
0
1
0
0
1
1
0
0
The order of columns must match the prediction schema
Both "input" and "output" features must be in the reference dataset, where "input" features come before "output" features.

Reference Data Storage Bucket

Download the income classifier reference dataset income-reference-data.csv. Store it in a bucket of your choice.

33KB

income-reference-data.csv

Open

In this demo, we will use minio and create a bucket called reference-data which we can upload the income-reference-data.csv file to.

Trigger `Add Reference Data` Job

In your deployment, navigate to the Monitor page and open the Distributions tab. An Add reference data button should be available.

This button opens a wizard where the bucket path and secret can be specified. For this demo, use the following values:

Bucket Path: minio://reference-data/
Bucket Secret: minio-bucket-envvars

Note: Here minio-bucket-envvars is a pre-created secret in the namespace containing env vars.

After confirming, the job will start running, and the Add reference data button will change to a Retrieving data... status.

Toggle Reference Data Feature Distributions Comparison

Give the job a few minutes to finish. Once finished, the Retrieving data... button will change to Reference data available. Now, the Toggle reference data toggle will become available to click for every feature, and you can view comparisons of the live prediction distributions against the reference data.

Use the filters to filter both live predictions and reference data.

PreviousSeldon Core 2 NextDrift Detection

Last updated 1 year ago

Was this helpful?

hashtagPrerequisites

hashtagRegister an income classifier model

hashtagConfigure predictions schema for classifier

hashtagLaunch a Seldon ML pipeline

hashtagRun a Batch Job to send predictions to the model pipeline

hashtagObserve predictions and feature distributions

hashtagFilter distributions by time or feature level filters

hashtagConfiguring parameters

hashtagReference Data Distributions Comparison

hashtagPreprocess Reference Data

hashtagReference Data Storage Bucket

hashtagTrigger Add Reference Data Job

hashtagToggle Reference Data Feature Distributions Comparison

Prerequisites

Register an income classifier model

Configure predictions schema for classifier

Launch a Seldon ML pipeline

Run a Batch Job to send predictions to the model pipeline

Observe predictions and feature distributions

Filter distributions by time or feature level filters

Configuring parameters

Reference Data Distributions Comparison

Preprocess Reference Data

Reference Data Storage Bucket

Trigger `Add Reference Data` Job

Toggle Reference Data Feature Distributions Comparison