# Feature Distributions

Distributions monitoring provides an ability to view the statistics and distributions of features and predictions made by your model between any given time. This feature also enables you to draw comparisons between the model predictions for different feature combinations, cohorts and/or time slices. It is a vital aspect of model monitoring cycle to understand if the deployed model has the desired prediction characteristics during different times and for different cohorts.

This demo uses a model trained to predict high or low income based on [demographic features from a 1996 US census](https://doi.org/10.24432/C5GP7S).

In this demo, we will:

* Register an income classifier model with the relevant predictions schema
* Launch a Seldon ML pipeline with the income classifier model
* Run a Batch Job to send predictions using the model pipeline
* Observe the feature distributions of the live predictions
* Filter distributions by time or predictions and feature level filters

## Prerequisites

Under the production installation, the following must be installed:

* [Elasticsearch](/seldon-enterprise-platform/production-environment/elasticsearch.md) (required) - used for storage of live predictions and reference data
* [Metadata storage](/seldon-enterprise-platform/production-environment/postgresql.md) (required) - used for storage of model metadata (which includes the prediction schema)
* [Argo Workflows](/seldon-enterprise-platform/production-environment/argo-workflows.md) (required for demo) - allows for batch jobs to be run
* [MinIO](/seldon-enterprise-platform/production-environment/minio.md) (optional) - allows for easy storage of and access to downloaded datasets

In addition to the prerequisites, this demo needs the request logger to connect to Seldon Enterprise Platform in order to fetch model level predictions schema. This requires [specific request logger configuration](/seldon-enterprise-platform/production-environment/request-logging.md#ml-metadata). Also, this feature is supported with Open Inference Protocol (OIP) only. Support for json data, string data, bytes payload or multi-node graph use cases is not available yet.

## Register an income classifier model

Register a pre-trained income classifier SKLearn model.

1. In the `Model Catalog` page, click `Register a new model`:

   !["Register a new model" button on the Model Catalog page](/files/f40YduRjaeZW0pvnAmij)
2. In the `Register New Model` wizard, enter the following information, then click `Register Model`:

   * *Model Name*: `income-classifier`
   * *URI*: `gs://seldon-models/scv2/samples/mlserver_1.6.0/income-sklearn/classifier/`
   * *Artifact Type*: `SciKit Learn`
   * *Version*: `v1`

   ![Model configuration wizard](/files/JW50768QuvhqNKagvKha)

## Configure predictions schema for classifier

Edit the model metadata to update the prediction schema for the model. The prediction schema is a generic schema structure for machine learning model predictions. It is a definition of feature inputs and output targets from the model prediction. Learn more about the predictions schema at the [ML Predictions Schema](https://github.com/SeldonIO/ml-prediction-schema) open source repository. Use the income classifier model predictions schema `income-classifier-prediction-schema.json` to edit and save the model level metadata.

{% file src="/files/KarkDsbbSlDBd4oKaUuy" %}

1. Click the model `income-classifier` that you registered.

   ![Select "income-classifier" model on the Model Catalog page](/files/wp6bml9poTIBCtgT3G2h)
2. Click `Edit Metadata` to update the **Prediction schema** field associated with the model using the contents of prediction schema `income-classifier-prediction-schema.json`.

   ![Model's metadata wizard](/files/V1P3kl66ymxfPb1anFpx)
3. Click `Save Metadata`.

## Launch a Seldon ML pipeline

Deploy the income classifier model from the catalog into an appropriate namespace

1. In the **Model catalog**, select **Deploy** of the **Action** dropdown.

   ![deploy model](/files/i0adewhXXW3T6A44aK8Z)
2. Enter the deployment details in the deployment creation wizard and click `Next`:

   * *Name*: `income-distributions-demo`
   * *Namespace*: `seldon`
   * *Type*: `Seldon ML Pipeline`

   ![deployment creation wizard 1](/files/b7bh2rqU1xaJaw3zVkcS)
3. All relevant details will be pre-filled from the model registry. Click `Next`:

   ![income creation wizard](/files/HoccQffjhwAg6NOedEVn)
4. Click `Next` for the remaining steps, then click `Launch`.

## Run a Batch Job to send predictions to the model pipeline

In order to observe the predictions and feature distributions, first we need to send some predictions to the model. To simulate such a use case, we are going to run a [Batch Job](/seldon-enterprise-platform/demos/seldon-core-v2/batch-requests.md) that will require an input file of predictions formatted according to the [Open Inference Protocol](https://docs.seldon.io/projects/seldon-core/en/v2/contents/apis/inference/v2.html). To do so, we need to follow the steps below:

1. Download the predictions data file `predictions.txt`.

{% file src="/files/zCfZGC1QB8gmVY2XnSmM" %}

This is a dataset which contains `60 predictions`. The first few lines of the input file 'predictions.txt' should show the following format:

```
{"inputs": [{"name": "income", "datatype": "INT64", "shape": [1, 12], "data": [30, 4, 4, 0, 2, 0, 4, 1, 5013, 0, 40, 9]}]}
{"inputs": [{"name": "income", "datatype": "INT64", "shape": [1, 12], "data": [30, 4, 1, 0, 6, 0, 4, 1, 2407, 0, 40, 9]}]}
{"inputs": [{"name": "income", "datatype": "INT64", "shape": [1, 12], "data": [32, 0, 3, 2, 0, 1, 4, 1, 0, 0, 40, 0]}]}
```

2. Upload the data to a bucket store of your choice. This demo uses [MinIO](/seldon-enterprise-platform/production-environment/minio.md) and stores the data at bucket path `minio://predictions-data/predictions.txt`. Refer to the [batch request demo](/seldon-enterprise-platform/demos/seldon-core-v2/batch-requests.md#setup-input-data) for an example of how this can be done using the minio browser.
3. Open the `Deployment Dashboard` of your deployment by clicking on it in the `Overview` page.
4. Click the `Batch Jobs` in the left pane.
5. Create a Batch Job with the following details:

```
Input Data Location: minio://predictions-data/predictions.txt
Output Data Location: minio://predictions-data/output-data-{{workflow.name}}.txt
Number of Workers: 1
Number of Retries: 3
Batch Size: 1
Minimum Batch Wait Interval (sec): 5
Method: Predict
Transport Protocol: REST
Input Data Type: Open Inference Protocol (OIP)
Object Store Secret Name: minio-bucket-envvars
```

This will create a new Batch Job that will use the predictions data, located at `minio://predictions-data/predictions.txt`, to send requests to the model. The output data will be stored at `minio://predictions-data/output-data-{{workflow.name}}.txt`. The workflow will send requests to the model every 5 seconds using the REST transport protocol, and those will be formatted according to the [Open Inference Protocol](https://docs.seldon.io/projects/seldon-core/en/v2/contents/apis/inference/v2.html). We also specify the secret name `minio-bucket-envvars` which contains the credentials required for access to the MinIO bucket store and the predictions data.

{% hint style="info" %}
**Note**: If the file containing the predictions data is uploaded to a private bucket storage, ensure that you [have configured your storage access credentials secret](/seldon-enterprise-platform/operations/storage-initializers.md#configuration).
{% endhint %}

## Observe predictions and feature distributions

Select the income classifier deployment and go to the monitor section to view the predictions and feature distributions.

![observe-distributions](/files/RSwl80l4bZmxywLhsDSl)

## Filter distributions by time or feature level filters

Filter distributions by time or predictions and feature level filters to compare different cohorts and further analysis. For example let's look at the predictions for all individuals in the `Age` group 25-50 and also filter by their `Marital Status` as `Married` and `Never-Married` only and see how the average prediction frequency changes for this cohort.

![filter-distributions](/files/yEJjeV5XDyCijck86uZD)

## Configuring parameters

Distributions parameters configuration allows you to configure your charts for further analysis. For example let's look at at the charts in the `Age` group and change the `Histogram interval` to `15` and `Number of time buckets` to `30` to see.

![configure-parameters](/files/CScnZWhezwWvNMbLBRmj)

## Reference Data Distributions Comparison

{% hint style="info" %}
**Note**: This feature is experimental and has very limited functionality and **only tabular reference data** is supported.
{% endhint %}

A useful feature for monitoring feature distributions is the ability to compare features against features from a reference dataset. Feature level comparisons allow users to easily gauge which features are actually drifting visually.

Here, we define reference data as a set of data where the distribution of the data is a useful representation of the expected live predictions. Typically, this would be a sampled subset of the training data used to create the inference model.

In this section of the demo, we will extend the distributions monitoring demo and:

* Show how reference data needs to be prepared before it can be inserted into Seldon Enterprise Platform
* Create a bucket and upload some reference data onto minio
* Trigger a retrieval job via the Seldon Enterprise Platform UI
* Toggle the comparison of live predictions and reference data feature distributions

## Preprocess Reference Data

{% hint style="info" %}
Note:\
\- Only tabular data that has been **processed** and saved in **CSV** format can be retrieved and inserted into Seldon Enterprise Platform as reference data.\
\- There are no `PROBA` or `ONE_HOT` features in the income dataset used in this demo so no preprocessing is required. Instead, we'll use the Iris dataset.
{% endhint %}

Currently, there is a **strict** requirement for the types of data that can be inserted into Seldon Enterprise Platform that is dependent on the [*prediction schema*](#configure-predictions-schema-for-classifier).

1. The number of columns in the reference data must match the number of expected columns from the prediction schema. This means that some feature types, (i.e. `ONE_HOT` and `PROBA`) may need to be split into dummy/indicator values. For example, a `PROBA` feature (e.g. the Iris Species form the [Iris Dataset](https://archive.ics.uci.edu/dataset/53/iris)) might have 3 categories in the schema:

   ```json
   {
      "name": "Iris Species",
      "type": "PROBA",
      "dataType": "FLOAT",
      "schema": [
        {
          "name": "Setosa"
        },
        {
          "name": "Versicolor"
        },
        {
          "name": "Virginica"
        }
      ]
   }
   ```

   However, the raw reference dataset represents this feature as a single column where the values are its categories. This is typically the case for output features where the reference data has the expected real output, while the model returns a probability distribution of the different possible outputs:

   | Iris Species |
   | ------------ |
   | Versicolor   |
   | Setosa       |
   | Virginica    |
   | Virginica    |
   | Setosa       |

   The processed data should have the following format

   | Setosa | Versicolor | Virginica |
   | ------ | ---------- | --------- |
   | 0      | 1          | 0         |
   | 1      | 0          | 0         |
   | 0      | 0          | 1         |
   | 0      | 0          | 1         |
   | 1      | 0          | 0         |
2. The order of columns must match the prediction schema
3. Both "input" and "output" features must be in the reference dataset, where "input" features come before "output" features.

## Reference Data Storage Bucket

Download the income classifier reference dataset `income-reference-data.csv`. Store it in a bucket of your choice.

{% file src="/files/rEuaMQis3X1LN4OodOpc" %}

In this demo, we will use [minio](/seldon-enterprise-platform/production-environment/minio.md) and create a bucket called `reference-data` which we can upload the `income-reference-data.csv` file to.

## Trigger `Add Reference Data` Job

In your deployment, navigate to the `Monitor` page and open the `Distributions` tab. An `Add reference data` button should be available.

This button opens a wizard where the bucket path and secret can be specified. For this demo, use the following values:

* *Bucket Path*: `minio://reference-data/`
* *Bucket Secret*: `minio-bucket-envvars`

{% hint style="info" %}
**Note**: Here `minio-bucket-envvars` is a pre-created secret in the namespace containing env vars.
{% endhint %}

After confirming, the job will start running, and the `Add reference data` button will change to a `Retrieving data...` status.

![Add reference data](/files/GjwPiLYulERYgvPggIEh)

## Toggle Reference Data Feature Distributions Comparison

Give the job a few minutes to finish. Once finished, the `Retrieving data...` button will change to `Reference data available`. Now, the `Toggle reference data` toggle will become available to click for every feature, and you can view comparisons of the live prediction distributions against the reference data.

Use the filters to filter **both** live predictions and reference data.

![Reference data](/files/sSdFMLSdWMl8KVran21w)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/seldon-enterprise-platform/demos/seldon-core-v2/distributions-monitoring.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
