1 of 6

Models

Models provide the atomic building blocks of Seldon. They represents machine learning models, drift detectors, outlier detectors, explainers, feature transformations, and more complex routing models such as multi-armed bandits.

Seldon can handle a wide range of
Artifacts can be stored on any of the 40 or more cloud storage technologies as well as from local (mounted) folder as discussed .

Kubernetes Example

A Kubernetes yaml example is shown below for a SKLearn model for iris classification:

Its Kubernetes spec has two core requirements

A storageUri specifying the location of the artifact. This can be any rclone URI specification.
A requirements list which provides tags that need to be matched by the Server that can run this artifact type. By default when you install Seldon we provide a set of Servers that cover a range of artifact types.

GRPC Example

You can also load models directly over the scheduler grpc service. An example is shown below use grpcurl tool:

Multi-model Serving with Overcommit

Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. It is a feature provided out of the box by Nvidia Triton and Seldon MLServer. Multi-model serving reduces infrastructure hardware requirements (e.g. expensive GPUs) which enables the deployment of a large number of models while making it efficient to operate the system at scale.

Seldon Core 2 leverages multi-model serving by design and it is the default option for deploying models. The system will find an appropriate server to load the model onto based on requirements that the user defines in the Model deployment definition.

Moreover, in many cases demand patterns allow for further Overcommit of resources. Seldon Core 2 is able to register more models than what can be served by the provisioned (memory) infrastructure and will swap models dynamically according to least used without adding significant latency overheads to inference workload.

Autoscaling of Models

Multi-Model Serving

Multi-model Serving

Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. This means that, within a single instance of the server, you can serve multiple models under different paths. This is a feature provided out of the box by Nvidia Triton and Seldon MLServer, currently the two inference servers that are integrated in Seldon Core 2.

This deployment pattern allows the system to handle a large number of deployed models letting them share hardware resources allocated to inference servers (e.g GPUs). For example if a single model inference server is deployed on a one GPU node, the underlying loaded models on this inference server instance are able to effectively share this GPU. This is contrast to a single model per server deployment pattern where only one model can use the allocated GPU.

Multi-model serving is enabled by design in Seldon Core 2. Based on requirements that are specified by the user on a given Model, the Scheduler will find an appropriate model inference server instance to load the model onto.

In the below example, given that the model is tensorflow, the system will deploy the model onto a triton server instance (matching with the Server labels). Additionally as the model memory requirement is 100Ki, the system will pick the server instance that has enough (memory) capacity to host this model in parallel to other potentially existing models.

# samples/models/tfsimple1.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple1
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki

All models are loaded and active on this model server. Inference requests for these models are served concurrently and the hardware resources are shared to fulfil these inference requests.

Overcommit

Overcommit allows shared servers to handle more models than can fit in memory. This is done by keeping highly utilized models in memory and evicting other ones to disk using a least-recently-used (LRU) cache mechanism. From a user perspective these models are all registered and "ready" to serve inference requests. If an inference request comes for a model that is unloaded/evicted to disk, the system will reload the model first before forwarding the request to the inference server.

Overcommit is enabled by setting SELDON_OVERCOMMIT_PERCENTAGE on shared servers; it is set by default at 10%. In other words a given model inference server instance can register models with a total memory requirement up to MEMORY_REQUEST * ( 1 + SELDON_OVERCOMMIT_PERCENTAGE / 100).

The Seldon Agent (a side car next to each model inference server deployment) is keeping track of inference requests times on the different models. These models are sorted in time ascending order and this data structure is used to evict the least recently used model in order to make room for another incoming model. This happens during two scenarios:

A new model load request beyond the active memory capacity of the inference server.
An incoming inference request to a registered model that is not loaded in-memory (previously evicted).

This is done seamlessly to users and specifically for reloading a model onto the inference server to respond to an inference request, the model artifact is cached on disk which allows a faster reload (no remote artifact fetch). Therefore we expect that the extra latency to reload a model during an inference request is acceptable in many cases (with a lower bound of ~100ms).

Overcommit can be disabled by setting SELDON_OVERCOMMIT_PERCENTAGE to 0 for a given shared server.

Note: Currently we are using memory requirement values that are specified by the user on the Server and Model side. In the future we are looking at how to make the system automatically handle memory management.

Check this notebook for a local example.

Inference Artifacts

To run your model inside Seldon you must supply an inference artifact that can be downloaded and run on one of MLServer or Triton inference servers. We list artifacts below by alphabetical order below.

Type

Server

Tag

Example

Saving Model artifacts

For many machine learning artifacts you can simply save them to a folder and load them into Seldon Core 2. Details are given below as well as a link to creating a custom model settings file if needed.

Custom MLServer Model Settings

For MLServer targeted models you can create a model-settings.json file to help MLServer load your model and place this alongside your artifact. See the MLServer project for details.

Custom Triton Configuration

For Triton inference server models you can create a configuration config.pbtxt file alongside your artifact.

Notes

The tag field represents the tag you need to add to the requirements part of the Model spec for your artifact to be loaded on a compatible server. e.g. for an sklearn model:

# samples/models/sklearn-iris-gs.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
  requirements:
  - sklearn
  memory: 100Ki

rClone

We utilize Rclone to copy model artifacts from a storage location to the model servers. This allows users to take advantage of Rclones support for over 40 cloud storage backends including Amazon S3, Google Storage and many others.

For local storage while developing see here.

For authorization needed for cloud storage when running on Kubernetes see here.

Parameterized Models

The Model specification allows parameters to be passed to the loaded model to allow customization. For example:

This capability is only available for MLServer custom model runtimes. The named keys and values will be added to the model-settings.json file for the provided model in the parameters.extra Dict. MLServer models are able to read these values in their load method.

Example Parameterized Models

Pandas Query

This model allows a Pandas query to be run in the input to select rows. An example is shown below:

# samples/models/choice1.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: choice-is-one
spec:
  storageUri: "gs://seldon-models/scv2/examples/pandasquery"
  requirements:
  - mlserver
  - python
  parameters:
  - name: query
    value: "choice == 1"

This invocation check filters for tensor A having value 1.

The model also returns a tensor called status which indicates the operation run and whether it was a success. If no rows satisfy the query then just a status tensor output will be returned.
Further details on Pandas query can be found here

This model can be useful for conditional Pipelines. For example, you could have two invocations of this model:

# samples/models/choice1.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: choice-is-one
spec:
  storageUri: "gs://seldon-models/scv2/examples/pandasquery"
  requirements:
  - mlserver
  - python
  parameters:
  - name: query
    value: "choice == 1"

and

# samples/models/choice2.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: choice-is-two
spec:
  storageUri: "gs://seldon-models/scv2/examples/pandasquery"
  requirements:
  - mlserver
  - python
  parameters:
  - name: query
    value: "choice == 2"

By including these in a Pipeline as follows we can define conditional routes:

# samples/pipelines/choice.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: choice
spec:
  steps:
  - name: choice-is-one
  - name: mul10
    inputs:
    - choice.inputs.INPUT
    triggers:
    - choice-is-one.outputs.choice
  - name: choice-is-two
  - name: add10
    inputs:
    - choice.inputs.INPUT
    triggers:
    - choice-is-two.outputs.choice
  output:
    steps:
    - mul10
    - add10
    stepsJoin: any

Here the mul10 model will be called if the choice-is-one model succeeds and the add10 model will be called if the choice-is-two model succeeds.

The full notebook can be found here