1 of 3

Deployment

MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including Seldon Core and KServe (formerly known as KFServing). This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.

In general, it should be possible to deploy models using MLServer into any serving engine compatible with the V2 protocol. Alternatively, it's also possible to manage MLServer deployments manually as regular processes (i.e. in a non-Kubernetes-native way). However, this may be more involved and highly dependant on the deployment infrastructure.

Seldon Core

MLServer is used as the core Python inference server in Seldon Core. Therefore, it should be straightforward to deploy your models either by using one of the built-in pre-packaged servers or by pointing to a custom image of MLServer.

This section assumes a basic knowledge of Seldon Core and Kubernetes, as well as access to a working Kubernetes cluster with Seldon Core installed. To learn more about Seldon Core or how to install it, please visit the Seldon Core documentation.

Pre-packaged Servers

Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.

Usage

To let Seldon Core know what framework was used to train your model, you can use the implementation field of your SeldonDeployment manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
        implementation: SKLEARN_SERVER
        modelUri: gs://seldon-models/sklearn/iris

As you can see highlighted above, all that we need to specify is that:

Our inference deployment should use the V2 inference protocol, which is done by setting the protocol field to kfserving.
Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the MLServer SKLearn runtime, which is done by setting the implementation field to SKLEARN_SERVER.

Note that, while the protocol should always be set to kfserving (i.e. so that models are served using the V2 inference protocol), the value of the implementation field will be dependant on your ML framework. The valid values of the implementation field are pre-determined by Seldon Core. However, it should also be possible to configure and add new ones (e.g. to support a custom MLServer runtime).

Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

kubectl apply -f my-seldondeployment-manifest.yaml

To consult the supported values of the implementation field where MLServer is used, you can check the support table below.

Supported Pre-packaged Servers

As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.

The table below shows a list of the currently supported values of the implementation field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.

Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a wider set of pre-packaged servers. To check the full list, please visit the Seldon Core documentation.

Custom Runtimes

There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.

Usage

The componentSpecs field of the SeldonDeployment manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write our SeldonDeployment manifest as follows:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                image: my-custom-server:0.1.0

As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:

Letting Seldon Core know that the model deployment will be served through the V2 inference protocol) by setting the protocol field to v2.
Pointing our model container to use our custom MLServer image, by specifying it on the image field of the componentSpecs section of the manifest.

kubectl apply -f my-seldondeployment-manifest.yaml

KServe

MLServer is used as the core Python inference server in KServe (formerly known as KFServing). This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.

This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about KServe or how to install it, please visit the KServe documentation.

Serving Runtimes

KServe provides built-in serving runtimes to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.

Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.

Usage

To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService manifest.

For example, to serve a Scikit-Learn model, you could use a manifest like the one below:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    sklearn:
      protocolVersion: v2
      storageUri: gs://seldon-models/sklearn/iris

As you can see highlighted above, the InferenceService manifest will only need to specify the following points:

The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn serving runtime to deploy it.
The model will be served using the V2 inference protocol, which can be enabled by setting the protocolVersion field to v2.

Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

kubectl apply -f my-inferenceservice-manifest.yaml

Supported Serving Runtimes

As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.

Note that, on top of the ones shown above (backed by MLServer), KServe also provides a wider set of serving runtimes. To see the full list, please visit the KServe documentation.

Custom Runtimes

Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for custom runtimes.

Usage

The InferenceService manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the custom MLServer image containing your custom logic. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write an InferenceService manifest like the one below:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    containers:
      - name: classifier
        image: my-custom-server:0.1.0
        env:
          - name: PROTOCOL
            value: v2
        ports:
          - containerPort: 8080
            protocol: TCP

As we can see highlighted above, the main points that we'll need to take into account are:

Pointing to our custom MLServer image in the custom container section of our InferenceService.
Explicitly choosing the V2 inference protocol to serve our model.
Let KServe know what port will be exposed by our custom container to send inference requests.

kubectl apply -f my-inferenceservice-manifest.yaml

Seldon Core

Pre-packaged Servers

Usage

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
        implementation: SKLEARN_SERVER
        modelUri: gs://seldon-models/sklearn/iris

As you can see highlighted above, all that we need to specify is that:

Our inference deployment should use the V2 inference protocol, which is done by setting the protocol field to kfserving.
Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the MLServer SKLearn runtime, which is done by setting the implementation field to SKLEARN_SERVER.

kubectl apply -f my-seldondeployment-manifest.yaml

To consult the supported values of the implementation field where MLServer is used, you can check the support table below.

Supported Pre-packaged Servers

As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.

Framework

MLServer Runtime

Seldon Core Pre-packaged Server

Documentation

Custom Runtimes

Usage

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                image: my-custom-server:0.1.0

As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:

Letting Seldon Core know that the model deployment will be served through the V2 inference protocol) by setting the protocol field to v2.
Pointing our model container to use our custom MLServer image, by specifying it on the image field of the componentSpecs section of the manifest.

kubectl apply -f my-seldondeployment-manifest.yaml

KServe

Serving Runtimes

Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.

Usage

To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService manifest.

For example, to serve a Scikit-Learn model, you could use a manifest like the one below:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    sklearn:
      protocolVersion: v2
      storageUri: gs://seldon-models/sklearn/iris

As you can see highlighted above, the InferenceService manifest will only need to specify the following points:

The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn serving runtime to deploy it.
The model will be served using the V2 inference protocol, which can be enabled by setting the protocolVersion field to v2.

kubectl apply -f my-inferenceservice-manifest.yaml

Supported Serving Runtimes

Framework

MLServer Runtime

KServe Serving Runtime

Documentation

Note that, on top of the ones shown above (backed by MLServer), KServe also provides a wider set of serving runtimes. To see the full list, please visit the KServe documentation.

Custom Runtimes

Usage

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    containers:
      - name: classifier
        image: my-custom-server:0.1.0
        env:
          - name: PROTOCOL
            value: v2
        ports:
          - containerPort: 8080
            protocol: TCP

As we can see highlighted above, the main points that we'll need to take into account are:

Pointing to our custom MLServer image in the custom container section of our InferenceService.
Explicitly choosing the V2 inference protocol to serve our model.
Let KServe know what port will be exposed by our custom container to send inference requests.

kubectl apply -f my-inferenceservice-manifest.yaml