LogoLogo
  • MLServer
  • Getting Started
  • User Guide
    • Content Types (and Codecs)
    • OpenAPI Support
    • Parallel Inference
    • Adaptive Batching
    • Custom Inference Runtimes
    • Metrics
    • Deployment
      • Seldon Core
      • KServe
    • Streaming
  • Inference Runtimes
    • SKLearn
    • XGBoost
    • MLFlow
    • Spark MlLib
    • LightGBM
    • Catboost
    • Alibi-Detect
    • Alibi-Explain
    • HuggingFace
    • Custom
  • Reference
    • MLServer Settings
    • Model Settings
    • MLServer CLI
    • Python API
      • MLModel
      • Types
      • Codecs
      • Metrics
  • Examples
    • Serving Scikit-Learn models
    • Serving XGBoost models
    • Serving LightGBM models
    • Serving MLflow models
    • Serving a custom model
    • Serving Alibi-Detect models
    • Serving HuggingFace Transformer Models
    • Multi-Model Serving
    • Model Repository API
    • Content Type Decoding
    • Custom Conda environments in MLServer
    • Serving a custom model with JSON serialization
    • Serving models through Kafka
    • Streaming
    • Deploying a Custom Tensorflow Model with MLServer and Seldon Core
  • Changelog
Powered by GitBook
On this page
  • Serving Runtimes
  • Usage
  • Supported Serving Runtimes
  • Custom Runtimes
  • Usage

Was this helpful?

Edit on GitHub
Export as PDF
  1. User Guide
  2. Deployment

KServe

PreviousSeldon CoreNextStreaming

Last updated 7 months ago

Was this helpful?

MLServer is used as the in . This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.

This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about or , please visit the .

Serving Runtimes

KServe provides built-in to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.

Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.

Usage

To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService manifest.

For example, to serve a Scikit-Learn model, you could use a manifest like the one below:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    sklearn:
      protocolVersion: v2
      storageUri: gs://seldon-models/sklearn/iris

As you can see highlighted above, the InferenceService manifest will only need to specify the following points:

  • The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn serving runtime to deploy it.

Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

kubectl apply -f my-inferenceservice-manifest.yaml

Supported Serving Runtimes

As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.

Framework
MLServer Runtime
KServe Serving Runtime
Documentation

Scikit-Learn

sklearn

XGBoost

xgboost

Custom Runtimes

Usage

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    containers:
      - name: classifier
        image: my-custom-server:0.1.0
        env:
          - name: PROTOCOL
            value: v2
        ports:
          - containerPort: 8080
            protocol: TCP

As we can see highlighted above, the main points that we'll need to take into account are:

  • Pointing to our custom MLServer image in the custom container section of our InferenceService.

  • Let KServe know what port will be exposed by our custom container to send inference requests.

Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

kubectl apply -f my-inferenceservice-manifest.yaml

The model will be served using the , which can be enabled by setting the protocolVersion field to v2.

Note that, on top of the ones shown above (backed by MLServer), KServe also provides a of serving runtimes. To see the full list, please visit the .

Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to , which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for .

The InferenceService manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the . For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write an InferenceService manifest like the one below:

Explicitly choosing the to serve our model.

core Python inference server
KServe (formerly known as KFServing)
KServe
how to install it
KServe documentation
serving runtimes
V2 inference protocol
wider set
KServe documentation
write custom runtimes
custom runtimes
custom MLServer image containing your custom logic
V2 inference protocol
MLServer SKLearn
SKLearn Serving Runtime
MLServer XGBoost
XGBoost Serving Runtime