1 of 16

Examples

To see MLServer in action you can check out the examples below. These are end-to-end notebooks, showing how to serve models with MLServer.

Inference Runtimes

If you are interested in how MLServer interacts with particular model frameworks, you can check the following examples. These focus on showcasing the different that ship with MLServer out of the box. Note that, for advanced use cases, you can also write your own custom inference runtime (see the ).

MLServer Features

To see some of the advanced features included in MLServer (e.g. multi-model serving), check out the examples below.

Tutorials

Tutorials are designed to be beginner-friendly and walk through accomplishing a series of tasks using MLServer (and other tools).

Serving Scikit-Learn models

Out of the box, mlserver supports the deployment and serving of scikit-learn models. By default, it will assume that these models have been serialised using joblib.

In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

Training

The first step will be to train a simple scikit-learn model. For that, we will use the MNIST example from the scikit-learn documentation which trains an SVM model.

# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html

# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split

# The digits dataset
digits = datasets.load_digits()

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)

# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(
    data, digits.target, test_size=0.5, shuffle=False)

# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)

Saving our trained model

To save our trained model, we will serialise it using joblib. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the scikit-learn documentation.

Our model will be persisted as a file named mnist-svm.joblib

import joblib

model_file_name = "mnist-svm.joblib"
joblib.dump(classifier, model_file_name)

Serving

Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

`settings.json`

%%writefile settings.json
{
    "debug": "true"
}

`model-settings.json`

%%writefile model-settings.json
{
    "name": "mnist-svm",
    "implementation": "mlserver_sklearn.SKLearnModel",
    "parameters": {
        "uri": "./mnist-svm.joblib",
        "version": "v0.1.0"
    }
}

Start serving our model

Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

import requests

x_0 = X_test[0:1]
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": x_0.shape,
          "datatype": "FP32",
          "data": x_0.tolist()
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

As we can see above, the model predicted the input as the number 8, which matches what's on the test set.

y_test[0]

Serving XGBoost models

Out of the box, mlserver supports the deployment and serving of xgboost models. By default, it will assume that these models have been serialised using the bst.save_model() method.

In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

Training

The first step will be to train a simple xgboost model. For that, we will use the mushrooms example from the xgboost Getting Started guide.

# Original code and extra details can be found in:
# https://xgboost.readthedocs.io/en/latest/get_started.html#python

import os
import xgboost as xgb
import requests

from urllib.parse import urlparse
from sklearn.datasets import load_svmlight_file


TRAIN_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.train'
TEST_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.test'


def _download_file(url: str) -> str:
    parsed = urlparse(url)
    file_name = os.path.basename(parsed.path)
    file_path = os.path.join(os.getcwd(), file_name)
    
    res = requests.get(url)
    
    with open(file_path, 'wb') as file:
        file.write(res.content)
    
    return file_path

train_dataset_path = _download_file(TRAIN_DATASET_URL)
test_dataset_path = _download_file(TEST_DATASET_URL)

# NOTE: Workaround to load SVMLight files from the XGBoost example
X_train, y_train = load_svmlight_file(train_dataset_path)
X_test, y_test = load_svmlight_file(test_dataset_path)
X_train = X_train.toarray()
X_test = X_test.toarray()

# read in data
dtrain = xgb.DMatrix(data=X_train, label=y_train)

# specify parameters via map
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)

bst

Saving our trained model

To save our trained model, we will serialise it using bst.save_model() and the JSON format. This is the approach by the XGBoost project.

Our model will be persisted as a file named mushroom-xgboost.json.

model_file_name = 'mushroom-xgboost.json'
bst.save_model(model_file_name)

Serving

Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

`settings.json`

%%writefile settings.json
{
    "debug": "true"
}

`model-settings.json`

%%writefile model-settings.json
{
    "name": "mushroom-xgboost",
    "implementation": "mlserver_xgboost.XGBoostModel",
    "parameters": {
        "uri": "./mushroom-xgboost.json",
        "version": "v0.1.0"
    }
}

Start serving our model

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

import requests

x_0 = X_test[0:1]
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": x_0.shape,
          "datatype": "FP32",
          "data": x_0.tolist()
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/mushroom-xgboost/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

As we can see above, the model predicted the input as close to 0, which matches what's on the test set.

y_test[0]

Serving LightGBM models

Out of the box, mlserver supports the deployment and serving of lightgbm models. By default, it will assume that these models have been serialised using the bst.save_model() method.

In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

Training

To test the LightGBM Server, first we need to generate a simple LightGBM model using Python.

import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import os

model_dir = "."
BST_FILE = "iris-lightgbm.bst"

iris = load_iris()
y = iris['target']
X = iris['data']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
dtrain = lgb.Dataset(X_train, label=y_train)

params = {
    'objective':'multiclass', 
    'metric':'softmax',
    'num_class': 3
}
lgb_model = lgb.train(params=params, train_set=dtrain)
model_file = os.path.join(model_dir, BST_FILE)
lgb_model.save_model(model_file)

Our model will be persisted as a file named iris-lightgbm.bst.

Serving

Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

`settings.json`

%%writefile settings.json
{
    "debug": "true"
}

`model-settings.json`

%%writefile model-settings.json
{
    "name": "iris-lgb",
    "implementation": "mlserver_lightgbm.LightGBMModel",
    "parameters": {
        "uri": "./iris-lightgbm.bst",
        "version": "v0.1.0"
    }
}

Start serving our model

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

import requests

x_0 = X_test[0:1]
inference_request = {
    "inputs": [
        {
          "name": "predict-prob",
          "shape": x_0.shape,
          "datatype": "FP32",
          "data": x_0.tolist()
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/iris-lgb/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

As we can see above, the model predicted the probability for each class, and the probability of class 1 is the biggest, close to 0.99, which matches what's on the test set.

y_test[0]

Serving MLflow models

Out of the box, MLServer supports the deployment and serving of MLflow models with the following features:

Loading of MLflow Model artifacts.
Support of dataframes, dict-of-tensors and tensor inputs.

In this example, we will showcase some of this features using an example model.

from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, 'w') as f:
        f.write(cell.format(**globals()))

Training

The first step will be to train and serialise a MLflow model. For that, we will use the linear regression examle from the MLflow docs.

# %load src/train.py
# Original source code and more details can be found in:
# https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html

# The data set used in this example is from
# http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties.
# In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2


if __name__ == "__main__":
    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # Read the wine-quality csv file from the URL
    csv_url = (
        "http://archive.ics.uci.edu/ml"
        "/machine-learning-databases/wine-quality/winequality-red.csv"
    )
    try:
        data = pd.read_csv(csv_url, sep=";")
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, "
            "check your internet connection. Error: %s",
            e,
        )

    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
    l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5

    with mlflow.start_run():
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
        model_signature = infer_signature(train_x, train_y)

        # Model registry does not work with file store
        if tracking_url_type_store != "file":

            # Register the model
            # There are other ways to use the Model Registry,
            # which depends on the use case,
            # please refer to the doc for more information:
            # https://mlflow.org/docs/latest/model-registry.html#api-workflow
            mlflow.sklearn.log_model(
                lr,
                "model",
                registered_model_name="ElasticnetWineModel",
                signature=model_signature,
            )
        else:
            mlflow.sklearn.log_model(lr, "model", signature=model_signature)

!python src/train.py

The training script will also serialise our trained model, leveraging the MLflow Model format. By default, we should be able to find the saved artifact under the mlruns folder.

import os

[experiment_file_path] = !ls -td ./mlruns/0/* | head -1
model_path = os.path.join(experiment_file_path, "artifacts", "model")
print(model_path)

!ls {model_path}

Serving

Now that we have trained and serialised our model, we are ready to start serving it. For that, the initial step will be to set up a model-settings.json that instructs MLServer to load our artifact using the MLflow Inference Runtime.

%%writetemplate ./model-settings.json
{{
    "name": "wine-classifier",
    "implementation": "mlserver_mlflow.MLflowRuntime",
    "parameters": {{
        "uri": "{model_path}"
    }}
}}

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set. For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

Note that, the request specifies the value pd as its content type, whereas every input specifies the content type np. These parameters will instruct MLServer to:

Convert every input value to a NumPy array, using the data type and shape information provided.
Group all the different inputs into a Pandas DataFrame, using their names as the column names.

To learn more about how MLServer uses content type parameters, you can check this worked out example.

import requests

inference_request = {
    "inputs": [
        {
          "name": "fixed acidity",
          "shape": [1],
          "datatype": "FP32",
          "data": [7.4],
        },
        {
          "name": "volatile acidity",
          "shape": [1],
          "datatype": "FP32",
          "data": [0.7000],
        },
        {
          "name": "citric acid",
          "shape": [1],
          "datatype": "FP32",
          "data": [0],
        },
        {
          "name": "residual sugar",
          "shape": [1],
          "datatype": "FP32",
          "data": [1.9],
        },
        {
          "name": "chlorides",
          "shape": [1],
          "datatype": "FP32",
          "data": [0.076],
        },
        {
          "name": "free sulfur dioxide",
          "shape": [1],
          "datatype": "FP32",
          "data": [11],
        },
        {
          "name": "total sulfur dioxide",
          "shape": [1],
          "datatype": "FP32",
          "data": [34],
        },
        {
          "name": "density",
          "shape": [1],
          "datatype": "FP32",
          "data": [0.9978],
        },
        {
          "name": "pH",
          "shape": [1],
          "datatype": "FP32",
          "data": [3.51],
        },
        {
          "name": "sulphates",
          "shape": [1],
          "datatype": "FP32",
          "data": [0.56],
        },
        {
          "name": "alcohol",
          "shape": [1],
          "datatype": "FP32",
          "data": [9.4],
        },
    ]
}

endpoint = "http://localhost:8080/v2/models/wine-classifier/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

As we can see in the output above, the predicted quality score for our input wine was 5.57.

MLflow Scoring Protocol

MLflow currently ships with an scoring server with its own protocol. In order to provide a drop-in replacement, the MLflow runtime in MLServer also exposes a custom endpoint which matches the signature of the MLflow's /invocations endpoint.

As an example, we can try to send the same request that sent previously, but using MLflow's protocol. Note that, in both cases, the request will be handled by the same MLServer instance.

import requests

inference_request = {
    "dataframe_split": {
        "columns": [
            "fixed acidity",
            "volatile acidity",
            "citric acid",
            "residual sugar",
            "chlorides",
            "free sulfur dioxide",
            "total sulfur dioxide",
            "density",
            "pH",
            "sulphates",
            "alcohol",
        ],
        "data": [[7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4]]
    }
}

endpoint = "http://localhost:8080/invocations"
response = requests.post(endpoint, json=inference_request)

response.json()

As we can see above, the predicted quality for our input is 5.57, matching the prediction we obtained above.

MLflow Model Signature

MLflow lets users define a model signature, where they can specify what types of inputs does the model accept, and what types of outputs it returns. Similarly, the V2 inference protocol employed by MLServer defines a metadata endpoint which can be used to query what inputs and outputs does the model accept. However, even though they serve similar functions, the data schemas used by each one of them are not compatible between them.

To solve this, if your model defines a MLflow model signature, MLServer will convert on-the-fly this signature to a metadata schema compatible with the V2 Inference Protocol. This will also include specifying any extra content type that is required to correctly decode / encode your data.

As an example, we can first have a look at the model signature saved for our MLflow model. This can be seen directly on the MLModel file saved by our model.

!cat {model_path}/MLmodel

We can then query the metadata endpoint, to see the model metadata inferred by MLServer from our test model's signature. For this, we will use the /v2/models/wine-classifier/ endpoint.

import requests


endpoint = "http://localhost:8080/v2/models/wine-classifier"
response = requests.get(endpoint)

response.json()

As we should be able to see, the model metadata now matches the information contained in our model signature, including any extra content types necessary to decode our data correctly.

Serving a custom model

The mlserver package comes with inference runtime implementations for scikit-learn and xgboost models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.

Overview

In this example, we will train a numpyro model. The numpyro library streamlines the implementation of probabilistic models, abstracting away advanced inference and training algorithms.

Out of the box, mlserver doesn't provide an inference runtime for numpyro. However, through this example we will see how easy is to develop our own.

Training

The first step will be to train our model. This will be a very simple bayesian regression model, based on an example provided in the numpyro docs.

Since this is a probabilistic model, during training we will compute an approximation to the posterior distribution of our model using MCMC.

# Original source code and more details can be found in:
# https://nbviewer.jupyter.org/github/pyro-ppl/numpyro/blob/master/notebooks/source/bayesian_regression.ipynb


import numpyro
import numpy as np
import pandas as pd

from numpyro import distributions as dist
from jax import random
from numpyro.infer import MCMC, NUTS

DATASET_URL = "https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/WaffleDivorce.csv"
dset = pd.read_csv(DATASET_URL, sep=";")

standardize = lambda x: (x - x.mean()) / x.std()

dset["AgeScaled"] = dset.MedianAgeMarriage.pipe(standardize)
dset["MarriageScaled"] = dset.Marriage.pipe(standardize)
dset["DivorceScaled"] = dset.Divorce.pipe(standardize)


def model(marriage=None, age=None, divorce=None):
    a = numpyro.sample("a", dist.Normal(0.0, 0.2))
    M, A = 0.0, 0.0
    if marriage is not None:
        bM = numpyro.sample("bM", dist.Normal(0.0, 0.5))
        M = bM * marriage
    if age is not None:
        bA = numpyro.sample("bA", dist.Normal(0.0, 0.5))
        A = bA * age
    sigma = numpyro.sample("sigma", dist.Exponential(1.0))
    mu = a + M + A
    numpyro.sample("obs", dist.Normal(mu, sigma), obs=divorce)


# Start from this source of randomness. We will split keys for subsequent operations.
rng_key = random.PRNGKey(0)
rng_key, rng_key_ = random.split(rng_key)

num_warmup, num_samples = 1000, 2000

# Run NUTS.
kernel = NUTS(model)
mcmc = MCMC(kernel, num_warmup=num_warmup, num_samples=num_samples)
mcmc.run(
    rng_key_, marriage=dset.MarriageScaled.values, divorce=dset.DivorceScaled.values
)
mcmc.print_summary()

Saving our trained model

Now that we have trained our model, the next step will be to save it so that it can be loaded afterwards at serving-time. Note that, since this is a probabilistic model, we will only need to save the traces that approximate the posterior distribution over latent parameters.

This will get saved in a numpyro-divorce.json file.

import json

samples = mcmc.get_samples()
serialisable = {}
for k, v in samples.items():
    serialisable[k] = np.asarray(v).tolist()

model_file_name = "numpyro-divorce.json"
with open(model_file_name, "w") as model_file:
    json.dump(serialisable, model_file)

Serving

The next step will be to serve our model using mlserver. For that, we will first implement an extension which serve as the runtime to perform inference using our custom numpyro model.

Custom inference runtime

Our custom inference wrapper should be responsible of:

Loading the model from the set samples we saved previously.
Running inference using our model structure, and the posterior approximated from the samples.

# %load models.py
import json
import numpyro
import numpy as np

from jax import random
from mlserver import MLModel
from mlserver.codecs import decode_args
from mlserver.utils import get_model_uri
from numpyro.infer import Predictive
from numpyro import distributions as dist
from typing import Optional


class NumpyroModel(MLModel):
    async def load(self) -> bool:
        model_uri = await get_model_uri(self._settings)
        with open(model_uri) as model_file:
            raw_samples = json.load(model_file)

        self._samples = {}
        for k, v in raw_samples.items():
            self._samples[k] = np.array(v)

        self._predictive = Predictive(self._model, self._samples)

        return True

    @decode_args
    async def predict(
        self,
        marriage: Optional[np.ndarray] = None,
        age: Optional[np.ndarray] = None,
        divorce: Optional[np.ndarray] = None,
    ) -> np.ndarray:
        predictions = self._predictive(
            rng_key=random.PRNGKey(0), marriage=marriage, age=age, divorce=divorce
        )

        obs = predictions["obs"]
        obs_mean = obs.mean()

        return np.asarray(obs_mean)

    def _model(self, marriage=None, age=None, divorce=None):
        a = numpyro.sample("a", dist.Normal(0.0, 0.2))
        M, A = 0.0, 0.0
        if marriage is not None:
            bM = numpyro.sample("bM", dist.Normal(0.0, 0.5))
            M = bM * marriage
        if age is not None:
            bA = numpyro.sample("bA", dist.Normal(0.0, 0.5))
            A = bA * age
        sigma = numpyro.sample("sigma", dist.Exponential(1.0))
        mu = a + M + A
        numpyro.sample("obs", dist.Normal(mu, sigma), obs=divorce)

Settings files

The next step will be to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

`settings.json`

# %load settings.json
{
    "debug": "true"
}

`model-settings.json`

# %load model-settings.json
{
    "name": "numpyro-divorce",
    "implementation": "models.NumpyroModel",
    "parameters": {
        "uri": "./numpyro-divorce.json"
    }
}

Start serving our model

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

import requests
import numpy as np

from mlserver.types import InferenceRequest
from mlserver.codecs import NumpyCodec

x_0 = np.array([28.0])
inference_request = InferenceRequest(
    inputs=[
        NumpyCodec.encode_input(name="marriage", payload=x_0)
    ]
)

endpoint = "http://localhost:8080/v2/models/numpyro-divorce/infer"
response = requests.post(endpoint, json=inference_request.model_dump())

response.json()

Deployment

Now that we have written and tested our custom model, the next step is to deploy it. With that goal in mind, the rough outline of steps will be to first build a custom image containing our code, and then deploy it.

Specifying requirements

MLServer will automatically find your requirements.txt file and install necessary python packages

# %load requirements.txt
numpy==1.22.4
numpyro==0.8.0
jax==0.2.24
jaxlib==0.3.7

Building a custom image

This section expects that Docker is available and running in the background.

MLServer offers helpers to build a custom Docker image containing your code. In this example, we will use the mlserver build subcommand to create an image, which we'll be able to deploy later.

Note that this section expects that Docker is available and running in the background, as well as a functional cluster with Seldon Core installed and some familiarity with kubectl.

%%bash
mlserver build . -t 'my-custom-numpyro-server:0.1.0'

To ensure that the image is fully functional, we can spin up a container and then send a test request. To start the container, you can run something along the following lines in a separate terminal:

docker run -it --rm -p 8080:8080 my-custom-numpyro-server:0.1.0

import numpy as np

from mlserver.types import InferenceRequest
from mlserver.codecs import NumpyCodec

x_0 = np.array([28.0])
inference_request = InferenceRequest(
    inputs=[
        NumpyCodec.encode_input(name="marriage", payload=x_0)
    ]
)

endpoint = "http://localhost:8080/v2/models/numpyro-divorce/infer"
response = requests.post(endpoint, json=inference_request.model_dump())

response.json()

As we should be able to see, the server running within our Docker image responds as expected.

Deploying our custom image

This section expects access to a functional Kubernetes cluster with Seldon Core installed and some familiarity with `kubectl`.

Now that we've built a custom image and verified that it works as expected, we can move to the next step and deploy it. There is a large number of tools out there to deploy images. However, for our example, we will focus on deploying it to a cluster running Seldon Core.

Also consider that depending on your Kubernetes installation Seldon Core might expect to get the container image from a public container registry like [Docker hub](https://hub.docker.com/) or [Google Container Registry](https://cloud.google.com/container-registry). For that you need to do an extra step of pushing the container to the registry using `docker tag <image name> <container registry>/<image name>` and `docker push <container registry>/<image name>` and also updating the `image` section of the yaml file to `<container registry>/<image name>`.

For that, we will need to create a SeldonDeployment resource which instructs Seldon Core to deploy a model embedded within our custom image and compliant with the V2 Inference Protocol. This can be achieved by applying (i.e. kubectl apply) a SeldonDeployment manifest to the cluster, similar to the one below:

%%writefile seldondeployment.yaml
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: numpyro-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: numpyro-divorce
        type: MODEL
      componentSpecs:
        - spec:
            containers:
              - name: numpyro-divorce
                image: my-custom-numpyro-server:0.1.0

Serving Alibi-Detect models

Out of the box, mlserver supports the deployment and serving of alibi_detect models. Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. In this example, we will cover how we can create a detector configuration to then serve it using mlserver.

Fetch reference data

The first step will be to fetch a reference data and other relevant metadata for an alibi-detect model.

For that, we will use the alibi library to get the adult dataset with demographic features from a 1996 US census.

Install `alibi` library for dataset dependencies and `alibi_detect` library for detector configuration from Pypi
```python
!pip install alibi alibi_detect
```

import alibi
import matplotlib.pyplot as plt
import numpy as np

adult = alibi.datasets.fetch_adult()
X, y = adult.data, adult.target
feature_names = adult.feature_names
category_map = adult.category_map

n_ref = 10000
n_test = 10000

X_ref, X_t0, X_t1 = X[:n_ref], X[n_ref:n_ref + n_test], X[n_ref + n_test:n_ref + 2 * n_test]
categories_per_feature = {f: None for f in list(category_map.keys())}

Drift Detector Configuration

This example is based on the Categorical and mixed type data drift detection on income prediction tabular data from the alibi-detect documentation.

Creating detector and saving configuration

from alibi_detect.cd import TabularDrift
cd_tabular = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)

from alibi_detect.utils.saving import save_detector
filepath = "alibi-detector-artifacts"
save_detector(cd_tabular, filepath)

Detecting data drift directly

preds = cd_tabular.predict(X_t0,drift_type="feature")

labels = ['No!', 'Yes!']
print(f"Threshold {preds['data']['threshold']}")
for f in range(cd_tabular.n_features):
    fname = feature_names[f]
    is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
    stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- Chi2 {stat_val:.3f} -- p-value {p_val:.3f}')

Threshold 0.05
Age -- Drift? No! -- Chi2 0.012 -- p-value 0.508
Workclass -- Drift? No! -- Chi2 8.487 -- p-value 0.387
Education -- Drift? No! -- Chi2 4.753 -- p-value 0.576
Marital Status -- Drift? No! -- Chi2 3.160 -- p-value 0.368
Occupation -- Drift? No! -- Chi2 8.194 -- p-value 0.415
Relationship -- Drift? No! -- Chi2 0.485 -- p-value 0.993
Race -- Drift? No! -- Chi2 0.587 -- p-value 0.965
Sex -- Drift? No! -- Chi2 0.217 -- p-value 0.641
Capital Gain -- Drift? No! -- Chi2 0.002 -- p-value 1.000
Capital Loss -- Drift? No! -- Chi2 0.002 -- p-value 1.000
Hours per week -- Drift? No! -- Chi2 0.012 -- p-value 0.508
Country -- Drift? No! -- Chi2 9.991 -- p-value 0.441

Serving

Now that we have the reference data and other configuration parameters, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

`settings.json`

%%writefile settings.json
{
    "debug": "true"
}

Overwriting settings.json

`model-settings.json`

%%writefile model-settings.json
{
  "name": "income-tabular-drift",
  "implementation": "mlserver_alibi_detect.AlibiDetectRuntime",
  "parameters": {
    "uri": "./alibi-detector-artifacts",
    "version": "v0.1.0",
    "extra": {
      "predict_parameters":{
        "drift_type": "feature"
      }
    }
  }
}

Overwriting model-settings.json

Start serving our model

Now that we have our config in-place, we can start the server by running mlserver start command. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

We now have our alibi-detect model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

import requests

inference_request = {
    "inputs": [
        {
            "name": "predict",
            "shape": X_t0.shape,
            "datatype": "FP32",
            "data": X_t0.tolist(),
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/income-tabular-drift/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)

View model response

import json
response_dict = json.loads(response.text)

labels = ['No!', 'Yes!']
for f in range(cd_tabular.n_features):
    stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
    fname = feature_names[f]
    is_drift = response_dict['outputs'][0]['data'][f]
    stat_val, p_val = response_dict['outputs'][1]['data'][f], response_dict['outputs'][2]['data'][f]
    print(f'{fname} -- Drift? {labels[is_drift]} -- Chi2 {stat_val:.3f} -- p-value {p_val:.3f}')

Age -- Drift? No! -- Chi2 0.012 -- p-value 0.508
Workclass -- Drift? No! -- Chi2 8.487 -- p-value 0.387
Education -- Drift? No! -- Chi2 4.753 -- p-value 0.576
Marital Status -- Drift? No! -- Chi2 3.160 -- p-value 0.368
Occupation -- Drift? No! -- Chi2 8.194 -- p-value 0.415
Relationship -- Drift? No! -- Chi2 0.485 -- p-value 0.993
Race -- Drift? No! -- Chi2 0.587 -- p-value 0.965
Sex -- Drift? No! -- Chi2 0.217 -- p-value 0.641
Capital Gain -- Drift? No! -- Chi2 0.002 -- p-value 1.000
Capital Loss -- Drift? No! -- Chi2 0.002 -- p-value 1.000
Hours per week -- Drift? No! -- Chi2 0.012 -- p-value 0.508
Country -- Drift? No! -- Chi2 9.991 -- p-value 0.441

Serving HuggingFace Transformer Models

Out of the box, MLServer supports the deployment and serving of HuggingFace Transformer models with the following features:

Loading of Transformer Model artifacts from the Hugging Face Hub.
Model quantization & optimization using the Hugging Face Optimum library
Request batching for GPU optimization (via adaptive batching and request batching)

In this example, we will showcase some of this features using an example model.

# Import required dependencies
import requests

Serving

Since we're using a pretrained model, we can skip straight to serving.

`model-settings.json`

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2"
        }
    }
}

Overwriting ./model-settings.json

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["this is a test"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()

{'model_name': 'transformer',
 'id': 'eb160c6b-8223-4342-ad92-6ac301a9fa5d',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"generated_text": "this is a testnet with 1-3,000-bit nodes as nodes."}']}]}

Using Optimum Optimized Models

We can also leverage the Optimum library that allows us to access quantized and optimized models.

We can download pretrained optimized models from the hub if available by enabling the optimum_model flag:

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2",
            "optimum_model": true
        }
    }
}

Overwriting ./model-settings.json

Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

mlserver start .

Send Test Request to Optimum Optimized Model

The request can now be sent using the same request structure but using optimized models for better performance.

inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["this is a test"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()

{'model_name': 'transformer',
 'id': '9c482c8d-b21e-44b1-8a42-7650a9dc01ef',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"generated_text": "this is a test of the \\"safe-code-safe-code-safe-code\\" approach. The method only accepts two parameters as parameters: the code. The parameter \'unsafe-code-safe-code-safe-code\' should"}']}]}

Testing Supported Tasks

We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.

Question Answering

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "question-answering"
        }
    }
}

Overwriting ./model-settings.json

Once again, you are able to run the model using the MLServer CLI.

mlserver start .

inference_request = {
    "inputs": [
        {
            "name": "question",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["what is your name?"],
        },
        {
            "name": "context",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Hello, I am Seldon, how is it going"],
        },
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()

{'model_name': 'transformer',
 'id': '4efac938-86d8-41a1-b78f-7690b2dcf197',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"score": 0.9869915843009949, "start": 12, "end": 18, "answer": "Seldon"}']}]}

Sentiment Analysis

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-classification"
        }
    }
}

Overwriting ./model-settings.json

Once again, you are able to run the model using the MLServer CLI.

mlserver start .

inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is terrible!"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()

{'model_name': 'transformer',
 'id': '835eabbd-daeb-4423-a64f-a7c4d7c60a9b',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"label": "NEGATIVE", "score": 0.9996137022972107}']}]}

GPU Acceleration

We can also evaluate GPU acceleration, we can test the speed on CPU vs GPU using the following parameters

Testing with CPU

We first test the time taken with the device=-1 which configures CPU by default

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "max_batch_size": 128,
    "max_batch_time": 1,
    "parameters": {
        "extra": {
            "task": "text-generation",
            "device": -1
        }
    }
}

Overwriting ./model-settings.json

Once again, you are able to run the model using the MLServer CLI.

mlserver start .

inference_request = {
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is a generation for the work" for i in range(512)],
        }
    ]
}

# Benchmark time
import time

start_time = time.monotonic()

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
)

print(f"Elapsed time: {time.monotonic() - start_time}")

Elapsed time: 66.42268538899953

We can see that it takes 81 seconds which is 8 times longer than the gpu example below.

Testing with GPU

IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.

Now we'll run the benchmark with GPU configured, which we can do by setting device=0

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "device": 0
        }
    }
}

Overwriting ./model-settings.json

inference_request = {
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is a generation for the work" for i in range(512)],
        }
    ]
}

# Benchmark time
import time

start_time = time.monotonic()

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
)

print(f"Elapsed time: {time.monotonic() - start_time}")

Elapsed time: 11.27933280000434

We can see that the elapsed time is 8 times less than the CPU version!

Adaptive Batching with GPU

We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.

In our case we can enable adaptive batching with the max_batch_size which in our case we will set it ot 128.

We will also configure max_batch_time which specifies` the maximum amount of time the MLServer orchestrator will wait before sending for inference.

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "max_batch_size": 128,
    "max_batch_time": 1,
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2",
            "device": 0
        }
    }
}

Overwriting ./model-settings.json

In order to achieve the throughput required of 50 requests per second, we will use the tool vegeta which performs load testing.

We can now see that we are able to see that the requests are batched and we receive 100% success eventhough the requests are sent one-by-one.

%%bash
jq -ncM '{"method": "POST", "header": {"Content-Type": ["application/json"] }, "url": "http://localhost:8080/v2/models/transformer/infer", "body": "{\"inputs\":[{\"name\":\"text_inputs\",\"shape\":[1],\"datatype\":\"BYTES\",\"data\":[\"test\"]}]}" | @base64 }' \
          | vegeta \
                -cpus="2" \
                attack \
                -duration="3s" \
                -rate="50" \
                -format=json \
          | vegeta \
                report \
                -type=text

Requests      [total, rate, throughput]         150, 50.34, 22.28
Duration      [total, attack, wait]             6.732s, 2.98s, 3.753s
Latencies     [min, mean, 50, 90, 95, 99, max]  1.975s, 3.168s, 3.22s, 4.065s, 4.183s, 4.299s, 4.318s
Bytes In      [total, mean]                     60978, 406.52
Bytes Out     [total, mean]                     12300, 82.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150  
Error Set:

Multi-Model Serving

MLServer has been built with in mind. This means that, within a single instance of MLServer, you can serve multiple models under different paths. This also includes multiple versions of the same model.

This notebook shows an example of how you can leverage MMS with MLServer.

Training

We will first start by training 2 different models:

Name

Framework

Source

Trained Model Path

Training our `mnist-svm` model

# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html

# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split

# The digits dataset
digits = datasets.load_digits()

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)

# Split data into train and test subsets
X_train, X_test_digits, y_train, y_test_digits = train_test_split(
    data, digits.target, test_size=0.5, shuffle=False)

# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)

import joblib
import os

mnist_svm_path = os.path.join("models", "mnist-svm")
os.makedirs(mnist_svm_path, exist_ok=True)

mnist_svm_model_path = os.path.join(mnist_svm_path, "model.joblib")
joblib.dump(classifier, mnist_svm_model_path)

Training our `mushroom-xgboost` model

# Original code and extra details can be found in:
# https://xgboost.readthedocs.io/en/latest/get_started.html#python

import os
import xgboost as xgb
import requests

from urllib.parse import urlparse
from sklearn.datasets import load_svmlight_file


TRAIN_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.train'
TEST_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.test'


def _download_file(url: str) -> str:
    parsed = urlparse(url)
    file_name = os.path.basename(parsed.path)
    file_path = os.path.join(os.getcwd(), file_name)
    
    res = requests.get(url)
    
    with open(file_path, 'wb') as file:
        file.write(res.content)
    
    return file_path

train_dataset_path = _download_file(TRAIN_DATASET_URL)
test_dataset_path = _download_file(TEST_DATASET_URL)

# NOTE: Workaround to load SVMLight files from the XGBoost example
X_train, y_train = load_svmlight_file(train_dataset_path)
X_test_agar, y_test_agar = load_svmlight_file(test_dataset_path)
X_train = X_train.toarray()
X_test_agar = X_test_agar.toarray()

# read in data
dtrain = xgb.DMatrix(data=X_train, label=y_train)

# specify parameters via map
param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)

bst

import os

mushroom_xgboost_path = os.path.join("models", "mushroom-xgboost")
os.makedirs(mushroom_xgboost_path, exist_ok=True)

mushroom_xgboost_model_path = os.path.join(mushroom_xgboost_path, "model.json")
bst.save_model(mushroom_xgboost_model_path)

Serving

The next step will be serving both our models within the same MLServer instance. For that, we will just need to create a model-settings.json file local to each of our models and a server-wide settings.json. That is,

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
models/mnist-svm/model-settings.json: holds the configuration specific to our mnist-svm model (e.g. input type, runtime to use, etc.).
models/mushroom-xgboost/model-settings.json: holds the configuration specific to our mushroom-xgboost model (e.g. input type, runtime to use, etc.).

`settings.json`

%%writefile settings.json
{
    "debug": "true"
}

`models/mnist-svm/model-settings.json`

%%writefile models/mnist-svm/model-settings.json
{
    "name": "mnist-svm",
    "implementation": "mlserver_sklearn.SKLearnModel",
    "parameters": {
        "version": "v0.1.0"
    }
}

`models/mushroom-xgboost/model-settings.json`

%%writefile models/mushroom-xgboost/model-settings.json
{
    "name": "mushroom-xgboost",
    "implementation": "mlserver_xgboost.XGBoostModel",
    "parameters": {
        "version": "v0.1.0"
    }
}

Start serving our model

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Testing

By this point, we should have both our models getting served by MLServer. To make sure that everything is working as expected, let's send a request from each test set.

For that, we can use the Python types that the mlserver package provides out of box, or we can build our request manually.

Testing our `mnist-svm` model

import requests

x_0 = X_test_digits[0:1]
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": x_0.shape,
          "datatype": "FP32",
          "data": x_0.tolist()
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

Testing our `mushroom-xgboost` model

import requests

x_0 = X_test_agar[0:1]
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": x_0.shape,
          "datatype": "FP32",
          "data": x_0.tolist()
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/mushroom-xgboost/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

Model Repository API

MLServer supports loading and unloading models dynamically from a models repository. This allows you to enable and disable the models accessible by MLServer on demand. This extension builds on top of the support for Multi-Model Serving, letting you change at runtime which models is MLServer currently serving.

The API to manage the model repository is modelled after Triton's Model Repository extension to the V2 Dataplane and is thus fully compatible with it.

This notebook will walk you through an example using the Model Repository API.

Training

First of all, we will need to train some models. For that, we will re-use the models we trained previously in the Multi-Model Serving example. You can check the details on how they are trained following that notebook.

!cp -r ../mms/models/* ./models

Serving

Next up, we will start our mlserver inference server. Note that, by default, this will load all our models.

mlserver start .

List available models

Now that we've got our inference server up and running, and serving 2 different models, we can start using the Model Repository API. To get us started, we will first list all available models in the repository.

import requests

response = requests.post("http://localhost:8080/v2/repository/index", json={})
response.json()

As we can, the repository lists 2 models (i.e. mushroom-xgboost and mnist-svm). Note that the state for both is set to READY. This means that both models are loaded, and thus ready for inference.

Unloading our `mushroom-xgboost` model

We will now try to unload one of the 2 models, mushroom-xgboost. This will unload the model from the inference server but will keep it available on our model repository.

requests.post("http://localhost:8080/v2/repository/models/mushroom-xgboost/unload")

If we now try to list the models available in our repository, we will see that the mushroom-xgboost model is flagged as UNAVAILABLE. This means that it's present in the repository but it's not loaded for inference.

response = requests.post("http://localhost:8080/v2/repository/index", json={})
response.json()

Loading our `mushroom-xgboost` model back

We will now load our model back into our inference server.

requests.post("http://localhost:8080/v2/repository/models/mushroom-xgboost/load")

If we now try to list the models again, we will see that our mushroom-xgboost is back again, ready for inference.

response = requests.post("http://localhost:8080/v2/repository/index", json={})
response.json()

Content Type Decoding

MLServer extends the V2 inference protocol by adding support for a content_type annotation. This annotation can be provided either through the model metadata parameters, or through the input parameters. By leveraging the content_type annotation, we can provide the necessary information to MLServer so that it can decode the input payload from the "wire" V2 protocol to something meaningful to the model / user (e.g. a NumPy array).

This example will walk you through some examples which illustrate how this works, and how it can be extended.

Echo Inference Runtime

To start with, we will write a dummy runtime which just prints the input, the decoded input and returns it. This will serve as a testbed to showcase how the content_type support works.

Later on, we will extend this runtime by adding custom codecs that will decode our V2 payload to custom types.

%%writefile runtime.py
import json

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs import DecodedParameterName

_to_exclude = {
    "parameters": {DecodedParameterName, "headers"},
    'inputs': {"__all__": {"parameters": {DecodedParameterName, "headers"}}}
}

class EchoRuntime(MLModel):
    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        outputs = []
        for request_input in payload.inputs:
            decoded_input = self.decode(request_input)
            print(f"------ Encoded Input ({request_input.name}) ------")
            as_dict = request_input.dict(exclude=_to_exclude)  # type: ignore
            print(json.dumps(as_dict, indent=2))
            print(f"------ Decoded input ({request_input.name}) ------")
            print(decoded_input)

            outputs.append(
                ResponseOutput(
                    name=request_input.name,
                    datatype=request_input.datatype,
                    shape=request_input.shape,
                    data=request_input.data
                )
            )

        return InferenceResponse(model_name=self.name, outputs=outputs)

As you can see above, this runtime will decode the incoming payloads by calling the self.decode() helper method. This method will check what's the right content type for each input in the following order:

Is there any content type defined in the inputs[].parameters.content_type field within the request payload?
Is there any content type defined in the inputs[].parameters.content_type field within the model metadata?
Is there any default content type that should be assumed?

Model Settings

In order to enable this runtime, we will also create a model-settings.json file. This file should be present (or accessible from) in the folder where we run mlserver start ..

%%writefile model-settings.json

{
    "name": "content-type-example",
    "implementation": "runtime.EchoRuntime"
}

Request Inputs

Our initial step will be to decide the content type based on the incoming inputs[].parameters field. For this, we will start our MLServer in the background (e.g. running mlserver start .)

import requests

payload = {
    "inputs": [
        {
            "name": "parameters-np",
            "datatype": "INT32",
            "shape": [2, 2],
            "data": [1, 2, 3, 4],
            "parameters": {
                "content_type": "np"
            }
        },
        {
            "name": "parameters-str",
            "datatype": "BYTES",
            "shape": [1],
            "data": "hello world 😁",
            "parameters": {
                "content_type": "str"
            }
        }
    ]
}

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload
)

Codecs

As you've probably already noticed, writing request payloads compliant with both the V2 Inference Protocol requires a certain knowledge about both the V2 spec and the structure expected by each content type. To account for this and simplify usage, the MLServer package exposes a set of utilities which will help you interact with your models via the V2 protocol.

These helpers are mainly shaped as "codecs". That is, abstractions which know how to "encode" and "decode" arbitrary Python datatypes to and from the V2 Inference Protocol.

Generally, we recommend using the existing set of codecs to generate your V2 payloads. This will ensure that requests and responses follow the right structure, and should provide a more seamless experience.

Following with our previous example, the same code could be rewritten using codecs as:

import requests
import numpy as np

from mlserver.types import InferenceRequest, InferenceResponse
from mlserver.codecs import NumpyCodec, StringCodec

parameters_np = np.array([[1, 2], [3, 4]])
parameters_str = ["hello world 😁"]

payload = InferenceRequest(
    inputs=[
        NumpyCodec.encode_input("parameters-np", parameters_np),
        # The `use_bytes=False` flag will ensure that the encoded payload is JSON-compatible
        StringCodec.encode_input("parameters-str", parameters_str, use_bytes=False),
    ]
)

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload.model_dump()
)

response_payload = InferenceResponse.parse_raw(response.text)
print(NumpyCodec.decode_output(response_payload.outputs[0]))
print(StringCodec.decode_output(response_payload.outputs[1]))

Note that the rewritten snippet now makes use of the built-in InferenceRequest class, which represents a V2 inference request. On top of that, it also uses the NumpyCodec and StringCodec implementations, which know how to encode a Numpy array and a list of strings into V2-compatible request inputs.

Model Metadata

Our next step will be to define the expected content type through the model metadata. This can be done by extending the model-settings.json file, and adding a section on inputs.

%%writefile model-settings.json

{
    "name": "content-type-example",
    "implementation": "runtime.EchoRuntime",
    "inputs": [
        {
            "name": "metadata-np",
            "datatype": "INT32",
            "shape": [2, 2],
            "parameters": {
                "content_type": "np"
            }
        },
        {
            "name": "metadata-str",
            "datatype": "BYTES",
            "shape": [11],
            "parameters": {
                "content_type": "str"
            }
        }
    ]
}

After adding this metadata, we will re-start MLServer (e.g. mlserver start .) and we will send a new request without any explicit parameters.

import requests

payload = {
    "inputs": [
        {
            "name": "metadata-np",
            "datatype": "INT32",
            "shape": [2, 2],
            "data": [1, 2, 3, 4],
        },
        {
            "name": "metadata-str",
            "datatype": "BYTES",
            "shape": [11],
            "data": "hello world 😁",
        }
    ]
}

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload
)

As you should be able to see in the server logs, MLServer will cross-reference the input names against the model metadata to find the right content type.

Custom Codecs

There may be cases where a custom inference runtime may need to encode / decode to custom datatypes. As an example, we can think of computer vision models which may only operate with pillow image objects.

In these scenarios, it's possible to extend the Codec interface to write our custom encoding logic. A Codec, is simply an object which defines a decode() and encode() methods. To illustrate how this would work, we will extend our custom runtime to add a custom PillowCodec.

%%writefile runtime.py
import io
import json

from PIL import Image

from mlserver import MLModel
from mlserver.types import (
    InferenceRequest,
    InferenceResponse,
    RequestInput,
    ResponseOutput,
)
from mlserver.codecs import NumpyCodec, register_input_codec, DecodedParameterName
from mlserver.codecs.utils import InputOrOutput


_to_exclude = {
    "parameters": {DecodedParameterName},
    "inputs": {"__all__": {"parameters": {DecodedParameterName}}},
}


@register_input_codec
class PillowCodec(NumpyCodec):
    ContentType = "img"
    DefaultMode = "L"

    @classmethod
    def can_encode(cls, payload: Image) -> bool:
        return isinstance(payload, Image)

    @classmethod
    def _decode(cls, input_or_output: InputOrOutput) -> Image:
        if input_or_output.datatype != "BYTES":
            # If not bytes, assume it's an array
            image_array = super().decode_input(input_or_output)  # type: ignore
            return Image.fromarray(image_array, mode=cls.DefaultMode)

        encoded = input_or_output.data
        if isinstance(encoded, str):
            encoded = encoded.encode()

        return Image.frombytes(
            mode=cls.DefaultMode, size=input_or_output.shape, data=encoded
        )

    @classmethod
    def encode_output(cls, name: str, payload: Image) -> ResponseOutput:  # type: ignore
        byte_array = io.BytesIO()
        payload.save(byte_array, mode=cls.DefaultMode)

        return ResponseOutput(
            name=name, shape=payload.size, datatype="BYTES", data=byte_array.getvalue()
        )

    @classmethod
    def decode_output(cls, response_output: ResponseOutput) -> Image:
        return cls._decode(response_output)

    @classmethod
    def encode_input(cls, name: str, payload: Image) -> RequestInput:  # type: ignore
        output = cls.encode_output(name, payload)
        return RequestInput(
            name=output.name,
            shape=output.shape,
            datatype=output.datatype,
            data=output.data,
        )

    @classmethod
    def decode_input(cls, request_input: RequestInput) -> Image:
        return cls._decode(request_input)


class EchoRuntime(MLModel):
    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        outputs = []
        for request_input in payload.inputs:
            decoded_input = self.decode(request_input)
            print(f"------ Encoded Input ({request_input.name}) ------")
            as_dict = request_input.dict(exclude=_to_exclude)  # type: ignore
            print(json.dumps(as_dict, indent=2))
            print(f"------ Decoded input ({request_input.name}) ------")
            print(decoded_input)

            outputs.append(
                ResponseOutput(
                    name=request_input.name,
                    datatype=request_input.datatype,
                    shape=request_input.shape,
                    data=request_input.data,
                )
            )

        return InferenceResponse(model_name=self.name, outputs=outputs)

We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.

import requests

payload = {
    "inputs": [
        {
            "name": "image-int32",
            "datatype": "INT32",
            "shape": [8, 8],
            "data": [
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0
            ],
            "parameters": {
                "content_type": "img"
            }
        },
        {
            "name": "image-bytes",
            "datatype": "BYTES",
            "shape": [8, 8],
            "data": (
                "10101010"
                "10101010"
                "10101010"
                "10101010"
                "10101010"
                "10101010"
                "10101010"
                "10101010"
            ),
            "parameters": {
                "content_type": "img"
            }
        }
    ]
}

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload
)

As you should be able to see in the MLServer logs, the server is now able to decode the payload into a Pillow image. This example also illustrates how Codec objects can be compatible with multiple datatype values (e.g. tensor and BYTES in this case).

Request Codecs

So far, we've seen how you can specify codecs so that they get applied at the input level. However, it is also possible to use request-wide codecs that aggregate multiple inputs to decode the payload. This is usually relevant for cases where the models expect a multi-column input type, like a Pandas DataFrame.

To illustrate this, we will first tweak our EchoRuntime so that it prints the decoded contents at the request level.

%%writefile runtime.py
import json

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs import DecodedParameterName

_to_exclude = {
    "parameters": {DecodedParameterName},
    'inputs': {"__all__": {"parameters": {DecodedParameterName}}}
}

class EchoRuntime(MLModel):
    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        print("------ Encoded Input (request) ------")
        as_dict = payload.dict(exclude=_to_exclude)  # type: ignore
        print(json.dumps(as_dict, indent=2))
        print("------ Decoded input (request) ------")
        decoded_request = None
        if payload.parameters:
            decoded_request = getattr(payload.parameters, DecodedParameterName)
        print(decoded_request)

        outputs = []
        for request_input in payload.inputs:
            outputs.append(
                ResponseOutput(
                    name=request_input.name,
                    datatype=request_input.datatype,
                    shape=request_input.shape,
                    data=request_input.data
                )
            )

        return InferenceResponse(model_name=self.name, outputs=outputs)

We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.

import requests

payload = {
    "inputs": [
        {
            "name": "parameters-np",
            "datatype": "INT32",
            "shape": [2, 2],
            "data": [1, 2, 3, 4],
            "parameters": {
                "content_type": "np"
            }
        },
        {
            "name": "parameters-str",
            "datatype": "BYTES",
            "shape": [2, 11],
            "data": ["hello world 😁", "bye bye 😁"],
            "parameters": {
                "content_type": "str"
            }
        }
    ],
    "parameters": {
        "content_type": "pd"
    }
}

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload
)

Custom Conda environments in MLServer

It's not unusual that model runtimes require extra dependencies that are not direct dependencies of MLServer. This is the case when we want to use , but also when our model artifacts are the output of older versions of a toolkit (e.g. models trained with an older version of SKLearn).

In these cases, since these dependencies (or dependency versions) are not known in advance by MLServer, they won't be included in the default seldonio/mlserver Docker image. To cover these cases, the seldonio/mlserver Docker image allows you to load custom environments before starting the server itself.

This example will walk you through how to create and save an custom environment, so that it can be loaded in MLServer without any extra change to the seldonio/mlserver Docker image.

Define our environment

For this example, we will create a custom environment to serve a model trained with an older version of Scikit-Learn. The first step will be define this environment, using a .

Note that these environments can also be created on the fly as we go, and then serialised later.

%%writefile environment.yml

name: old-sklearn
channels:
    - conda-forge
dependencies:
    - python == 3.8
    - scikit-learn == 0.24.2
    - joblib == 0.17.0
    - requests
    - pip
    - pip:
        - mlserver == 1.1.0
        - mlserver-sklearn == 1.1.0

Train model in our custom environment

To illustrate the point, we will train a Scikit-Learn model using our older environment.

The first step will be to create and activate an environment which reflects what's outlined in our environment.yml file.

NOTE: If you are running this from a Jupyter Notebook, you will need to restart your Jupyter instance so that it runs from this environment.

!conda env create --force -f environment.yml
!conda activate old-sklearn

We can now train and save a Scikit-Learn model using the older version of our environment. This model will be serialised as model.joblib.

# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html

# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split

# The digits dataset
digits = datasets.load_digits()

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)

# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(
    data, digits.target, test_size=0.5, shuffle=False)

# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)

import joblib

model_file_name = "model.joblib"
joblib.dump(classifier, model_file_name)

Serialise our custom environment

This tool, will save a portable version of our environment as a .tar.gz file, also known as tarball.

!conda pack --force -n old-sklearn -o old-sklearn.tar.gz

Serving

Now that we have defined our environment (and we've got a sample artifact trained in that environment), we can move to serving our model.

To do that, we will first need to select the right runtime through a model-settings.json config file.

%%writefile model-settings.json
{
    "name": "mnist-svm",
    "implementation": "mlserver_sklearn.SKLearnModel"
}

We can then spin up our model, using our custom environment, leveraging MLServer's Docker image. Keep in mind that you will need Docker installed in your machine to run this example.

Our Docker command will need to take into account the following points:

Mount the example's folder as a volume so that it can be accessed from within the container.
Let MLServer know that our custom environment's tarball can be found as old-sklearn.tar.gz.
Expose port 8080 so that we can send requests from the outside.

From the command line, this can be done using Docker's CLI as:

docker run -it --rm \
    -v "$PWD":/mnt/models \
    -e "MLSERVER_ENV_TARBALL=/mnt/models/old-sklearn.tar.gz" \
    -p 8080:8080 \
    seldonio/mlserver:1.1.0-slim

Note that we need to keep the server running in the background while we send requests. Therefore, it's best to run this command on a separate terminal session.

Send test inference request

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

import requests

x_0 = X_test[0:1]
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": x_0.shape,
          "datatype": "FP32",
          "data": x_0.tolist()
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/mnist-svm/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

Serving a custom model with JSON serialization

Overview

In this example, we create a simple Hello World JSON model that parses and modifies a JSON data chunk. This is often useful as a means to quickly bootstrap existing models that utilize JSON based model inputs.

Serving

The next step will be to serve our model using mlserver. For that, we will first implement an extension which serve as the runtime to perform inference using our custom Hello World JSON model.

Custom inference runtime

This is a trivial model to demonstrate how to conceptually work with JSON inputs / outputs. In this example:

Parse the JSON input from the client
Create a JSON response echoing back the client request as well as a server generated message

%%writefile jsonmodels.py
import json

from typing import Dict, Any
from mlserver import MLModel, types
from mlserver.codecs import StringCodec


class JsonHelloWorldModel(MLModel):
    async def load(self) -> bool:
        # Perform additional custom initialization here.
        print("Initialize model")

        # Set readiness flag for model
        return await super().load()

    async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
        request = self._extract_json(payload)
        response = {
            "request": request,
            "server_response": "Got your request. Hello from the server.",
        }
        response_bytes = json.dumps(response).encode("UTF-8")

        return types.InferenceResponse(
            id=payload.id,
            model_name=self.name,
            model_version=self.version,
            outputs=[
                types.ResponseOutput(
                    name="echo_response",
                    shape=[len(response_bytes)],
                    datatype="BYTES",
                    data=[response_bytes],
                    parameters=types.Parameters(content_type="str"),
                )
            ],
        )

    def _extract_json(self, payload: types.InferenceRequest) -> Dict[str, Any]:
        inputs = {}
        for inp in payload.inputs:
            inputs[inp.name] = json.loads(
                "".join(self.decode(inp, default_codec=StringCodec))
            )

        return inputs

Settings files

The next step will be to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

`settings.json`

%%writefile settings.json
{
    "debug": "true"
}

`model-settings.json`

%%writefile model-settings.json
{
    "name": "json-hello-world",
    "implementation": "jsonmodels.JsonHelloWorldModel"
}

Start serving our model

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request (REST)

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

import requests
import json
from mlserver.types import InferenceResponse
from mlserver.codecs.string import StringRequestCodec
from pprint import PrettyPrinter

pp = PrettyPrinter(indent=1)

inputs = {"name": "Foo Bar", "message": "Hello from Client (REST)!"}

# NOTE: this uses characters rather than encoded bytes. It is recommended that you use the `mlserver` types to assist in the correct encoding.
inputs_string = json.dumps(inputs)

inference_request = {
    "inputs": [
        {
            "name": "echo_request",
            "shape": [len(inputs_string)],
            "datatype": "BYTES",
            "data": [inputs_string],
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/json-hello-world/infer"
response = requests.post(endpoint, json=inference_request)

print(f"full response:\n")
print(response)
# retrive text output as dictionary
inference_response = InferenceResponse.parse_raw(response.text)
raw_json = StringRequestCodec.decode_response(inference_response)
output = json.loads(raw_json[0])
print(f"\ndata part:\n")
pp.pprint(output)

Send test inference request (gRPC)

Utilizing string data with the gRPC interface can be a bit tricky. To ensure we are correctly handling inputs and outputs we will be handled correctly.

For simplicity in this case, we leverage the Python types that mlserver provides out of the box. Alternatively, the gRPC stubs can be generated regenerated from the V2 specification directly for use by non-Python as well as Python clients.

import requests
import json
import grpc
from mlserver.codecs.string import StringRequestCodec
import mlserver.grpc.converters as converters
import mlserver.grpc.dataplane_pb2_grpc as dataplane
import mlserver.types as types
from pprint import PrettyPrinter

pp = PrettyPrinter(indent=1)

model_name = "json-hello-world"
inputs = {"name": "Foo Bar", "message": "Hello from Client (gRPC)!"}
inputs_bytes = json.dumps(inputs).encode("UTF-8")

inference_request = types.InferenceRequest(
    inputs=[
        types.RequestInput(
            name="echo_request",
            shape=[len(inputs_bytes)],
            datatype="BYTES",
            data=[inputs_bytes],
            parameters=types.Parameters(content_type="str"),
        )
    ]
)

inference_request_g = converters.ModelInferRequestConverter.from_types(
    inference_request, model_name=model_name, model_version=None
)

grpc_channel = grpc.insecure_channel("localhost:8081")
grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)

response = grpc_stub.ModelInfer(inference_request_g)

print(f"full response:\n")
print(response)
# retrive text output as dictionary
inference_response = converters.ModelInferResponseConverter.to_types(response)
raw_json = StringRequestCodec.decode_response(inference_response)
output = json.loads(raw_json[0])
print(f"\ndata part:\n")
pp.pprint(output)

Serving models through Kafka

Out of the box, MLServer provides support to receive inference requests from Kafka. The Kafka server can run side-by-side with the REST and gRPC ones, and adds a new interface to interact with your model. The inference responses coming back from your model, will also get written back to their own output topic.

In this example, we will showcase the integration with Kafka by serving a model thorugh Kafka.

Run Kafka

We are going to start by running a simple local docker deployment of kafka that we can test against. This will be a minimal cluster that will consist of a single zookeeper node and a single broker.

You need to have Java installed in order for it to work correctly.

!wget https://apache.mirrors.nublue.co.uk/kafka/2.8.0/kafka_2.12-2.8.0.tgz
!tar -zxvf kafka_2.12-2.8.0.tgz
!./kafka_2.12-2.8.0/bin/kafka-storage.sh format -t OXn8RTSlQdmxwjhKnSB_6A -c ./kafka_2.12-2.8.0/config/kraft/server.properties

Run the no-zookeeper kafka broker

Now you can just run it with the following command outside the terminal:

!./kafka_2.12-2.8.0/bin/kafka-server-start.sh ./kafka_2.12-2.8.0/config/kraft/server.properties

Create Topics

Now we can create the input and output topics required

!./kafka_2.12-2.8.0/bin/kafka-topics.sh --create --topic mlserver-input --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
!./kafka_2.12-2.8.0/bin/kafka-topics.sh --create --topic mlserver-output --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092

Training

# Original source code and more details can be found in:
# https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html

# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics
from sklearn.model_selection import train_test_split

# The digits dataset
digits = datasets.load_digits()

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)

# Split data into train and test subsets
X_train, X_test, y_train, y_test = train_test_split(
    data, digits.target, test_size=0.5, shuffle=False)

# We learn the digits on the first half of the digits
classifier.fit(X_train, y_train)

Saving our trained model

Our model will be persisted as a file named mnist-svm.joblib

import joblib

model_file_name = "mnist-svm.joblib"
joblib.dump(classifier, model_file_name)

Serving

Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

Note that, the settings.json file will contain our Kafka configuration, including the address of the Kafka broker and the input / output topics that will be used for inference.

`settings.json`

%%writefile settings.json
{
    "debug": "true",
    "kafka_enabled": "true"
}

`model-settings.json`

%%writefile model-settings.json
{
    "name": "mnist-svm",
    "implementation": "mlserver_sklearn.SKLearnModel",
    "parameters": {
        "uri": "./mnist-svm.joblib",
        "version": "v0.1.0"
    }
}

Start serving our model

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

import requests

x_0 = X_test[0:1]
inference_request = {
    "inputs": [
        {
          "name": "predict",
          "shape": x_0.shape,
          "datatype": "FP32",
          "data": x_0.tolist()
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

Send inference request through Kafka

Now that we have verified that our server is accepting REST requests, we will try to send a new inference request through Kafka. For this, we just need to send a request to the mlserver-input topic (which is the default input topic):

import json
from kafka import KafkaProducer

producer = KafkaProducer(bootstrap_servers="localhost:9092")

headers = {
    "mlserver-model": b"mnist-svm",
    "mlserver-version": b"v0.1.0",
}

producer.send(
    "mlserver-input",
    json.dumps(inference_request).encode("utf-8"),
    headers=list(headers.items()))

Once the message has gone into the queue, the Kafka server running within MLServer should receive this message and run inference. The prediction output should then get posted into an output queue, which will be named mlserver-output by default.

from kafka import KafkaConsumer

consumer = KafkaConsumer(
    "mlserver-output",
    bootstrap_servers="localhost:9092",
    auto_offset_reset="earliest")

for msg in consumer:
    print(f"key: {msg.key}")
    print(f"value: {msg.value}\n")
    break

As we should now be able to see above, the results of our inference request should now be visible in the output Kafka queue.

Streaming

The mlserver package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs.

Overview

In this example, we create a simple Identity Text Model which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model.

Serving

The next step will be to serve our model using mlserver. For that, we will first implement an extension that serves as the runtime to perform inference using our custom TextModel.

Custom inference runtime

This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:

split the text into words using the white space as the delimiter.
wait 0.5 seconds between each word to simulate a slow model.
return each word one by one.

%%writefile text_model.py

import asyncio
from typing import AsyncIterator
from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse
from mlserver.codecs import StringCodec


class TextModel(MLModel):

    async def predict_stream(
        self, payloads: AsyncIterator[InferenceRequest]
    ) -> AsyncIterator[InferenceResponse]:
        payload = [_ async for _ in payloads][0]
        text = StringCodec.decode_input(payload.inputs[0])[0]
        words = text.split(" ")

        split_text = []
        for i, word in enumerate(words):
            split_text.append(word if i == 0 else " " + word)

        for word in split_text:
            await asyncio.sleep(0.5)
            yield InferenceResponse(
                model_name=self._settings.name,
                outputs=[
                    StringCodec.encode_output(
                        name="output",
                        payload=[word],
                        use_bytes=True,
                    ),
                ],
            )

As it can be seen, the predict_stream method receives as an input an AsyncIterator of InferenceRequest and returns an AsyncIterator of InferenceResponse. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.

Note that although unary-unary can be covered by predict_stream method as well, mlserver already covers that through the predict method.

One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above.

Settings file

The next step will be to create 2 configuration files:

settings.json: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

settings.json

%%writefile settings.json

{
  "debug": false,
  "parallel_workers": 0,
  "gzip_enabled": false
}

Note the currently there are three main limitations of the streaming support in MLServer:

distributed workers are not supported (i.e., the parallel_workers setting should be set to 0)
gzip middleware is not supported for REST (i.e., gzip_enabled setting should be set to false)

model-settings.json

%%writefile model-settings.json

{
  "name": "text-model",

  "implementation": "text_model.TextModel",
  
  "versions": ["text-model/v1.2.3"],
  "platform": "mlserver",
  "inputs": [
    {
      "datatype": "BYTES",
      "name": "prompt",
      "shape": [1]
    }
  ],
  "outputs": [
    {
      "datatype": "BYTES",
      "name": "output",
      "shape": [1]
    }
  ]
}

Start serving the model

Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be run from the same directory where our config files are or point to the folder where they are.

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal.

Inference request

To test our model, we will use the following inference request:

%%writefile generate-request.json

{
    "inputs": [
        {
            "name": "prompt",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["What is the capital of France?"],
            "parameters": {
            "content_type": "str"
            }
        }
    ],
    "outputs": [
        {
          "name": "output"
        }
    ]
}

Send test generate stream request (REST)

To send a REST streaming request to the server, we will use the following Python code:

import httpx
from httpx_sse import connect_sse
from mlserver import types
from mlserver.codecs import StringCodec

inference_request = types.InferenceRequest.parse_file("./generate-request.json")

with httpx.Client() as client:
    with connect_sse(client, "POST", "http://localhost:8080/v2/models/text-model/generate_stream", json=inference_request.dict()) as event_source:
        for sse in event_source.iter_sse():
            response = types.InferenceResponse.parse_raw(sse.data)
            print(StringCodec.decode_output(response.outputs[0]))

Send test generate stream request (gRPC)

To send a gRPC streaming request to the server, we will use the following Python code:

import grpc
import mlserver.types as types
from mlserver.codecs import StringCodec
from mlserver.grpc.converters import ModelInferResponseConverter
import mlserver.grpc.converters as converters
import mlserver.grpc.dataplane_pb2_grpc as dataplane

inference_request = types.InferenceRequest.parse_file("./generate-request.json")

# need to convert from string to bytes for grpc
inference_request.inputs[0] = StringCodec.encode_input("prompt", inference_request.inputs[0].data.root)
inference_request_g = converters.ModelInferRequestConverter.from_types(
    inference_request, model_name="text-model", model_version=None
)

async def get_inference_request_stream(inference_request):
    yield inference_request

async with grpc.aio.insecure_channel("localhost:8081") as grpc_channel:
    grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
    inference_request_stream = get_inference_request_stream(inference_request_g)
    
    async for response in grpc_stub.ModelStreamInfer(inference_request_stream):
        response = ModelInferResponseConverter.to_types(response)
        print(StringCodec.decode_output(response.outputs[0]))

Note that for gRPC, the request is transformed into an async generator which is then passed to the ModelStreamInfer method. The response is also an async generator which can be iterated over to get the response.

Deploying a Custom Tensorflow Model with MLServer and Seldon Core

Background

Intro

This tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we'll cover:

Running the model locally
Turning the ML model into an API
Containerizing the model
Storing the container in a registry
Deploying the model to Kubernetes (with Seldon Core)
Scaling the model

The tutorial comes with an accompanying video which you might find useful as you work through the steps:

The slides used in the video can be found .

The Use Case

For this tutorial, we're going to use the available from the Tensorflow Catalog. This dataset includes leaf images from the cassava plant. Each plant can be classified as either "healthly" or as having one of four diseases (Mosaic Disease, Bacterial Blight, Green Mite, Brown Streak Disease).

Getting Set Up

git clone https://github.com/SeldonIO/cassava-example.git

If you've already cloned the MLServer repository, you can also find it in docs/examples/cassava.

Once you've done that, you can just run:

cd cassava-example/

pip install -r requirements.txt

And it'll set you up with all the libraries required to run the code.

Running The Python App

The starting point for this tutorial is python script app.py. This is typical of the kind of python code we'd run standalone or in a jupyter notebook. Let's familiarise ourself with the code:

from helpers import plot, preprocess
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub

# Fixes an issue with Jax and TF competing for GPU
tf.config.experimental.set_visible_devices([], 'GPU')

# Load the model
model_path = './model'
classifier = hub.KerasLayer(model_path)

# Load the dataset and store the class names
dataset, info = tfds.load('cassava', with_info=True)
class_names = info.features['label'].names + ['unknown']

# Select a batch of examples and plot them
batch_size = 9
batch = dataset['validation'].map(preprocess).batch(batch_size).as_numpy_iterator()
examples = next(batch)
plot(examples, class_names)

# Generate predictions for the batch and plot them against their labels
predictions = classifier(examples['image'])
predictions_max = tf.argmax(predictions, axis=-1)
print(predictions_max)
plot(examples, class_names, predictions_max)

First up, we're importing a couple of functions from our helpers.py file:

plot provides the visualisation of the samples, labels and predictions.
preprocess is used to resize images to 224x224 pixels and normalize the RGB values.

The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.

Try it yourself by running:

python app.py

Creating an API for The Model

The problem with running our code like we did earlier is that it's not accessible to anyone who doesn't have the python script (and all of it's dependencies). A good way to solve this is to turn our model into an API.

Setting Things Up

In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load() and predict(). Let's take a look at the code (found in model/serve-model.py):

from mlserver import MLModel
from mlserver.codecs import decode_args
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub

# Define a class for our Model, inheriting the MLModel class from MLServer
class CassavaModel(MLModel):

  # Load the model into memory
  async def load(self) -> bool:
    tf.config.experimental.set_visible_devices([], 'GPU')
    model_path = '.'
    self._model = hub.KerasLayer(model_path)
    self.ready = True
    return self.ready

  # Logic for making predictions against our model
  @decode_args
  async def predict(self, payload: np.ndarray) -> np.ndarray:
    # convert payload to tf.tensor
    payload_tensor = tf.constant(payload)

    # Make predictions
    predictions = self._model(payload_tensor)
    predictions_max = tf.argmax(predictions, axis=-1)

    # convert predictions to np.ndarray
    response_data = np.array(predictions_max)

    return response_data

The load() method is used to define any logic required to set up our model for inference. In our case, we're loading the model weights into self._model. The predict() method is where we include all of our prediction logic.

You may notice that we've slightly modified our code from earlier (in app.py). The biggest change is that it is now wrapped in a single class CassavaModel.

The only other task we need to do to run our model on MLServer is to specify a model-settings.json file:

{
    "name": "cassava",
    "implementation": "serve-model.CassavaModel"
}

This is a simple configuration file that tells MLServer how to handle our model. In our case, we've provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel).

Serving The Model

We're now ready to serve our model with MLServer. To do that we can simply run:

mlserver start model/

MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.

Making Predictions Using The API

Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the test.py file by running:

python test.py --local

Containerizing The Model

Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build command.

Before we run this command, we need to provide our dependencies in either a requirements.txt or a conda.env file. The requirements file we'll use for this example is stored in model/requirements.txt:

tensorflow==2.12.0
tensorflow-hub==0.13.0

Notice that we didn't need to include mlserver in our requirements? That's because the builder image has mlserver included already.

We're now ready to build our container image using:

mlserver build model/ -t [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]

Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".

MLServer will now build the model into a container image for us. We can check the output of this by running:

docker images

Finally, we want to send this container image to be stored in our container registry. We can do this by running:

docker push [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]

Deploying to Kubernetes

Now that we've turned our model into a production-ready API, containerized it and pushed it to a registry, it's time to deploy our model.

Creating the Deployment

To create our deployment with Seldon Core we need to create a small configuration file that looks like this:

You can find this file named deployment.yaml in the base folder of this tutorial's repository.

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: cassava
spec:
  protocol: v2
  predictors:
    - componentSpecs:
        - spec:
            containers:
              - image: YOUR_CONTAINER_REGISTRY/IMAGE_NAME
                name: cassava
                imagePullPolicy: Always
      graph:
        name: cassava
        type: MODEL
      name: cassava

Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".

We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:

kubectl create -f deployment.yaml

To check our deployment is up and running we can run:

kubectl get pods

We should see STATUS = Running once our deployment has finalized.

Testing the Deployment

Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it's working.

To do this, we simply run the test.py file in the following way:

python test.py --remote

This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.

Scaling the Model

Kubernetes and Seldon Core make this really easy to do by simply running:

kubectl scale sdep cassava --replicas=3

We can replace the --replicas=3 with any number we want to scale to.

To watch the servers scaling out we can run:

kubectl get pods --watch

Examples

Inference Runtimes

MLServer Features

Tutorials

Serving Scikit-Learn models

Training

Saving our trained model

Serving

settings.json

model-settings.json

Start serving our model

Send test inference request

Serving XGBoost models

Training

Saving our trained model

Serving

settings.json

model-settings.json

Start serving our model

Send test inference request

Serving LightGBM models

Training

Serving

settings.json

model-settings.json

Start serving our model

Send test inference request

Serving MLflow models

Training

Serving

Send test inference request

MLflow Scoring Protocol

MLflow Model Signature

Serving a custom model

Overview

Training

Saving our trained model

Serving

Custom inference runtime

Settings files

settings.json

model-settings.json

Start serving our model

Send test inference request

Deployment

Specifying requirements

Building a custom image

Deploying our custom image

Serving Alibi-Detect models

Fetch reference data

Drift Detector Configuration

Creating detector and saving configuration

Detecting data drift directly

Serving

settings.json

model-settings.json

Start serving our model

Send test inference request

View model response

Serving HuggingFace Transformer Models

Serving

model-settings.json

Send test inference request

Using Optimum Optimized Models

Send Test Request to Optimum Optimized Model

Testing Supported Tasks

Question Answering

Sentiment Analysis

GPU Acceleration

Testing with CPU

Testing with GPU

Adaptive Batching with GPU

Multi-Model Serving

Training

Training our mnist-svm model

Training our mushroom-xgboost model

Serving

settings.json

models/mnist-svm/model-settings.json

models/mushroom-xgboost/model-settings.json

`settings.json`

`model-settings.json`

`settings.json`

`model-settings.json`

`settings.json`

`model-settings.json`

`settings.json`

`model-settings.json`

`settings.json`

`model-settings.json`

`model-settings.json`

Training our `mnist-svm` model

Training our `mushroom-xgboost` model

`settings.json`

`models/mnist-svm/model-settings.json`

`models/mushroom-xgboost/model-settings.json`

Testing our `mnist-svm` model

Testing our `mushroom-xgboost` model

Unloading our `mushroom-xgboost` model

Loading our `mushroom-xgboost` model back

`settings.json`

`model-settings.json`

`settings.json`

`model-settings.json`