1 of 50

MLServer

An open source inference server for your machine learning models.

Overview

Multi-model serving, letting users run multiple models within the same process.

Usage

You can install the mlserver package running:

Inference Runtimes

Out of the box, MLServer provides support for:

Supported Python Versions

🔴 Unsupported

🟠 Deprecated: To be removed in a future version

🟢 Supported

🔵 Untested

Examples

Developer Guide

Versioning

We generally keep the version as a placeholder for an upcoming version.

For example:

Testing

To run all of the tests for MLServer and the runtimes, use:

To run run tests for a single file, use something like:

Getting Started

This guide will help you get started creating machine learning microservices with MLServer in less than 30 minutes. Our use case will be to create a service that helps us compare the similarity between two documents. Think about whenever you are comparing a book, news article, blog post, or tutorial to read next, wouldn't it be great to have a way to compare with similar ones that you have already read and liked (without having to rely on a recommendation's system)? That's what we'll focus on this guide, on creating a document similarity service. 📜 + 📃 = 😎👌🔥

The code is showcased as if it were cells inside a notebook but you can run each of the steps inside Python files with minimal effort.

00 What is MLServer?

MLServer is an open-source Python library for building production-ready asynchronous APIs for machine learning models.

01 Dependencies

Let's first install these libraries.

We will also need to download the language model separately once we have spaCy inside our virtual environment.

If you're going over this guide inside a notebook, don't forget to add an exclamation mark ! in front of the two commands above. If you are in VSCode, you can keep them as they are and change the cell type to bash.

02 Set Up

At its core, MLServer requires that users give it 3 things, a model-settings.json file with information about the model, an (optional) settings.json file with information related to the server you are about to set up, and a .py file with the load-predict recipe for your model (as shown in the picture above).

Let's create a directory for our model.

Before we create a service that allows us to compare the similarity between two documents, it is good practice to first test that our solution works first, especially if we're using a pre-trained model and/or a pipeline.

Feel free to change the movies for other topics you might be interested in.

You can run the following lines inside a notebook or, conversely, add them to a app.py file.

If you created an app.py file with the code above, make sure you run python app.py from the terminal.

Now that we have our two summaries, let's compare them using spacy.

Notice that both summaries have information about the other movie, about "films" in general, and about the dates each aired on (which is the same). The reality is that, the model hasn't seen any of these movies so it might be generalizing to the context of each article, "movies," rather than their content, "dolls as humans and the atomic bomb."

You should, of course, play around with different pages and see if what you get back is coherent with what you would expect.

Time to create a machine learning API for our use-case. 😎

03 Building a Service

MLServer allows us to wrap machine learning models into APIs and build microservices with replicas of a single model, or different models all together.

To create a service with MLServer, we will define a class with two asynchronous functions, one that loads the model and another one to run inference (or predict) with. The former will load the spacy model we tested in the last section, and the latter will take in a list with the two documents we want to compare. Lastly, our function will return a numpy array with a single value, our similarity score. We'll write the file to our similarity_model directory and call it my_model.py.

Now that we have our model file ready to go, the last piece of the puzzle is to tell MLServer a bit of info about it. In particular, it wants (or needs) to know the name of the model and how to implement it. The former can be anything you want (and it will be part of the URL of your API), and the latter will follow the recipe of name_of_py_file_with_your_model.class_with_your_model.

Let's create the model-settings.json file MLServer is expecting inside our similarity_model directory and add the name and the implementation of our model to it.

Now that everything is in place, we can start serving predictions locally to test how things would play out for our future users. We'll initiate our server via the command line, and later on we'll see how to do the same via Python files. Here's where we are at right now in the process of developing microservices with MLServer.

To start our service, open up a terminal and run the following command.

Note: If this is a fresh terminal, make sure you activate your environment before you run the command above. If you run the command above from your notebook (e.g. !mlserver start similarity_model/), you will have to send the request below from another notebook or terminal since the cell will continue to run until you turn it off.

04 Testing our Service

Please note that the request below uses the variables we created earlier with the summaries of Barbie and Oppenheimer. If you are sending this POST request from a fresh python file, make sure you move those lines of code above into your request file.

Let's decompose what just happened.

All URLs you create with MLServer will have the following structure.

This kind of protocol is a standard adopted by different companies like NVIDIA, Tensorflow Serving, KServe, and others, to keep everyone on the same page. If you think about driving cars globally, your country has to apply a standard for driving on a particular side of the road, and this ensures you and everyone else stays on the left (or the right depending on where you are at). Adopting this means that you won't have to wonder where the next driver is going to come out of when you are driving and are about to take a turn, instead, you can focus on getting to where you're going to without much worrying.

Let's describe what each of the components of our inference_request does.

name: this maps one-to-one to the name of the parameter in your predict() function.
shape: represents the shape of the elements in our data. In our case, it is a list with [2] strings.
datatype: the different data types expected by the server, e.g., str, numpy array, pandas dataframe, bytes, etc.
parameters: allows us to specify the content_type beyond the data types
data: the inputs to our predict function.

05 Creating Model Replicas

Say you need to meet the demand of a high number of users and one model might not be enough, or is not using all of the resources of the virtual machine instance it was allocated to. What we can do in this case is to create multiple replicas of our model to increase the throughput of the requests that come in. This can be particularly useful at the peak times of our server. To do this, we need to tweak the settings of our server via the settings.json file. In it, we'll add the number of independent models we want to have to the parameter "parallel_workers": 3.

Let's stop our server, change the settings of it, start it again, and test it.

As you can see in the output of the terminal in the picture above, we now have 3 models running in parallel. The reason you might see 4 is because, by default, MLServer will print the name of the initialized model if it is one or more, and it will also print one for each of the replicas specified in the settings.

Let's first test that the function works as intended.

Now let's map three POST requests at the same time.

We can also test it one by one.

06 Packaging our Service

The first step is to create a requirements.txt file with all of our dependencies and add it to the directory we've been using for our service (similarity_model).

The next step is to build a docker image with our model, its dependencies and our server. If you've never heard of docker images before, here's a short description.

A Docker image is a lightweight, standalone, and executable package that includes everything needed to run a piece of software, including code, libraries, dependencies, and settings. It's like a carry-on bag for your application, containing everything it needs to travel safely and run smoothly in different environments. Just as a carry-on bag allows you to bring your essentials with you on a trip, a Docker image enables you to transport your application and its requirements across various computing environments, ensuring consistent and reliable deployment.

MLServer has a convenient function that lets us create docker images with our services. Let's use it.

We can check that our image was successfully build not only by looking at the logs of the previous command but also with the docker images command.

Let's test that our image works as intended with the following command. Make sure you have closed your previous server by using CTRL + C in your terminal.

User Guide

Content Types (and Codecs)

To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.

To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.

Usage

To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type field of the parameters section of your request.

As an example, we can consider the following dataframe, containing two columns: Age and First Name.

First Name

Age

Joanne

Michael

This table, could be specified in the V2 protocol as the following payload, where we declare that:

The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd).
The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str).

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "First Name",
      "datatype": "BYTES",
      "parameters": {
        "content_type": "str"
      },
      "shape": [2],
      "data": ["Joanne", "Michael"]
    },
    {
      "name": "Age",
      "datatype": "INT32",
      "shape": [2],
      "data": [34, 22]
    },
  ]
}

It's important to keep in mind that content types can be specified at both the request level and the input level. The former will apply to the entire set of inputs, whereas the latter will only apply to a particular input of the payload.

Codecs

Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.

Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.

However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.

To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):

Request Codecs
- encode_request() <mlserver.codecs.RequestCodec.encode_request>
- decode_request() <mlserver.codecs.RequestCodec.decode_request>
- encode_response() <mlserver.codecs.RequestCodec.encode_response>
- decode_response() <mlserver.codecs.RequestCodec.decode_response>
Input Codecs
- encode_input() <mlserver.codecs.InputCodec.encode_input>
- decode_input() <mlserver.codecs.InputCodec.decode_input>
- encode_output() <mlserver.codecs.InputCodec.encode_output>
- decode_output() <mlserver.codecs.InputCodec.decode_output>

Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.

For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:

import pandas as pd

from mlserver.codecs import PandasCodec

dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})

inference_request = PandasCodec.encode_request(dataframe)
print(inference_request)

Converting to / from JSON

When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types package (i.e. InferenceRequest <mlserver.types.InferenceRequest> or InferenceResponse <mlserver.types.InferenceResponse>). Therefore, if you want to use them with requests (or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.

For example, if we want to send an inference request to model foo, we could do something along the following lines:

import pandas as pd
import requests

from mlserver.codecs import PandasCodec

dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})

inference_request = PandasCodec.encode_request(dataframe)

# raw_request will be a Python dictionary compatible with `requests`'s `json` kwarg
raw_request = inference_request.dict()

response = requests.post("localhost:8080/v2/models/foo/infer", json=raw_request)

# raw_response will be a dictionary (loaded from the response's JSON),
# therefore we can pass it as the InferenceResponse constructors' kwargs
raw_response = response.json()
inference_response = InferenceResponse(**raw_response)

Support for NaN values

The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.

In order to send / receive NaN values, you must ensure that:

You are using the REST interface.
The input / output entry containing NaN values uses either the FP16, FP32 or FP64 datatypes.

Assuming those conditions are satisfied, any null value within your tensor payload will be converted to NaN.

For example, if you take the following Numpy array:

import numpy as np

foo = np.array([[1.2, 2.3], [np.NaN, 4.5]])

We could encode it as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "np"
      },
      "data": [1.2, 2.3, null, 4.5]
      "datatype": "FP64",
      "shape": [2, 2],
    }
  ]
}

Model Metadata

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "First Name",
      "datatype": "BYTES",
      "parameters": {
        "content_type": "str"
      },
      "shape": [-1],
    },
    {
      "name": "Age",
      "datatype": "INT32",
      "shape": [-1],
    },
  ]
}

It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.

Available Content Types

Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.

Python Type

Content Type

Request Level

Request Codec

Input Level

Input Codec

np

✅

mlserver.codecs.NumpyRequestCodec

✅

mlserver.codecs.NumpyCodec

pd

✅

mlserver.codecs.PandasCodec

❌

str

✅

mlserver.codecs.string.StringRequestCodec

✅

mlserver.codecs.StringCodec

base64

❌

✅

mlserver.codecs.Base64Codec

datetime

❌

✅

mlserver.codecs.DatetimeCodec

NumPy Array

The np content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:

The shape field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.

By default, MLServer will always assume that an array with a single-dimensional shape, e.g. [N], is equivalent to [N, 1]. That is, each entry will be treated like a single one-dimensional data point (i.e. instead of a [1, D] array, where the full array is a single D-dimensional data point). To avoid any ambiguity, where possible, the Numpy codec will always explicitly encode [N] arrays as [N, 1].

For example, if we think of the following NumPy Array:

import numpy as np

foo = np.array([[1, 2], [3, 4]])

We could encode it as the input foo in a V2 protocol request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "np"
      },
      "data": [1, 2, 3, 4]
      "datatype": "INT32",
      "shape": [2, 2],
    }
  ]
}

from mlserver.codecs import NumpyRequestCodec

# Encode an entire V2 request
inference_request = NumpyRequestCodec.encode_request(foo)

from mlserver.types import InferenceRequest
from mlserver.codecs import NumpyCodec

# We can use the `NumpyCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    NumpyCodec.encode_input("foo", foo)
  ]
)

When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single tensor.

Pandas DataFrame

The pd content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.

The pd content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,

Each entry of the inputs list (or outputs, in the case of responses), will represent a column of the DataFrame.
Each of these entires, will contain all the row elements for that particular column.
The shape field of each input (or output) entry will contain (at least) the amount of rows included in the dataframe.

For example, if we consider the following dataframe:

We could encode it to the V2 Inference Protocol as:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "data": ["a1", "a2", "a3", "a4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "B",
      "data": ["b1", "b2", "b3", "b4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "C",
      "data": ["c1", "c2", "c3", "c4"]
      "datatype": "BYTES",
      "shape": [4],
    },
  ]
}

import pandas as pd

from mlserver.codecs import PandasCodec

foo = pd.DataFrame({
  "A": ["a1", "a2", "a3", "a4"],
  "B": ["b1", "b2", "b3", "b4"],
  "C": ["c1", "c2", "c3", "c4"]
})

inference_request = PandasCodec.encode_request(foo)

UTF-8 String

The str content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:

The expected datatype is BYTES.
The shape field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"] payload will have a shape of 2 elements).

For example, when if we consider the following list of strings:

foo = ["bar", "bar2"]

We could encode it to the V2 Inference Protocol as:

{
  "parameters": {
    "content_type": "str"
  },
  "inputs": [
    {
      "name": "foo",
      "data": ["bar", "bar2"]
      "datatype": "BYTES",
      "shape": [2],
    }
  ]
}

from mlserver.codecs.string import StringRequestCodec

# Encode an entire V2 request
inference_request = StringRequestCodec.encode_request(foo, use_bytes=False)

from mlserver.types import InferenceRequest
from mlserver.codecs import StringCodec

# We can use the `StringCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    StringCodec.encode_input("foo", foo, use_bytes=False)
  ]
)

When using the str content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single string or a set of strings.

Base64

The base64 content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:

The expected datatype is BYTES.
The data field should contain the base64-encoded binary strings.
The shape field represents the number of binary strings that are encoded in the payload.

For example, if we think of the following "bytes array":

foo = b"Python is fun"

We could encode it as the input foo of a V2 request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "base64"
      },
      "data": ["UHl0aG9uIGlzIGZ1bg=="]
      "datatype": "BYTES",
      "shape": [1],
    }
  ]
}

from mlserver.types import InferenceRequest
from mlserver.codecs import Base64Codec

# We can use the `Base64Codec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    Base64Codec.encode_input("foo", foo, use_bytes=False)
  ]
)

Datetime

The expected datatype is BYTES.
The shape field represents the number of datetimes that are encoded in the payload.

For example, if we think of the following datetime object:

import datetime

foo = datetime.datetime(2022, 1, 11, 11, 0, 0)

We could encode it as the input foo of a V2 request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "datetime"
      },
      "data": ["2022-01-11T11:00:00"]
      "datatype": "BYTES",
      "shape": [1],
    }
  ]
}

from mlserver.types import InferenceRequest
from mlserver.codecs import DatetimeCodec

# We can use the `DatetimeCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    DatetimeCodec.encode_input("foo", foo, use_bytes=False)
  ]
)

OpenAPI Support

MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:

Name

Description

OpenAPI Spec

Open Inference Protocol

Main dataplane for inference, health and metadata

Model Repository Extension

Extension to the protocol to provide a control plane which lets you load / unload models dynamically

Swagger UI

On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.

The autogenerated Swagger UI can be accessed under the /v2/docs endpoint.

Besides the Swagger UI, you can also access the raw OpenAPI spec through the /v2/docs/dataplane.json endpoint.

Model Swagger UI

The model-specific autogenerated Swagger UI can be accessed under the following endpoints:

/v2/models/{model_name}/docs
/v2/models/{model_name}/versions/{model_version}/docs

Besides the Swagger UI, you can also access the model-specific raw OpenAPI spec through the following endpoints:

/v2/models/{model_name}/docs/dataplane.json
/v2/models/{model_name}/versions/{model_version}/docs/dataplane.json

Parallel Inference

Concurrency in Python

To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.

Overhead

Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020.

The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.

For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.

Usage

By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers setting.

`parallel_workers`

The parallel_workers field of the settings.json file (or alternatively, the MLSERVER_PARALLEL_WORKERS global environment variable) controls the size of MLServer's inference pool. The expected values are:

N, where N > 0, will create a pool of N workers.
0, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.

`inference_pool_gid`

The inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_INFERENCE_POOL_GID global environment variable) allows to load models on a dedicated inference pool based on the group ID (GID) to prevent starvation behavior.

Complementing the inference_pool_gid, if the autogenerate_inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_AUTOGENERATE_INFERENCE_POOL_GID global environment variable) is set to True, a UUID is automatically generated, and a dedicated inference pool will load the given model. This option is useful if the user wants to load a single model on an dedicated inference pool without having to manage the GID themselves.

References

Custom Inference Runtimes

There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

Writing a custom inference runtime

MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel> abstract class, whose main methods are:

load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.

Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.

Simplified interface

MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict() method with the mlserver.codecs.decode_args decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.

As an example of the above, let's assume a model which

Takes two lists of strings as inputs:
- questions, containing multiple questions to ask our model.
- context, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.

Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:

Note that, the method signature of our predict method now specifies:

The input names that we should be looking for in the request payload (i.e. questions and context).
The expected content type for each of the request inputs (i.e. List[str] on both cases).
The expected content type of the response outputs (i.e. np.ndarray).

Read and write headers

The headers field within the parameters section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.

There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers field of the parameters object within the InferenceRequest instance.

Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers field within the parameters object of the returned InferenceResponse instance.

Loading a custom MLServer runtime

MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json configuration file.

For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:

Note that, from the example above, we are assuming that:

Your custom runtime code lives in the models.py file.
The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).

Loading a custom Python environment

More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.

Note that, in the folder layout above, we are assuming that:

The environment.tar.gz tarball contains a pre-packaged version of your custom environment.
The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).

Building a custom MLServer image

The mlserver build command expects that a Docker runtime is available and running in the background.

To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:

The output will be a Docker image named my-custom-server, ready to be used.

Custom Environment

Default Settings

Default setting values can still be overriden by external environment variables or model-specific model-settings.json.

Custom Dockerfile

Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like

Supporting arbitrary user IDs.

Reference

Python API

Content Types (and Codecs)

Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime objects, Pillow images, etc. Unfortunately, the definition of the doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.

Usage

Some inference runtimes may apply a content type by default if none is present. To learn more about each runtime's defaults, please check the .

As an example, we can consider the following dataframe, containing two columns: Age and First Name.

First Name

Age

Joanne

Michael

This table, could be specified in the V2 protocol as the following payload, where we declare that:

The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd).
The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str).

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "First Name",
      "datatype": "BYTES",
      "parameters": {
        "content_type": "str"
      },
      "shape": [2],
      "data": ["Joanne", "Michael"]
    },
    {
      "name": "Age",
      "datatype": "INT32",
      "shape": [2],
      "data": [34, 22]
    },
  ]
}

To learn more about the available content types and how to use them, you can see all the available ones in the section below.

Codecs

However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.

Request Codecs
- encode_request() <mlserver.codecs.RequestCodec.encode_request>
- decode_request() <mlserver.codecs.RequestCodec.decode_request>
- encode_response() <mlserver.codecs.RequestCodec.encode_response>
- decode_response() <mlserver.codecs.RequestCodec.decode_response>
Input Codecs
- encode_input() <mlserver.codecs.InputCodec.encode_input>
- decode_input() <mlserver.codecs.InputCodec.decode_input>
- encode_output() <mlserver.codecs.InputCodec.encode_output>
- decode_output() <mlserver.codecs.InputCodec.decode_output>

For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:

import pandas as pd

from mlserver.codecs import PandasCodec

dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})

inference_request = PandasCodec.encode_request(dataframe)
print(inference_request)

For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .

Converting to / from JSON

Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump() or .model_dump_json() method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).

For example, if we want to send an inference request to model foo, we could do something along the following lines:

import pandas as pd
import requests

from mlserver.codecs import PandasCodec

dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})

inference_request = PandasCodec.encode_request(dataframe)

# raw_request will be a Python dictionary compatible with `requests`'s `json` kwarg
raw_request = inference_request.dict()

response = requests.post("localhost:8080/v2/models/foo/infer", json=raw_request)

# raw_response will be a dictionary (loaded from the response's JSON),
# therefore we can pass it as the InferenceResponse constructors' kwargs
raw_response = response.json()
inference_response = InferenceResponse(**raw_response)

Support for NaN values

In order to send / receive NaN values, you must ensure that:

You are using the REST interface.
The input / output entry containing NaN values uses either the FP16, FP32 or FP64 datatypes.
You are either using the or the .

Assuming those conditions are satisfied, any null value within your tensor payload will be converted to NaN.

For example, if you take the following Numpy array:

import numpy as np

foo = np.array([[1.2, 2.3], [np.NaN, 4.5]])

We could encode it as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "np"
      },
      "data": [1.2, 2.3, null, 4.5]
      "datatype": "FP64",
      "shape": [2, 2],
    }
  ]
}

Model Metadata

Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.

For example, to configure the content type values of the , one could create a model-settings.json file like the one below:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "First Name",
      "datatype": "BYTES",
      "parameters": {
        "content_type": "str"
      },
      "shape": [-1],
    },
    {
      "name": "Age",
      "datatype": "INT32",
      "shape": [-1],
    },
  ]
}

Available Content Types

Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.

Python Type

Content Type

Request Level

Request Codec

Input Level

Input Codec

np

✅

mlserver.codecs.NumpyRequestCodec

✅

mlserver.codecs.NumpyCodec

pd

✅

mlserver.codecs.PandasCodec

❌

str

✅

mlserver.codecs.string.StringRequestCodec

✅

mlserver.codecs.StringCodec

base64

❌

✅

mlserver.codecs.Base64Codec

datetime

❌

✅

mlserver.codecs.DatetimeCodec

MLServer allows you extend the supported content types by adding custom ones. To learn more about how to write your own custom content types, you can check this . You can also learn more about building custom extensions for MLServer on the of the docs.

NumPy Array

The expects that the data of each input is sent as a flat array. Therefore, the np content type will expect that tensors are sent flattened. The information in the shape field will then be used to reshape the vector into the right dimensions.

The np content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:

The datatype field will be matched to the closest .
The shape field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.

For example, if we think of the following NumPy Array:

import numpy as np

foo = np.array([[1, 2], [3, 4]])

We could encode it as the input foo in a V2 protocol request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "np"
      },
      "data": [1, 2, 3, 4]
      "datatype": "INT32",
      "shape": [2, 2],
    }
  ]
}

from mlserver.codecs import NumpyRequestCodec

# Encode an entire V2 request
inference_request = NumpyRequestCodec.encode_request(foo)

from mlserver.types import InferenceRequest
from mlserver.codecs import NumpyCodec

# We can use the `NumpyCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    NumpyCodec.encode_input("foo", foo)
  ]
)

Pandas DataFrame

The pd content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.

The pd content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,

Each entry of the inputs list (or outputs, in the case of responses), will represent a column of the DataFrame.
Each of these entires, will contain all the row elements for that particular column.
The shape field of each input (or output) entry will contain (at least) the amount of rows included in the dataframe.

For example, if we consider the following dataframe:

We could encode it to the V2 Inference Protocol as:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "data": ["a1", "a2", "a3", "a4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "B",
      "data": ["b1", "b2", "b3", "b4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "C",
      "data": ["c1", "c2", "c3", "c4"]
      "datatype": "BYTES",
      "shape": [4],
    },
  ]
}

import pandas as pd

from mlserver.codecs import PandasCodec

foo = pd.DataFrame({
  "A": ["a1", "a2", "a3", "a4"],
  "B": ["b1", "b2", "b3", "b4"],
  "C": ["c1", "c2", "c3", "c4"]
})

inference_request = PandasCodec.encode_request(foo)

UTF-8 String

The str content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:

The expected datatype is BYTES.
The shape field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"] payload will have a shape of 2 elements).

For example, when if we consider the following list of strings:

foo = ["bar", "bar2"]

We could encode it to the V2 Inference Protocol as:

{
  "parameters": {
    "content_type": "str"
  },
  "inputs": [
    {
      "name": "foo",
      "data": ["bar", "bar2"]
      "datatype": "BYTES",
      "shape": [2],
    }
  ]
}

from mlserver.codecs.string import StringRequestCodec

# Encode an entire V2 request
inference_request = StringRequestCodec.encode_request(foo, use_bytes=False)

from mlserver.types import InferenceRequest
from mlserver.codecs import StringCodec

# We can use the `StringCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    StringCodec.encode_input("foo", foo, use_bytes=False)
  ]
)

Base64

The base64 content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:

The expected datatype is BYTES.
The data field should contain the base64-encoded binary strings.
The shape field represents the number of binary strings that are encoded in the payload.

For example, if we think of the following "bytes array":

foo = b"Python is fun"

We could encode it as the input foo of a V2 request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "base64"
      },
      "data": ["UHl0aG9uIGlzIGZ1bg=="]
      "datatype": "BYTES",
      "shape": [1],
    }
  ]
}

from mlserver.types import InferenceRequest
from mlserver.codecs import Base64Codec

# We can use the `Base64Codec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    Base64Codec.encode_input("foo", foo, use_bytes=False)
  ]
)

Datetime

The datetime content type will decode a V2 input into a , taking into account the following:

The expected datatype is BYTES.
The data field should contain the dates serialised following the .
The shape field represents the number of datetimes that are encoded in the payload.

For example, if we think of the following datetime object:

import datetime

foo = datetime.datetime(2022, 1, 11, 11, 0, 0)

We could encode it as the input foo of a V2 request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "datetime"
      },
      "data": ["2022-01-11T11:00:00"]
      "datatype": "BYTES",
      "shape": [1],
    }
  ]
}

from mlserver.types import InferenceRequest
from mlserver.codecs import DatetimeCodec

# We can use the `DatetimeCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    DatetimeCodec.encode_input("foo", foo, use_bytes=False)
  ]
)

Getting Started

The code is showcased as if it were cells inside a notebook but you can run each of the steps inside Python files with minimal effort.

00 What is MLServer?

MLServer is an open-source Python library for building production-ready asynchronous APIs for machine learning models.

01 Dependencies

The first step is to install mlserver, the spacy library, and the spacy will need for our use case. We will also download the wikipedia-api library to test our use case with a few fun summaries.

If you've never heard of before, it is an open-source Python library for advanced natural language processing that excels at large-scale information extraction and retrieval tasks, among many others. The model we'll use is a pre-trained model on English text from the web. This model will help us get started with our use case faster than if we had to train a model from scratch for our use case.

Let's first install these libraries.

pip install mlserver spacy wikipedia-api

We will also need to download the language model separately once we have spaCy inside our virtual environment.

python -m spacy download en_core_web_lg

02 Set Up

Let's create a directory for our model.

mkdir -p similarity_model

import spacy

nlp = spacy.load("en_core_web_lg")

Now that we have our model loaded, let's look at the similarity of the abstracts of using the wikipedia-api Python library. The main requirement of the API is that we pass into the main class, Wikipedia(), a project name, an email and the language we want information to be returned in. After that, we can search the for the movie summaries we want by passing the title of the movie to the .page() method and accessing the summary of it with the .summary attribute.

Feel free to change the movies for other topics you might be interested in.

You can run the following lines inside a notebook or, conversely, add them to a app.py file.

import wikipediaapi

wiki_wiki = wikipediaapi.Wikipedia('MyMovieEval (example@example.com)', 'en')

barbie = wiki_wiki.page('Barbie_(film)').summary
oppenheimer = wiki_wiki.page('Oppenheimer_(film)').summary

print(barbie)
print()
print(oppenheimer)

If you created an app.py file with the code above, make sure you run python app.py from the terminal.

Barbie is a 2023 American fantasy comedy film directed by Greta Gerwig and written by Gerwig and Noah Baumbach. Based on the Barbie fashion dolls by Mattel, it is the first live-action Barbie film after numerous computer-animated direct-to-video and streaming television films. The film stars Margot Robbie as Barbie and Ryan Gosling as Ken, and follows the two on a journey of self-discovery following an existential crisis. The film also features an ensemble cast that includes America Ferrera, Kate McKinnon, Issa Rae, Rhea Perlman, and Will Ferrell...

Oppenheimer is a 2023 biographical thriller film written and directed by Christopher Nolan. Based on the 2005 biography American Prometheus by Kai Bird and Martin J. Sherwin, the film chronicles the life of J. Robert Oppenheimer, a theoretical physicist who was pivotal in developing the first nuclear weapons as part of the Manhattan Project, and thereby ushering in the Atomic Age. Cillian Murphy stars as Oppenheimer, with Emily Blunt as Oppenheimer's wife Katherine "Kitty" Oppenheimer; Matt Damon as General Leslie Groves, director of the Manhattan Project; and Robert Downey Jr. as Lewis Strauss, a senior member of the United States Atomic Energy Commission. The ensemble supporting cast includes Florence Pugh, Josh Hartnett, Casey Affleck, Rami Malek, Gary Oldman and Kenneth Branagh...

Now that we have our two summaries, let's compare them using spacy.

doc1 = nlp(barbie)
doc2 = nlp(oppenheimer)

doc1.similarity(doc2)

0.9866910567224084

You should, of course, play around with different pages and see if what you get back is coherent with what you would expect.

Time to create a machine learning API for our use-case. 😎

03 Building a Service

MLServer allows us to wrap machine learning models into APIs and build microservices with replicas of a single model, or different models all together.

# similarity_model/my_model.py

from mlserver.codecs import decode_args
from mlserver import MLModel
from typing import List
import numpy as np
import spacy

class MyKulModel(MLModel):

    async def load(self):
        self.model = spacy.load("en_core_web_lg")

    @decode_args
    async def predict(self, docs: List[str]) -> np.ndarray:

        doc1 = self.model(docs[0])
        doc2 = self.model(docs[1])

        return np.array(doc1.similarity(doc2))

Let's create the model-settings.json file MLServer is expecting inside our similarity_model directory and add the name and the implementation of our model to it.

# similarity_model/model-settings.json

{
    "name": "doc-sim-model",
    "implementation": "my_model.MyKulModel"
}

As you can see in the image, our server will be initialized with three entry points, one for HTTP requests, another for gRPC, and another for the metrics. To learn more about the powerful metrics feature of MLServer, please visit the relevant docs page . To learn more about gRPC, please see this tutorial .

To start our service, open up a terminal and run the following command.

mlserver start similarity_model/

04 Testing our Service

Time to become a client of our service and test it. For this, we'll set up the payload we'll send to our service and use the requests library to our request.

from mlserver.codecs import StringCodec
import requests

inference_request = {
    "inputs": [
        StringCodec.encode_input(name='docs', payload=[barbie, oppenheimer], use_bytes=False).model_dump()
    ]
}
print(inference_request)

{'inputs': [{'name': 'docs',
   'shape': [2, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'str'},
   'data': [
        'Barbie is a 2023 American fantasy comedy...',
        'Oppenheimer is a 2023 biographical thriller...'
        ]
    }]
}

r = requests.post('http://0.0.0.0:8080/v2/models/doc-sim-model/infer', json=inference_request)

r.json()

{'model_name': 'doc-sim-model',
    'id': 'a4665ddb-1868-4523-bd00-a25902d9b124',
    'parameters': {},
    'outputs': [{'name': 'output-0',
    'shape': [1],
    'datatype': 'FP64',
    'parameters': {'content_type': 'np'},
    'data': [0.9866910567224084]}]}

print(f"Our movies are {round(r.json()['outputs'][0]['data'][0] * 100, 4)}% similar!")

Our movies are 98.6691% similar

Let's decompose what just happened.

The URL for our service might seem a bit odd if you've never heard of the . This protocol is a set of specifications that allows machine learning models to be shared and deployed in a standardized way. This protocol enables the use of machine learning models on a variety of platforms and devices without requiring changes to the model or its code. The OIP is useful because it allows us to integrate machine learning into a wide range of applications in a standard way.

All URLs you create with MLServer will have the following structure.

Let's describe what each of the components of our inference_request does.

name: this maps one-to-one to the name of the parameter in your predict() function.
shape: represents the shape of the elements in our data. In our case, it is a list with [2] strings.
datatype: the different data types expected by the server, e.g., str, numpy array, pandas dataframe, bytes, etc.
parameters: allows us to specify the content_type beyond the data types
data: the inputs to our predict function.

To learn more about the OIP and how MLServer content types work, please have a looks at their .

05 Creating Model Replicas

Let's stop our server, change the settings of it, start it again, and test it.

# similarity_model/settings.json

{
    "parallel_workers": 3
}

mlserver start similarity_model

Let's get a few more to test our server. Get as creative as you'd like. 💡

deep_impact    = wiki_wiki.page('Deep_Impact_(film)').summary
armageddon     = wiki_wiki.page('Armageddon_(1998_film)').summary

antz           = wiki_wiki.page('Antz').summary
a_bugs_life    = wiki_wiki.page("A_Bug's_Life").summary

the_dark_night = wiki_wiki.page('The_Dark_Knight').summary
mamma_mia      = wiki_wiki.page('Mamma_Mia!_(film)').summary

def get_sim_score(movie1, movie2):
    response = requests.post(
        'http://0.0.0.0:8080/v2/models/doc-sim-model/infer',
        json={
            "inputs": [
                StringCodec.encode_input(name='docs', payload=[movie1, movie2], use_bytes=False).model_dump()
            ]
        })
    return response.json()['outputs'][0]['data'][0]

Let's first test that the function works as intended.

get_sim_score(deep_impact, armageddon)

0.9569279450151813

Now let's map three POST requests at the same time.

results = list(
    map(get_sim_score, (deep_impact, antz, the_dark_night), (armageddon, a_bugs_life, mamma_mia))
)
results

[0.9569279450151813, 0.9725374771538605, 0.9626173937217876]

We can also test it one by one.

for movie1, movie2 in zip((deep_impact, antz, the_dark_night), (armageddon, a_bugs_life, mamma_mia)):
    print(get_sim_score(movie1, movie2))

0.9569279450151813
0.9725374771538605
0.9626173937217876

06 Packaging our Service

For the last step of this guide, we are going to package our model and service into a docker image that we can reuse in another project or share with colleagues immediately. This step requires that we have docker installed and configured in our PCs, so if you need to set up docker, you can do so by following the instructions in the documentation .

The first step is to create a requirements.txt file with all of our dependencies and add it to the directory we've been using for our service (similarity_model).

# similarity_model/requirements.txt

mlserver
spacy==3.6.0
https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl

The next step is to build a docker image with our model, its dependencies and our server. If you've never heard of docker images before, here's a short description.

A Docker image is a lightweight, standalone, and executable package that includes everything needed to run a piece of software, including code, libraries, dependencies, and settings. It's like a carry-on bag for your application, containing everything it needs to travel safely and run smoothly in different environments. Just as a carry-on bag allows you to bring your essentials with you on a trip, a Docker image enables you to transport your application and its requirements across various computing environments, ensuring consistent and reliable deployment.

MLServer has a convenient function that lets us create docker images with our services. Let's use it.

mlserver build similarity_model/ -t 'fancy_ml_service'

We can check that our image was successfully build not only by looking at the logs of the previous command but also with the docker images command.

docker images

Let's test that our image works as intended with the following command. Make sure you have closed your previous server by using CTRL + C in your terminal.

docker run -it --rm -p 8080:8080 fancy_ml_service

Now that you have a packaged and fully-functioning microservice with our model, we could deploy our container to a production serving platform like , or via different offerings available through the many cloud providers out there (e.g. AWS Lambda, Google Cloud Run, etc.). You could also run this image on KServe, a Kubernetes native tool for model serving, or anywhere else where you can bring your docker image with you.

To learn more about MLServer and the different ways in which you can use it, head over to the section or the . To learn about some of the deployment options available, head over to the docs .

To keep up to date with what we are up to at Seldon, make sure you join our .

Custom Inference Runtimes

This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this which walks through the process of writing a custom runtime.

Writing a custom inference runtime

load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.

Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class MyCustomRuntime(MLModel):

  async def load(self) -> bool:
    # TODO: Replace for custom logic to load a model artifact
    self._model = load_my_custom_model()
    return True

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    # TODO: Replace for custom logic to run inference
    return self._model.predict(payload)

Simplified interface

Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through .

MLServer's "simplified" interface aims to cover use cases where encoding / decoding can be done through one of the codecs built-in into the MLServer package. However, there are instances where this may not be enough (e.g. variable number of inputs, variable content types, etc.). For these types of cases, please use MLServer's , where you will have full control over the full encoding / decoding process.

As an example of the above, let's assume a model which

Takes two lists of strings as inputs:
- questions, containing multiple questions to ask our model.
- context, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.

Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:

from mlserver import MLModel
from mlserver.codecs import decode_args
from typing import List

class MyCustomRuntime(MLModel):

  async def load(self) -> bool:
    # TODO: Replace for custom logic to load a model artifact
    self._model = load_my_custom_model()
    return True

  @decode_args
  async def predict(self, questions: List[str], context: List[str]) -> np.ndarray:
    # TODO: Replace for custom logic to run inference
    return self._model.predict(questions, context)

Note that, the method signature of our predict method now specifies:

The input names that we should be looking for in the request payload (i.e. questions and context).
The expected content type for each of the request inputs (i.e. List[str] on both cases).
The expected content type of the response outputs (i.e. np.ndarray).

Read and write headers

The headers field within the parameters section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class CustomHeadersRuntime(MLModel):

  ...

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    if payload.parameters and payload.parametes.headers:
      # These are all the incoming HTTP headers / gRPC metadata
      print(payload.parameters.headers)
    ...

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class CustomHeadersRuntime(MLModel):

  ...

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    ...
    return InferenceResponse(
      # Include any actual outputs from inference
      outputs=[],
      parameters=Parameters(headers={"foo": "bar"})
    )

Loading a custom MLServer runtime

For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:

.
└── models
    └── sum-model
        ├── model-settings.json
        ├── models.py

Note that, from the example above, we are assuming that:

Your custom runtime code lives in the models.py file.
The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).
```
{
  "model": "sum-model",
  "implementation": "models.MyCustomRuntime"
}
```

Loading a custom Python environment

It is possible to load this custom set of dependencies by providing them through an , whose path can be specified within your model-settings.json file.

To load a custom environment, must be enabled.

The main MLServer process communicates with workers in custom environments via using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same as the main process. Consult the tables below for environment compatibility.

Status

Description

Worker Python \ Server Python

3.9

3.10

3.11

If we take the above as a reference, we could extend it to include our custom environment as:

.
└── models
    └── sum-model
        ├── environment.tar.gz
        ├── model-settings.json
        ├── models.py

Note that, in the folder layout above, we are assuming that:

The environment.tar.gz tarball contains a pre-packaged version of your custom environment.

The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).

{
  "model": "sum-model",
  "implementation": "models.MyCustomRuntime",
  "parameters": {
    "environment_tarball": "./environment.tar.gz"
  }
}

Building a custom MLServer image

The mlserver build command expects that a Docker runtime is available and running in the background.

MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a or a requirements.txt file.

To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:

mlserver build . -t my-custom-server

The output will be a Docker image named my-custom-server, ready to be used.

Custom Environment

The subcommand will search for any Conda environment file (i.e. named either as environment.yaml or conda.yaml) and / or any requirements.txt present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.

The environment built by the mlserver build will be global to the whole MLServer image (i.e. every loaded model will, by default, use that custom environment). For Multi-Model Serving scenarios, it may be better to use instead - which will allow you to run multiple custom environments at the same time.

Default Settings

The mlserver build subcommand will treat any or files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.

Default setting values can still be overriden by external environment variables or model-specific model-settings.json.

Custom Dockerfile

Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like

Supporting arbitrary user IDs.
Building your on the fly.
Configure a set of .

However, there may be occasions where you need to customise your Dockerfile even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like or ).

To account for these cases, MLServer also includes a subcommand which will just generate a Dockerfile (and optionally a .dockerignore file) exactly like the one used by the mlserver build command. This Dockerfile can then be customised according to your needs.

The base Dockerfile requires to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1 environment variable, e.g.

DOCKER_BUILDKIT=1 docker build . -t my-custom-runtime:0.1.0

Serving HuggingFace Transformer Models

Out of the box, MLServer supports the deployment and serving of HuggingFace Transformer models with the following features:

Loading of Transformer Model artifacts from the Hugging Face Hub.
Model quantization & optimization using the Hugging Face Optimum library
Request batching for GPU optimization (via adaptive batching and request batching)

In this example, we will showcase some of this features using an example model.

# Import required dependencies
import requests

Serving

Since we're using a pretrained model, we can skip straight to serving.

`model-settings.json`

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2"
        }
    }
}

Overwriting ./model-settings.json

Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["this is a test"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()

{'model_name': 'transformer',
 'id': 'eb160c6b-8223-4342-ad92-6ac301a9fa5d',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"generated_text": "this is a testnet with 1-3,000-bit nodes as nodes."}']}]}

Using Optimum Optimized Models

We can also leverage the Optimum library that allows us to access quantized and optimized models.

We can download pretrained optimized models from the hub if available by enabling the optimum_model flag:

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2",
            "optimum_model": true
        }
    }
}

Overwriting ./model-settings.json

Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

mlserver start .

Send Test Request to Optimum Optimized Model

The request can now be sent using the same request structure but using optimized models for better performance.

inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["this is a test"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()

{'model_name': 'transformer',
 'id': '9c482c8d-b21e-44b1-8a42-7650a9dc01ef',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"generated_text": "this is a test of the \\"safe-code-safe-code-safe-code\\" approach. The method only accepts two parameters as parameters: the code. The parameter \'unsafe-code-safe-code-safe-code\' should"}']}]}

Testing Supported Tasks

We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.

Question Answering

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "question-answering"
        }
    }
}

Overwriting ./model-settings.json

Once again, you are able to run the model using the MLServer CLI.

mlserver start .

inference_request = {
    "inputs": [
        {
            "name": "question",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["what is your name?"],
        },
        {
            "name": "context",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Hello, I am Seldon, how is it going"],
        },
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()

{'model_name': 'transformer',
 'id': '4efac938-86d8-41a1-b78f-7690b2dcf197',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"score": 0.9869915843009949, "start": 12, "end": 18, "answer": "Seldon"}']}]}

Sentiment Analysis

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-classification"
        }
    }
}

Overwriting ./model-settings.json

Once again, you are able to run the model using the MLServer CLI.

mlserver start .

inference_request = {
    "inputs": [
        {
            "name": "args",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is terrible!"],
        }
    ]
}

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
).json()

{'model_name': 'transformer',
 'id': '835eabbd-daeb-4423-a64f-a7c4d7c60a9b',
 'parameters': {},
 'outputs': [{'name': 'output',
   'shape': [1, 1],
   'datatype': 'BYTES',
   'parameters': {'content_type': 'hg_jsonlist'},
   'data': ['{"label": "NEGATIVE", "score": 0.9996137022972107}']}]}

GPU Acceleration

We can also evaluate GPU acceleration, we can test the speed on CPU vs GPU using the following parameters

Testing with CPU

We first test the time taken with the device=-1 which configures CPU by default

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "max_batch_size": 128,
    "max_batch_time": 1,
    "parameters": {
        "extra": {
            "task": "text-generation",
            "device": -1
        }
    }
}

Overwriting ./model-settings.json

Once again, you are able to run the model using the MLServer CLI.

mlserver start .

inference_request = {
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is a generation for the work" for i in range(512)],
        }
    ]
}

# Benchmark time
import time

start_time = time.monotonic()

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
)

print(f"Elapsed time: {time.monotonic() - start_time}")

Elapsed time: 66.42268538899953

We can see that it takes 81 seconds which is 8 times longer than the gpu example below.

Testing with GPU

IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.

Now we'll run the benchmark with GPU configured, which we can do by setting device=0

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "parameters": {
        "extra": {
            "task": "text-generation",
            "device": 0
        }
    }
}

Overwriting ./model-settings.json

inference_request = {
    "inputs": [
        {
            "name": "text_inputs",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["This is a generation for the work" for i in range(512)],
        }
    ]
}

# Benchmark time
import time

start_time = time.monotonic()

requests.post(
    "http://localhost:8080/v2/models/transformer/infer", json=inference_request
)

print(f"Elapsed time: {time.monotonic() - start_time}")

Elapsed time: 11.27933280000434

We can see that the elapsed time is 8 times less than the CPU version!

Adaptive Batching with GPU

We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.

In our case we can enable adaptive batching with the max_batch_size which in our case we will set it ot 128.

We will also configure max_batch_time which specifies` the maximum amount of time the MLServer orchestrator will wait before sending for inference.

%%writefile ./model-settings.json
{
    "name": "transformer",
    "implementation": "mlserver_huggingface.HuggingFaceRuntime",
    "max_batch_size": 128,
    "max_batch_time": 1,
    "parameters": {
        "extra": {
            "task": "text-generation",
            "pretrained_model": "distilgpt2",
            "device": 0
        }
    }
}

Overwriting ./model-settings.json

In order to achieve the throughput required of 50 requests per second, we will use the tool vegeta which performs load testing.

We can now see that we are able to see that the requests are batched and we receive 100% success eventhough the requests are sent one-by-one.

%%bash
jq -ncM '{"method": "POST", "header": {"Content-Type": ["application/json"] }, "url": "http://localhost:8080/v2/models/transformer/infer", "body": "{\"inputs\":[{\"name\":\"text_inputs\",\"shape\":[1],\"datatype\":\"BYTES\",\"data\":[\"test\"]}]}" | @base64 }' \
          | vegeta \
                -cpus="2" \
                attack \
                -duration="3s" \
                -rate="50" \
                -format=json \
          | vegeta \
                report \
                -type=text

Requests      [total, rate, throughput]         150, 50.34, 22.28
Duration      [total, attack, wait]             6.732s, 2.98s, 3.753s
Latencies     [min, mean, 50, 90, 95, 99, max]  1.975s, 3.168s, 3.22s, 4.065s, 4.183s, 4.299s, 4.318s
Bytes In      [total, mean]                     60978, 406.52
Bytes Out     [total, mean]                     12300, 82.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150  
Error Set:

Serving MLflow models

Out of the box, MLServer supports the deployment and serving of MLflow models with the following features:

Loading of MLflow Model artifacts.
Support of dataframes, dict-of-tensors and tensor inputs.

In this example, we will showcase some of this features using an example model.

from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def writetemplate(line, cell):
    with open(line, 'w') as f:
        f.write(cell.format(**globals()))

Training

# %load src/train.py
# Original source code and more details can be found in:
# https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html

# The data set used in this example is from
# http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties.
# In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2


if __name__ == "__main__":
    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # Read the wine-quality csv file from the URL
    csv_url = (
        "http://archive.ics.uci.edu/ml"
        "/machine-learning-databases/wine-quality/winequality-red.csv"
    )
    try:
        data = pd.read_csv(csv_url, sep=";")
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, "
            "check your internet connection. Error: %s",
            e,
        )

    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
    l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5

    with mlflow.start_run():
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
        model_signature = infer_signature(train_x, train_y)

        # Model registry does not work with file store
        if tracking_url_type_store != "file":

            # Register the model
            # There are other ways to use the Model Registry,
            # which depends on the use case,
            # please refer to the doc for more information:
            # https://mlflow.org/docs/latest/model-registry.html#api-workflow
            mlflow.sklearn.log_model(
                lr,
                "model",
                registered_model_name="ElasticnetWineModel",
                signature=model_signature,
            )
        else:
            mlflow.sklearn.log_model(lr, "model", signature=model_signature)

!python src/train.py

import os

[experiment_file_path] = !ls -td ./mlruns/0/* | head -1
model_path = os.path.join(experiment_file_path, "artifacts", "model")
print(model_path)

!ls {model_path}

Serving

Now that we have trained and serialised our model, we are ready to start serving it. For that, the initial step will be to set up a model-settings.json that instructs MLServer to load our artifact using the MLflow Inference Runtime.

%%writetemplate ./model-settings.json
{{
    "name": "wine-classifier",
    "implementation": "mlserver_mlflow.MLflowRuntime",
    "parameters": {{
        "uri": "{model_path}"
    }}
}}

mlserver start .

Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

Send test inference request

We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set. For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

Note that, the request specifies the value pd as its content type, whereas every input specifies the content type np. These parameters will instruct MLServer to:

Convert every input value to a NumPy array, using the data type and shape information provided.
Group all the different inputs into a Pandas DataFrame, using their names as the column names.

import requests

inference_request = {
    "inputs": [
        {
          "name": "fixed acidity",
          "shape": [1],
          "datatype": "FP32",
          "data": [7.4],
        },
        {
          "name": "volatile acidity",
          "shape": [1],
          "datatype": "FP32",
          "data": [0.7000],
        },
        {
          "name": "citric acid",
          "shape": [1],
          "datatype": "FP32",
          "data": [0],
        },
        {
          "name": "residual sugar",
          "shape": [1],
          "datatype": "FP32",
          "data": [1.9],
        },
        {
          "name": "chlorides",
          "shape": [1],
          "datatype": "FP32",
          "data": [0.076],
        },
        {
          "name": "free sulfur dioxide",
          "shape": [1],
          "datatype": "FP32",
          "data": [11],
        },
        {
          "name": "total sulfur dioxide",
          "shape": [1],
          "datatype": "FP32",
          "data": [34],
        },
        {
          "name": "density",
          "shape": [1],
          "datatype": "FP32",
          "data": [0.9978],
        },
        {
          "name": "pH",
          "shape": [1],
          "datatype": "FP32",
          "data": [3.51],
        },
        {
          "name": "sulphates",
          "shape": [1],
          "datatype": "FP32",
          "data": [0.56],
        },
        {
          "name": "alcohol",
          "shape": [1],
          "datatype": "FP32",
          "data": [9.4],
        },
    ]
}

endpoint = "http://localhost:8080/v2/models/wine-classifier/infer"
response = requests.post(endpoint, json=inference_request)

response.json()

As we can see in the output above, the predicted quality score for our input wine was 5.57.

MLflow Scoring Protocol

As an example, we can try to send the same request that sent previously, but using MLflow's protocol. Note that, in both cases, the request will be handled by the same MLServer instance.

import requests

inference_request = {
    "dataframe_split": {
        "columns": [
            "fixed acidity",
            "volatile acidity",
            "citric acid",
            "residual sugar",
            "chlorides",
            "free sulfur dioxide",
            "total sulfur dioxide",
            "density",
            "pH",
            "sulphates",
            "alcohol",
        ],
        "data": [[7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4]]
    }
}

endpoint = "http://localhost:8080/invocations"
response = requests.post(endpoint, json=inference_request)

response.json()

As we can see above, the predicted quality for our input is 5.57, matching the prediction we obtained above.

MLflow Model Signature

As an example, we can first have a look at the model signature saved for our MLflow model. This can be seen directly on the MLModel file saved by our model.

!cat {model_path}/MLmodel

We can then query the metadata endpoint, to see the model metadata inferred by MLServer from our test model's signature. For this, we will use the /v2/models/wine-classifier/ endpoint.

import requests


endpoint = "http://localhost:8080/v2/models/wine-classifier"
response = requests.get(endpoint)

response.json()

As we should be able to see, the model metadata now matches the information contained in our model signature, including any extra content types necessary to decode our data correctly.

Content Type Decoding

MLServer extends the V2 inference protocol by adding support for a content_type annotation. This annotation can be provided either through the model metadata parameters, or through the input parameters. By leveraging the content_type annotation, we can provide the necessary information to MLServer so that it can decode the input payload from the "wire" V2 protocol to something meaningful to the model / user (e.g. a NumPy array).

This example will walk you through some examples which illustrate how this works, and how it can be extended.

Echo Inference Runtime

To start with, we will write a dummy runtime which just prints the input, the decoded input and returns it. This will serve as a testbed to showcase how the content_type support works.

Later on, we will extend this runtime by adding custom codecs that will decode our V2 payload to custom types.

%%writefile runtime.py
import json

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs import DecodedParameterName

_to_exclude = {
    "parameters": {DecodedParameterName, "headers"},
    'inputs': {"__all__": {"parameters": {DecodedParameterName, "headers"}}}
}

class EchoRuntime(MLModel):
    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        outputs = []
        for request_input in payload.inputs:
            decoded_input = self.decode(request_input)
            print(f"------ Encoded Input ({request_input.name}) ------")
            as_dict = request_input.dict(exclude=_to_exclude)  # type: ignore
            print(json.dumps(as_dict, indent=2))
            print(f"------ Decoded input ({request_input.name}) ------")
            print(decoded_input)

            outputs.append(
                ResponseOutput(
                    name=request_input.name,
                    datatype=request_input.datatype,
                    shape=request_input.shape,
                    data=request_input.data
                )
            )

        return InferenceResponse(model_name=self.name, outputs=outputs)

As you can see above, this runtime will decode the incoming payloads by calling the self.decode() helper method. This method will check what's the right content type for each input in the following order:

Is there any content type defined in the inputs[].parameters.content_type field within the request payload?
Is there any content type defined in the inputs[].parameters.content_type field within the model metadata?
Is there any default content type that should be assumed?

Model Settings

In order to enable this runtime, we will also create a model-settings.json file. This file should be present (or accessible from) in the folder where we run mlserver start ..

%%writefile model-settings.json

{
    "name": "content-type-example",
    "implementation": "runtime.EchoRuntime"
}

Request Inputs

Our initial step will be to decide the content type based on the incoming inputs[].parameters field. For this, we will start our MLServer in the background (e.g. running mlserver start .)

import requests

payload = {
    "inputs": [
        {
            "name": "parameters-np",
            "datatype": "INT32",
            "shape": [2, 2],
            "data": [1, 2, 3, 4],
            "parameters": {
                "content_type": "np"
            }
        },
        {
            "name": "parameters-str",
            "datatype": "BYTES",
            "shape": [1],
            "data": "hello world 😁",
            "parameters": {
                "content_type": "str"
            }
        }
    ]
}

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload
)

Codecs

As you've probably already noticed, writing request payloads compliant with both the V2 Inference Protocol requires a certain knowledge about both the V2 spec and the structure expected by each content type. To account for this and simplify usage, the MLServer package exposes a set of utilities which will help you interact with your models via the V2 protocol.

These helpers are mainly shaped as "codecs". That is, abstractions which know how to "encode" and "decode" arbitrary Python datatypes to and from the V2 Inference Protocol.

Generally, we recommend using the existing set of codecs to generate your V2 payloads. This will ensure that requests and responses follow the right structure, and should provide a more seamless experience.

Following with our previous example, the same code could be rewritten using codecs as:

import requests
import numpy as np

from mlserver.types import InferenceRequest, InferenceResponse
from mlserver.codecs import NumpyCodec, StringCodec

parameters_np = np.array([[1, 2], [3, 4]])
parameters_str = ["hello world 😁"]

payload = InferenceRequest(
    inputs=[
        NumpyCodec.encode_input("parameters-np", parameters_np),
        # The `use_bytes=False` flag will ensure that the encoded payload is JSON-compatible
        StringCodec.encode_input("parameters-str", parameters_str, use_bytes=False),
    ]
)

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload.model_dump()
)

response_payload = InferenceResponse.parse_raw(response.text)
print(NumpyCodec.decode_output(response_payload.outputs[0]))
print(StringCodec.decode_output(response_payload.outputs[1]))

Note that the rewritten snippet now makes use of the built-in InferenceRequest class, which represents a V2 inference request. On top of that, it also uses the NumpyCodec and StringCodec implementations, which know how to encode a Numpy array and a list of strings into V2-compatible request inputs.

Model Metadata

Our next step will be to define the expected content type through the model metadata. This can be done by extending the model-settings.json file, and adding a section on inputs.

%%writefile model-settings.json

{
    "name": "content-type-example",
    "implementation": "runtime.EchoRuntime",
    "inputs": [
        {
            "name": "metadata-np",
            "datatype": "INT32",
            "shape": [2, 2],
            "parameters": {
                "content_type": "np"
            }
        },
        {
            "name": "metadata-str",
            "datatype": "BYTES",
            "shape": [11],
            "parameters": {
                "content_type": "str"
            }
        }
    ]
}

After adding this metadata, we will re-start MLServer (e.g. mlserver start .) and we will send a new request without any explicit parameters.

import requests

payload = {
    "inputs": [
        {
            "name": "metadata-np",
            "datatype": "INT32",
            "shape": [2, 2],
            "data": [1, 2, 3, 4],
        },
        {
            "name": "metadata-str",
            "datatype": "BYTES",
            "shape": [11],
            "data": "hello world 😁",
        }
    ]
}

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload
)

As you should be able to see in the server logs, MLServer will cross-reference the input names against the model metadata to find the right content type.

Custom Codecs

There may be cases where a custom inference runtime may need to encode / decode to custom datatypes. As an example, we can think of computer vision models which may only operate with pillow image objects.

In these scenarios, it's possible to extend the Codec interface to write our custom encoding logic. A Codec, is simply an object which defines a decode() and encode() methods. To illustrate how this would work, we will extend our custom runtime to add a custom PillowCodec.

%%writefile runtime.py
import io
import json

from PIL import Image

from mlserver import MLModel
from mlserver.types import (
    InferenceRequest,
    InferenceResponse,
    RequestInput,
    ResponseOutput,
)
from mlserver.codecs import NumpyCodec, register_input_codec, DecodedParameterName
from mlserver.codecs.utils import InputOrOutput


_to_exclude = {
    "parameters": {DecodedParameterName},
    "inputs": {"__all__": {"parameters": {DecodedParameterName}}},
}


@register_input_codec
class PillowCodec(NumpyCodec):
    ContentType = "img"
    DefaultMode = "L"

    @classmethod
    def can_encode(cls, payload: Image) -> bool:
        return isinstance(payload, Image)

    @classmethod
    def _decode(cls, input_or_output: InputOrOutput) -> Image:
        if input_or_output.datatype != "BYTES":
            # If not bytes, assume it's an array
            image_array = super().decode_input(input_or_output)  # type: ignore
            return Image.fromarray(image_array, mode=cls.DefaultMode)

        encoded = input_or_output.data
        if isinstance(encoded, str):
            encoded = encoded.encode()

        return Image.frombytes(
            mode=cls.DefaultMode, size=input_or_output.shape, data=encoded
        )

    @classmethod
    def encode_output(cls, name: str, payload: Image) -> ResponseOutput:  # type: ignore
        byte_array = io.BytesIO()
        payload.save(byte_array, mode=cls.DefaultMode)

        return ResponseOutput(
            name=name, shape=payload.size, datatype="BYTES", data=byte_array.getvalue()
        )

    @classmethod
    def decode_output(cls, response_output: ResponseOutput) -> Image:
        return cls._decode(response_output)

    @classmethod
    def encode_input(cls, name: str, payload: Image) -> RequestInput:  # type: ignore
        output = cls.encode_output(name, payload)
        return RequestInput(
            name=output.name,
            shape=output.shape,
            datatype=output.datatype,
            data=output.data,
        )

    @classmethod
    def decode_input(cls, request_input: RequestInput) -> Image:
        return cls._decode(request_input)


class EchoRuntime(MLModel):
    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        outputs = []
        for request_input in payload.inputs:
            decoded_input = self.decode(request_input)
            print(f"------ Encoded Input ({request_input.name}) ------")
            as_dict = request_input.dict(exclude=_to_exclude)  # type: ignore
            print(json.dumps(as_dict, indent=2))
            print(f"------ Decoded input ({request_input.name}) ------")
            print(decoded_input)

            outputs.append(
                ResponseOutput(
                    name=request_input.name,
                    datatype=request_input.datatype,
                    shape=request_input.shape,
                    data=request_input.data,
                )
            )

        return InferenceResponse(model_name=self.name, outputs=outputs)

We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.

import requests

payload = {
    "inputs": [
        {
            "name": "image-int32",
            "datatype": "INT32",
            "shape": [8, 8],
            "data": [
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0,
                1, 0, 1, 0, 1, 0, 1, 0
            ],
            "parameters": {
                "content_type": "img"
            }
        },
        {
            "name": "image-bytes",
            "datatype": "BYTES",
            "shape": [8, 8],
            "data": (
                "10101010"
                "10101010"
                "10101010"
                "10101010"
                "10101010"
                "10101010"
                "10101010"
                "10101010"
            ),
            "parameters": {
                "content_type": "img"
            }
        }
    ]
}

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload
)

As you should be able to see in the MLServer logs, the server is now able to decode the payload into a Pillow image. This example also illustrates how Codec objects can be compatible with multiple datatype values (e.g. tensor and BYTES in this case).

Request Codecs

So far, we've seen how you can specify codecs so that they get applied at the input level. However, it is also possible to use request-wide codecs that aggregate multiple inputs to decode the payload. This is usually relevant for cases where the models expect a multi-column input type, like a Pandas DataFrame.

To illustrate this, we will first tweak our EchoRuntime so that it prints the decoded contents at the request level.

%%writefile runtime.py
import json

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
from mlserver.codecs import DecodedParameterName

_to_exclude = {
    "parameters": {DecodedParameterName},
    'inputs': {"__all__": {"parameters": {DecodedParameterName}}}
}

class EchoRuntime(MLModel):
    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        print("------ Encoded Input (request) ------")
        as_dict = payload.dict(exclude=_to_exclude)  # type: ignore
        print(json.dumps(as_dict, indent=2))
        print("------ Decoded input (request) ------")
        decoded_request = None
        if payload.parameters:
            decoded_request = getattr(payload.parameters, DecodedParameterName)
        print(decoded_request)

        outputs = []
        for request_input in payload.inputs:
            outputs.append(
                ResponseOutput(
                    name=request_input.name,
                    datatype=request_input.datatype,
                    shape=request_input.shape,
                    data=request_input.data
                )
            )

        return InferenceResponse(model_name=self.name, outputs=outputs)

We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.

import requests

payload = {
    "inputs": [
        {
            "name": "parameters-np",
            "datatype": "INT32",
            "shape": [2, 2],
            "data": [1, 2, 3, 4],
            "parameters": {
                "content_type": "np"
            }
        },
        {
            "name": "parameters-str",
            "datatype": "BYTES",
            "shape": [2, 11],
            "data": ["hello world 😁", "bye bye 😁"],
            "parameters": {
                "content_type": "str"
            }
        }
    ],
    "parameters": {
        "content_type": "pd"
    }
}

response = requests.post(
    "http://localhost:8080/v2/models/content-type-example/infer",
    json=payload
)

Deploying a Custom Tensorflow Model with MLServer and Seldon Core

Background

Intro

This tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we'll cover:

Running the model locally
Turning the ML model into an API
Containerizing the model
Storing the container in a registry
Deploying the model to Kubernetes (with Seldon Core)
Scaling the model

The Use Case

Getting Set Up

git clone https://github.com/SeldonIO/cassava-example.git

If you've already cloned the MLServer repository, you can also find it in docs/examples/cassava.

Once you've done that, you can just run:

cd cassava-example/

pip install -r requirements.txt

And it'll set you up with all the libraries required to run the code.

Running The Python App

The starting point for this tutorial is python script app.py. This is typical of the kind of python code we'd run standalone or in a jupyter notebook. Let's familiarise ourself with the code:

from helpers import plot, preprocess
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_hub as hub

# Fixes an issue with Jax and TF competing for GPU
tf.config.experimental.set_visible_devices([], 'GPU')

# Load the model
model_path = './model'
classifier = hub.KerasLayer(model_path)

# Load the dataset and store the class names
dataset, info = tfds.load('cassava', with_info=True)
class_names = info.features['label'].names + ['unknown']

# Select a batch of examples and plot them
batch_size = 9
batch = dataset['validation'].map(preprocess).batch(batch_size).as_numpy_iterator()
examples = next(batch)
plot(examples, class_names)

# Generate predictions for the batch and plot them against their labels
predictions = classifier(examples['image'])
predictions_max = tf.argmax(predictions, axis=-1)
print(predictions_max)
plot(examples, class_names, predictions_max)

First up, we're importing a couple of functions from our helpers.py file:

plot provides the visualisation of the samples, labels and predictions.
preprocess is used to resize images to 224x224 pixels and normalize the RGB values.

The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.

Try it yourself by running:

python app.py

Creating an API for The Model

The problem with running our code like we did earlier is that it's not accessible to anyone who doesn't have the python script (and all of it's dependencies). A good way to solve this is to turn our model into an API.

Setting Things Up

In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load() and predict(). Let's take a look at the code (found in model/serve-model.py):

from mlserver import MLModel
from mlserver.codecs import decode_args
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub

# Define a class for our Model, inheriting the MLModel class from MLServer
class CassavaModel(MLModel):

  # Load the model into memory
  async def load(self) -> bool:
    tf.config.experimental.set_visible_devices([], 'GPU')
    model_path = '.'
    self._model = hub.KerasLayer(model_path)
    self.ready = True
    return self.ready

  # Logic for making predictions against our model
  @decode_args
  async def predict(self, payload: np.ndarray) -> np.ndarray:
    # convert payload to tf.tensor
    payload_tensor = tf.constant(payload)

    # Make predictions
    predictions = self._model(payload_tensor)
    predictions_max = tf.argmax(predictions, axis=-1)

    # convert predictions to np.ndarray
    response_data = np.array(predictions_max)

    return response_data

The load() method is used to define any logic required to set up our model for inference. In our case, we're loading the model weights into self._model. The predict() method is where we include all of our prediction logic.

You may notice that we've slightly modified our code from earlier (in app.py). The biggest change is that it is now wrapped in a single class CassavaModel.

The only other task we need to do to run our model on MLServer is to specify a model-settings.json file:

{
    "name": "cassava",
    "implementation": "serve-model.CassavaModel"
}

This is a simple configuration file that tells MLServer how to handle our model. In our case, we've provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel).

Serving The Model

We're now ready to serve our model with MLServer. To do that we can simply run:

mlserver start model/

MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.

Making Predictions Using The API

Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the test.py file by running:

python test.py --local

Containerizing The Model

Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build command.

Before we run this command, we need to provide our dependencies in either a requirements.txt or a conda.env file. The requirements file we'll use for this example is stored in model/requirements.txt:

tensorflow==2.12.0
tensorflow-hub==0.13.0

Notice that we didn't need to include mlserver in our requirements? That's because the builder image has mlserver included already.

We're now ready to build our container image using:

mlserver build model/ -t [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]

Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".

MLServer will now build the model into a container image for us. We can check the output of this by running:

docker images

Finally, we want to send this container image to be stored in our container registry. We can do this by running:

docker push [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]

Deploying to Kubernetes

Now that we've turned our model into a production-ready API, containerized it and pushed it to a registry, it's time to deploy our model.

Creating the Deployment

To create our deployment with Seldon Core we need to create a small configuration file that looks like this:

You can find this file named deployment.yaml in the base folder of this tutorial's repository.

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: cassava
spec:
  protocol: v2
  predictors:
    - componentSpecs:
        - spec:
            containers:
              - image: YOUR_CONTAINER_REGISTRY/IMAGE_NAME
                name: cassava
                imagePullPolicy: Always
      graph:
        name: cassava
        type: MODEL
      name: cassava

Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".

We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:

kubectl create -f deployment.yaml

To check our deployment is up and running we can run:

kubectl get pods

We should see STATUS = Running once our deployment has finalized.

Testing the Deployment

Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it's working.

To do this, we simply run the test.py file in the following way:

python test.py --remote

This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.

Scaling the Model

Kubernetes and Seldon Core make this really easy to do by simply running:

kubectl scale sdep cassava --replicas=3

We can replace the --replicas=3 with any number we want to scale to.

To watch the servers scaling out we can run:

kubectl get pods --watch