1 of 11

User Guide

Content Types (and Codecs)

Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime objects, Pillow images, etc. Unfortunately, the definition of the doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.

To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.

To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.

Usage

To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type field of the parameters section of your request.

As an example, we can consider the following dataframe, containing two columns: Age and First Name.

This table, could be specified in the V2 protocol as the following payload, where we declare that:

The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd).
The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str).

It's important to keep in mind that content types can be specified at both the request level and the input level. The former will apply to the entire set of inputs, whereas the latter will only apply to a particular input of the payload.

Codecs

Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.

Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.

However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.

To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):

Request Codecs
- encode_request() <mlserver.codecs.RequestCodec.encode_request>
- decode_request() <mlserver.codecs.RequestCodec.decode_request>
- encode_response() <mlserver.codecs.RequestCodec.encode_response>
- decode_response() <mlserver.codecs.RequestCodec.decode_response>
Input Codecs
- encode_input() <mlserver.codecs.InputCodec.encode_input>
- decode_input() <mlserver.codecs.InputCodec.decode_input>
- encode_output() <mlserver.codecs.InputCodec.encode_output>
- decode_output() <mlserver.codecs.InputCodec.decode_output>

Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.

For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:

Converting to / from JSON

When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types package (i.e. InferenceRequest <mlserver.types.InferenceRequest> or InferenceResponse <mlserver.types.InferenceResponse>). Therefore, if you want to use them with requests (or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.

For example, if we want to send an inference request to model foo, we could do something along the following lines:

Support for NaN values

The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.

In order to send / receive NaN values, you must ensure that:

You are using the REST interface.
The input / output entry containing NaN values uses either the FP16, FP32 or FP64 datatypes.

Assuming those conditions are satisfied, any null value within your tensor payload will be converted to NaN.

For example, if you take the following Numpy array:

We could encode it as:

Model Metadata

It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.

Available Content Types

Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.

NumPy Array

The np content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:

The shape field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.

By default, MLServer will always assume that an array with a single-dimensional shape, e.g. [N], is equivalent to [N, 1]. That is, each entry will be treated like a single one-dimensional data point (i.e. instead of a [1, D] array, where the full array is a single D-dimensional data point). To avoid any ambiguity, where possible, the Numpy codec will always explicitly encode [N] arrays as [N, 1].

For example, if we think of the following NumPy Array:

We could encode it as the input foo in a V2 protocol request as:

When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single tensor.

Pandas DataFrame

The pd content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.

The pd content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,

Each entry of the inputs list (or outputs, in the case of responses), will represent a column of the DataFrame.
Each of these entires, will contain all the row elements for that particular column.
The shape field of each input (or output) entry will contain (at least) the amount of rows included in the dataframe.

For example, if we consider the following dataframe:

We could encode it to the V2 Inference Protocol as:

UTF-8 String

The str content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:

The expected datatype is BYTES.
The shape field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"] payload will have a shape of 2 elements).

For example, when if we consider the following list of strings:

We could encode it to the V2 Inference Protocol as:

When using the str content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single string or a set of strings.

Base64

The base64 content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:

The expected datatype is BYTES.
The data field should contain the base64-encoded binary strings.
The shape field represents the number of binary strings that are encoded in the payload.

For example, if we think of the following "bytes array":

We could encode it as the input foo of a V2 request as:

Datetime

The expected datatype is BYTES.
The shape field represents the number of datetimes that are encoded in the payload.

For example, if we think of the following datetime object:

We could encode it as the input foo of a V2 request as:

OpenAPI Support

MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:

Name

Description

OpenAPI Spec

Swagger UI

On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.

The autogenerated Swagger UI can be accessed under the /v2/docs endpoint.

Besides the Swagger UI, you can also access the raw OpenAPI spec through the /v2/docs/dataplane.json endpoint.

Model Swagger UI

Alongside the general API documentation, MLServer will also autogenerate a Swagger UI tailored to individual models, showing the endpoints available for each one.

The model-specific autogenerated Swagger UI can be accessed under the following endpoints:

/v2/models/{model_name}/docs
/v2/models/{model_name}/versions/{model_version}/docs

Besides the Swagger UI, you can also access the model-specific raw OpenAPI spec through the following endpoints:

/v2/models/{model_name}/docs/dataplane.json
/v2/models/{model_name}/versions/{model_version}/docs/dataplane.json

Parallel Inference

Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes. This allows MLServer to scale out beyond the limitations of the Python interpreter. To learn more about why this can be beneficial, you can check the concurrency section below.

By default, MLServer will spin up a pool with only one worker process to run inference. All models will be loaded uniformly across the inference pool workers. To read more about advanced settings, please see the usage section below.

Concurrency in Python

The Global Interpreter Lock (GIL) is a mutex lock that exists in most Python interpreters (e.g. CPython). Its main purpose is to lock Python’s execution so that it only runs on a single processor at the same time. This simplifies certain things to the interpreter. However, it also adds the limitation that a single Python process will never be able to leverage multiple cores.

When we think about MLServer's support for Multi-Model Serving (MMS), this could lead to scenarios where a heavily-used model starves the other models running within the same MLServer instance. Similarly, even if we don’t take MMS into account, the GIL also makes it harder to scale inference for a single model.

To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.

Overhead

Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020.

The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.

Despite the above, even though this overhead is minimised, this it can still be particularly noticeable for lightweight inference methods, where the extra IPC overhead can take a large percentage of the overall time. In these cases (which can only be assessed on a model-by-model basis), the user has the option to disable the parallel inference feature.

For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.

Usage

By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers setting.

`parallel_workers`

The parallel_workers field of the settings.json file (or alternatively, the MLSERVER_PARALLEL_WORKERS global environment variable) controls the size of MLServer's inference pool. The expected values are:

N, where N > 0, will create a pool of N workers.
0, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.

References

Jiale Zhi, Rui Wang, Jeff Clune, and Kenneth O. Stanley. Fiber: A Platform for Efficient Development and Distributed Training for Reinforcement Learning and Population-Based Methods. arXiv:2003.11164 [cs, stat], March 2020. arXiv:2003.11164.

Adaptive Batching

MLServer includes support to batch requests together transparently on-the-fly. We refer to this as "adaptive batching", although it can also be known as "predictive batching".

Benefits

There are usually two main reasons to adopt adaptive batching:

Maximise resource usage. Usually, inference operations are “vectorised” (i.e. are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time. Therefore, to make sure that it’s used at maximum capacity, we need to run inference across batches.
Minimise any inference overhead. Usually, all models will have to “pay” a constant overhead when running any type of inference. This can be something like IO to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it’s in our interest to send as large batches as we can without deteriorating performance.

However, these benefits will usually scale only up to a certain point, which is usually determined by either the infrastructure, the machine learning framework used to train your model, or a combination of both. Therefore, to maximise the performance improvements brought in by adaptive batching it will be important to configure it with the appropriate values for your model. Since these values are usually found through experimentation, MLServer won't enable by default adaptive batching on newly loaded models.

Usage

MLServer lets you configure adaptive batching independently for each model through two main parameters:

Maximum batch size, that is how many requests you want to group together.
Maximum batch time, that is how much time we should wait for new requests until we reach our maximum batch size.

`max_batch_size`

The max_batch_size field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_SIZE global environment variable) controls the maximum number of requests that should be grouped together on each batch. The expected values are:

N, where N > 1, will create batches of up to N elements.
0 or 1, will disable adaptive batching.

`max_batch_time`

The max_batch_time field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_TIME global environment variable) controls the time that MLServer should wait for new requests to come in until we reach our maximum batch size.

The expected format is in seconds, but it will take fractional values. That is, 500ms could be expressed as 0.5.

The expected values are:

T, where T > 0, will wait T seconds at most.
0, will disable adaptive batching.

Merge and split of custom parameters

MLserver allows adding custom parameters to the parameters field of the requests. These parameters are received as a merged list of parameters inside the server, e.g.

# request 1
types.RequestInput(
    name="parameters-np",
    shape=[1],
    datatype="BYTES",
    data=[],
    parameters=types.Parameters(
        custom-param='value-1',
    )
)

# request 2
types.RequestInput(
    name="parameters-np",
    shape=[1],
    datatype="BYTES",
    data=[],
    parameters=types.Parameters(
        custom-param='value-2',
    )
)

is received as follows in the batched request in the server:

types.RequestInput(
    name="parameters-np",
    shape=[2],
    datatype="BYTES",
    data=[],
    parameters=types.Parameters(
        custom-param=['value-1', 'value-2'],
    )
)

The same way if the request is sent back from the server as a batched request

types.ResponseOutput(
    name="foo",
    datatype="INT32",
    shape=[3, 3],
    data=[1, 2, 3, 4, 5, 6, 7, 8, 9],
    parameters=types.Parameters(
        content_type="np",
        foo=["foo_1", "foo_2"],
        bar=["bar_1", "bar_2", "bar_3"],
    ),
)

it will be returned unbatched from the server as follows:

# Request 1
types.ResponseOutput(
    name="foo",
    datatype="INT32",
    shape=[1, 3],
    data=[1, 2, 3],
    parameters=types.Parameters(
        content_type="np", foo="foo_1", bar="'bar_1"
    ),
)

# Request 2
types.ResponseOutput(
    name="foo",
    datatype="INT32",
    shape=[1, 3],
    data=[4, 5, 6],
    parameters=types.Parameters(
        content_type="np", foo="foo_2", bar="bar_2"
    ),
)

# Request 3
types.ResponseOutput(
    name="foo",
    datatype="INT32",
    shape=[1, 3],
    data=[7, 8, 9],
    parameters=types.Parameters(content_type="np", bar="bar_3"),
)

Custom Inference Runtimes

There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this end-to-end example which walks through the process of writing a custom runtime.

Writing a custom inference runtime

MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel> abstract class, whose main methods are:

load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.

Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class MyCustomRuntime(MLModel):

  async def load(self) -> bool:
    # TODO: Replace for custom logic to load a model artifact
    self._model = load_my_custom_model()
    return True

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    # TODO: Replace for custom logic to run inference
    return self._model.predict(payload)

Simplified interface

MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict() method with the mlserver.codecs.decode_args decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.

Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through MLServer's codecs and content types system.

MLServer's "simplified" interface aims to cover use cases where encoding / decoding can be done through one of the codecs built-in into the MLServer package. However, there are instances where this may not be enough (e.g. variable number of inputs, variable content types, etc.). For these types of cases, please use MLServer's "advanced" interface, where you will have full control over the full encoding / decoding process.

As an example of the above, let's assume a model which

Takes two lists of strings as inputs:
- questions, containing multiple questions to ask our model.
- context, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.

Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:

from mlserver import MLModel
from mlserver.codecs import decode_args
from typing import List

class MyCustomRuntime(MLModel):

  async def load(self) -> bool:
    # TODO: Replace for custom logic to load a model artifact
    self._model = load_my_custom_model()
    return True

  @decode_args
  async def predict(self, questions: List[str], context: List[str]) -> np.ndarray:
    # TODO: Replace for custom logic to run inference
    return self._model.predict(questions, context)

Note that, the method signature of our predict method now specifies:

The input names that we should be looking for in the request payload (i.e. questions and context).
The expected content type for each of the request inputs (i.e. List[str] on both cases).
The expected content type of the response outputs (i.e. np.ndarray).

Read and write headers

The headers field within the parameters section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.

There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers field of the parameters object within the InferenceRequest instance.

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class CustomHeadersRuntime(MLModel):

  ...

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    if payload.parameters and payload.parametes.headers:
      # These are all the incoming HTTP headers / gRPC metadata
      print(payload.parameters.headers)
    ...

Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers field within the parameters object of the returned InferenceResponse instance.

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class CustomHeadersRuntime(MLModel):

  ...

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    ...
    return InferenceResponse(
      # Include any actual outputs from inference
      outputs=[],
      parameters=Parameters(headers={"foo": "bar"})
    )

Loading a custom MLServer runtime

MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json configuration file.

For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:

.
└── models
    └── sum-model
        ├── model-settings.json
        ├── models.py

Note that, from the example above, we are assuming that:

Your custom runtime code lives in the models.py file.
The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).
```
{
  "model": "sum-model",
  "implementation": "models.MyCustomRuntime"
}
```

Loading a custom Python environment

More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.

It is possible to load this custom set of dependencies by providing them through an environment tarball, whose path can be specified within your model-settings.json file.

To load a custom environment, parallel inference must be enabled.

The main MLServer process communicates with workers in custom environments via multiprocessing.Queue using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same default pickle protocol as the main process. Consult the tables below for environment compatibility.

If we take the previous example above as a reference, we could extend it to include our custom environment as:

.
└── models
    └── sum-model
        ├── environment.tar.gz
        ├── model-settings.json
        ├── models.py

Note that, in the folder layout above, we are assuming that:

The environment.tar.gz tarball contains a pre-packaged version of your custom environment.

The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).

{
  "model": "sum-model",
  "implementation": "models.MyCustomRuntime",
  "parameters": {
    "environment_tarball": "./environment.tar.gz"
  }
}

Building a custom MLServer image

The mlserver build command expects that a Docker runtime is available and running in the background.

MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a Conda environment file or a requirements.txt file.

To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:

mlserver build . -t my-custom-server

The output will be a Docker image named my-custom-server, ready to be used.

Custom Environment

The mlserver build subcommand will search for any Conda environment file (i.e. named either as environment.yaml or conda.yaml) and / or any requirements.txt present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.

The environment built by the mlserver build will be global to the whole MLServer image (i.e. every loaded model will, by default, use that custom environment). For Multi-Model Serving scenarios, it may be better to use per-model custom environments instead - which will allow you to run multiple custom environments at the same time.

Default Settings

The mlserver build subcommand will treat any settings.json or model-settings.json files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.

Default setting values can still be overriden by external environment variables or model-specific model-settings.json.

Custom Dockerfile

Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like

Supporting arbitrary user IDs.
Building your base custom environment on the fly.
Configure a set of default setting values.

However, there may be occasions where you need to customise your Dockerfile even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like Kaniko or Buildah).

To account for these cases, MLServer also includes a mlserver dockerfile subcommand which will just generate a Dockerfile (and optionally a .dockerignore file) exactly like the one used by the mlserver build command. This Dockerfile can then be customised according to your needs.

The base Dockerfile requires Docker's Buildkit to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1 environment variable, e.g.

DOCKER_BUILDKIT=1 docker build . -t my-custom-runtime:0.1.0

Metrics

Out-of-the-box, MLServer exposes a set of metrics that help you monitor your machine learning workloads in production. These include standard metrics like number of requests and latency.

On top of these, you can also register and track your own custom metrics as part of your custom inference runtimes.

Default Metrics

By default, MLServer will expose metrics around inference requests (count and error rate) and the status of its internal requests queues. These internal queues are used for adaptive batching and communication with the inference workers.

Metric Name

Description

REST Server Metrics

On top of the default set of metrics, MLServer's REST server will also expose a set of metrics specific to REST.

The prefix for the REST-specific metrics will be dependent on the metrics_rest_server_prefix flag from the MLServer settings.

Metric Name

Description

gRPC Server Metrics

On top of the default set of metrics, MLServer's gRPC server will also expose a set of metrics specific to gRPC.

Custom Metrics

MLServer allows you to register custom metrics within your custom inference runtimes. This can be done through the mlserver.register() and mlserver.log() methods.

mlserver.register: Register a new metric.
mlserver.log: Log a new set of metric / value pairs. If there's any unregistered metric, it will get registered on-the-fly.

Under the hood, metrics logged through the mlserver.log method will get exposed to Prometheus as a Histogram.

Custom metrics will generally be registered in the load() <mlserver.MLModel.load> method and then used in the predict() <mlserver.MLModel.predict> method of your custom runtime.

import mlserver

from mlserver.types import InferenceRequest, InferenceResponse

class MyCustomRuntime(mlserver.MLModel):
  async def load(self) -> bool:
    self._model = load_my_custom_model()
    mlserver.register("my_custom_metric", "This is a custom metric example")
    return True

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    mlserver.log(my_custom_metric=34)
    # TODO: Replace for custom logic to run inference
    return self._model.predict(payload)

Metrics Labelling

For metrics specific to a model (e.g. custom metrics, request counts, etc), MLServer will always label these with the model name and model version. Downstream, this will allow to aggregate and query metrics per model.

If these labels are not present on a specific metric, this means that those metrics can't be sliced at the model level.

Below, you can find the list of standardised labels that you will be able to find on model-specific metrics:

Settings

MLServer will expose metric values through a metrics endpoint exposed on its own metric server. This endpoint can be polled by Prometheus or other OpenMetrics-compatible backends.

Below you can find the settings available to control the behaviour of the metrics server:

Deployment

MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including Seldon Core and KServe (formerly known as KFServing). This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.

In general, it should be possible to deploy models using MLServer into any serving engine compatible with the V2 protocol. Alternatively, it's also possible to manage MLServer deployments manually as regular processes (i.e. in a non-Kubernetes-native way). However, this may be more involved and highly dependant on the deployment infrastructure.

Seldon Core

MLServer is used as the core Python inference server in Seldon Core. Therefore, it should be straightforward to deploy your models either by using one of the built-in pre-packaged servers or by pointing to a custom image of MLServer.

This section assumes a basic knowledge of Seldon Core and Kubernetes, as well as access to a working Kubernetes cluster with Seldon Core installed. To learn more about Seldon Core or how to install it, please visit the Seldon Core documentation.

Pre-packaged Servers

Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.

Usage

To let Seldon Core know what framework was used to train your model, you can use the implementation field of your SeldonDeployment manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
        implementation: SKLEARN_SERVER
        modelUri: gs://seldon-models/sklearn/iris

As you can see highlighted above, all that we need to specify is that:

Our inference deployment should use the V2 inference protocol, which is done by setting the protocol field to kfserving.
Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the MLServer SKLearn runtime, which is done by setting the implementation field to SKLEARN_SERVER.

Note that, while the protocol should always be set to kfserving (i.e. so that models are served using the V2 inference protocol), the value of the implementation field will be dependant on your ML framework. The valid values of the implementation field are pre-determined by Seldon Core. However, it should also be possible to configure and add new ones (e.g. to support a custom MLServer runtime).

Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

kubectl apply -f my-seldondeployment-manifest.yaml

To consult the supported values of the implementation field where MLServer is used, you can check the support table below.

Supported Pre-packaged Servers

As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.

The table below shows a list of the currently supported values of the implementation field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.

Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a wider set of pre-packaged servers. To check the full list, please visit the Seldon Core documentation.

Custom Runtimes

There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.

Usage

The componentSpecs field of the SeldonDeployment manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write our SeldonDeployment manifest as follows:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                image: my-custom-server:0.1.0

As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:

Letting Seldon Core know that the model deployment will be served through the V2 inference protocol) by setting the protocol field to v2.
Pointing our model container to use our custom MLServer image, by specifying it on the image field of the componentSpecs section of the manifest.

kubectl apply -f my-seldondeployment-manifest.yaml

KServe

MLServer is used as the core Python inference server in KServe (formerly known as KFServing). This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.

This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about KServe or how to install it, please visit the KServe documentation.

Serving Runtimes

KServe provides built-in serving runtimes to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.

Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.

Usage

To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService manifest.

For example, to serve a Scikit-Learn model, you could use a manifest like the one below:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    sklearn:
      protocolVersion: v2
      storageUri: gs://seldon-models/sklearn/iris

As you can see highlighted above, the InferenceService manifest will only need to specify the following points:

The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn serving runtime to deploy it.
The model will be served using the V2 inference protocol, which can be enabled by setting the protocolVersion field to v2.

Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

kubectl apply -f my-inferenceservice-manifest.yaml

Supported Serving Runtimes

As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.

Note that, on top of the ones shown above (backed by MLServer), KServe also provides a wider set of serving runtimes. To see the full list, please visit the KServe documentation.

Custom Runtimes

Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for custom runtimes.

Usage

The InferenceService manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the custom MLServer image containing your custom logic. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write an InferenceService manifest like the one below:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    containers:
      - name: classifier
        image: my-custom-server:0.1.0
        env:
          - name: PROTOCOL
            value: v2
        ports:
          - containerPort: 8080
            protocol: TCP

As we can see highlighted above, the main points that we'll need to take into account are:

Pointing to our custom MLServer image in the custom container section of our InferenceService.
Explicitly choosing the V2 inference protocol to serve our model.
Let KServe know what port will be exposed by our custom container to send inference requests.

kubectl apply -f my-inferenceservice-manifest.yaml

Streaming

Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers.

REST Server

Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.

The streaming endpoints are available for both the infer and generate methods through the following endpoints:

/v2/models/{model_name}/versions/{model_version}/infer_stream
/v2/models/{model_name}/infer_stream
/v2/models/{model_name}/versions/{model_version}/generate_stream
/v2/models/{model_name}/generate_stream

Note that for REST, the generate and generate_stream endpoints are aliases for the infer and infer_stream endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs).

gRPC Server

Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.

The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but mlserver already has the predict method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic.

The stub method for streaming to be used by the client is ModelStreamInfer.

Limitation

There are three main limitations of the streaming support in MLServer:

the parallel_workers setting should be set to 0 to disable distributed workers (to be addressed in future releases)
for REST, the gzip_enabled setting should be set to false to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue here)

Custom Inference Runtimes

Writing a custom inference runtime

load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.

Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class MyCustomRuntime(MLModel):

  async def load(self) -> bool:
    # TODO: Replace for custom logic to load a model artifact
    self._model = load_my_custom_model()
    return True

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    # TODO: Replace for custom logic to run inference
    return self._model.predict(payload)

Simplified interface

As an example of the above, let's assume a model which

Takes two lists of strings as inputs:
- questions, containing multiple questions to ask our model.
- context, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.

Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:

from mlserver import MLModel
from mlserver.codecs import decode_args
from typing import List

class MyCustomRuntime(MLModel):

  async def load(self) -> bool:
    # TODO: Replace for custom logic to load a model artifact
    self._model = load_my_custom_model()
    return True

  @decode_args
  async def predict(self, questions: List[str], context: List[str]) -> np.ndarray:
    # TODO: Replace for custom logic to run inference
    return self._model.predict(questions, context)

Note that, the method signature of our predict method now specifies:

The input names that we should be looking for in the request payload (i.e. questions and context).
The expected content type for each of the request inputs (i.e. List[str] on both cases).
The expected content type of the response outputs (i.e. np.ndarray).

Read and write headers

The headers field within the parameters section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class CustomHeadersRuntime(MLModel):

  ...

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    if payload.parameters and payload.parametes.headers:
      # These are all the incoming HTTP headers / gRPC metadata
      print(payload.parameters.headers)
    ...

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class CustomHeadersRuntime(MLModel):

  ...

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    ...
    return InferenceResponse(
      # Include any actual outputs from inference
      outputs=[],
      parameters=Parameters(headers={"foo": "bar"})
    )

Loading a custom MLServer runtime

For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:

.
└── models
    └── sum-model
        ├── model-settings.json
        ├── models.py

Note that, from the example above, we are assuming that:

Your custom runtime code lives in the models.py file.
The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).
```
{
  "model": "sum-model",
  "implementation": "models.MyCustomRuntime"
}
```

Loading a custom Python environment

It is possible to load this custom set of dependencies by providing them through an environment tarball, whose path can be specified within your model-settings.json file.

To load a custom environment, parallel inference must be enabled.

Status

Description

Worker Python \ Server Python

3.9

3.10

3.11

If we take the previous example above as a reference, we could extend it to include our custom environment as:

.
└── models
    └── sum-model
        ├── environment.tar.gz
        ├── model-settings.json
        ├── models.py

Note that, in the folder layout above, we are assuming that:

The environment.tar.gz tarball contains a pre-packaged version of your custom environment.

The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).

{
  "model": "sum-model",
  "implementation": "models.MyCustomRuntime",
  "parameters": {
    "environment_tarball": "./environment.tar.gz"
  }
}

Building a custom MLServer image

The mlserver build command expects that a Docker runtime is available and running in the background.

To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:

mlserver build . -t my-custom-server

The output will be a Docker image named my-custom-server, ready to be used.

Custom Environment

Default Settings

Default setting values can still be overriden by external environment variables or model-specific model-settings.json.

Custom Dockerfile

Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like

Supporting arbitrary user IDs.
Building your base custom environment on the fly.
Configure a set of default setting values.

The base Dockerfile requires Docker's Buildkit to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1 environment variable, e.g.

DOCKER_BUILDKIT=1 docker build . -t my-custom-runtime:0.1.0

Content Types (and Codecs)

Usage

Some inference runtimes may apply a content type by default if none is present. To learn more about each runtime's defaults, please check the .

As an example, we can consider the following dataframe, containing two columns: Age and First Name.

First Name

Age

This table, could be specified in the V2 protocol as the following payload, where we declare that:

The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd).
The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str).

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "First Name",
      "datatype": "BYTES",
      "parameters": {
        "content_type": "str"
      },
      "shape": [2],
      "data": ["Joanne", "Michael"]
    },
    {
      "name": "Age",
      "datatype": "INT32",
      "shape": [2],
      "data": [34, 22]
    },
  ]
}

To learn more about the available content types and how to use them, you can see all the available ones in the section below.

Codecs

However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.

Request Codecs
- encode_request() <mlserver.codecs.RequestCodec.encode_request>
- decode_request() <mlserver.codecs.RequestCodec.decode_request>
- encode_response() <mlserver.codecs.RequestCodec.encode_response>
- decode_response() <mlserver.codecs.RequestCodec.decode_response>
Input Codecs
- encode_input() <mlserver.codecs.InputCodec.encode_input>
- decode_input() <mlserver.codecs.InputCodec.decode_input>
- encode_output() <mlserver.codecs.InputCodec.encode_output>
- decode_output() <mlserver.codecs.InputCodec.decode_output>

For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:

import pandas as pd

from mlserver.codecs import PandasCodec

dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})

inference_request = PandasCodec.encode_request(dataframe)
print(inference_request)

For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .

Converting to / from JSON

Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump() or .model_dump_json() method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).

For example, if we want to send an inference request to model foo, we could do something along the following lines:

import pandas as pd
import requests

from mlserver.codecs import PandasCodec

dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})

inference_request = PandasCodec.encode_request(dataframe)

# raw_request will be a Python dictionary compatible with `requests`'s `json` kwarg
raw_request = inference_request.dict()

response = requests.post("localhost:8080/v2/models/foo/infer", json=raw_request)

# raw_response will be a dictionary (loaded from the response's JSON),
# therefore we can pass it as the InferenceResponse constructors' kwargs
raw_response = response.json()
inference_response = InferenceResponse(**raw_response)

Support for NaN values

In order to send / receive NaN values, you must ensure that:

You are using the REST interface.
The input / output entry containing NaN values uses either the FP16, FP32 or FP64 datatypes.
You are either using the or the .

Assuming those conditions are satisfied, any null value within your tensor payload will be converted to NaN.

For example, if you take the following Numpy array:

import numpy as np

foo = np.array([[1.2, 2.3], [np.NaN, 4.5]])

We could encode it as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "np"
      },
      "data": [1.2, 2.3, null, 4.5]
      "datatype": "FP64",
      "shape": [2, 2],
    }
  ]
}

Model Metadata

Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.

For example, to configure the content type values of the , one could create a model-settings.json file like the one below:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "First Name",
      "datatype": "BYTES",
      "parameters": {
        "content_type": "str"
      },
      "shape": [-1],
    },
    {
      "name": "Age",
      "datatype": "INT32",
      "shape": [-1],
    },
  ]
}

Available Content Types

Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.

Python Type

Content Type

Request Level

Request Codec

Input Level

Input Codec

MLServer allows you extend the supported content types by adding custom ones. To learn more about how to write your own custom content types, you can check this . You can also learn more about building custom extensions for MLServer on the of the docs.

NumPy Array

The expects that the data of each input is sent as a flat array. Therefore, the np content type will expect that tensors are sent flattened. The information in the shape field will then be used to reshape the vector into the right dimensions.

The np content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:

The datatype field will be matched to the closest .
The shape field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.

For example, if we think of the following NumPy Array:

import numpy as np

foo = np.array([[1, 2], [3, 4]])

We could encode it as the input foo in a V2 protocol request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "np"
      },
      "data": [1, 2, 3, 4]
      "datatype": "INT32",
      "shape": [2, 2],
    }
  ]
}

from mlserver.codecs import NumpyRequestCodec

# Encode an entire V2 request
inference_request = NumpyRequestCodec.encode_request(foo)

from mlserver.types import InferenceRequest
from mlserver.codecs import NumpyCodec

# We can use the `NumpyCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    NumpyCodec.encode_input("foo", foo)
  ]
)

Pandas DataFrame

The pd content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.

The pd content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,

Each entry of the inputs list (or outputs, in the case of responses), will represent a column of the DataFrame.
Each of these entires, will contain all the row elements for that particular column.
The shape field of each input (or output) entry will contain (at least) the amount of rows included in the dataframe.

For example, if we consider the following dataframe:

We could encode it to the V2 Inference Protocol as:

{
  "parameters": {
    "content_type": "pd"
  },
  "inputs": [
    {
      "name": "A",
      "data": ["a1", "a2", "a3", "a4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "B",
      "data": ["b1", "b2", "b3", "b4"]
      "datatype": "BYTES",
      "shape": [4],
    },
    {
      "name": "C",
      "data": ["c1", "c2", "c3", "c4"]
      "datatype": "BYTES",
      "shape": [4],
    },
  ]
}

import pandas as pd

from mlserver.codecs import PandasCodec

foo = pd.DataFrame({
  "A": ["a1", "a2", "a3", "a4"],
  "B": ["b1", "b2", "b3", "b4"],
  "C": ["c1", "c2", "c3", "c4"]
})

inference_request = PandasCodec.encode_request(foo)

UTF-8 String

The str content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:

The expected datatype is BYTES.
The shape field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"] payload will have a shape of 2 elements).

For example, when if we consider the following list of strings:

foo = ["bar", "bar2"]

We could encode it to the V2 Inference Protocol as:

{
  "parameters": {
    "content_type": "str"
  },
  "inputs": [
    {
      "name": "foo",
      "data": ["bar", "bar2"]
      "datatype": "BYTES",
      "shape": [2],
    }
  ]
}

from mlserver.codecs.string import StringRequestCodec

# Encode an entire V2 request
inference_request = StringRequestCodec.encode_request(foo, use_bytes=False)

from mlserver.types import InferenceRequest
from mlserver.codecs import StringCodec

# We can use the `StringCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    StringCodec.encode_input("foo", foo, use_bytes=False)
  ]
)

Base64

The base64 content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:

The expected datatype is BYTES.
The data field should contain the base64-encoded binary strings.
The shape field represents the number of binary strings that are encoded in the payload.

For example, if we think of the following "bytes array":

foo = b"Python is fun"

We could encode it as the input foo of a V2 request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "base64"
      },
      "data": ["UHl0aG9uIGlzIGZ1bg=="]
      "datatype": "BYTES",
      "shape": [1],
    }
  ]
}

from mlserver.types import InferenceRequest
from mlserver.codecs import Base64Codec

# We can use the `Base64Codec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    Base64Codec.encode_input("foo", foo, use_bytes=False)
  ]
)

Datetime

The datetime content type will decode a V2 input into a , taking into account the following:

The expected datatype is BYTES.
The data field should contain the dates serialised following the .
The shape field represents the number of datetimes that are encoded in the payload.

For example, if we think of the following datetime object:

import datetime

foo = datetime.datetime(2022, 1, 11, 11, 0, 0)

We could encode it as the input foo of a V2 request as:

{
  "inputs": [
    {
      "name": "foo",
      "parameters": {
        "content_type": "datetime"
      },
      "data": ["2022-01-11T11:00:00"]
      "datatype": "BYTES",
      "shape": [1],
    }
  ]
}

from mlserver.types import InferenceRequest
from mlserver.codecs import DatetimeCodec

# We can use the `DatetimeCodec` to encode a single input head with name `foo`
# within a larger request
inference_request = InferenceRequest(
  inputs=[
    DatetimeCodec.encode_input("foo", foo, use_bytes=False)
  ]
)