1 of 1

Custom Inference Runtimes

There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

Writing a custom inference runtime

MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel> abstract class, whose main methods are:

load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.

Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.

Simplified interface

MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict() method with the mlserver.codecs.decode_args decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.

As an example of the above, let's assume a model which

Takes two lists of strings as inputs:
- questions, containing multiple questions to ask our model.
- context, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.

Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:

Note that, the method signature of our predict method now specifies:

The input names that we should be looking for in the request payload (i.e. questions and context).
The expected content type for each of the request inputs (i.e. List[str] on both cases).
The expected content type of the response outputs (i.e. np.ndarray).

Read and write headers

The headers field within the parameters section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.

There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers field of the parameters object within the InferenceRequest instance.

Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers field within the parameters object of the returned InferenceResponse instance.

from mlserver import MLModel
from mlserver.types import InferenceRequest, InferenceResponse

class CustomHeadersRuntime(MLModel):

  ...

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    ...
    return InferenceResponse(
      # Include any actual outputs from inference
      outputs=[],
      parameters=Parameters(headers={"foo": "bar"})
    )

Loading a custom MLServer runtime

MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json configuration file.

For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:

.
└── models
    └── sum-model
        ├── model-settings.json
        ├── models.py

Note that, from the example above, we are assuming that:

Your custom runtime code lives in the models.py file.
The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).
```
{
  "model": "sum-model",
  "implementation": "models.MyCustomRuntime"
}
```

Loading a custom Python environment

More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.

Status

Description

🔴

Unsupported

🟢

Supported

🔵

Untested

Worker Python \ Server Python

3.9

3.10

3.11

3.9

🟢

🔵

3.10

🟢

🔵

3.11

🔵

.
└── models
    └── sum-model
        ├── environment.tar.gz
        ├── model-settings.json
        ├── models.py

Note that, in the folder layout above, we are assuming that:

The environment.tar.gz tarball contains a pre-packaged version of your custom environment.

The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).

{
  "model": "sum-model",
  "implementation": "models.MyCustomRuntime",
  "parameters": {
    "environment_tarball": "./environment.tar.gz"
  }
}

Building a custom MLServer image

The mlserver build command expects that a Docker runtime is available and running in the background.

MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a Conda environment file or a requirements.txt file.

To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:

mlserver build . -t my-custom-server

The output will be a Docker image named my-custom-server, ready to be used.

Custom Environment

The mlserver build subcommand will search for any Conda environment file (i.e. named either as environment.yaml or conda.yaml) and / or any requirements.txt present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.

Default Settings

The mlserver build subcommand will treat any settings.json or model-settings.json files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.

Default setting values can still be overriden by external environment variables or model-specific model-settings.json.

Custom Dockerfile

Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like

Supporting arbitrary user IDs.
Building your base custom environment on the fly.
Configure a set of default setting values.

However, there may be occasions where you need to customise your Dockerfile even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like Kaniko or Buildah).

To account for these cases, MLServer also includes a mlserver dockerfile subcommand which will just generate a Dockerfile (and optionally a .dockerignore file) exactly like the one used by the mlserver build command. This Dockerfile can then be customised according to your needs.

The base Dockerfile requires Docker's Buildkit to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1 environment variable, e.g.

DOCKER_BUILDKIT=1 docker build . -t my-custom-runtime:0.1.0