All pages
Powered by GitBook
1 of 11

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

User Guide

Metrics

Out-of-the-box, MLServer exposes a set of metrics that help you monitor your machine learning workloads in production. These include standard metrics like number of requests and latency.

On top of these, you can also register and track your own custom metrics as part of your custom inference runtimes.

Default Metrics

By default, MLServer will expose metrics around inference requests (count and error rate) and the status of its internal requests queues. These internal queues are used for adaptive batching and communication with the inference workers.

Metric Name
Description

REST Server Metrics

On top of the default set of metrics, MLServer's REST server will also expose a set of metrics specific to REST.

Metric Name
Description

gRPC Server Metrics

On top of the default set of metrics, MLServer's gRPC server will also expose a set of metrics specific to gRPC.

Metric Name
Description

Custom Metrics

MLServer allows you to register custom metrics within your custom inference runtimes. This can be done through the mlserver.register() and mlserver.log() methods.

  • mlserver.register: Register a new metric.

  • mlserver.log: Log a new set of metric / value pairs. If there's any unregistered metric, it will get registered on-the-fly.

Custom metrics will generally be registered in the load() <mlserver.MLModel.load> method and then used in the predict() <mlserver.MLModel.predict> method of your .

Metrics Labelling

For metrics specific to a model (e.g. , request counts, etc), MLServer will always label these with the model name and model version. Downstream, this will allow to aggregate and query metrics per model.

Below, you can find the list of standardised labels that you will be able to find on model-specific metrics:

Label Name
Description

Settings

MLServer will expose metric values through a metrics endpoint exposed on its own metric server. This endpoint can be polled by or other -compatible backends.

Below you can find the available to control the behaviour of the metrics server:

Label Name
Description
Default

Prefix used for metric names specific to MLServer's REST inference interface.

rest_server

metrics_dir

Directory used to store internal metric files (used to support metrics sharing across ). This is equivalent to Prometheus' env var.

MLServer's current working directory (i.e. $PWD)

model_infer_request_success

Number of successful inference requests.

model_infer_request_failure

Number of failed inference requests.

batch_request_queue

Queue size for the adaptive batching queue.

parallel_request_queue

Queue size for the inference workers queue.

[rest_server]_requests

Number of REST requests, labelled by endpoint and status code.

[rest_server]_requests_duration_seconds

Latency of REST requests.

[rest_server]_requests_in_progress

Number of in-flight REST requests.

grpc_server_handled

Number of gRPC requests, labelled by gRPC code and method.

grpc_server_started

Number of in-flight gRPC requests.

import mlserver

from mlserver.types import InferenceRequest, InferenceResponse

class MyCustomRuntime(mlserver.MLModel):
  async def load(self) -> bool:
    self._model = load_my_custom_model()
    mlserver.register("my_custom_metric", "This is a custom metric example")
    return True

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
    mlserver.log(my_custom_metric=34)
    # TODO: Replace for custom logic to run inference
    return self._model.predict(payload)

model_name

Model Name (e.g. my-custom-model)

model_version

Model Version (e.g. v1.2.3)

metrics_endpoint

Path under which the metrics endpoint will be exposed.

/metrics

metrics_port

Port used to serve the metrics server.

8082

The prefix for the REST-specific metrics will be dependent on the metrics_rest_server_prefix flag from the MLServer settings.

Under the hood, metrics logged through the mlserver.log method will get exposed to Prometheus as a Histogram.

If these labels are not present on a specific metric, this means that those metrics can't be sliced at the model level.

custom runtime
custom metrics
Prometheus
OpenMetrics

metrics_rest_server_prefix

settings
inference workers
$PROMETHEUS_MULTIPROC_DIR

OpenAPI Support

MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:

Name
Description
OpenAPI Spec

Open Inference Protocol

Main dataplane for inference, health and metadata

Swagger UI

On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.

The autogenerated Swagger UI can be accessed under the /v2/docs endpoint.

Model Swagger UI

Alongside the , MLServer will also autogenerate a Swagger UI tailored to individual models, showing the endpoints available for each one.

The model-specific autogenerated Swagger UI can be accessed under the following endpoints:

  • /v2/models/{model_name}/docs

  • /v2/models/{model_name}/versions/{model_version}/docs

Streaming

Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers.

REST Server

Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.

The streaming endpoints are available for both the infer and generate methods through the following endpoints:

  • /v2/models/{model_name}/versions/{model_version}/infer_stream

  • /v2/models/{model_name}/infer_stream

  • /v2/models/{model_name}/versions/{model_version}/generate_stream

Note that for REST, the generate and generate_stream endpoints are aliases for the infer and infer_stream endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs).

gRPC Server

Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.

The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but mlserver already has the predict method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic.

The stub method for streaming to be used by the client is ModelStreamInfer.

Limitation

There are three main limitations of the streaming support in MLServer:

  • the parallel_workers setting should be set to 0 to disable distributed workers (to be addressed in future releases)

  • for REST, the gzip_enabled setting should be set to false to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue )

Parallel Inference

Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes. This allows MLServer to scale out beyond the limitations of the Python interpreter. To learn more about why this can be beneficial, you can check the below.

By default, MLServer will spin up a pool with only one worker process to run inference. All models will be loaded uniformly across the inference pool workers. To read more about advanced settings, please see the .

Concurrency in Python

The

/v2/models/{model_name}/generate_stream

here

Model Repository Extension

Extension to the protocol to provide a control plane which lets you load / unload models dynamically

model_repository.json

Besides the Swagger UI, you can also access the raw OpenAPI spec through the /v2/docs/dataplane.json endpoint.

Besides the Swagger UI, you can also access the model-specific raw OpenAPI spec through the following endpoints:

  • /v2/models/{model_name}/docs/dataplane.json

  • /v2/models/{model_name}/versions/{model_version}/docs/dataplane.json

general API documentation
dataplane.json
is a mutex lock that exists in most Python interpreters (e.g. CPython). Its main purpose is to lock Python’s execution so that it only runs on a single processor at the same time. This simplifies certain things to the interpreter. However, it also adds the limitation that a
single Python process will never be able to leverage multiple cores
.

When we think about MLServer's support for Multi-Model Serving (MMS), this could lead to scenarios where a heavily-used model starves the other models running within the same MLServer instance. Similarly, even if we don’t take MMS into account, the GIL also makes it harder to scale inference for a single model.

To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.

Overhead

Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020.

The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.

Despite the above, even though this overhead is minimised, this it can still be particularly noticeable for lightweight inference methods, where the extra IPC overhead can take a large percentage of the overall time. In these cases (which can only be assessed on a model-by-model basis), the user has the option to disable the parallel inference feature.

For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.

Usage

By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers setting.

parallel_workers

The parallel_workers field of the settings.json file (or alternatively, the MLSERVER_PARALLEL_WORKERS global environment variable) controls the size of MLServer's inference pool. The expected values are:

  • N, where N > 0, will create a pool of N workers.

  • 0, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.

inference_pool_gid

The inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_INFERENCE_POOL_GID global environment variable) allows to load models on a dedicated inference pool based on the group ID (GID) to prevent starvation behavior.

Complementing the inference_pool_gid, if the autogenerate_inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_AUTOGENERATE_INFERENCE_POOL_GID global environment variable) is set to True, a UUID is automatically generated, and a dedicated inference pool will load the given model. This option is useful if the user wants to load a single model on an dedicated inference pool without having to manage the GID themselves.

References

Jiale Zhi, Rui Wang, Jeff Clune, and Kenneth O. Stanley. Fiber: A Platform for Efficient Development and Distributed Training for Reinforcement Learning and Population-Based Methods. arXiv:2003.11164 [cs, stat], March 2020. arXiv:2003.11164.

concurrency section
usage section below
Global Interpreter Lock (GIL)

Seldon Core

MLServer is used as the core Python inference server in Seldon Core. Therefore, it should be straightforward to deploy your models either by using one of the built-in pre-packaged servers or by pointing to a custom image of MLServer.

This section assumes a basic knowledge of Seldon Core and Kubernetes, as well as access to a working Kubernetes cluster with Seldon Core installed. To learn more about or , please visit the .

Pre-packaged Servers

Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.

Usage

To let Seldon Core know what framework was used to train your model, you can use the implementation field of your SeldonDeployment manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:

As you can see highlighted above, all that we need to specify is that:

  • Our inference deployment should use the , which is done by setting the protocol field to kfserving.

  • Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the , which is done by setting the implementation field to

Note that, while the protocol should always be set to kfserving (i.e. so that models are served using the ), the value of the implementation field will be dependant on your ML framework. The valid values of the implementation field are . However, it should also be possible to (e.g. to support a ).

Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

To consult the supported values of the implementation field where MLServer is used, you can check the support table below.

Supported Pre-packaged Servers

As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.

The table below shows a list of the currently supported values of the implementation field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.

Framework
MLServer Runtime
Seldon Core Pre-packaged Server
Documentation

Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a of pre-packaged servers. To check the full list, please visit the .

Custom Runtimes

There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to , which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.

Usage

The componentSpecs field of the SeldonDeployment manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write our SeldonDeployment manifest as follows:

As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:

  • Letting Seldon Core know that the model deployment will be served through the ) by setting the protocol field to v2.

  • Pointing our model container to use our custom MLServer image, by specifying it on the image field of the componentSpecs section of the manifest.

Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

SKLEARN_SERVER
.

XGBOOST_SERVER

MLflow

MLFLOW_SERVER

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
        implementation: SKLEARN_SERVER
        modelUri: gs://seldon-models/sklearn/iris
kubectl apply -f my-seldondeployment-manifest.yaml

Scikit-Learn

MLServer SKLearn

SKLEARN_SERVER

SKLearn Server

XGBoost

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                image: my-custom-server:0.1.0
kubectl apply -f my-seldondeployment-manifest.yaml
V2 inference protocol
MLServer SKLearn runtime
V2 inference protocol
pre-determined by Seldon Core
configure and add new ones
custom MLServer runtime
wider set
Seldon Core documentation
write custom runtimes
V2 inference protocol
Seldon Core
how to install it
Seldon Core documentation

KServe

MLServer is used as the core Python inference server in KServe (formerly known as KFServing). This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.

This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about or , please visit the .

Serving Runtimes

KServe provides built-in serving runtimes to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.

Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.

Usage

To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService manifest.

For example, to serve a Scikit-Learn model, you could use a manifest like the one below:

As you can see highlighted above, the InferenceService manifest will only need to specify the following points:

  • The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn serving runtime to deploy it.

  • The model will be served using the , which can be enabled by setting the protocolVersion field to v2.

Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

Supported Serving Runtimes

As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.

Framework
MLServer Runtime
KServe Serving Runtime
Documentation

Note that, on top of the ones shown above (backed by MLServer), KServe also provides a of serving runtimes. To see the full list, please visit the .

Custom Runtimes

Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to , which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for .

Usage

The InferenceService manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the . For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write an InferenceService manifest like the one below:

As we can see highlighted above, the main points that we'll need to take into account are:

  • Pointing to our custom MLServer image in the custom container section of our InferenceService.

  • Explicitly choosing the to serve our model.

  • Let KServe know what port will be exposed by our custom container to send inference requests.

Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

Adaptive Batching

MLServer includes support to batch requests together transparently on-the-fly. We refer to this as "adaptive batching", although it can also be known as "predictive batching".

Benefits

There are usually two main reasons to adopt adaptive batching:

MLServer XGBoost
XGBoost Server
MLServer MLflow
MLflow Server

xgboost

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    sklearn:
      protocolVersion: v2
      storageUri: gs://seldon-models/sklearn/iris
kubectl apply -f my-inferenceservice-manifest.yaml

Scikit-Learn

MLServer SKLearn

sklearn

SKLearn Serving Runtime

XGBoost

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: my-model
spec:
  predictor:
    containers:
      - name: classifier
        image: my-custom-server:0.1.0
        env:
          - name: PROTOCOL
            value: v2
        ports:
          - containerPort: 8080
            protocol: TCP
kubectl apply -f my-inferenceservice-manifest.yaml
V2 inference protocol
wider set
KServe documentation
write custom runtimes
custom runtimes
custom MLServer image containing your custom logic
V2 inference protocol
KServe
how to install it
KServe documentation

Maximise resource usage. Usually, inference operations are “vectorised” (i.e. are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time. Therefore, to make sure that it’s used at maximum capacity, we need to run inference across batches.
  • Minimise any inference overhead. Usually, all models will have to “pay” a constant overhead when running any type of inference. This can be something like IO to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it’s in our interest to send as large batches as we can without deteriorating performance.

  • However, these benefits will usually scale only up to a certain point, which is usually determined by either the infrastructure, the machine learning framework used to train your model, or a combination of both. Therefore, to maximise the performance improvements brought in by adaptive batching it will be important to configure it with the appropriate values for your model. Since these values are usually found through experimentation, MLServer won't enable by default adaptive batching on newly loaded models.

    Usage

    MLServer lets you configure adaptive batching independently for each model through two main parameters:

    • Maximum batch size, that is how many requests you want to group together.

    • Maximum batch time, that is how much time we should wait for new requests until we reach our maximum batch size.

    max_batch_size

    The max_batch_size field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_SIZE global environment variable) controls the maximum number of requests that should be grouped together on each batch. The expected values are:

    • N, where N > 1, will create batches of up to N elements.

    • 0 or 1, will disable adaptive batching.

    max_batch_time

    The max_batch_time field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_TIME global environment variable) controls the time that MLServer should wait for new requests to come in until we reach our maximum batch size.

    The expected format is in seconds, but it will take fractional values. That is, 500ms could be expressed as 0.5.

    The expected values are:

    • T, where T > 0, will wait T seconds at most.

    • 0, will disable adaptive batching.

    Merge and split of custom parameters

    MLserver allows adding custom parameters to the parameters field of the requests. These parameters are received as a merged list of parameters inside the server, e.g.

    is received as follows in the batched request in the server:

    The same way if the request is sent back from the server as a batched request

    it will be returned unbatched from the server as follows:

    # request 1
    types.RequestInput(
        name="parameters-np",
        shape=[1],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param='value-1',
        )
    )
    
    # request 2
    types.RequestInput(
        name="parameters-np",
        shape=[1],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param='value-2',
        )
    )
    types.RequestInput(
        name="parameters-np",
        shape=[2],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param=['value-1', 'value-2'],
        )
    )
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[3, 3],
        data=[1, 2, 3, 4, 5, 6, 7, 8, 9],
        parameters=types.Parameters(
            content_type="np",
            foo=["foo_1", "foo_2"],
            bar=["bar_1", "bar_2", "bar_3"],
        ),
    )
    # Request 1
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[1, 2, 3],
        parameters=types.Parameters(
            content_type="np", foo="foo_1", bar="'bar_1"
        ),
    )
    
    # Request 2
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[4, 5, 6],
        parameters=types.Parameters(
            content_type="np", foo="foo_2", bar="bar_2"
        ),
    )
    
    # Request 3
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[7, 8, 9],
        parameters=types.Parameters(content_type="np", bar="bar_3"),
    )

    Deployment

    MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including Seldon Core and KServe (formerly known as KFServing). This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.

    In general, it should be possible to deploy models using MLServer into any serving engine compatible with the V2 protocol. Alternatively, it's also possible to manage MLServer deployments manually as regular processes (i.e. in a non-Kubernetes-native way). However, this may be more involved and highly dependant on the deployment infrastructure.

    MLServer XGBoost
    XGBoost Serving Runtime

    Seldon Core

    KServe

    Cover
    Cover

    Custom Inference Runtimes

    There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

    This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this end-to-end example which walks through the process of writing a custom runtime.

    Writing a custom inference runtime

    MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel> abstract class, whose main methods are:

    • load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).

    • unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).

    • predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.

    Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.

    Simplified interface

    MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict() method with the mlserver.codecs.decode_args decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.

    Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through .

    As an example of the above, let's assume a model which

    • Takes two lists of strings as inputs:

      • questions, containing multiple questions to ask our model.

      • context, containing multiple contexts for each of the questions.

    Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:

    Note that, the method signature of our predict method now specifies:

    • The input names that we should be looking for in the request payload (i.e. questions and context).

    • The expected content type for each of the request inputs (i.e. List[str] on both cases).

    • The expected content type of the response outputs (i.e.

    Read and write headers

    There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers field of the parameters object within the InferenceRequest instance.

    Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers field within the parameters object of the returned InferenceResponse instance.

    Loading a custom MLServer runtime

    MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json configuration file.

    For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:

    Note that, from the example above, we are assuming that:

    • Your custom runtime code lives in the models.py file.

    • The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).

    Loading a custom Python environment

    More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.

    It is possible to load this custom set of dependencies by providing them through an , whose path can be specified within your model-settings.json file.

    Status
    Description
    Worker Python \ Server Python
    3.9
    3.10
    3.11

    If we take the above as a reference, we could extend it to include our custom environment as:

    Note that, in the folder layout above, we are assuming that:

    • The environment.tar.gz tarball contains a pre-packaged version of your custom environment.

    • The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).

    Building a custom MLServer image

    MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a or a requirements.txt file.

    To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:

    The output will be a Docker image named my-custom-server, ready to be used.

    Custom Environment

    The subcommand will search for any Conda environment file (i.e. named either as environment.yaml or conda.yaml) and / or any requirements.txt present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.

    Default Settings

    The mlserver build subcommand will treat any or files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.

    Custom Dockerfile

    Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like

    • Supporting arbitrary user IDs.

    • Building your on the fly.

    • Configure a set of .

    However, there may be occasions where you need to customise your Dockerfile even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like or ).

    To account for these cases, MLServer also includes a subcommand which will just generate a Dockerfile (and optionally a .dockerignore file) exactly like the one used by the mlserver build command. This Dockerfile can then be customised according to your needs.

    Returns a Numpy array with some predictions as the output.

    np.ndarray
    ).

    🔵

    3.11

    🔵

    🔵

    🔵

    đź”´

    Unsupported

    🟢

    Supported

    🔵

    Untested

    3.9

    🟢

    🟢

    🔵

    3.10

    🟢

    MLServer's "simplified" interface aims to cover use cases where encoding / decoding can be done through one of the codecs built-in into the MLServer package. However, there are instances where this may not be enough (e.g. variable number of inputs, variable content types, etc.). For these types of cases, please use MLServer's "advanced" interface, where you will have full control over the full encoding / decoding process.

    The headers field within the parameters section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.

    To load a custom environment, parallel inference must be enabled.

    The main MLServer process communicates with workers in custom environments via multiprocessing.Queue using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same default pickle protocol as the main process. Consult the tables below for environment compatibility.

    The mlserver build command expects that a Docker runtime is available and running in the background.

    The environment built by the mlserver build will be global to the whole MLServer image (i.e. every loaded model will, by default, use that custom environment). For Multi-Model Serving scenarios, it may be better to use per-model custom environments instead - which will allow you to run multiple custom environments at the same time.

    Default setting values can still be overriden by external environment variables or model-specific model-settings.json.

    The base Dockerfile requires Docker's Buildkit to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1 environment variable, e.g.

    MLServer's codecs and content types system
    environment tarball
    previous example
    Conda environment file
    base custom environment
    default setting values
    Kaniko
    Buildah

    🟢

    mlserver build
    settings.json
    model-settings.json
    mlserver dockerfile
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class MyCustomRuntime(MLModel):
    
      async def load(self) -> bool:
        # TODO: Replace for custom logic to load a model artifact
        self._model = load_my_custom_model()
        return True
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        # TODO: Replace for custom logic to run inference
        return self._model.predict(payload)
    from mlserver import MLModel
    from mlserver.codecs import decode_args
    from typing import List
    
    class MyCustomRuntime(MLModel):
    
      async def load(self) -> bool:
        # TODO: Replace for custom logic to load a model artifact
        self._model = load_my_custom_model()
        return True
    
      @decode_args
      async def predict(self, questions: List[str], context: List[str]) -> np.ndarray:
        # TODO: Replace for custom logic to run inference
        return self._model.predict(questions, context)
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class CustomHeadersRuntime(MLModel):
    
      ...
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        if payload.parameters and payload.parametes.headers:
          # These are all the incoming HTTP headers / gRPC metadata
          print(payload.parameters.headers)
        ...
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class CustomHeadersRuntime(MLModel):
    
      ...
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        ...
        return InferenceResponse(
          # Include any actual outputs from inference
          outputs=[],
          parameters=Parameters(headers={"foo": "bar"})
        )
    .
    └── models
        └── sum-model
            ├── model-settings.json
            ├── models.py
    {
      "model": "sum-model",
      "implementation": "models.MyCustomRuntime"
    }
    .
    └── models
        └── sum-model
            ├── environment.tar.gz
            ├── model-settings.json
            ├── models.py
    {
      "model": "sum-model",
      "implementation": "models.MyCustomRuntime",
      "parameters": {
        "environment_tarball": "./environment.tar.gz"
      }
    }
    mlserver build . -t my-custom-server
    DOCKER_BUILDKIT=1 docker build . -t my-custom-runtime:0.1.0

    Content Types (and Codecs)

    Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime objects, Pillow images, etc. Unfortunately, the definition of the V2 Inference Protocol doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.

    To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.

    To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.

    Usage

    To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type field of the parameters section of your request.

    As an example, we can consider the following dataframe, containing two columns: Age and First Name.

    First Name
    Age

    This table, could be specified in the V2 protocol as the following payload, where we declare that:

    • The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd).

    • The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str).

    To learn more about the available content types and how to use them, you can see all the available ones in the section below.

    Codecs

    Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.

    Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.

    However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.

    To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):

    • Request Codecs

      • encode_request() <mlserver.codecs.RequestCodec.encode_request>

      • decode_request() <mlserver.codecs.RequestCodec.decode_request>

    Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.

    For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:

    For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .

    Converting to / from JSON

    When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types package (i.e. InferenceRequest <mlserver.types.InferenceRequest> or InferenceResponse <mlserver.types.InferenceResponse>). Therefore, if you want to use them with requests (or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.

    Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump() or .model_dump_json() method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).

    For example, if we want to send an inference request to model foo, we could do something along the following lines:

    Support for NaN values

    The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.

    In order to send / receive NaN values, you must ensure that:

    • You are using the REST interface.

    • The input / output entry containing NaN values uses either the FP16, FP32 or FP64 datatypes.

    Assuming those conditions are satisfied, any null value within your tensor payload will be converted to NaN.

    For example, if you take the following Numpy array:

    We could encode it as:

    Model Metadata

    Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.

    For example, to configure the content type values of the , one could create a model-settings.json file like the one below:

    It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.

    Available Content Types

    Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.

    Python Type
    Content Type
    Request Level
    Request Codec
    Input Level
    Input Codec

    NumPy Array

    The np content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:

    • The datatype field will be matched to the closest .

    • The shape field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.

    For example, if we think of the following NumPy Array:

    We could encode it as the input foo in a V2 protocol request as:

    When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single tensor.

    Pandas DataFrame

    The pd content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,

    • Each entry of the inputs list (or outputs, in the case of responses), will represent a column of the DataFrame.

    • Each of these entires, will contain all the row elements for that particular column.

    • The shape field of each input

    For example, if we consider the following dataframe:

    A
    B
    C

    We could encode it to the V2 Inference Protocol as:

    UTF-8 String

    The str content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:

    • The expected datatype is BYTES.

    • The shape field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"] payload will have a shape of 2 elements).

    For example, when if we consider the following list of strings:

    We could encode it to the V2 Inference Protocol as:

    When using the str content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single string or a set of strings.

    Base64

    The base64 content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:

    • The expected datatype is BYTES.

    • The data field should contain the base64-encoded binary strings.

    • The shape

    For example, if we think of the following "bytes array":

    We could encode it as the input foo of a V2 request as:

    Datetime

    The datetime content type will decode a V2 input into a , taking into account the following:

    • The expected datatype is BYTES.

    • The data field should contain the dates serialised following the .

    • The shape

    For example, if we think of the following datetime object:

    We could encode it as the input foo of a V2 request as:

    encode_response() <mlserver.codecs.RequestCodec.encode_response>

  • decode_response() <mlserver.codecs.RequestCodec.decode_response>

  • Input Codecs

    • encode_input() <mlserver.codecs.InputCodec.encode_input>

    • decode_input() <mlserver.codecs.InputCodec.decode_input>

    • encode_output() <mlserver.codecs.InputCodec.encode_output>

    • decode_output() <mlserver.codecs.InputCodec.decode_output>

  • You are either using the Pandas codec or the Numpy codec.

    mlserver.codecs.NumpyCodec

    pd

    âś…

    mlserver.codecs.PandasCodec

    ❌

    str

    âś…

    mlserver.codecs.string.StringRequestCodec

    âś…

    mlserver.codecs.StringCodec

    base64

    ❌

    âś…

    mlserver.codecs.Base64Codec

    datetime

    ❌

    âś…

    mlserver.codecs.DatetimeCodec

    (or
    output
    ) entry will contain (at least) the amount of rows included in the dataframe.

    b3

    c3

    a4

    b4

    c4

    field represents the number of binary strings that are encoded in the payload.
    field represents the number of datetimes that are encoded in the payload.

    Joanne

    34

    Michael

    22

    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "First Name",
          "datatype": "BYTES",
          "parameters": {
            "content_type": "str"
          },
          "shape": [2],
          "data": ["Joanne", "Michael"]
        },
        {
          "name": "Age",
          "datatype": "INT32",
          "shape": [2],
          "data": [34, 22]
        },
      ]
    }
    import pandas as pd
    
    from mlserver.codecs import PandasCodec
    
    dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})
    
    inference_request = PandasCodec.encode_request(dataframe)
    print(inference_request)
    import pandas as pd
    import requests
    
    from mlserver.codecs import PandasCodec
    
    dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})
    
    inference_request = PandasCodec.encode_request(dataframe)
    
    # raw_request will be a Python dictionary compatible with `requests`'s `json` kwarg
    raw_request = inference_request.dict()
    
    response = requests.post("localhost:8080/v2/models/foo/infer", json=raw_request)
    
    # raw_response will be a dictionary (loaded from the response's JSON),
    # therefore we can pass it as the InferenceResponse constructors' kwargs
    raw_response = response.json()
    inference_response = InferenceResponse(**raw_response)
    import numpy as np
    
    foo = np.array([[1.2, 2.3], [np.NaN, 4.5]])
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "np"
          },
          "data": [1.2, 2.3, null, 4.5]
          "datatype": "FP64",
          "shape": [2, 2],
        }
      ]
    }
    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "First Name",
          "datatype": "BYTES",
          "parameters": {
            "content_type": "str"
          },
          "shape": [-1],
        },
        {
          "name": "Age",
          "datatype": "INT32",
          "shape": [-1],
        },
      ]
    }

    NumPy Array

    np

    âś…

    mlserver.codecs.NumpyRequestCodec

    import numpy as np
    
    foo = np.array([[1, 2], [3, 4]])
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "np"
          },
          "data": [1, 2, 3, 4]
          "datatype": "INT32",
          "shape": [2, 2],
        }
      ]
    }
    from mlserver.codecs import NumpyRequestCodec
    
    # Encode an entire V2 request
    inference_request = NumpyRequestCodec.encode_request(foo)
    from mlserver.types import InferenceRequest
    from mlserver.codecs import NumpyCodec
    
    # We can use the `NumpyCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        NumpyCodec.encode_input("foo", foo)
      ]
    )

    a1

    b1

    c1

    a2

    b2

    c2

    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "A",
          "data": ["a1", "a2", "a3", "a4"]
          "datatype": "BYTES",
          "shape": [4],
        },
        {
          "name": "B",
          "data": ["b1", "b2", "b3", "b4"]
          "datatype": "BYTES",
          "shape": [4],
        },
        {
          "name": "C",
          "data": ["c1", "c2", "c3", "c4"]
          "datatype": "BYTES",
          "shape": [4],
        },
      ]
    }
    import pandas as pd
    
    from mlserver.codecs import PandasCodec
    
    foo = pd.DataFrame({
      "A": ["a1", "a2", "a3", "a4"],
      "B": ["b1", "b2", "b3", "b4"],
      "C": ["c1", "c2", "c3", "c4"]
    })
    
    inference_request = PandasCodec.encode_request(foo)
    foo = ["bar", "bar2"]
    {
      "parameters": {
        "content_type": "str"
      },
      "inputs": [
        {
          "name": "foo",
          "data": ["bar", "bar2"]
          "datatype": "BYTES",
          "shape": [2],
        }
      ]
    }
    from mlserver.codecs.string import StringRequestCodec
    
    # Encode an entire V2 request
    inference_request = StringRequestCodec.encode_request(foo, use_bytes=False)
    from mlserver.types import InferenceRequest
    from mlserver.codecs import StringCodec
    
    # We can use the `StringCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        StringCodec.encode_input("foo", foo, use_bytes=False)
      ]
    )
    foo = b"Python is fun"
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "base64"
          },
          "data": ["UHl0aG9uIGlzIGZ1bg=="]
          "datatype": "BYTES",
          "shape": [1],
        }
      ]
    }
    from mlserver.types import InferenceRequest
    from mlserver.codecs import Base64Codec
    
    # We can use the `Base64Codec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        Base64Codec.encode_input("foo", foo, use_bytes=False)
      ]
    )
    import datetime
    
    foo = datetime.datetime(2022, 1, 11, 11, 0, 0)
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "datetime"
          },
          "data": ["2022-01-11T11:00:00"]
          "datatype": "BYTES",
          "shape": [1],
        }
      ]
    }
    from mlserver.types import InferenceRequest
    from mlserver.codecs import DatetimeCodec
    
    # We can use the `DatetimeCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        DatetimeCodec.encode_input("foo", foo, use_bytes=False)
      ]
    )

    Some inference runtimes may apply a content type by default if none is present. To learn more about each runtime's defaults, please check the relevant inference runtime's docs.

    It's important to keep in mind that content types can be specified at both the request level and the input level. The former will apply to the entire set of inputs, whereas the latter will only apply to a particular input of the payload.

    MLServer allows you extend the supported content types by adding custom ones. To learn more about how to write your own custom content types, you can check this full end-to-end example. You can also learn more about building custom extensions for MLServer on the Custom Inference Runtime section of the docs.

    The V2 Inference Protocol expects that the data of each input is sent as a flat array. Therefore, the np content type will expect that tensors are sent flattened. The information in the shape field will then be used to reshape the vector into the right dimensions.

    By default, MLServer will always assume that an array with a single-dimensional shape, e.g. [N], is equivalent to [N, 1]. That is, each entry will be treated like a single one-dimensional data point (i.e. instead of a [1, D] array, where the full array is a single D-dimensional data point). To avoid any ambiguity, where possible, the Numpy codec will always explicitly encode [N] arrays as [N, 1].

    The pd content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.

    Available Content Types
    Content Type Decoding example
    Pydantic
    other methods
    model's metadata
    example above
    NumPy dtype
    Python datetime.datetime object
    ISO 8601 standard
    Content Types
    Request Codecs
    Input Codecs

    âś…

    a3

    Pandas DataFrame
    UTF-8 String
    Base64
    Datetime