Only this pageAll pages
Powered by GitBook
1 of 50

MLServer

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

User Guide

Seldon Core

MLServer is used as the core Python inference server in Seldon Core. Therefore, it should be straightforward to deploy your models either by using one of the built-in pre-packaged servers or by pointing to a custom image of MLServer.

This section assumes a basic knowledge of Seldon Core and Kubernetes, as well as access to a working Kubernetes cluster with Seldon Core installed. To learn more about or , please visit the .

Pre-packaged Servers

Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.

Usage

To let Seldon Core know what framework was used to train your model, you can use the implementation field of your SeldonDeployment manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
        implementation: SKLEARN_SERVER
        modelUri: gs://seldon-models/sklearn/iris

As you can see highlighted above, all that we need to specify is that:

  • Our inference deployment should use the , which is done by setting the protocol field to kfserving.

  • Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the , which is done by setting the implementation field to SKLEARN_SERVER.

Note that, while the protocol should always be set to kfserving (i.e. so that models are served using the ), the value of the implementation field will be dependant on your ML framework. The valid values of the implementation field are . However, it should also be possible to (e.g. to support a ).

Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

To consult the supported values of the implementation field where MLServer is used, you can check the support table below.

As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.

The table below shows a list of the currently supported values of the implementation field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.

Framework
MLServer Runtime
Seldon Core Pre-packaged Server
Documentation

Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a of pre-packaged servers. To check the full list, please visit the .

There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to , which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.

The componentSpecs field of the SeldonDeployment manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write our SeldonDeployment manifest as follows:

As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:

  • Letting Seldon Core know that the model deployment will be served through the ) by setting the protocol field to v2.

  • Pointing our model container to use our custom MLServer image, by specifying it on the image field of the componentSpecs section of the manifest.

Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

XGBOOST_SERVER

MLflow

MLFLOW_SERVER

kubectl apply -f my-seldondeployment-manifest.yaml

Scikit-Learn

MLServer SKLearn

SKLEARN_SERVER

SKLearn Server

XGBoost

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: my-model
spec:
  protocol: v2
  predictors:
    - name: default
      graph:
        name: classifier
      componentSpecs:
        - spec:
            containers:
              - name: classifier
                image: my-custom-server:0.1.0
kubectl apply -f my-seldondeployment-manifest.yaml

Supported Pre-packaged Servers

Custom Runtimes

Usage

V2 inference protocol
MLServer SKLearn runtime
V2 inference protocol
pre-determined by Seldon Core
configure and add new ones
custom MLServer runtime
wider set
Seldon Core documentation
write custom runtimes
V2 inference protocol
Seldon Core
how to install it
Seldon Core documentation

MLServer

An open source inference server for your machine learning models.

Overview

MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with KFServing's V2 Dataplane spec. Watch a quick video introducing the project here.

  • Multi-model serving, letting users run multiple models within the same process.

  • Ability to run across multiple models through a pool of inference workers.

  • Support for , to group inference requests together on the fly.

  • Scalability with deployment in Kubernetes native frameworks, including and , where MLServer is the core Python inference server used to serve machine learning models.

  • Support for the standard on both the gRPC and REST flavours, which has been standardised and adopted by various model serving frameworks.

You can read more about the goals of this project on the .

You can install the mlserver package running:

Note that to use any of the optional , you'll need to install the relevant package. For example, to serve a scikit-learn model, you would need to install the mlserver-sklearn package:

For further information on how to use MLServer, you can check any of the .

Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. You can read more about .

Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common frameworks. This allows you to start serving models saved in these frameworks straight away. However, it's also possible to .

Out of the box, MLServer provides support for:

Framework
Supported
Documentation

🔴 Unsupported

🟠 Deprecated: To be removed in a future version

🟢 Supported

🔵 Untested

Python Version
Status

To see MLServer in action, check out . You can find below a few selected examples showcasing how you can leverage MLServer to start serving your machine learning models.

Both the main mlserver package and the try to follow the same versioning schema. To bump the version across all of them, you can use the script.

We generally keep the version as a placeholder for an upcoming version.

For example:

To run all of the tests for MLServer and the runtimes, use:

To run run tests for a single file, use something like:

MLServer XGBoost
XGBoost Server
MLServer MLflow
MLflow Server

✅

LightGBM

✅

CatBoost

✅

Tempo

✅

MLflow

✅

Alibi-Detect

✅

Alibi-Explain

✅

HuggingFace

✅

🟢

3.11

🟢

3.12

🟢

3.13

🔴

Serving a tempo pipeline

  • Serving a custom model

  • Serving an alibi-detect model

  • Serving a HuggingFace model

  • Multi-Model Serving with multiple frameworks

  • Loading / unloading models from a model repository

  • pip install mlserver
    pip install mlserver-sklearn

    Scikit-Learn

    ✅

    MLServer SKLearn

    XGBoost

    ✅

    MLServer XGBoost

    3.7

    🔴

    3.8

    🔴

    3.9

    🟢

    ./hack/update-version.sh 0.2.0.dev1
    make test
    tox -e py3 -- tests/batch_processing/test_rest.py

    Usage

    Inference Runtimes

    Supported Python Versions

    Examples

    Developer Guide

    Versioning

    Testing

    inference in parallel for vertical scaling
    adaptive batching
    Seldon Core
    KServe (formerly known as KFServing)
    V2 Inference Protocol
    initial design document
    inference runtimes
    available examples
    inference runtimes in their documentation page
    write custom runtimes
    our full list of examples
    Serving a scikit-learn model
    Serving a xgboost model
    Serving a lightgbm model
    Serving a catboost model
    inference runtimes packages
    ./hack/update-version.sh

    Spark MLlib

    3.10

    KServe

    MLServer is used as the core Python inference server in KServe (formerly known as KFServing). This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.

    This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about or , please visit the .

    Serving Runtimes

    KServe provides built-in serving runtimes to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.

    Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.

    Usage

    To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService manifest.

    For example, to serve a Scikit-Learn model, you could use a manifest like the one below:

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: my-model
    spec:
      predictor:
        sklearn:
          protocolVersion: v2
          storageUri: gs://seldon-models/sklearn/iris

    As you can see highlighted above, the InferenceService manifest will only need to specify the following points:

    • The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn serving runtime to deploy it.

    • The model will be served using the , which can be enabled by setting the protocolVersion field to v2.

    Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

    As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.

    Framework
    MLServer Runtime
    KServe Serving Runtime
    Documentation

    Note that, on top of the ones shown above (backed by MLServer), KServe also provides a of serving runtimes. To see the full list, please visit the .

    Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to , which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for .

    The InferenceService manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the . For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write an InferenceService manifest like the one below:

    As we can see highlighted above, the main points that we'll need to take into account are:

    • Pointing to our custom MLServer image in the custom container section of our InferenceService.

    • Explicitly choosing the to serve our model.

    • Let KServe know what port will be exposed by our custom container to send inference requests.

    Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

    Streaming

    Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers.

    Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.

    The streaming endpoints are available for both the infer and generate methods through the following endpoints:

    • /v2/models/{model_name}/versions/{model_version}/infer_stream

    /v2/models/{model_name}/infer_stream

  • /v2/models/{model_name}/versions/{model_version}/generate_stream

  • /v2/models/{model_name}/generate_stream

  • Note that for REST, the generate and generate_stream endpoints are aliases for the infer and infer_stream endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs).

    Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.

    The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but mlserver already has the predict method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic.

    The stub method for streaming to be used by the client is ModelStreamInfer.

    There are three main limitations of the streaming support in MLServer:

    • the parallel_workers setting should be set to 0 to disable distributed workers (to be addressed in future releases)

    • for REST, the gzip_enabled setting should be set to false to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue here)

    REST Server

    gRPC Server

    Limitation

    xgboost

    kubectl apply -f my-inferenceservice-manifest.yaml

    Scikit-Learn

    MLServer SKLearn

    sklearn

    SKLearn Serving Runtime

    XGBoost

    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: my-model
    spec:
      predictor:
        containers:
          - name: classifier
            image: my-custom-server:0.1.0
            env:
              - name: PROTOCOL
                value: v2
            ports:
              - containerPort: 8080
                protocol: TCP
    kubectl apply -f my-inferenceservice-manifest.yaml

    Supported Serving Runtimes

    Custom Runtimes

    Usage

    V2 inference protocol
    wider set
    KServe documentation
    write custom runtimes
    custom runtimes
    custom MLServer image containing your custom logic
    V2 inference protocol
    KServe
    how to install it
    KServe documentation

    MLServer MLlib
    MLServer LightGBM
    MLServer CatBoost
    github.com/SeldonIO/tempo
    MLServer MLflow
    MLServer Alibi Detect
    MLServer Alibi Explain
    MLServer HuggingFace

    Adaptive Batching

    MLServer includes support to batch requests together transparently on-the-fly. We refer to this as "adaptive batching", although it can also be known as "predictive batching".

    There are usually two main reasons to adopt adaptive batching:

    • Maximise resource usage. Usually, inference operations are “vectorised” (i.e. are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time. Therefore, to make sure that it’s used at maximum capacity, we need to run inference across batches.

    Deployment

    MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including and . This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.

    MLFlow

    This package provides a MLServer runtime compatible with .

    You can install the runtime, alongside mlserver, as:

    The MLflow inference runtime introduces a new dict content type, which decodes an incoming V2 request as a . This is useful for certain MLflow-serialised models, which will expect that the model inputs are serialised in this format.

    LightGBM

    This package provides a MLServer runtime compatible with LightGBM.

    You can install the runtime, alongside mlserver, as:

    For further information on how to use MLServer with LightGBM, you can check out this .

    If no is present on the request or metadata, the LightGBM runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .

    Catboost

    This package provides a MLServer runtime compatible with CatBoost's CatboostClassifier.

    You can install the runtime, alongside mlserver, as:

    For further information on how to use MLServer with CatBoost, you can check out this .

    If no is present on the request or metadata, the CatBoost runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .

    Spark MlLib

    This package provides a MLServer runtime compatible with Spark MLlib.

    You can install the runtime, alongside mlserver, as:

    For further information on how to use MLServer with Spark MLlib, you can check out the .

    Custom

    There may be cases where the offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

    To learn more about how you can write custom runtimes with MLServer, check out the . Alternatively, you can also see this which walks through the process of writing a custom runtime.

    MLServer XGBoost
    XGBoost Serving Runtime
    inference runtimes
    Custom Runtimes user guide
    end-to-end example

    In general, it should be possible to deploy models using MLServer into any serving engine compatible with the V2 protocol. Alternatively, it's also possible to manage MLServer deployments manually as regular processes (i.e. in a non-Kubernetes-native way). However, this may be more involved and highly dependant on the deployment infrastructure.

    Seldon Core
    KServe (formerly known as KFServing)
    Cover

    Seldon Core

    Cover

    KServe

    pip install mlserver mlserver-mlflow
    The `dict` content type can be _stacked_ with other content types, like
    [`np`](../../docs/user-guide/content-type).
    This allows the user to use a different set of content types to decode each of
    the dict entries.

    Usage

    Content Types

    MLflow models
    dictionary of tensors
    pip install mlserver mlserver-lightgbm

    Usage

    Content Types

    worked out example
    content type
    NumPy Array
    model's metadata
    pip install mlserver mlserver-catboost

    Usage

    Content Types

    worked out example
    content type
    NumPy Array
    model's metadata
    pip install mlserver mlserver-mllib

    Usage

    MLServer repository

    Alibi-Explain

    This package provides a MLServer runtime compatible with Alibi-Explain.

    Usage

    You can install the runtime, alongside mlserver, as:

    pip install mlserver mlserver-alibi-explain

    Metrics

    MetricsServer

    Methods

    on_worker_stop()

    start()

    stop()

    configure_metrics()

    No description available.

    log()

    Logs a new set of metric values. Each kwarg of this method will be treated as a separate metric / value pair. If any of the metrics does not exist, a new one will be created with a default description.

    register()

    Registers a new metric with its description. If the metric already exists, it will just return the existing one.

    on_worker_stop(worker: Worker) -> None
    start()
    stop(sig: Optional[int] = None)
    configure_metrics(settings: Settings)
    log(metrics)
    register(name: str, description: str) -> Histogram
    Minimise any inference overhead. Usually, all models will have to “pay” a constant overhead when running any type of inference. This can be something like IO to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it’s in our interest to send as large batches as we can without deteriorating performance.

    However, these benefits will usually scale only up to a certain point, which is usually determined by either the infrastructure, the machine learning framework used to train your model, or a combination of both. Therefore, to maximise the performance improvements brought in by adaptive batching it will be important to configure it with the appropriate values for your model. Since these values are usually found through experimentation, MLServer won't enable by default adaptive batching on newly loaded models.

    MLServer lets you configure adaptive batching independently for each model through two main parameters:

    • Maximum batch size, that is how many requests you want to group together.

    • Maximum batch time, that is how much time we should wait for new requests until we reach our maximum batch size.

    The max_batch_size field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_SIZE global environment variable) controls the maximum number of requests that should be grouped together on each batch. The expected values are:

    • N, where N > 1, will create batches of up to N elements.

    • 0 or 1, will disable adaptive batching.

    The max_batch_time field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_TIME global environment variable) controls the time that MLServer should wait for new requests to come in until we reach our maximum batch size.

    The expected format is in seconds, but it will take fractional values. That is, 500ms could be expressed as 0.5.

    The expected values are:

    • T, where T > 0, will wait T seconds at most.

    • 0, will disable adaptive batching.

    MLserver allows adding custom parameters to the parameters field of the requests. These parameters are received as a merged list of parameters inside the server, e.g.

    is received as follows in the batched request in the server:

    The same way if the request is sent back from the server as a batched request

    it will be returned unbatched from the server as follows:

    Benefits

    # request 1
    types.RequestInput(
        name="parameters-np",
        shape=[1],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param='value-1',
        )
    )
    
    # request 2
    types.RequestInput(
        name="parameters-np",
        shape=[1],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param='value-2',
        )
    )
    types.RequestInput(
        name="parameters-np",
        shape=[2],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param=['value-1', 'value-2'],
        )
    )
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[3, 3],
        data=[1, 2, 3, 4, 5, 6, 7, 8, 9],
        parameters=types.Parameters(
            content_type="np",
            foo=["foo_1", "foo_2"],
            bar=["bar_1", "bar_2", "bar_3"],
        ),
    )
    # Request 1
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[1, 2, 3],
        parameters=types.Parameters(
            content_type="np", foo="foo_1", bar="'bar_1"
        ),
    )
    
    # Request 2
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[4, 5, 6],
        parameters=types.Parameters(
            content_type="np", foo="foo_2", bar="bar_2"
        ),
    )
    
    # Request 3
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[7, 8, 9],
        parameters=types.Parameters(content_type="np", bar="bar_3"),
    )

    Usage

    max_batch_size

    max_batch_time

    Merge and split of custom parameters

    Python API

    MLServer exposes a Python framework to build custom inference runtimes, define request/response types, plug codecs for payload conversion, and emit metrics. This page provides a high-level overview and links to the API docs.

    • MLModel

      • Base class to implement custom inference runtimes.

      • Core lifecycle: load(), predict(), unload().

      • Helpers for encoding/decoding requests and responses.

      • Access to model metadata and settings.

      • Extend this class to implement your own model logic.

      • Data structures and enums for the V2 inference protocol.

      • Includes Pydantic models like InferenceRequest, InferenceResponse, RequestInput, ResponseOutput

      • Encode/decode payloads between Open Inference Protocol types and Python types.

      • Base classes: InputCodec (inputs/outputs) and RequestCodec (requests/responses).

      • Emit and configure metrics within MLServer.

      • Use log() to record custom metrics; see server lifecycle hooks and utilities.

    SKLearn

    This package provides a MLServer runtime compatible with Scikit-Learn.

    You can install the runtime, alongside mlserver, as:

    For further information on how to use MLServer with Scikit-Learn, you can check out this .

    If no is present on the request or metadata, the Scikit-Learn runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .

    The Scikit-Learn inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict, predict_proba and transform

    Examples

    To see MLServer in action you can check out the examples below. These are end-to-end notebooks, showing how to serve models with MLServer.

    If you are interested in how MLServer interacts with particular model frameworks, you can check the following examples. These focus on showcasing the different that ship with MLServer out of the box. Note that, for advanced use cases, you can also write your own custom inference runtime (see the ).

    Parallel Inference

    Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes. This allows MLServer to scale out beyond the limitations of the Python interpreter. To learn more about why this can be beneficial, you can check the below.

    By default, MLServer will spin up a pool with only one worker process to run inference. All models will be loaded uniformly across the inference pool workers. To read more about advanced settings, please see the .

    The is a mutex lock that exists in most Python interpreters (e.g. CPython). Its main purpose is to lock Python’s execution so that it only runs on a single processor at the same time. This simplifies certain things to the interpreter. However, it also adds the limitation that a single Python process will never be able to leverage multiple cores.

    When we think about MLServer's support for , this could lead to scenarios where a heavily-used model starves the other models running within the same MLServer instance. Similarly, even if we don’t take MMS into account, the GIL also makes it harder to scale inference for a single model.

    API Reference

    This page links to the key reference docs for configuring and using MLServer.

    Server-wide configuration (e.g., HTTP/GRPC ports) loaded from a settings.json in the working directory. Settings can also be provided via environment variables prefixed with MLSERVER_ (e.g., MLSERVER_GRPC_PORT).

    • Scope: server-wide (independent from model-specific settings)

    MLModel

    Abstract inference runtime which exposes the main interface to interact with ML models.

    Helper to decode a request input into its corresponding high-level Python object. This method will find the most appropiate :doc:input codec </user-guide/content-type> based on the model's metadata and the input's content type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

    Helper to decode an inference request into its corresponding high-level Python object. This method will find the most appropiate :doc:request codec </user-guide/content-type> based on the model's metadata and the requests's content type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

    HuggingFace

    This package provides a MLServer runtime compatible with HuggingFace Transformers.

    You can install the runtime, alongside mlserver, as:

    For further information on how to use MLServer with HuggingFace, you can check out this .

    The HuggingFace runtime will always decode the input request using its own built-in codec. Therefore, at the request level will be ignored. Note that this doesn't include annotations, which will be respected as usual.

    The HuggingFace runtime exposes a couple extra parameters which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json

    Model Repository API

    MLServer supports loading and unloading models dynamically from a models repository. This allows you to enable and disable the models accessible by MLServer on demand. This extension builds on top of the support for , letting you change at runtime which models is MLServer currently serving.

    The API to manage the model repository is modelled after to the V2 Dataplane and is thus fully compatible with it.

    This notebook will walk you through an example using the Model Repository API.

    First of all, we will need to train some models. For that, we will re-use the models we trained previously in the . You can check the details on how they are trained following that notebook.

    Next up, we will start our mlserver inference server. Note that, by default, this will load all our models.

    .
  • See model fields (type and default) and JSON Schemas in the docs.

  • Built-ins include codecs such as NumpyCodec, Base64Codec, StringCodec, etc.

    When creating a custom runtime, start by subclassing MLModel, use the structures from Types for requests/responses, pick or implement the appropriate Codecs, and optionally emit Metrics from your model code.

    Types
    Codecs
    Metrics
    To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.

    Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020.

    The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.

    Despite the above, even though this overhead is minimised, this it can still be particularly noticeable for lightweight inference methods, where the extra IPC overhead can take a large percentage of the overall time. In these cases (which can only be assessed on a model-by-model basis), the user has the option to disable the parallel inference feature.

    For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.

    By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers setting.

    The parallel_workers field of the settings.json file (or alternatively, the MLSERVER_PARALLEL_WORKERS global environment variable) controls the size of MLServer's inference pool. The expected values are:

    • N, where N > 0, will create a pool of N workers.

    • 0, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.

    The inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_INFERENCE_POOL_GID global environment variable) allows to load models on a dedicated inference pool based on the group ID (GID) to prevent starvation behavior.

    Complementing the inference_pool_gid, if the autogenerate_inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_AUTOGENERATE_INFERENCE_POOL_GID global environment variable) is set to True, a UUID is automatically generated, and a dedicated inference pool will load the given model. This option is useful if the user wants to load a single model on an dedicated inference pool without having to manage the GID themselves.

    Jiale Zhi, Rui Wang, Jeff Clune, and Kenneth O. Stanley. Fiber: A Platform for Efficient Development and Distributed Training for Reinforcement Learning and Population-Based Methods. arXiv:2003.11164 [cs, stat], March 2020. arXiv:2003.11164.

    Concurrency in Python

    concurrency section
    usage section below
    Global Interpreter Lock (GIL)
    Multi-Model Serving (MMS)

    Overhead

    Usage

    parallel_workers

    inference_pool_gid

    References

    Sources: settings.json or env vars MLSERVER_*

    Read the full reference →

    Each model has its own configuration (metadata, parallelism, etc.). Typically provided via a model-settings.json next to the model artifacts. Alternatively, use env vars prefixed with MLSERVER_MODEL_ (e.g., MLSERVER_MODEL_IMPLEMENTATION). If no model-settings.json is found, MLServer will try to load a default model from these env vars. Note: these env vars are shared across models unless overridden by model-settings.json.

    • Scope: per-model

    • Sources: model-settings.json or env vars MLSERVER_MODEL_*

    Read the full reference →

    The mlserver CLI helps with common model lifecycle tasks (build images, init projects, start serving, etc.). For a quick overview:

    • Commands include: build, dockerfile, infer (deprecated), init, start

    • Each command lists its options, arguments, and examples

    Read the full CLI reference →

    Build custom runtimes and integrate with MLServer using Python:

    • MLModel: base class for custom inference runtimes

    • Types: request/response schemas and enums (Pydantic)

    • Codecs: payload conversions between protocol types and Python types

    • Metrics: emit and configure metrics

    Browse the Python API →

    MLServer Settings

    Model Settings

    MLServer CLI

    Python API

    Helper to encode a high-level Python object into its corresponding response output. This method will find the most appropiate :doc:input codec </user-guide/content-type> based on the model's metadata, request output's content type or payload's type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

    Helper to encode a high-level Python object into its corresponding inference response. This method will find the most appropiate :doc:request codec </user-guide/content-type> based on the payload's type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

    Method responsible for loading the model from a model artefact. This method will be called on each of the parallel workers (when :doc:parallel inference </user-guide/parallel-inference>) is enabled). Its return value will represent the model's readiness status. A return value of True will mean the model is ready.

    This method can be overriden to implement your custom load logic.

    No description available.

    Method responsible for running inference on the model.

    This method can be overriden to implement your custom inference logic.

    Method responsible for running generation on the model, streaming a set of responses back to the client.

    This method can be overriden to implement your custom inference logic.

    Method responsible for unloading the model, freeing any resources (e.g. CPU memory, GPU memory, etc.). This method will be called on each of the parallel workers (when :doc:parallel inference </user-guide/parallel-inference>) is enabled). A return value of True will mean the model is now unloaded.

    This method can be overriden to implement your custom unload logic.

    Methods

    decode()

    decode_request()

    encode()

    encode_response()

    load()

    metadata()

    predict()

    predict_stream()

    unload()

    file, e.g.

    It is possible to load a local model into a HuggingFace pipeline by specifying the model artefact folder path in parameters.uri in model-settings.json.

    Models in the HuggingFace hub can be loaded by specifying their name in parameters.extra.pretrained_model in model-settings.json.

    You can find the full reference of the accepted extra settings for the HuggingFace runtime below:

    Usage

    Content Types

    Settings

    worked out example
    content type annotations
    input-level content type

    Loading models

    Local models

    HuggingFace models

    Reference

    Now that we've got our inference server up and running, and serving 2 different models, we can start using the Model Repository API. To get us started, we will first list all available models in the repository.

    As we can, the repository lists 2 models (i.e. mushroom-xgboost and mnist-svm). Note that the state for both is set to READY. This means that both models are loaded, and thus ready for inference.

    We will now try to unload one of the 2 models, mushroom-xgboost. This will unload the model from the inference server but will keep it available on our model repository.

    If we now try to list the models available in our repository, we will see that the mushroom-xgboost model is flagged as UNAVAILABLE. This means that it's present in the repository but it's not loaded for inference.

    We will now load our model back into our inference server.

    If we now try to list the models again, we will see that our mushroom-xgboost is back again, ready for inference.

    Training

    Serving

    List available models

    Multi-Model Serving
    Triton's Model Repository extension
    Multi-Model Serving example

    Unloading our mushroom-xgboost model

    Loading our mushroom-xgboost model back

    mlserver --help
    decode(request_input: RequestInput, default_codec: Union[type[ForwardRef('InputCodec')], ForwardRef('InputCodec'), None] = None) -> Any
    decode_request(inference_request: InferenceRequest, default_codec: Union[type[ForwardRef('RequestCodec')], ForwardRef('RequestCodec'), None] = None) -> Any
    encode(payload: Any, request_output: RequestOutput, default_codec: Union[type[ForwardRef('InputCodec')], ForwardRef('InputCodec'), None] = None) -> ResponseOutput
    encode_response(payload: Any, default_codec: Union[type[ForwardRef('RequestCodec')], ForwardRef('RequestCodec'), None] = None) -> InferenceResponse
    load() -> bool
    metadata() -> MetadataModelResponse
    predict(payload: InferenceRequest) -> InferenceResponse
    predict_stream(payloads: AsyncIterator[InferenceRequest]) -> AsyncIterator[InferenceResponse]
    unload() -> bool
    pip install mlserver mlserver-huggingface
    ---
    emphasize-lines: 5-8
    ---
    {
      "name": "qa",
      "implementation": "mlserver_huggingface.HuggingFaceRuntime",
      "parameters": {
        "extra": {
          "task": "question-answering",
          "optimum_model": true
        }
      }
    }
    These settings can also be injected through environment variables prefixed with `MLSERVER_MODEL_HUGGINGFACE_`, e.g.
    
    ```bash
    MLSERVER_MODEL_HUGGINGFACE_TASK="question-answering"
    MLSERVER_MODEL_HUGGINGFACE_OPTIMUM_MODEL=true
    ```
    If `parameters.extra.pretrained_model` is specified, it takes precedence over `parameters.uri`.
    
    .. autopydantic_settings:: mlserver_huggingface.settings.HuggingFaceSettings
    !cp -r ../mms/models/* ./models
    mlserver start .
    import requests
    
    response = requests.post("http://localhost:8080/v2/repository/index", json={})
    response.json()
    requests.post("http://localhost:8080/v2/repository/models/mushroom-xgboost/unload")
    response = requests.post("http://localhost:8080/v2/repository/index", json={})
    response.json()
    requests.post("http://localhost:8080/v2/repository/models/mushroom-xgboost/load")
    response = requests.post("http://localhost:8080/v2/repository/index", json={})
    response.json()
    methods of the Scikit-Learn model.
    Output
    Returned By Default
    Availability

    predict

    ✅

    Available on most models, but not in .

    predict_proba

    ❌

    Only available on non-regressor models.

    By default, the runtime will only return the output of predict. However, you are able to control which outputs you want back through the outputs field of your {class}InferenceRequest <mlserver.types.InferenceRequest> payload.

    For example, to only return the model's predict_proba output, you could define a payload such as:

    pip install mlserver mlserver-sklearn

    Usage

    Content Types

    Model Outputs

    worked out example
    content type
    NumPy Array
    model's metadata
    ---
    emphasize-lines: 10-12
    ---
    {
      "inputs": [
        {
          "name": "my-input",
          "datatype": "INT32",
          "shape": [2, 2],
          "data": [1, 2, 3, 4]
        }
      ],
      "outputs": [
        { "name": "predict_proba" }
      ]
    }

    Serving LightGBM models

  • Serving CatBoost models

  • Serving MLflow models

  • Serving custom models

  • Serving Alibi Detect models

  • Serving HuggingFace models

  • To see some of the advanced features included in MLServer (e.g. multi-model serving), check out the examples below.

    • Multi-Model Serving with multiple frameworks

    • Loading / unloading models from a model repository

    • Content-Type Decoding

    • Custom Conda environment

    Tutorials are designed to be beginner-friendly and walk through accomplishing a series of tasks using MLServer (and other tools).

    • Deploying a Custom Tensorflow Model with MLServer and Seldon Core

    Inference Runtimes

    inference runtimes
    example below on custom models
    Serving Scikit-Learn models
    Serving XGBoost models

    MLServer Features

    Tutorials

    Alibi-Detect

    This package provides a MLServer runtime compatible with alibi-detect models.

    Usage

    You can install the mlserver-alibi-detect runtime, alongside mlserver, as:

    pip install mlserver mlserver-alibi-detect

    For further information on how to use MLServer with Alibi-Detect, you can check out this worked out example.

    Content Types

    If no content type is present on the request or metadata, the Alibi-Detect runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

    Settings

    The Alibi Detect runtime exposes a couple setting flags which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.

    You can find the full reference of the accepted extra settings for the Alibi Detect runtime below:

    Metrics

    Out-of-the-box, MLServer exposes a set of metrics that help you monitor your machine learning workloads in production. These include standard metrics like number of requests and latency.

    On top of these, you can also register and track your own as part of your .

    By default, MLServer will expose metrics around inference requests (count and error rate) and the status of its internal requests queues. These internal queues are used for and .

    Metric Name
    Description

    Inference Runtimes

    Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice.

    Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common ML frameworks. This allows you to start serving models saved in these frameworks straight away. To avoid bringing in dependencies for frameworks that you don't need to use, these runtimes are implemented as independent (and optional) Python packages. This mechanism also allows you to rollout your very easily.

    To pick which runtime you want to use for your model, you just need to make sure that the right package is installed, and then point to the correct runtime class in your model-settings.json file.

    Framework

    XGBoost

    This package provides a MLServer runtime compatible with XGBoost.

    You can install the runtime, alongside mlserver, as:

    For further information on how to use MLServer with XGBoost, you can check out this .

    The XGBoost inference runtime will expect that your model is serialised via one of the following methods:

    Extension
    Docs
    Example

    Model Parameters

    Attribute
    Type
    Default

    Serving LightGBM models

    Out of the box, mlserver supports the deployment and serving of lightgbm models. By default, it will assume that these models have been .

    In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

    To test the LightGBM Server, first we need to generate a simple LightGBM model using Python.

    Our model will be persisted as a file named iris-lightgbm.bst.

    Now that we have trained and saved our model, the next step will be to serve it using

    Serving custom models requiring JSON inputs or outputs
    Serving models through Kafka
    Streaming inference

    transform

    ❌

    Only available on Scikit-Learn pipelines.

    Scikit-Learn pipelines

    Reference

    mlserver
    . For that, we will need to create 2 configuration files:
    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    As we can see above, the model predicted the probability for each class, and the probability of class 1 is the biggest, close to 0.99, which matches what's on the test set.

    Training

    Serving

    serialised using the bst.save_model() method

    settings.json

    model-settings.json

    Start serving our model

    Send test inference request

    ---
    emphasize-lines: 6-8
    ---
    {
      "name": "drift-detector",
      "implementation": "mlserver_alibi_detect.AlibiDetectRuntime",
      "parameters": {
        "uri": "./alibi-detect-artifact/",
        "extra": {
          "batch_size": 5
        }
      }
    }
    
    .. autopydantic_settings:: mlserver_alibi_detect.runtime.AlibiDetectSettings
    import lightgbm as lgb
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    import os
    
    model_dir = "."
    BST_FILE = "iris-lightgbm.bst"
    
    iris = load_iris()
    y = iris['target']
    X = iris['data']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
    dtrain = lgb.Dataset(X_train, label=y_train)
    
    params = {
        'objective':'multiclass', 
        'metric':'softmax',
        'num_class': 3
    }
    lgb_model = lgb.train(params=params, train_set=dtrain)
    model_file = os.path.join(model_dir, BST_FILE)
    lgb_model.save_model(model_file)
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile model-settings.json
    {
        "name": "iris-lgb",
        "implementation": "mlserver_lightgbm.LightGBMModel",
        "parameters": {
            "uri": "./iris-lightgbm.bst",
            "version": "v0.1.0"
        }
    }
    mlserver start .
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict-prob",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/iris-lgb/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    y_test[0]

    Number of successful inference requests.

    model_infer_request_failure

    Number of failed inference requests.

    batch_request_queue

    Queue size for the queue.

    parallel_request_queue

    Queue size for the queue.

    On top of the default set of metrics, MLServer's REST server will also expose a set of metrics specific to REST.

    Metric Name
    Description

    [rest_server]_requests

    Number of REST requests, labelled by endpoint and status code.

    [rest_server]_requests_duration_seconds

    Latency of REST requests.

    [rest_server]_requests_in_progress

    Number of in-flight REST requests.

    On top of the default set of metrics, MLServer's gRPC server will also expose a set of metrics specific to gRPC.

    Metric Name
    Description

    grpc_server_handled

    Number of gRPC requests, labelled by gRPC code and method.

    grpc_server_started

    Number of in-flight gRPC requests.

    MLServer allows you to register custom metrics within your custom inference runtimes. This can be done through the mlserver.register() and mlserver.log() methods.

    • mlserver.register: Register a new metric.

    • mlserver.log: Log a new set of metric / value pairs. If there's any unregistered metric, it will get registered on-the-fly.

    Custom metrics will generally be registered in the load() <mlserver.MLModel.load> method and then used in the predict() <mlserver.MLModel.predict> method of your custom runtime.

    For metrics specific to a model (e.g. custom metrics, request counts, etc), MLServer will always label these with the model name and model version. Downstream, this will allow to aggregate and query metrics per model.

    Below, you can find the list of standardised labels that you will be able to find on model-specific metrics:

    Label Name
    Description

    model_name

    Model Name (e.g. my-custom-model)

    model_version

    Model Version (e.g. v1.2.3)

    MLServer will expose metric values through a metrics endpoint exposed on its own metric server. This endpoint can be polled by Prometheus or other OpenMetrics-compatible backends.

    Below you can find the available to control the behaviour of the metrics server:

    Label Name
    Description
    Default

    metrics_endpoint

    Path under which the metrics endpoint will be exposed.

    /metrics

    metrics_port

    Port used to serve the metrics server.

    8082

    Default Metrics

    custom metrics
    custom inference runtimes
    adaptive batching
    communication with the inference workers

    model_infer_request_success

    import mlserver
    
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class MyCustomRuntime(mlserver.MLModel):
      async def load(self) -> bool:
        self._model = load_my_custom_model()
        mlserver.register("my_custom_metric", "This is a custom metric example")
        return True
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        mlserver.log(my_custom_metric=34)
        # TODO: Replace for custom logic to run inference
        return self._model.predict(payload)

    REST Server Metrics

    The prefix for the REST-specific metrics will be dependent on the metrics_rest_server_prefix flag from the .

    gRPC Server Metrics

    Custom Metrics

    Under the hood, metrics logged through the mlserver.log method will get exposed to Prometheus as a Histogram.

    Metrics Labelling

    If these labels are not present on a specific metric, this means that those metrics can't be sliced at the model level.

    Settings

    Package Name
    Implementation Class
    Example
    Documentation

    Scikit-Learn

    mlserver-sklearn

    mlserver_sklearn.SKLearnModel

    XGBoost

    mlserver-xgboost

    mlserver_xgboost.XGBoostModel

    Spark MLlib

    Included Inference Runtimes

    own custom runtimes

    *.json

    booster.save_model("model.json")

    *.ubj

    booster.save_model("model.ubj")

    *.bst

    booster.save_model("model.bst")

    If no content type is present on the request or metadata, the XGBoost runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

    The XGBoost inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict and predict_proba methods of the XGBoost model.

    Output
    Returned By Default
    Availability

    predict

    ✅

    Available on all XGBoost models.

    predict_proba

    ❌

    Only available on non-regressor models (i.e. XGBClassifier models).

    By default, the runtime will only return the output of predict. However, you are able to control which outputs you want back through the outputs field of your {class}InferenceRequest <mlserver.types.InferenceRequest> payload.

    For example, to only return the model's predict_proba output, you could define a payload such as:

    pip install mlserver mlserver-xgboost

    Usage

    XGBoost Artifact Type

    worked out example
    By default, the runtime will look for a file called `model.[json | ubj | bst]`.
    However, this can be modified through the `parameters.uri` field of your
    {class}`ModelSettings <mlserver.settings.ModelSettings>` config (see the
    section on [Model Settings](../../docs/reference/model-settings.md) for more
    details).
    
    ```{code-block} json
    ---
    emphasize-lines: 3-5
    ---
    {
      "name": "foo",
      "parameters": {
        "uri": "./my-own-model-filename.json"
      }
    }
    ```
    ---
    emphasize-lines: 10-12
    ---
    {
      "inputs": [
        {
          "name": "my-input",
          "datatype": "INT32",
          "shape": [2, 2],
          "data": [1, 2, 3, 4]
        }
      ],
      "outputs": [
        { "name": "predict_proba" }
      ]
    }

    Content Types

    Model Outputs

    env_file

    str

    ".env"

    protected_namespaces

    tuple

    ('model_', 'settings_')

    Field
    Type
    Default
    Description

    autogenerate_inference_pool_gid

    bool

    False

    Flag to autogenerate the inference pool group id for this model.

    content_type

    Optional[str]

    extra

    str

    "allow"

    env_prefix

    str

    Config

    "MLSERVER_MODEL_"

    Fields

    Model Settings

    Config

    Attribute
    Type
    Default

    extra

    str

    "ignore"

    env_prefix

    str

    Field
    Type
    Default
    Description

    Serving Scikit-Learn models

    Out of the box, mlserver supports the deployment and serving of scikit-learn models. By default, it will assume that these models have been serialised using joblib.

    In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

    Training

    The first step will be to train a simple scikit-learn model. For that, we will use the MNIST example from the scikit-learn documentation which trains an SVM model.

    # Original source code and more details can be found in:
    # https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
    
    # Import datasets, classifiers and performance metrics
    from sklearn import datasets, svm, metrics
    from sklearn.model_selection import train_test_split
    
    # The digits dataset
    digits = datasets.load_digits()
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))
    
    # Create a classifier: a support vector classifier
    classifier = svm.SVC(gamma=0.001)
    
    # Split data into train and test subsets
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.5, shuffle=False)
    
    # We learn the digits on the first half of the digits
    classifier.fit(X_train, y_train)

    Saving our trained model

    To save our trained model, we will serialise it using joblib. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the .

    Our model will be persisted as a file named mnist-svm.joblib

    Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    As we can see above, the model predicted the input as the number 8, which matches what's on the test set.

    Serving XGBoost models

    Out of the box, mlserver supports the deployment and serving of xgboost models. By default, it will assume that these models have been serialised using the bst.save_model() method.

    In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

    Training

    The first step will be to train a simple xgboost model. For that, we will use the mushrooms example from the xgboost Getting Started guide.

    # Original code and extra details can be found in:
    # https://xgboost.readthedocs.io/en/latest/get_started.html#python
    
    import os
    import xgboost as xgb
    import requests
    
    from urllib.parse import urlparse
    from sklearn.datasets import load_svmlight_file
    
    
    TRAIN_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.train'
    TEST_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.test'
    
    
    def _download_file(url: str) -> str:
        parsed = urlparse(url)
        file_name = os.path.basename(parsed.path)
        file_path = os.path.join(os.getcwd(), file_name)
        
        res = requests.get(url)
        
        with open(file_path, 'wb') as file:
            file.write(res.content)
        
        return file_path
    
    train_dataset_path = _download_file(TRAIN_DATASET_URL)
    test_dataset_path = _download_file(TEST_DATASET_URL)
    
    # NOTE: Workaround to load SVMLight files from the XGBoost example
    X_train, y_train = load_svmlight_file(train_dataset_path)
    X_test, y_test = load_svmlight_file(test_dataset_path)
    X_train = X_train.toarray()
    X_test = X_test.toarray()
    
    # read in data
    dtrain = xgb.DMatrix(data=X_train, label=y_train)
    
    # specify parameters via map
    param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
    num_round = 2
    bst = xgb.train(param, dtrain, num_round)
    
    bst

    Saving our trained model

    To save our trained model, we will serialise it using bst.save_model() and the JSON format. This is the approach by the XGBoost project.

    Our model will be persisted as a file named mushroom-xgboost.json.

    Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    As we can see above, the model predicted the input as close to 0, which matches what's on the test set.

    Serving models through Kafka

    Out of the box, MLServer provides support to receive inference requests from Kafka. The Kafka server can run side-by-side with the REST and gRPC ones, and adds a new interface to interact with your model. The inference responses coming back from your model, will also get written back to their own output topic.

    In this example, we will showcase the integration with Kafka by serving a model thorugh Kafka.

    We are going to start by running a simple local docker deployment of kafka that we can test against. This will be a minimal cluster that will consist of a single zookeeper node and a single broker.

    You need to have Java installed in order for it to work correctly.

    Now you can just run it with the following command outside the terminal:

    Now we can create the input and output topics required

    The first step will be to train a simple

    Custom Conda environments in MLServer

    It's not unusual that model runtimes require extra dependencies that are not direct dependencies of MLServer. This is the case when we want to use , but also when our model artifacts are the output of older versions of a toolkit (e.g. models trained with an older version of SKLearn).

    In these cases, since these dependencies (or dependency versions) are not known in advance by MLServer, they won't be included in the default seldonio/mlserver Docker image. To cover these cases, the seldonio/mlserver Docker image allows you to load custom environments before starting the server itself.

    This example will walk you through how to create and save an custom environment, so that it can be loaded in MLServer without any extra change to the seldonio/mlserver Docker image.

    None

    Default content type to use for requests and responses.

    environment_path

    Optional[str]

    None

    Path to a directory that contains the python environment to be used to load this model.

    environment_tarball

    Optional[str]

    None

    Path to the environment tarball which should be used to load this model.

    extra

    Optional[dict]

    <factory>

    Arbitrary settings, dependent on the inference runtime implementation.

    format

    Optional[str]

    None

    Format of the model (only available on certain runtimes).

    inference_pool_gid

    Optional[str]

    None

    Inference pool group id to be used to serve this model.

    uri

    Optional[str]

    None

    URI where the model artifacts can be found. This path must be either absolute or relative to where MLServer is running.

    version

    Optional[str]

    None

    Version of the model.

    mlserver-mllib

    mlserver_mllib.MLlibModel

    MLlib example

    MLServer MLlib

    LightGBM

    mlserver-lightgbm

    mlserver_lightgbm.LightGBMModel

    LightGBM example

    MLServer LightGBM

    CatBoost

    mlserver-catboost

    mlserver_catboost.CatboostModel

    CatBoost example

    MLServer CatBoost

    MLflow

    mlserver-mlflow

    mlserver_mlflow.MLflowRuntime

    MLflow example

    MLServer MLflow

    Alibi-Detect

    mlserver-alibi-detect

    mlserver_alibi_detect.AlibiDetectRuntime

    Alibi-detect example

    MLServer Alibi-Detect

    Scikit-Learn example
    MLServer SKLearn
    XGBoost example
    MLServer XGBoost
    JSON Format
    Binary JSON Format
    (Old) Binary Format

    metrics_rest_server_prefix

    Prefix used for metric names specific to MLServer's REST inference interface.

    rest_server

    metrics_dir

    Directory used to store internal metric files (used to support metrics sharing across inference workers). This is equivalent to Prometheus' $PROMETHEUS_MULTIPROC_DIR env var.

    MLServer's current working directory (i.e. $PWD)

    settings
    adaptive batching
    inference workers
    MLServer settings

    Python path to the inference runtime to use to serve this model (e.g. mlserver_sklearn.SKLearnModel).

    inputs

    List[MetadataTensor]

    <factory>

    Metadata about the inputs accepted by the model.

    max_batch_size

    int

    0

    When adaptive batching is enabled, maximum number of requests to group together in a single batch.

    max_batch_time

    float

    0.0

    When adaptive batching is enabled, maximum amount of time (in seconds) to wait for enough requests to build a full batch.

    name

    str

    ''

    Name of the model.

    outputs

    List[MetadataTensor]

    <factory>

    Metadata about the outputs returned by the model.

    parallel_workers

    Optional[int]

    None

    Use the parallel_workers field in the server-wide settings instead.

    parameters

    Optional[ModelParameters]

    None

    Extra parameters for each instance of this model.

    platform

    str

    ''

    Framework used to train and serialise the model (e.g. sklearn).

    versions

    List[str]

    <factory>

    Versions of dependencies used to train the model (e.g. sklearn/0.20.1).

    warm_workers

    bool

    False

    Inference workers will now always be warmed up at start time.

    "MLSERVER_MODEL_"

    env_file

    str

    ".env"

    protected_namespaces

    tuple

    ('model_', 'settings_')

    cache_enabled

    bool

    False

    Enable caching for a specific model. This parameter can be used to disable cache for a specific model, if the server-level caching is enabled. If the server-level caching is disabled, this parameter value will have no effect.

    implementation_

    str

    Fields

    -

    video_play_icon

    Serving

    settings.json

    model-settings.json

    Start serving our model

    Send test inference request

    scikit-learn documentation

    Serving

    settings.json

    model-settings.json

    Start serving our model

    Send test inference request

    scikit-learn
    model. For that, we will use the
    which trains an SVM model.

    To save our trained model, we will serialise it using joblib. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the scikit-learn documentation.

    Our model will be persisted as a file named mnist-svm.joblib

    Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Note that, the settings.json file will contain our Kafka configuration, including the address of the Kafka broker and the input / output topics that will be used for inference.

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Now that we have verified that our server is accepting REST requests, we will try to send a new inference request through Kafka. For this, we just need to send a request to the mlserver-input topic (which is the default input topic):

    Once the message has gone into the queue, the Kafka server running within MLServer should receive this message and run inference. The prediction output should then get posted into an output queue, which will be named mlserver-output by default.

    As we should now be able to see above, the results of our inference request should now be visible in the output Kafka queue.

    Run Kafka

    Run the no-zookeeper kafka broker

    Create Topics

    Training

    Scikit-Learn

    Saving our trained model

    Serving

    settings.json

    model-settings.json

    Start serving our model

    Send test inference request

    Send inference request through Kafka

    MNIST example from the scikit-learn documentation
    For this example, we will create a custom environment to serve a model trained with an older version of Scikit-Learn. The first step will be define this environment, using a environment.yml.

    Note that these environments can also be created on the fly as we go, and then serialised later.

    To illustrate the point, we will train a Scikit-Learn model using our older environment.

    The first step will be to create and activate an environment which reflects what's outlined in our environment.yml file.

    NOTE: If you are running this from a Jupyter Notebook, you will need to restart your Jupyter instance so that it runs from this environment.

    We can now train and save a Scikit-Learn model using the older version of our environment. This model will be serialised as model.joblib.

    You can find more details of this process in the Scikit-Learn example.

    Lastly, we will need to serialise our environment in the format expected by MLServer. To do that, we will use a tool called conda-pack.

    This tool, will save a portable version of our environment as a .tar.gz file, also known as tarball.

    Now that we have defined our environment (and we've got a sample artifact trained in that environment), we can move to serving our model.

    To do that, we will first need to select the right runtime through a model-settings.json config file.

    We can then spin up our model, using our custom environment, leveraging MLServer's Docker image. Keep in mind that you will need Docker installed in your machine to run this example.

    Our Docker command will need to take into account the following points:

    • Mount the example's folder as a volume so that it can be accessed from within the container.

    • Let MLServer know that our custom environment's tarball can be found as old-sklearn.tar.gz.

    • Expose port 8080 so that we can send requests from the outside.

    From the command line, this can be done using Docker's CLI as:

    Note that we need to keep the server running in the background while we send requests. Therefore, it's best to run this command on a separate terminal session.

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Define our environment

    custom runtimes

    Train model in our custom environment

    Serialise our custom environment

    Serving

    Send test inference request

    import joblib
    
    model_file_name = "mnist-svm.joblib"
    joblib.dump(classifier, model_file_name)
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile model-settings.json
    {
        "name": "mnist-svm",
        "implementation": "mlserver_sklearn.SKLearnModel",
        "parameters": {
            "uri": "./mnist-svm.joblib",
            "version": "v0.1.0"
        }
    }
    mlserver start .
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    y_test[0]
    model_file_name = 'mushroom-xgboost.json'
    bst.save_model(model_file_name)
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile model-settings.json
    {
        "name": "mushroom-xgboost",
        "implementation": "mlserver_xgboost.XGBoostModel",
        "parameters": {
            "uri": "./mushroom-xgboost.json",
            "version": "v0.1.0"
        }
    }
    mlserver start .
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mushroom-xgboost/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    y_test[0]
    !wget https://apache.mirrors.nublue.co.uk/kafka/2.8.0/kafka_2.12-2.8.0.tgz
    !tar -zxvf kafka_2.12-2.8.0.tgz
    !./kafka_2.12-2.8.0/bin/kafka-storage.sh format -t OXn8RTSlQdmxwjhKnSB_6A -c ./kafka_2.12-2.8.0/config/kraft/server.properties
    !./kafka_2.12-2.8.0/bin/kafka-server-start.sh ./kafka_2.12-2.8.0/config/kraft/server.properties
    !./kafka_2.12-2.8.0/bin/kafka-topics.sh --create --topic mlserver-input --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
    !./kafka_2.12-2.8.0/bin/kafka-topics.sh --create --topic mlserver-output --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
    # Original source code and more details can be found in:
    # https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
    
    # Import datasets, classifiers and performance metrics
    from sklearn import datasets, svm, metrics
    from sklearn.model_selection import train_test_split
    
    # The digits dataset
    digits = datasets.load_digits()
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))
    
    # Create a classifier: a support vector classifier
    classifier = svm.SVC(gamma=0.001)
    
    # Split data into train and test subsets
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.5, shuffle=False)
    
    # We learn the digits on the first half of the digits
    classifier.fit(X_train, y_train)
    import joblib
    
    model_file_name = "mnist-svm.joblib"
    joblib.dump(classifier, model_file_name)
    %%writefile settings.json
    {
        "debug": "true",
        "kafka_enabled": "true"
    }
    %%writefile model-settings.json
    {
        "name": "mnist-svm",
        "implementation": "mlserver_sklearn.SKLearnModel",
        "parameters": {
            "uri": "./mnist-svm.joblib",
            "version": "v0.1.0"
        }
    }
    mlserver start .
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    import json
    from kafka import KafkaProducer
    
    producer = KafkaProducer(bootstrap_servers="localhost:9092")
    
    headers = {
        "mlserver-model": b"mnist-svm",
        "mlserver-version": b"v0.1.0",
    }
    
    producer.send(
        "mlserver-input",
        json.dumps(inference_request).encode("utf-8"),
        headers=list(headers.items()))
    from kafka import KafkaConsumer
    
    consumer = KafkaConsumer(
        "mlserver-output",
        bootstrap_servers="localhost:9092",
        auto_offset_reset="earliest")
    
    for msg in consumer:
        print(f"key: {msg.key}")
        print(f"value: {msg.value}\n")
        break
    %%writefile environment.yml
    
    name: old-sklearn
    channels:
        - conda-forge
    dependencies:
        - python == 3.8
        - scikit-learn == 0.24.2
        - joblib == 0.17.0
        - requests
        - pip
        - pip:
            - mlserver == 1.1.0
            - mlserver-sklearn == 1.1.0
    !conda env create --force -f environment.yml
    !conda activate old-sklearn
    # Original source code and more details can be found in:
    # https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
    
    # Import datasets, classifiers and performance metrics
    from sklearn import datasets, svm, metrics
    from sklearn.model_selection import train_test_split
    
    # The digits dataset
    digits = datasets.load_digits()
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))
    
    # Create a classifier: a support vector classifier
    classifier = svm.SVC(gamma=0.001)
    
    # Split data into train and test subsets
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.5, shuffle=False)
    
    # We learn the digits on the first half of the digits
    classifier.fit(X_train, y_train)
    import joblib
    
    model_file_name = "model.joblib"
    joblib.dump(classifier, model_file_name)
    !conda pack --force -n old-sklearn -o old-sklearn.tar.gz
    %%writefile model-settings.json
    {
        "name": "mnist-svm",
        "implementation": "mlserver_sklearn.SKLearnModel"
    }
    docker run -it --rm \
        -v "$PWD":/mnt/models \
        -e "MLSERVER_ENV_TARBALL=/mnt/models/old-sklearn.tar.gz" \
        -p 8080:8080 \
        seldonio/mlserver:1.1.0-slim
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mnist-svm/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()

    Custom Inference Runtimes

    There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

    This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this end-to-end example which walks through the process of writing a custom runtime.

    Writing a custom inference runtime

    MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel> abstract class, whose main methods are:

    • load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).

    • unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).

    • predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.

    Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.

    MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict() method with the mlserver.codecs.decode_args decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.

    Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through .

    As an example of the above, let's assume a model which

    • Takes two lists of strings as inputs:

      • questions, containing multiple questions to ask our model.

      • context, containing multiple contexts for each of the questions.

    Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:

    Note that, the method signature of our predict method now specifies:

    • The input names that we should be looking for in the request payload (i.e. questions and context).

    • The expected content type for each of the request inputs (i.e. List[str] on both cases).

    • The expected content type of the response outputs (i.e.

    There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers field of the parameters object within the InferenceRequest instance.

    Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers field within the parameters object of the returned InferenceResponse instance.

    MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json configuration file.

    For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:

    Note that, from the example above, we are assuming that:

    • Your custom runtime code lives in the models.py file.

    • The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).

    More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.

    It is possible to load this custom set of dependencies by providing them through an , whose path can be specified within your model-settings.json file.

    Status
    Description
    Worker Python \ Server Python
    3.9
    3.10
    3.11

    If we take the above as a reference, we could extend it to include our custom environment as:

    Note that, in the folder layout above, we are assuming that:

    • The environment.tar.gz tarball contains a pre-packaged version of your custom environment.

    • The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).

    MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a or a requirements.txt file.

    To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:

    The output will be a Docker image named my-custom-server, ready to be used.

    The subcommand will search for any Conda environment file (i.e. named either as environment.yaml or conda.yaml) and / or any requirements.txt present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.

    The mlserver build subcommand will treat any or files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.

    Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like

    • Supporting arbitrary user IDs.

    • Building your on the fly.

    • Configure a set of .

    However, there may be occasions where you need to customise your Dockerfile even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like or ).

    To account for these cases, MLServer also includes a subcommand which will just generate a Dockerfile (and optionally a .dockerignore file) exactly like the one used by the mlserver build command. This Dockerfile can then be customised according to your needs.

    Serving Alibi-Detect models

    Out of the box, mlserver supports the deployment and serving of alibi_detect models. Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. In this example, we will cover how we can create a detector configuration to then serve it using mlserver.

    Fetch reference data

    The first step will be to fetch a reference data and other relevant metadata for an alibi-detect model.

    For that, we will use the alibi library to get the adult dataset with demographic features from a 1996 US census.

    Install `alibi` library for dataset dependencies and `alibi_detect` library for detector configuration from Pypi
    ```python
    !pip install alibi alibi_detect
    ```
    import alibi
    import matplotlib.pyplot as plt
    import numpy as np
    adult = alibi.datasets.fetch_adult()
    X, y = adult.data, adult.target
    feature_names = adult.feature_names
    category_map = adult.category_map
    n_ref = 10000
    n_test = 10000
    
    X_ref, X_t0, X_t1 = X[:n_ref], X[n_ref:n_ref + n_test], X[n_ref + n_test:n_ref + 2 * n_test]
    categories_per_feature = {f: None for f in list(category_map.keys())}

    Drift Detector Configuration

    This example is based on the Categorical and mixed type data drift detection on income prediction tabular data from the alibi-detect documentation.

    Creating detector and saving configuration

    from alibi_detect.cd import TabularDrift
    cd_tabular = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)
    from alibi_detect.utils.saving import save_detector
    filepath = "alibi-detector-artifacts"
    save_detector(cd_tabular, filepath)

    Detecting data drift directly

    preds = cd_tabular.predict(X_t0,drift_type="feature")
    
    labels = ['No!', 'Yes!']
    print(f"Threshold {preds['data']['threshold']}")
    for f in range(cd_tabular.n_features):
        fname = feature_names[f]
        is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
        stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
        print(f'{fname} -- Drift? {labels[is_drift]} -- Chi2 {stat_val:.3f} -- p-value {p_val:.3f}')
    Threshold 0.05
    Age -- Drift? No! -- Chi2 0.012 -- p-value 0.508
    Workclass -- Drift? No! -- Chi2 8.487 -- p-value 0.387
    Education -- Drift? No! -- Chi2 4.753 -- p-value 0.576
    Marital Status -- Drift? No! -- Chi2 3.160 -- p-value 0.368
    Occupation -- Drift? No! -- Chi2 8.194 -- p-value 0.415
    Relationship -- Drift? No! -- Chi2 0.485 -- p-value 0.993
    Race -- Drift? No! -- Chi2 0.587 -- p-value 0.965
    Sex -- Drift? No! -- Chi2 0.217 -- p-value 0.641
    Capital Gain -- Drift? No! -- Chi2 0.002 -- p-value 1.000
    Capital Loss -- Drift? No! -- Chi2 0.002 -- p-value 1.000
    Hours per week -- Drift? No! -- Chi2 0.012 -- p-value 0.508
    Country -- Drift? No! -- Chi2 9.991 -- p-value 0.441

    Serving

    Now that we have the reference data and other configuration parameters, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Now that we have our config in-place, we can start the server by running mlserver start command. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We now have our alibi-detect model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    OpenAPI Support

    MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:

    Name
    Description
    OpenAPI Spec

    Open Inference Protocol

    Main dataplane for inference, health and metadata

    On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.

    The autogenerated Swagger UI can be accessed under the /v2/docs endpoint.

    Alongside the , MLServer will also autogenerate a Swagger UI tailored to individual models, showing the endpoints available for each one.

    The model-specific autogenerated Swagger UI can be accessed under the following endpoints:

    • /v2/models/{model_name}/docs

    • /v2/models/{model_name}/versions/{model_version}/docs

    Multi-Model Serving

    MLServer has been built with in mind. This means that, within a single instance of MLServer, you can serve multiple models under different paths. This also includes multiple versions of the same model.

    This notebook shows an example of how you can leverage MMS with MLServer.

    We will first start by training 2 different models:

    Name
    Framework
    Source
    Trained Model Path

    Serving a custom model with JSON serialization

    The mlserver package comes with inference runtime implementations for scikit-learn and xgboost models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.

    In this example, we create a simple Hello World JSON model that parses and modifies a JSON data chunk. This is often useful as a means to quickly bootstrap existing models that utilize JSON based model inputs.

    The next step will be to serve our model using mlserver. For that, we will first implement an extension which serve as the runtime to perform inference using our custom

    Model Repository Extension

    Extension to the protocol to provide a control plane which lets you load / unload models dynamically

    model_repository.json

    Swagger UI

    Besides the Swagger UI, you can also access the raw OpenAPI spec through the /v2/docs/dataplane.json endpoint.

    Model Swagger UI

    Besides the Swagger UI, you can also access the model-specific raw OpenAPI spec through the following endpoints:

    • /v2/models/{model_name}/docs/dataplane.json

    • /v2/models/{model_name}/versions/{model_version}/docs/dataplane.json

    general API documentation
    dataplane.json

    settings.json

    model-settings.json

    Start serving our model

    Send test inference request

    View model response

    Returns a Numpy array with some predictions as the output.

    np.ndarray
    ).

    🔵

    3.11

    🔵

    🔵

    🔵

    🔴

    Unsupported

    🟢

    Supported

    🔵

    Untested

    3.9

    🟢

    🟢

    🔵

    3.10

    🟢

    Simplified interface

    MLServer's "simplified" interface aims to cover use cases where encoding / decoding can be done through one of the codecs built-in into the MLServer package. However, there are instances where this may not be enough (e.g. variable number of inputs, variable content types, etc.). For these types of cases, please use MLServer's "advanced" interface, where you will have full control over the full encoding / decoding process.

    Read and write headers

    The headers field within the parameters section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.

    Loading a custom MLServer runtime

    Loading a custom Python environment

    To load a custom environment, parallel inference must be enabled.

    The main MLServer process communicates with workers in custom environments via multiprocessing.Queue using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same default pickle protocol as the main process. Consult the tables below for environment compatibility.

    Building a custom MLServer image

    The mlserver build command expects that a Docker runtime is available and running in the background.

    Custom Environment

    The environment built by the mlserver build will be global to the whole MLServer image (i.e. every loaded model will, by default, use that custom environment). For Multi-Model Serving scenarios, it may be better to use per-model custom environments instead - which will allow you to run multiple custom environments at the same time.

    Default Settings

    Default setting values can still be overriden by external environment variables or model-specific model-settings.json.

    Custom Dockerfile

    The base Dockerfile requires Docker's Buildkit to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1 environment variable, e.g.

    MLServer's codecs and content types system
    environment tarball
    previous example
    Conda environment file
    base custom environment
    default setting values
    Kaniko
    Buildah

    🟢

    mlserver build
    settings.json
    model-settings.json
    mlserver dockerfile
    Hello World JSON
    model.

    This is a trivial model to demonstrate how to conceptually work with JSON inputs / outputs. In this example:

    • Parse the JSON input from the client

    • Create a JSON response echoing back the client request as well as a server generated message

    The next step will be to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Utilizing string data with the gRPC interface can be a bit tricky. To ensure we are correctly handling inputs and outputs we will be handled correctly.

    For simplicity in this case, we leverage the Python types that mlserver provides out of the box. Alternatively, the gRPC stubs can be generated regenerated from the V2 specification directly for use by non-Python as well as Python clients.

    Overview

    Serving

    Custom inference runtime

    Settings files

    settings.json

    model-settings.json

    Start serving our model

    Send test inference request (REST)

    Send test inference request (gRPC)

    %%writefile settings.json
    {
        "debug": "true"
    }
    Overwriting settings.json
    %%writefile model-settings.json
    {
      "name": "income-tabular-drift",
      "implementation": "mlserver_alibi_detect.AlibiDetectRuntime",
      "parameters": {
        "uri": "./alibi-detector-artifacts",
        "version": "v0.1.0",
        "extra": {
          "predict_parameters":{
            "drift_type": "feature"
          }
        }
      }
    }
    Overwriting model-settings.json
    mlserver start .
    import requests
    
    inference_request = {
        "inputs": [
            {
                "name": "predict",
                "shape": X_t0.shape,
                "datatype": "FP32",
                "data": X_t0.tolist(),
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/income-tabular-drift/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    import json
    response_dict = json.loads(response.text)
    
    labels = ['No!', 'Yes!']
    for f in range(cd_tabular.n_features):
        stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
        fname = feature_names[f]
        is_drift = response_dict['outputs'][0]['data'][f]
        stat_val, p_val = response_dict['outputs'][1]['data'][f], response_dict['outputs'][2]['data'][f]
        print(f'{fname} -- Drift? {labels[is_drift]} -- Chi2 {stat_val:.3f} -- p-value {p_val:.3f}')
    Age -- Drift? No! -- Chi2 0.012 -- p-value 0.508
    Workclass -- Drift? No! -- Chi2 8.487 -- p-value 0.387
    Education -- Drift? No! -- Chi2 4.753 -- p-value 0.576
    Marital Status -- Drift? No! -- Chi2 3.160 -- p-value 0.368
    Occupation -- Drift? No! -- Chi2 8.194 -- p-value 0.415
    Relationship -- Drift? No! -- Chi2 0.485 -- p-value 0.993
    Race -- Drift? No! -- Chi2 0.587 -- p-value 0.965
    Sex -- Drift? No! -- Chi2 0.217 -- p-value 0.641
    Capital Gain -- Drift? No! -- Chi2 0.002 -- p-value 1.000
    Capital Loss -- Drift? No! -- Chi2 0.002 -- p-value 1.000
    Hours per week -- Drift? No! -- Chi2 0.012 -- p-value 0.508
    Country -- Drift? No! -- Chi2 9.991 -- p-value 0.441
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class MyCustomRuntime(MLModel):
    
      async def load(self) -> bool:
        # TODO: Replace for custom logic to load a model artifact
        self._model = load_my_custom_model()
        return True
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        # TODO: Replace for custom logic to run inference
        return self._model.predict(payload)
    from mlserver import MLModel
    from mlserver.codecs import decode_args
    from typing import List
    
    class MyCustomRuntime(MLModel):
    
      async def load(self) -> bool:
        # TODO: Replace for custom logic to load a model artifact
        self._model = load_my_custom_model()
        return True
    
      @decode_args
      async def predict(self, questions: List[str], context: List[str]) -> np.ndarray:
        # TODO: Replace for custom logic to run inference
        return self._model.predict(questions, context)
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class CustomHeadersRuntime(MLModel):
    
      ...
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        if payload.parameters and payload.parametes.headers:
          # These are all the incoming HTTP headers / gRPC metadata
          print(payload.parameters.headers)
        ...
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class CustomHeadersRuntime(MLModel):
    
      ...
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        ...
        return InferenceResponse(
          # Include any actual outputs from inference
          outputs=[],
          parameters=Parameters(headers={"foo": "bar"})
        )
    .
    └── models
        └── sum-model
            ├── model-settings.json
            ├── models.py
    {
      "model": "sum-model",
      "implementation": "models.MyCustomRuntime"
    }
    .
    └── models
        └── sum-model
            ├── environment.tar.gz
            ├── model-settings.json
            ├── models.py
    {
      "model": "sum-model",
      "implementation": "models.MyCustomRuntime",
      "parameters": {
        "environment_tarball": "./environment.tar.gz"
      }
    }
    mlserver build . -t my-custom-server
    DOCKER_BUILDKIT=1 docker build . -t my-custom-runtime:0.1.0
    %%writefile jsonmodels.py
    import json
    
    from typing import Dict, Any
    from mlserver import MLModel, types
    from mlserver.codecs import StringCodec
    
    
    class JsonHelloWorldModel(MLModel):
        async def load(self) -> bool:
            # Perform additional custom initialization here.
            print("Initialize model")
    
            # Set readiness flag for model
            return await super().load()
    
        async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
            request = self._extract_json(payload)
            response = {
                "request": request,
                "server_response": "Got your request. Hello from the server.",
            }
            response_bytes = json.dumps(response).encode("UTF-8")
    
            return types.InferenceResponse(
                id=payload.id,
                model_name=self.name,
                model_version=self.version,
                outputs=[
                    types.ResponseOutput(
                        name="echo_response",
                        shape=[len(response_bytes)],
                        datatype="BYTES",
                        data=[response_bytes],
                        parameters=types.Parameters(content_type="str"),
                    )
                ],
            )
    
        def _extract_json(self, payload: types.InferenceRequest) -> Dict[str, Any]:
            inputs = {}
            for inp in payload.inputs:
                inputs[inp.name] = json.loads(
                    "".join(self.decode(inp, default_codec=StringCodec))
                )
    
            return inputs
    
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile model-settings.json
    {
        "name": "json-hello-world",
        "implementation": "jsonmodels.JsonHelloWorldModel"
    }
    mlserver start .
    import requests
    import json
    from mlserver.types import InferenceResponse
    from mlserver.codecs.string import StringRequestCodec
    from pprint import PrettyPrinter
    
    pp = PrettyPrinter(indent=1)
    
    inputs = {"name": "Foo Bar", "message": "Hello from Client (REST)!"}
    
    # NOTE: this uses characters rather than encoded bytes. It is recommended that you use the `mlserver` types to assist in the correct encoding.
    inputs_string = json.dumps(inputs)
    
    inference_request = {
        "inputs": [
            {
                "name": "echo_request",
                "shape": [len(inputs_string)],
                "datatype": "BYTES",
                "data": [inputs_string],
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/json-hello-world/infer"
    response = requests.post(endpoint, json=inference_request)
    
    print(f"full response:\n")
    print(response)
    # retrive text output as dictionary
    inference_response = InferenceResponse.parse_raw(response.text)
    raw_json = StringRequestCodec.decode_response(inference_response)
    output = json.loads(raw_json[0])
    print(f"\ndata part:\n")
    pp.pprint(output)
    import requests
    import json
    import grpc
    from mlserver.codecs.string import StringRequestCodec
    import mlserver.grpc.converters as converters
    import mlserver.grpc.dataplane_pb2_grpc as dataplane
    import mlserver.types as types
    from pprint import PrettyPrinter
    
    pp = PrettyPrinter(indent=1)
    
    model_name = "json-hello-world"
    inputs = {"name": "Foo Bar", "message": "Hello from Client (gRPC)!"}
    inputs_bytes = json.dumps(inputs).encode("UTF-8")
    
    inference_request = types.InferenceRequest(
        inputs=[
            types.RequestInput(
                name="echo_request",
                shape=[len(inputs_bytes)],
                datatype="BYTES",
                data=[inputs_bytes],
                parameters=types.Parameters(content_type="str"),
            )
        ]
    )
    
    inference_request_g = converters.ModelInferRequestConverter.from_types(
        inference_request, model_name=model_name, model_version=None
    )
    
    grpc_channel = grpc.insecure_channel("localhost:8081")
    grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
    
    response = grpc_stub.ModelInfer(inference_request_g)
    
    print(f"full response:\n")
    print(response)
    # retrive text output as dictionary
    inference_response = converters.ModelInferResponseConverter.to_types(response)
    raw_json = StringRequestCodec.decode_response(inference_response)
    output = json.loads(raw_json[0])
    print(f"\ndata part:\n")
    pp.pprint(output)

    scikit-learn

    ./models/mnist-svm/model.joblib

    mushroom-xgboost

    xgboost

    ./models/mushroom-xgboost/model.json

    The next step will be serving both our models within the same MLServer instance. For that, we will just need to create a model-settings.json file local to each of our models and a server-wide settings.json. That is,

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • models/mnist-svm/model-settings.json: holds the configuration specific to our mnist-svm model (e.g. input type, runtime to use, etc.).

    • models/mushroom-xgboost/model-settings.json: holds the configuration specific to our mushroom-xgboost model (e.g. input type, runtime to use, etc.).

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    By this point, we should have both our models getting served by MLServer. To make sure that everything is working as expected, let's send a request from each test set.

    For that, we can use the Python types that the mlserver package provides out of box, or we can build our request manually.

    Training

    Multi-Model Serving (MMS)

    mnist-svm

    # Original source code and more details can be found in:
    # https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
    
    # Import datasets, classifiers and performance metrics
    from sklearn import datasets, svm, metrics
    from sklearn.model_selection import train_test_split
    
    # The digits dataset
    digits = datasets.load_digits()
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))
    
    # Create a classifier: a support vector classifier
    classifier = svm.SVC(gamma=0.001)
    
    # Split data into train and test subsets
    X_train, X_test_digits, y_train, y_test_digits = train_test_split(
        data, digits.target, test_size=0.5, shuffle=False)
    
    # We learn the digits on the first half of the digits
    classifier.fit(X_train, y_train)
    import joblib
    import os
    
    mnist_svm_path = os.path.join("models", "mnist-svm")
    os.makedirs(mnist_svm_path, exist_ok=True)
    
    mnist_svm_model_path = os.path.join(mnist_svm_path, "model.joblib")
    joblib.dump(classifier, mnist_svm_model_path)
    # Original code and extra details can be found in:
    # https://xgboost.readthedocs.io/en/latest/get_started.html#python
    
    import os
    import xgboost as xgb
    import requests
    
    from urllib.parse import urlparse
    from sklearn.datasets import load_svmlight_file
    
    
    TRAIN_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.train'
    TEST_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.test'
    
    
    def _download_file(url: str) -> str:
        parsed = urlparse(url)
        file_name = os.path.basename(parsed.path)
        file_path = os.path.join(os.getcwd(), file_name)
        
        res = requests.get(url)
        
        with open(file_path, 'wb') as file:
            file.write(res.content)
        
        return file_path
    
    train_dataset_path = _download_file(TRAIN_DATASET_URL)
    test_dataset_path = _download_file(TEST_DATASET_URL)
    
    # NOTE: Workaround to load SVMLight files from the XGBoost example
    X_train, y_train = load_svmlight_file(train_dataset_path)
    X_test_agar, y_test_agar = load_svmlight_file(test_dataset_path)
    X_train = X_train.toarray()
    X_test_agar = X_test_agar.toarray()
    
    # read in data
    dtrain = xgb.DMatrix(data=X_train, label=y_train)
    
    # specify parameters via map
    param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
    num_round = 2
    bst = xgb.train(param, dtrain, num_round)
    
    bst
    import os
    
    mushroom_xgboost_path = os.path.join("models", "mushroom-xgboost")
    os.makedirs(mushroom_xgboost_path, exist_ok=True)
    
    mushroom_xgboost_model_path = os.path.join(mushroom_xgboost_path, "model.json")
    bst.save_model(mushroom_xgboost_model_path)
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile models/mnist-svm/model-settings.json
    {
        "name": "mnist-svm",
        "implementation": "mlserver_sklearn.SKLearnModel",
        "parameters": {
            "version": "v0.1.0"
        }
    }
    %%writefile models/mushroom-xgboost/model-settings.json
    {
        "name": "mushroom-xgboost",
        "implementation": "mlserver_xgboost.XGBoostModel",
        "parameters": {
            "version": "v0.1.0"
        }
    }
    
    mlserver start .
    import requests
    
    x_0 = X_test_digits[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    import requests
    
    x_0 = X_test_agar[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mushroom-xgboost/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()

    Training our mnist-svm model

    Training our mushroom-xgboost model

    Serving

    settings.json

    models/mnist-svm/model-settings.json

    models/mushroom-xgboost/model-settings.json

    Start serving our model

    Testing

    Testing our mnist-svm model

    Testing our mushroom-xgboost model

    MLServer Settings

    Config

    Attribute
    Type
    Default

    extra

    str

    "ignore"

    env_prefix

    str

    Field
    Type
    Default
    Description

    Codecs

    Codec that convers to / from a base64 input.

    Evaluate whether the codec can encode (decode) the payload.

    Decode a request input into a high-level Python type.

    Decode a response output into a high-level Python type.

    Encode the given payload into a RequestInput.

    Encode the given payload into a response output.

    Exception.add_note(note) -- add a note to the exception

    MNIST example from the scikit-learn documentation
    Mushrooms example from the xgboost Getting Started guide

    Cache size to be used if caching is enabled.

    cors_settings

    Optional[CORSSettings]

    None

    -

    debug

    bool

    True

    -

    environments_dir

    str

    '-'

    -

    extensions

    List[str]

    []

    -

    grpc_max_message_length

    Optional[int]

    None

    -

    grpc_port

    int

    8081

    -

    gzip_enabled

    bool

    True

    Enable GZipMiddleware.

    host

    str

    '0.0.0.0'

    -

    http_port

    int

    8080

    -

    kafka_enabled

    bool

    False

    Enable Kafka integration for the server.

    kafka_servers

    str

    'localhost:9092'

    Comma-separated list of Kafka servers.

    kafka_topic_input

    str

    'mlserver-input'

    Kafka topic for input messages.

    kafka_topic_output

    str

    'mlserver-output'

    Kafka topic for output messages.

    load_models_at_startup

    bool

    True

    -

    logging_settings

    Union[str, Dict[Any, Any], None]

    None

    Path to logging config file or dictionary configuration.

    metrics_dir

    str

    '-'

    Directory used to share metrics across parallel workers. Equivalent to the PROMETHEUS_MULTIPROC_DIR env var in prometheus-client. Note that this won't be used if the parallel_workers flag is disabled. By default, the .metrics folder of the current working directory will be used.

    metrics_endpoint

    Optional[str]

    '/metrics'

    Endpoint used to expose Prometheus metrics. Alternatively, can be set to None to disable it.

    metrics_port

    int

    8082

    Port used to expose metrics endpoint.

    metrics_rest_server_prefix

    str

    'rest_server'

    Metrics rest server string prefix to be exported.

    model_repository_implementation

    Optional[ImportString]

    None

    -

    model_repository_implementation_args

    dict

    {}

    -

    model_repository_root

    str

    '.'

    -

    parallel_workers

    int

    1

    -

    parallel_workers_timeout

    int

    5

    -

    root_path

    str

    ''

    -

    server_name

    str

    'mlserver'

    -

    server_version

    str

    '1.7.0.dev0'

    -

    tracing_server

    Optional[str]

    None

    Server name used to export OpenTelemetry tracing to collector service.

    use_structured_logging

    bool

    False

    Use JSON-formatted structured logging instead of default format.

    "MLSERVER_"

    env_file

    str

    ".env"

    protected_namespaces

    tuple

    ()

    cache_enabled

    bool

    False

    Enable caching for the model predictions.

    cache_size

    int

    Fields

    100

    Exception.with_traceback(tb) -- set self.traceback to tb and return self.

    Codec that convers to / from a datetime input.

    Evaluate whether the codec can encode (decode) the payload.

    Decode a request input into a high-level Python type.

    Decode a response output into a high-level Python type.

    Encode the given payload into a RequestInput.

    Encode the given payload into a response output.

    The InputCodec interface lets you define type conversions of your raw input data to / from the Open Inference Protocol. Note that this codec applies at the individual input (output) level.

    For request-wide transformations (e.g. dataframes), use the RequestCodec interface instead.

    Evaluate whether the codec can encode (decode) the payload.

    Decode a request input into a high-level Python type.

    Decode a response output into a high-level Python type.

    Encode the given payload into a RequestInput.

    Encode the given payload into a response output.

    Decodes an request input (response output) as a NumPy array.

    Evaluate whether the codec can encode (decode) the payload.

    Decode a request input into a high-level Python type.

    Decode a response output into a high-level Python type.

    Encode the given payload into a RequestInput.

    Encode the given payload into a response output.

    Decodes the first input (output) of request (response) as a NumPy array. This codec can be useful for cases where the whole payload is a single NumPy tensor.

    Evaluate whether the codec can encode (decode) the payload.

    Decode an inference request into a high-level Python object.

    Decode an inference response into a high-level Python object.

    Encode the given payload into an inference request.

    Encode the given payload into an inference response.

    Decodes a request (response) into a Pandas DataFrame, assuming each input (output) head corresponds to a column of the DataFrame.

    Evaluate whether the codec can encode (decode) the payload.

    Decode an inference request into a high-level Python object.

    Decode an inference response into a high-level Python object.

    Encode the given payload into an inference request.

    Encode the given payload into an inference response.

    The RequestCodec interface lets you define request-level conversions between high-level Python types and the Open Inference Protocol. This can be useful where the encoding of your payload encompases multiple input heads (e.g. dataframes, where each column can be thought as a separate input head).

    For individual input-level encoding / decoding, use the InputCodec interface instead.

    Evaluate whether the codec can encode (decode) the payload.

    Decode an inference request into a high-level Python object.

    Decode an inference response into a high-level Python object.

    Encode the given payload into an inference request.

    Encode the given payload into an inference response.

    Encodes a list of Python strings as a BYTES input (output).

    Evaluate whether the codec can encode (decode) the payload.

    Decode a request input into a high-level Python type.

    Decode a response output into a high-level Python type.

    Encode the given payload into a RequestInput.

    Encode the given payload into a response output.

    Decodes the first input (output) of request (response) as a list of strings. This codec can be useful for cases where the whole payload is a single list of strings.

    Evaluate whether the codec can encode (decode) the payload.

    Decode an inference request into a high-level Python object.

    Decode an inference response into a high-level Python object.

    Encode the given payload into an inference request.

    Encode the given payload into an inference response.

    No description available.

    No description available.

    No description available.

    No description available.

    No description available.

    No description available.

    No description available.

    No description available.

    No description available.

    No description available.

    Base64Codec

    Methods

    can_encode()

    decode_input()

    decode_output()

    encode_input()

    encode_output()

    CodecError

    Methods

    add_note()

    with_traceback()

    DatetimeCodec

    Methods

    can_encode()

    decode_input()

    decode_output()

    encode_input()

    encode_output()

    InputCodec

    Methods

    can_encode()

    decode_input()

    decode_output()

    encode_input()

    encode_output()

    NumpyCodec

    Methods

    can_encode()

    decode_input()

    decode_output()

    encode_input()

    encode_output()

    NumpyRequestCodec

    Methods

    can_encode()

    decode_request()

    decode_response()

    encode_request()

    encode_response()

    PandasCodec

    Methods

    can_encode()

    decode_request()

    decode_response()

    encode_outputs()

    encode_request()

    encode_response()

    RequestCodec

    Methods

    can_encode()

    decode_request()

    decode_response()

    encode_request()

    encode_response()

    StringCodec

    Methods

    can_encode()

    decode_input()

    decode_output()

    encode_input()

    encode_output()

    StringRequestCodec

    Methods

    can_encode()

    decode_request()

    decode_response()

    encode_request()

    encode_response()

    decode_args()

    decode_inference_request()

    decode_request_input()

    encode_inference_response()

    encode_response_output()

    get_decoded()

    get_decoded_or_raw()

    has_decoded()

    register_input_codec()

    register_request_codec()

    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> List[bytes]
    decode_output(response_output: ResponseOutput) -> List[bytes]
    encode_input(name: str, payload: List[bytes], use_bytes: bool = True, kwargs) -> RequestInput
    encode_output(name: str, payload: List[bytes], use_bytes: bool = True, kwargs) -> ResponseOutput
    add_note(...)
    with_traceback(...)
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> List[datetime]
    decode_output(response_output: ResponseOutput) -> List[datetime]
    encode_input(name: str, payload: List[Union[str, datetime]], use_bytes: bool = True, kwargs) -> RequestInput
    encode_output(name: str, payload: List[Union[str, datetime]], use_bytes: bool = True, kwargs) -> ResponseOutput
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> Any
    decode_output(response_output: ResponseOutput) -> Any
    encode_input(name: str, payload: Any, kwargs) -> RequestInput
    encode_output(name: str, payload: Any, kwargs) -> ResponseOutput
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> ndarray
    decode_output(response_output: ResponseOutput) -> ndarray
    encode_input(name: str, payload: ndarray, kwargs) -> RequestInput
    encode_output(name: str, payload: ndarray, kwargs) -> ResponseOutput
    can_encode(payload: Any) -> bool
    decode_request(request: InferenceRequest) -> Any
    decode_response(response: InferenceResponse) -> Any
    encode_request(payload: Any, kwargs) -> InferenceRequest
    encode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponse
    can_encode(payload: Any) -> bool
    decode_request(request: InferenceRequest) -> DataFrame
    decode_response(response: InferenceResponse) -> DataFrame
    encode_outputs(payload: DataFrame, use_bytes: bool = True) -> List[ResponseOutput]
    encode_request(payload: DataFrame, use_bytes: bool = True, kwargs) -> InferenceRequest
    encode_response(model_name: str, payload: DataFrame, model_version: Optional[str] = None, use_bytes: bool = True, kwargs) -> InferenceResponse
    can_encode(payload: Any) -> bool
    decode_request(request: InferenceRequest) -> Any
    decode_response(response: InferenceResponse) -> Any
    encode_request(payload: Any, kwargs) -> InferenceRequest
    encode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponse
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> List[str]
    decode_output(response_output: ResponseOutput) -> List[str]
    encode_input(name: str, payload: List[str], use_bytes: bool = True, kwargs) -> RequestInput
    encode_output(name: str, payload: List[str], use_bytes: bool = True, kwargs) -> ResponseOutput
    can_encode(payload: Any) -> bool
    decode_request(request: InferenceRequest) -> Any
    decode_response(response: InferenceResponse) -> Any
    encode_request(payload: Any, kwargs) -> InferenceRequest
    encode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponse
    decode_args(predict: Callable) -> Callable[[ForwardRef('MLModel'), <class 'mlserver.types.dataplane.InferenceRequest'>], Coroutine[Any, Any, InferenceResponse]]
    decode_inference_request(inference_request: InferenceRequest, model_settings: Optional[ModelSettings] = None, metadata_inputs: Dict[str, MetadataTensor] = {}) -> Optional[Any]
    decode_request_input(request_input: RequestInput, metadata_inputs: Dict[str, MetadataTensor] = {}) -> Optional[Any]
    encode_inference_response(payload: Any, model_settings: ModelSettings) -> Optional[InferenceResponse]
    encode_response_output(payload: Any, request_output: RequestOutput, metadata_outputs: Dict[str, MetadataTensor] = {}) -> Optional[ResponseOutput]
    get_decoded(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> Any
    get_decoded_or_raw(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> Any
    has_decoded(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> bool
    register_input_codec(CodecKlass: Union[type[InputCodec], InputCodec])
    register_request_codec(CodecKlass: Union[type[RequestCodec], RequestCodec])

    Deploying a Custom Tensorflow Model with MLServer and Seldon Core

    Background

    Intro

    This tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we'll cover:

    • Running the model locally

    • Turning the ML model into an API

    • Containerizing the model

    • Storing the container in a registry

    • Deploying the model to Kubernetes (with Seldon Core)

    • Scaling the model

    The tutorial comes with an accompanying video which you might find useful as you work through the steps:

    The slides used in the video can be found .

    For this tutorial, we're going to use the available from the Tensorflow Catalog. This dataset includes leaf images from the cassava plant. Each plant can be classified as either "healthly" or as having one of four diseases (Mosaic Disease, Bacterial Blight, Green Mite, Brown Streak Disease).

    We won't go through the steps of training the classifier. Instead, we'll be using a pre-trained one available on TensorFlow Hub. You can find the .

    The easiest way to run this example is to clone the repository located :

    If you've already cloned the MLServer repository, you can also find it in docs/examples/cassava.

    Once you've done that, you can just run:

    And it'll set you up with all the libraries required to run the code.

    The starting point for this tutorial is python script app.py. This is typical of the kind of python code we'd run standalone or in a jupyter notebook. Let's familiarise ourself with the code:

    First up, we're importing a couple of functions from our helpers.py file:

    • plot provides the visualisation of the samples, labels and predictions.

    • preprocess is used to resize images to 224x224 pixels and normalize the RGB values.

    The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.

    Try it yourself by running:

    Here's what our setup currently looks like:

    The problem with running our code like we did earlier is that it's not accessible to anyone who doesn't have the python script (and all of it's dependencies). A good way to solve this is to turn our model into an API.

    Typically people turn to popular python web servers like or . This is a good approach and gives us lots of flexibility but it also requires us to do a lot of the work ourselves. We need to impelement routes, set up logging, capture metrics and define an API schema among other things. A simpler way to tackle this problem is to use an inference server. For this tutorial we're going to use the open source framework.

    MLServer supports a bunch of out of the box, but it also supports which is what we'll use for our Tensorflow model.

    In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load() and predict(). Let's take a look at the code (found in model/serve-model.py):

    The load() method is used to define any logic required to set up our model for inference. In our case, we're loading the model weights into self._model. The predict() method is where we include all of our prediction logic.

    You may notice that we've slightly modified our code from earlier (in app.py). The biggest change is that it is now wrapped in a single class CassavaModel.

    The only other task we need to do to run our model on MLServer is to specify a model-settings.json file:

    This is a simple configuration file that tells MLServer how to handle our model. In our case, we've provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel).

    We're now ready to serve our model with MLServer. To do that we can simply run:

    MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.

    Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the test.py file by running:

    Our setup has now evloved and looks like this:

    are an easy way to package our application together with it's runtime and dependencies. More importantly, containerizing our model allows it to run in a variety of different environments.

    Note: you will need installed to run this section of the tutorial. You'll also need a account or another container registry.

    Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build command.

    Before we run this command, we need to provide our dependencies in either a requirements.txt or a conda.env file. The requirements file we'll use for this example is stored in model/requirements.txt:

    Notice that we didn't need to include mlserver in our requirements? That's because the builder image has mlserver included already.

    We're now ready to build our container image using:

    Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".

    MLServer will now build the model into a container image for us. We can check the output of this by running:

    Finally, we want to send this container image to be stored in our container registry. We can do this by running:

    Our setup now looks like this. Where our model has been packaged and sent to a container registry:

    Now that we've turned our model into a production-ready API, containerized it and pushed it to a registry, it's time to deploy our model.

    We're going to use a popular open source framework called to deploy our model. Seldon Core is great because it combines all of the awesome cloud-native features we get from but it also adds machine-learning specific features.

    This tutorial assumes you already have a Seldon Core cluster up and running. If that's not the case, head over the and get set up first. You'll also need to install the kubectl command line interface.

    To create our deployment with Seldon Core we need to create a small configuration file that looks like this:

    You can find this file named deployment.yaml in the base folder of this tutorial's repository.

    Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".

    We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:

    To check our deployment is up and running we can run:

    We should see STATUS = Running once our deployment has finalized.

    Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it's working.

    To do this, we simply run the test.py file in the following way:

    This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.

    A note on running this yourself: This example is set up to connect to a kubernetes cluster running locally on your machine. If yours is local too, you'll need to make sure you before sending requests. If your cluster is remote, you'll need to change the inference_url variable on line 21 of test.py.

    Having deployed our model to kubernetes and tested it, our setup now looks like this:

    Our model is now running in a production environment and able to handle requests from external sources. This is awesome but what happens as the number of requests being sent to our model starts to increase? Eventually, we'll reach the limit of what a single server can handle. Thankfully, we can get around this problem by scaling our model .

    Kubernetes and Seldon Core make this really easy to do by simply running:

    We can replace the --replicas=3 with any number we want to scale to.

    To watch the servers scaling out we can run:

    Once the new replicas have finished rolling out, our setup now looks like this:

    In this tutorial we've scaled the model out manually to show how it works. In a real environment we'd want to set up to make sure our prediction API is always online and performing as expected.

    MLServer CLI

    The MLServer package includes a mlserver CLI designed to help with common tasks in a model’s lifecycle. You can see a high-level outline at any time via:

    mlserver --help

    mlserver

    Command-line interface to manage MLServer models.

    mlserver [OPTIONS] COMMAND [ARGS]...

    Options

    • --version (Default: False) Show the version and exit.

    build

    Build a Docker image for a custom MLServer runtime.

    mlserver build [OPTIONS] FOLDER

    Options

    • -t, --tag <text>

    • --no-cache (Default: False)

    • FOLDER Required argument

    Generate a Dockerfile

    • -i, --include-dockerignore (Default: False)

    • FOLDER Required argument

    Deprecated: This experimental feature will be removed in future work. Execute batch inference requests against V2 inference server.

    Deprecated: This experimental feature will be removed in future work.

    • --url, -u <text> (Default: localhost:8080; Env: MLSERVER_INFER_URL) URL of the MLServer to send inference requests to. Should not contain http or https.

    • --model-name, -m

    Generate a base project template

    • -t, --template <text> (Default: https://github.com/EthicalML/sml-security/)

    Start serving a machine learning model with MLServer.

    • FOLDER Required argument

    Serving a custom model

    The mlserver package comes with inference runtime implementations for scikit-learn and xgboost models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.

    In this example, we will train a . The numpyro library streamlines the implementation of probabilistic models, abstracting away advanced inference and training algorithms.

    Out of the box, mlserver doesn't provide an inference runtime for numpyro

    <text>
    (Required; Env:
    MLSERVER_INFER_MODEL_NAME
    ) Name of the model to send inference requests to.
  • --input-data-path, -i <path> (Required; Env: MLSERVER_INFER_INPUT_DATA_PATH) Local path to the input file containing inference requests to be processed.

  • --output-data-path, -o <path> (Required; Env: MLSERVER_INFER_OUTPUT_DATA_PATH) Local path to the output file for the inference responses to be written to.

  • --workers, -w <integer> (Default: 10; Env: MLSERVER_INFER_WORKERS)

  • --retries, -r <integer> (Default: 3; Env: MLSERVER_INFER_RETRIES)

  • --batch-size, -s <integer> (Default: 1; Env: MLSERVER_INFER_BATCH_SIZE) Send inference requests grouped together as micro-batches.

  • --binary-data, -b (Default: False; Env: MLSERVER_INFER_BINARY_DATA) Send inference requests as binary data (not fully supported).

  • --verbose, -v (Default: False; Env: MLSERVER_INFER_VERBOSE) Verbose mode.

  • --extra-verbose, -vv (Default: False; Env: MLSERVER_INFER_EXTRA_VERBOSE) Extra verbose mode (shows detailed requests and responses).

  • --transport, -t <choice> (Options: rest | grpc; Default: rest; Env: MLSERVER_INFER_TRANSPORT) Transport type to use to send inference requests. Can be 'rest' or 'grpc' (not yet supported).

  • --request-headers, -H <text> (Env: MLSERVER_INFER_REQUEST_HEADERS) Headers to be set on each inference request send to the server. Multiple options are allowed as: -H 'Header1: Val1' -H 'Header2: Val2'. When setting up as environmental provide as 'Header1:Val1 Header2:Val2'.

  • --timeout <integer> (Default: 60; Env: MLSERVER_INFER_CONNECTION_TIMEOUT) Connection timeout to be passed to tritonclient.

  • --batch-interval <float> (Default: 0; Env: MLSERVER_INFER_BATCH_INTERVAL) Minimum time interval (in seconds) between requests made by each worker.

  • --batch-jitter <float> (Default: 0; Env: MLSERVER_INFER_BATCH_JITTER) Maximum random jitter (in seconds) added to batch interval between requests.

  • --use-ssl (Default: False; Env: MLSERVER_INFER_USE_SSL) Use SSL in communications with inference server.

  • --insecure (Default: False; Env: MLSERVER_INFER_INSECURE) Disable SSL verification in communications. Use with caution.

  • Arguments

    dockerfile

    Options

    Arguments

    infer

    Options

    init

    Options

    start

    Arguments

    . However, through this example we will see how easy is to develop our own.

    The first step will be to train our model. This will be a very simple bayesian regression model, based on an example provided in the numpyro docs.

    Since this is a probabilistic model, during training we will compute an approximation to the posterior distribution of our model using MCMC.

    Now that we have trained our model, the next step will be to save it so that it can be loaded afterwards at serving-time. Note that, since this is a probabilistic model, we will only need to save the traces that approximate the posterior distribution over latent parameters.

    This will get saved in a numpyro-divorce.json file.

    The next step will be to serve our model using mlserver. For that, we will first implement an extension which serve as the runtime to perform inference using our custom numpyro model.

    Our custom inference wrapper should be responsible of:

    • Loading the model from the set samples we saved previously.

    • Running inference using our model structure, and the posterior approximated from the samples.

    The next step will be to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Now that we have written and tested our custom model, the next step is to deploy it. With that goal in mind, the rough outline of steps will be to first build a custom image containing our code, and then deploy it.

    MLServer will automatically find your requirements.txt file and install necessary python packages

    MLServer offers helpers to build a custom Docker image containing your code. In this example, we will use the mlserver build subcommand to create an image, which we'll be able to deploy later.

    Note that this section expects that Docker is available and running in the background, as well as a functional cluster with Seldon Core installed and some familiarity with kubectl.

    To ensure that the image is fully functional, we can spin up a container and then send a test request. To start the container, you can run something along the following lines in a separate terminal:

    As we should be able to see, the server running within our Docker image responds as expected.

    Now that we've built a custom image and verified that it works as expected, we can move to the next step and deploy it. There is a large number of tools out there to deploy images. However, for our example, we will focus on deploying it to a cluster running Seldon Core.

    For that, we will need to create a SeldonDeployment resource which instructs Seldon Core to deploy a model embedded within our custom image and compliant with the V2 Inference Protocol. This can be achieved by applying (i.e. kubectl apply) a SeldonDeployment manifest to the cluster, similar to the one below:

    Overview

    numpyro model

    Training

    Saving our trained model

    Serving

    Custom inference runtime

    Settings files

    settings.json

    model-settings.json

    Start serving our model

    Send test inference request

    Deployment

    Specifying requirements

    Building a custom image

    Deploying our custom image

    mlserver dockerfile [OPTIONS] FOLDER
    mlserver infer [OPTIONS]
    mlserver init [OPTIONS]
    mlserver start [OPTIONS] FOLDER
    # Original source code and more details can be found in:
    # https://nbviewer.jupyter.org/github/pyro-ppl/numpyro/blob/master/notebooks/source/bayesian_regression.ipynb
    
    
    import numpyro
    import numpy as np
    import pandas as pd
    
    from numpyro import distributions as dist
    from jax import random
    from numpyro.infer import MCMC, NUTS
    
    DATASET_URL = "https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/WaffleDivorce.csv"
    dset = pd.read_csv(DATASET_URL, sep=";")
    
    standardize = lambda x: (x - x.mean()) / x.std()
    
    dset["AgeScaled"] = dset.MedianAgeMarriage.pipe(standardize)
    dset["MarriageScaled"] = dset.Marriage.pipe(standardize)
    dset["DivorceScaled"] = dset.Divorce.pipe(standardize)
    
    
    def model(marriage=None, age=None, divorce=None):
        a = numpyro.sample("a", dist.Normal(0.0, 0.2))
        M, A = 0.0, 0.0
        if marriage is not None:
            bM = numpyro.sample("bM", dist.Normal(0.0, 0.5))
            M = bM * marriage
        if age is not None:
            bA = numpyro.sample("bA", dist.Normal(0.0, 0.5))
            A = bA * age
        sigma = numpyro.sample("sigma", dist.Exponential(1.0))
        mu = a + M + A
        numpyro.sample("obs", dist.Normal(mu, sigma), obs=divorce)
    
    
    # Start from this source of randomness. We will split keys for subsequent operations.
    rng_key = random.PRNGKey(0)
    rng_key, rng_key_ = random.split(rng_key)
    
    num_warmup, num_samples = 1000, 2000
    
    # Run NUTS.
    kernel = NUTS(model)
    mcmc = MCMC(kernel, num_warmup=num_warmup, num_samples=num_samples)
    mcmc.run(
        rng_key_, marriage=dset.MarriageScaled.values, divorce=dset.DivorceScaled.values
    )
    mcmc.print_summary()
    import json
    
    samples = mcmc.get_samples()
    serialisable = {}
    for k, v in samples.items():
        serialisable[k] = np.asarray(v).tolist()
    
    model_file_name = "numpyro-divorce.json"
    with open(model_file_name, "w") as model_file:
        json.dump(serialisable, model_file)
    # %load models.py
    import json
    import numpyro
    import numpy as np
    
    from jax import random
    from mlserver import MLModel
    from mlserver.codecs import decode_args
    from mlserver.utils import get_model_uri
    from numpyro.infer import Predictive
    from numpyro import distributions as dist
    from typing import Optional
    
    
    class NumpyroModel(MLModel):
        async def load(self) -> bool:
            model_uri = await get_model_uri(self._settings)
            with open(model_uri) as model_file:
                raw_samples = json.load(model_file)
    
            self._samples = {}
            for k, v in raw_samples.items():
                self._samples[k] = np.array(v)
    
            self._predictive = Predictive(self._model, self._samples)
    
            return True
    
        @decode_args
        async def predict(
            self,
            marriage: Optional[np.ndarray] = None,
            age: Optional[np.ndarray] = None,
            divorce: Optional[np.ndarray] = None,
        ) -> np.ndarray:
            predictions = self._predictive(
                rng_key=random.PRNGKey(0), marriage=marriage, age=age, divorce=divorce
            )
    
            obs = predictions["obs"]
            obs_mean = obs.mean()
    
            return np.asarray(obs_mean)
    
        def _model(self, marriage=None, age=None, divorce=None):
            a = numpyro.sample("a", dist.Normal(0.0, 0.2))
            M, A = 0.0, 0.0
            if marriage is not None:
                bM = numpyro.sample("bM", dist.Normal(0.0, 0.5))
                M = bM * marriage
            if age is not None:
                bA = numpyro.sample("bA", dist.Normal(0.0, 0.5))
                A = bA * age
            sigma = numpyro.sample("sigma", dist.Exponential(1.0))
            mu = a + M + A
            numpyro.sample("obs", dist.Normal(mu, sigma), obs=divorce)
    
    # %load settings.json
    {
        "debug": "true"
    }
    
    # %load model-settings.json
    {
        "name": "numpyro-divorce",
        "implementation": "models.NumpyroModel",
        "parameters": {
            "uri": "./numpyro-divorce.json"
        }
    }
    
    mlserver start .
    import requests
    import numpy as np
    
    from mlserver.types import InferenceRequest
    from mlserver.codecs import NumpyCodec
    
    x_0 = np.array([28.0])
    inference_request = InferenceRequest(
        inputs=[
            NumpyCodec.encode_input(name="marriage", payload=x_0)
        ]
    )
    
    endpoint = "http://localhost:8080/v2/models/numpyro-divorce/infer"
    response = requests.post(endpoint, json=inference_request.model_dump())
    
    response.json()
    # %load requirements.txt
    numpy==1.22.4
    numpyro==0.8.0
    jax==0.2.24
    jaxlib==0.3.7
    
    This section expects that Docker is available and running in the background.
    %%bash
    mlserver build . -t 'my-custom-numpyro-server:0.1.0'
    docker run -it --rm -p 8080:8080 my-custom-numpyro-server:0.1.0
    import numpy as np
    
    from mlserver.types import InferenceRequest
    from mlserver.codecs import NumpyCodec
    
    x_0 = np.array([28.0])
    inference_request = InferenceRequest(
        inputs=[
            NumpyCodec.encode_input(name="marriage", payload=x_0)
        ]
    )
    
    endpoint = "http://localhost:8080/v2/models/numpyro-divorce/infer"
    response = requests.post(endpoint, json=inference_request.model_dump())
    
    response.json()
    This section expects access to a functional Kubernetes cluster with Seldon Core installed and some familiarity with `kubectl`.
    Also consider that depending on your Kubernetes installation Seldon Core might expect to get the container image from a public container registry like [Docker hub](https://hub.docker.com/) or [Google Container Registry](https://cloud.google.com/container-registry). For that you need to do an extra step of pushing the container to the registry using `docker tag <image name> <container registry>/<image name>` and `docker push <container registry>/<image name>` and also updating the `image` section of the yaml file to `<container registry>/<image name>`.
    %%writefile seldondeployment.yaml
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
      name: numpyro-model
    spec:
      protocol: v2
      predictors:
        - name: default
          graph:
            name: numpyro-divorce
            type: MODEL
          componentSpecs:
            - spec:
                containers:
                  - name: numpyro-divorce
                    image: my-custom-numpyro-server:0.1.0
    git clone https://github.com/SeldonIO/cassava-example.git
    cd cassava-example/
    pip install -r requirements.txt
    from helpers import plot, preprocess
    import tensorflow as tf
    import tensorflow_datasets as tfds
    import tensorflow_hub as hub
    
    # Fixes an issue with Jax and TF competing for GPU
    tf.config.experimental.set_visible_devices([], 'GPU')
    
    # Load the model
    model_path = './model'
    classifier = hub.KerasLayer(model_path)
    
    # Load the dataset and store the class names
    dataset, info = tfds.load('cassava', with_info=True)
    class_names = info.features['label'].names + ['unknown']
    
    # Select a batch of examples and plot them
    batch_size = 9
    batch = dataset['validation'].map(preprocess).batch(batch_size).as_numpy_iterator()
    examples = next(batch)
    plot(examples, class_names)
    
    # Generate predictions for the batch and plot them against their labels
    predictions = classifier(examples['image'])
    predictions_max = tf.argmax(predictions, axis=-1)
    print(predictions_max)
    plot(examples, class_names, predictions_max)
    python app.py
    from mlserver import MLModel
    from mlserver.codecs import decode_args
    import numpy as np
    import tensorflow as tf
    import tensorflow_hub as hub
    
    # Define a class for our Model, inheriting the MLModel class from MLServer
    class CassavaModel(MLModel):
    
      # Load the model into memory
      async def load(self) -> bool:
        tf.config.experimental.set_visible_devices([], 'GPU')
        model_path = '.'
        self._model = hub.KerasLayer(model_path)
        self.ready = True
        return self.ready
    
      # Logic for making predictions against our model
      @decode_args
      async def predict(self, payload: np.ndarray) -> np.ndarray:
        # convert payload to tf.tensor
        payload_tensor = tf.constant(payload)
    
        # Make predictions
        predictions = self._model(payload_tensor)
        predictions_max = tf.argmax(predictions, axis=-1)
    
        # convert predictions to np.ndarray
        response_data = np.array(predictions_max)
    
        return response_data
    {
        "name": "cassava",
        "implementation": "serve-model.CassavaModel"
    }
    mlserver start model/
    python test.py --local
    tensorflow==2.12.0
    tensorflow-hub==0.13.0
    mlserver build model/ -t [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]
    docker images
    docker push [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
      name: cassava
    spec:
      protocol: v2
      predictors:
        - componentSpecs:
            - spec:
                containers:
                  - image: YOUR_CONTAINER_REGISTRY/IMAGE_NAME
                    name: cassava
                    imagePullPolicy: Always
          graph:
            name: cassava
            type: MODEL
          name: cassava
    kubectl create -f deployment.yaml
    kubectl get pods
    python test.py --remote
    kubectl scale sdep cassava --replicas=3
    kubectl get pods --watch

    The Use Case

    Getting Set Up

    Running The Python App

    Creating an API for The Model

    Setting Things Up

    Serving The Model

    Making Predictions Using The API

    Containerizing The Model

    Deploying to Kubernetes

    Creating the Deployment

    Testing the Deployment

    Scaling the Model

    here
    Cassava dataset
    model details here
    here
    Flask
    FastAPI
    MLServer
    inference runtimes
    custom python code
    Containers
    Docker
    docker hub
    Seldon Core
    Kubernetes
    installation instructions
    port forward
    horizontally
    auto-scaling
    cassava_examples
    step_1
    step_2
    step_3
    step_4
    step_5

    Serving MLflow models

    Out of the box, MLServer supports the deployment and serving of MLflow models with the following features:

    • Loading of MLflow Model artifacts.

    • Support of dataframes, dict-of-tensors and tensor inputs.

    In this example, we will showcase some of this features using an example model.

    from IPython.core.magic import register_line_cell_magic
    
    @register_line_cell_magic
    def writetemplate(line, cell):
        with open(line, 'w') as f:
            f.write(cell.format(**globals()))

    Training

    The first step will be to train and serialise a MLflow model. For that, we will use the linear regression examle from the MLflow docs.

    # %load src/train.py
    # Original source code and more details can be found in:
    # https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html
    
    # The data set used in this example is from
    # http://archive.ics.uci.edu/ml/datasets/Wine+Quality
    # P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
    # Modeling wine preferences by data mining from physicochemical properties.
    # In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
    
    import warnings
    import sys
    
    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import ElasticNet
    from urllib.parse import urlparse
    import mlflow
    import mlflow.sklearn
    from mlflow.models.signature import infer_signature
    
    import logging
    
    logging.basicConfig(level=logging.WARN)
    logger = logging.getLogger(__name__)
    
    
    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mae, r2
    
    
    if __name__ == "__main__":
        warnings.filterwarnings("ignore")
        np.random.seed(40)
    
        # Read the wine-quality csv file from the URL
        csv_url = (
            "http://archive.ics.uci.edu/ml"
            "/machine-learning-databases/wine-quality/winequality-red.csv"
        )
        try:
            data = pd.read_csv(csv_url, sep=";")
        except Exception as e:
            logger.exception(
                "Unable to download training & test CSV, "
                "check your internet connection. Error: %s",
                e,
            )
    
        # Split the data into training and test sets. (0.75, 0.25) split.
        train, test = train_test_split(data)
    
        # The predicted column is "quality" which is a scalar from [3, 9]
        train_x = train.drop(["quality"], axis=1)
        test_x = test.drop(["quality"], axis=1)
        train_y = train[["quality"]]
        test_y = test[["quality"]]
    
        alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
        l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5
    
        with mlflow.start_run():
            lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
            lr.fit(train_x, train_y)
    
            predicted_qualities = lr.predict(test_x)
    
            (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
    
            print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
            print("  RMSE: %s" % rmse)
            print("  MAE: %s" % mae)
            print("  R2: %s" % r2)
    
            mlflow.log_param("alpha", alpha)
            mlflow.log_param("l1_ratio", l1_ratio)
            mlflow.log_metric("rmse", rmse)
            mlflow.log_metric("r2", r2)
            mlflow.log_metric("mae", mae)
    
            tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
            model_signature = infer_signature(train_x, train_y)
    
            # Model registry does not work with file store
            if tracking_url_type_store != "file":
    
                # Register the model
                # There are other ways to use the Model Registry,
                # which depends on the use case,
                # please refer to the doc for more information:
                # https://mlflow.org/docs/latest/model-registry.html#api-workflow
                mlflow.sklearn.log_model(
                    lr,
                    "model",
                    registered_model_name="ElasticnetWineModel",
                    signature=model_signature,
                )
            else:
                mlflow.sklearn.log_model(lr, "model", signature=model_signature)
    
    !python src/train.py

    The training script will also serialise our trained model, leveraging the . By default, we should be able to find the saved artifact under the mlruns folder.

    Now that we have trained and serialised our model, we are ready to start serving it. For that, the initial step will be to set up a model-settings.json that instructs MLServer to load our artifact using the MLflow Inference Runtime.

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set. For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Note that, the request specifies the value pd as its content type, whereas every input specifies the content type np. These parameters will instruct MLServer to:

    • Convert every input value to a NumPy array, using the data type and shape information provided.

    • Group all the different inputs into a Pandas DataFrame, using their names as the column names.

    To learn more about how MLServer uses content type parameters, you can check this .

    As we can see in the output above, the predicted quality score for our input wine was 5.57.

    MLflow currently ships with an . In order to provide a drop-in replacement, the MLflow runtime in MLServer also exposes a custom endpoint which matches the signature of the MLflow's /invocations endpoint.

    As an example, we can try to send the same request that sent previously, but using MLflow's protocol. Note that, in both cases, the request will be handled by the same MLServer instance.

    As we can see above, the predicted quality for our input is 5.57, matching the prediction we obtained above.

    MLflow lets users define a , where they can specify what types of inputs does the model accept, and what types of outputs it returns. Similarly, the employed by MLServer defines a which can be used to query what inputs and outputs does the model accept. However, even though they serve similar functions, the data schemas used by each one of them are not compatible between them.

    To solve this, if your model defines a MLflow model signature, MLServer will convert on-the-fly this signature to a metadata schema compatible with the V2 Inference Protocol. This will also include specifying any extra that is required to correctly decode / encode your data.

    As an example, we can first have a look at the model signature saved for our MLflow model. This can be seen directly on the MLModel file saved by our model.

    We can then query the metadata endpoint, to see the model metadata inferred by MLServer from our test model's signature. For this, we will use the /v2/models/wine-classifier/ endpoint.

    As we should be able to see, the model metadata now matches the information contained in our model signature, including any extra content types necessary to decode our data correctly.

    Serving HuggingFace Transformer Models

    Out of the box, MLServer supports the deployment and serving of HuggingFace Transformer models with the following features:

    • Loading of Transformer Model artifacts from the Hugging Face Hub.

    • Model quantization & optimization using the Hugging Face Optimum library

    • Request batching for GPU optimization (via adaptive batching and request batching)

    In this example, we will showcase some of this features using an example model.

    Since we're using a pretrained model, we can skip straight to serving.

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    We can also leverage the Optimum library that allows us to access quantized and optimized models.

    We can download pretrained optimized models from the hub if available by enabling the optimum_model flag:

    Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    The request can now be sent using the same request structure but using optimized models for better performance.

    We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.

    Once again, you are able to run the model using the MLServer CLI.

    Once again, you are able to run the model using the MLServer CLI.

    We can also evaluate GPU acceleration, we can test the speed on CPU vs GPU using the following parameters

    We first test the time taken with the device=-1 which configures CPU by default

    Once again, you are able to run the model using the MLServer CLI.

    We can see that it takes 81 seconds which is 8 times longer than the gpu example below.

    IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.

    Now we'll run the benchmark with GPU configured, which we can do by setting device=0

    We can see that the elapsed time is 8 times less than the CPU version!

    We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.

    In our case we can enable adaptive batching with the max_batch_size which in our case we will set it ot 128.

    We will also configure max_batch_time which specifies` the maximum amount of time the MLServer orchestrator will wait before sending for inference.

    In order to achieve the throughput required of 50 requests per second, we will use the tool vegeta which performs load testing.

    We can now see that we are able to see that the requests are batched and we receive 100% success eventhough the requests are sent one-by-one.

    video_play_icon

    Serving

    Send test inference request

    MLflow Scoring Protocol

    MLflow Model Signature

    MLflow Model format
    worked out example
    scoring server with its own protocol
    model signature
    V2 inference protocol
    metadata endpoint
    content type

    Serving

    model-settings.json

    Send test inference request

    Using Optimum Optimized Models

    Send Test Request to Optimum Optimized Model

    Testing Supported Tasks

    Question Answering

    Sentiment Analysis

    GPU Acceleration

    Testing with CPU

    Testing with GPU

    Adaptive Batching with GPU

    import os
    
    [experiment_file_path] = !ls -td ./mlruns/0/* | head -1
    model_path = os.path.join(experiment_file_path, "artifacts", "model")
    print(model_path)
    !ls {model_path} 
    %%writetemplate ./model-settings.json
    {{
        "name": "wine-classifier",
        "implementation": "mlserver_mlflow.MLflowRuntime",
        "parameters": {{
            "uri": "{model_path}"
        }}
    }}
    mlserver start .
    import requests
    
    inference_request = {
        "inputs": [
            {
              "name": "fixed acidity",
              "shape": [1],
              "datatype": "FP32",
              "data": [7.4],
            },
            {
              "name": "volatile acidity",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.7000],
            },
            {
              "name": "citric acid",
              "shape": [1],
              "datatype": "FP32",
              "data": [0],
            },
            {
              "name": "residual sugar",
              "shape": [1],
              "datatype": "FP32",
              "data": [1.9],
            },
            {
              "name": "chlorides",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.076],
            },
            {
              "name": "free sulfur dioxide",
              "shape": [1],
              "datatype": "FP32",
              "data": [11],
            },
            {
              "name": "total sulfur dioxide",
              "shape": [1],
              "datatype": "FP32",
              "data": [34],
            },
            {
              "name": "density",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.9978],
            },
            {
              "name": "pH",
              "shape": [1],
              "datatype": "FP32",
              "data": [3.51],
            },
            {
              "name": "sulphates",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.56],
            },
            {
              "name": "alcohol",
              "shape": [1],
              "datatype": "FP32",
              "data": [9.4],
            },
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/wine-classifier/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    import requests
    
    inference_request = {
        "dataframe_split": {
            "columns": [
                "fixed acidity",
                "volatile acidity",
                "citric acid",
                "residual sugar",
                "chlorides",
                "free sulfur dioxide",
                "total sulfur dioxide",
                "density",
                "pH",
                "sulphates",
                "alcohol",
            ],
            "data": [[7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4]]
        }
    }
    
    endpoint = "http://localhost:8080/invocations"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    !cat {model_path}/MLmodel
    import requests
    
    
    endpoint = "http://localhost:8080/v2/models/wine-classifier"
    response = requests.get(endpoint)
    
    response.json()
    # Import required dependencies
    import requests
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "text-generation",
                "pretrained_model": "distilgpt2"
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "args",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["this is a test"],
            }
        ]
    }
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    ).json()
    {'model_name': 'transformer',
     'id': 'eb160c6b-8223-4342-ad92-6ac301a9fa5d',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'hg_jsonlist'},
       'data': ['{"generated_text": "this is a testnet with 1-3,000-bit nodes as nodes."}']}]}
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "text-generation",
                "pretrained_model": "distilgpt2",
                "optimum_model": true
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "args",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["this is a test"],
            }
        ]
    }
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    ).json()
    {'model_name': 'transformer',
     'id': '9c482c8d-b21e-44b1-8a42-7650a9dc01ef',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'hg_jsonlist'},
       'data': ['{"generated_text": "this is a test of the \\"safe-code-safe-code-safe-code\\" approach. The method only accepts two parameters as parameters: the code. The parameter \'unsafe-code-safe-code-safe-code\' should"}']}]}
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "question-answering"
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "question",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["what is your name?"],
            },
            {
                "name": "context",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["Hello, I am Seldon, how is it going"],
            },
        ]
    }
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    ).json()
    {'model_name': 'transformer',
     'id': '4efac938-86d8-41a1-b78f-7690b2dcf197',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'hg_jsonlist'},
       'data': ['{"score": 0.9869915843009949, "start": 12, "end": 18, "answer": "Seldon"}']}]}
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "text-classification"
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "args",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["This is terrible!"],
            }
        ]
    }
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    ).json()
    {'model_name': 'transformer',
     'id': '835eabbd-daeb-4423-a64f-a7c4d7c60a9b',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'hg_jsonlist'},
       'data': ['{"label": "NEGATIVE", "score": 0.9996137022972107}']}]}
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "max_batch_size": 128,
        "max_batch_time": 1,
        "parameters": {
            "extra": {
                "task": "text-generation",
                "device": -1
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "text_inputs",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["This is a generation for the work" for i in range(512)],
            }
        ]
    }
    
    # Benchmark time
    import time
    
    start_time = time.monotonic()
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    )
    
    print(f"Elapsed time: {time.monotonic() - start_time}")
    Elapsed time: 66.42268538899953
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "text-generation",
                "device": 0
            }
        }
    }
    Overwriting ./model-settings.json
    inference_request = {
        "inputs": [
            {
                "name": "text_inputs",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["This is a generation for the work" for i in range(512)],
            }
        ]
    }
    
    # Benchmark time
    import time
    
    start_time = time.monotonic()
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    )
    
    print(f"Elapsed time: {time.monotonic() - start_time}")
    Elapsed time: 11.27933280000434
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "max_batch_size": 128,
        "max_batch_time": 1,
        "parameters": {
            "extra": {
                "task": "text-generation",
                "pretrained_model": "distilgpt2",
                "device": 0
            }
        }
    }
    Overwriting ./model-settings.json
    %%bash
    jq -ncM '{"method": "POST", "header": {"Content-Type": ["application/json"] }, "url": "http://localhost:8080/v2/models/transformer/infer", "body": "{\"inputs\":[{\"name\":\"text_inputs\",\"shape\":[1],\"datatype\":\"BYTES\",\"data\":[\"test\"]}]}" | @base64 }' \
              | vegeta \
                    -cpus="2" \
                    attack \
                    -duration="3s" \
                    -rate="50" \
                    -format=json \
              | vegeta \
                    report \
                    -type=text
    Requests      [total, rate, throughput]         150, 50.34, 22.28
    Duration      [total, attack, wait]             6.732s, 2.98s, 3.753s
    Latencies     [min, mean, 50, 90, 95, 99, max]  1.975s, 3.168s, 3.22s, 4.065s, 4.183s, 4.299s, 4.318s
    Bytes In      [total, mean]                     60978, 406.52
    Bytes Out     [total, mean]                     12300, 82.00
    Success       [ratio]                           100.00%
    Status Codes  [code:count]                      200:150  
    Error Set:

    Content Type Decoding

    MLServer extends the V2 inference protocol by adding support for a content_type annotation. This annotation can be provided either through the model metadata parameters, or through the input parameters. By leveraging the content_type annotation, we can provide the necessary information to MLServer so that it can decode the input payload from the "wire" V2 protocol to something meaningful to the model / user (e.g. a NumPy array).

    This example will walk you through some examples which illustrate how this works, and how it can be extended.

    Echo Inference Runtime

    To start with, we will write a dummy runtime which just prints the input, the decoded input and returns it. This will serve as a testbed to showcase how the content_type support works.

    Later on, we will extend this runtime by adding custom codecs that will decode our V2 payload to custom types.

    As you can see above, this runtime will decode the incoming payloads by calling the self.decode() helper method. This method will check what's the right content type for each input in the following order:

    1. Is there any content type defined in the inputs[].parameters.content_type field within the request payload?

    2. Is there any content type defined in the inputs[].parameters.content_type field within the model metadata?

    3. Is there any default content type that should be assumed?

    In order to enable this runtime, we will also create a model-settings.json file. This file should be present (or accessible from) in the folder where we run mlserver start ..

    Our initial step will be to decide the content type based on the incoming inputs[].parameters field. For this, we will start our MLServer in the background (e.g. running mlserver start .)

    As you've probably already noticed, writing request payloads compliant with both the V2 Inference Protocol requires a certain knowledge about both the V2 spec and the structure expected by each content type. To account for this and simplify usage, the MLServer package exposes a set of utilities which will help you interact with your models via the V2 protocol.

    These helpers are mainly shaped as "codecs". That is, abstractions which know how to "encode" and "decode" arbitrary Python datatypes to and from the V2 Inference Protocol.

    Generally, we recommend using the existing set of codecs to generate your V2 payloads. This will ensure that requests and responses follow the right structure, and should provide a more seamless experience.

    Following with our previous example, the same code could be rewritten using codecs as:

    Note that the rewritten snippet now makes use of the built-in InferenceRequest class, which represents a V2 inference request. On top of that, it also uses the NumpyCodec and StringCodec implementations, which know how to encode a Numpy array and a list of strings into V2-compatible request inputs.

    Our next step will be to define the expected content type through the model metadata. This can be done by extending the model-settings.json file, and adding a section on inputs.

    After adding this metadata, we will re-start MLServer (e.g. mlserver start .) and we will send a new request without any explicit parameters.

    As you should be able to see in the server logs, MLServer will cross-reference the input names against the model metadata to find the right content type.

    There may be cases where a custom inference runtime may need to encode / decode to custom datatypes. As an example, we can think of computer vision models which may only operate with pillow image objects.

    In these scenarios, it's possible to extend the Codec interface to write our custom encoding logic. A Codec, is simply an object which defines a decode() and encode() methods. To illustrate how this would work, we will extend our custom runtime to add a custom PillowCodec.

    We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.

    As you should be able to see in the MLServer logs, the server is now able to decode the payload into a Pillow image. This example also illustrates how Codec objects can be compatible with multiple datatype values (e.g. tensor and BYTES in this case).

    So far, we've seen how you can specify codecs so that they get applied at the input level. However, it is also possible to use request-wide codecs that aggregate multiple inputs to decode the payload. This is usually relevant for cases where the models expect a multi-column input type, like a Pandas DataFrame.

    To illustrate this, we will first tweak our EchoRuntime so that it prints the decoded contents at the request level.

    We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.

    Model Settings

    Request Inputs

    Codecs

    Model Metadata

    Custom Codecs

    Request Codecs

    %%writefile runtime.py
    import json
    
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
    from mlserver.codecs import DecodedParameterName
    
    _to_exclude = {
        "parameters": {DecodedParameterName, "headers"},
        'inputs': {"__all__": {"parameters": {DecodedParameterName, "headers"}}}
    }
    
    class EchoRuntime(MLModel):
        async def predict(self, payload: InferenceRequest) -> InferenceResponse:
            outputs = []
            for request_input in payload.inputs:
                decoded_input = self.decode(request_input)
                print(f"------ Encoded Input ({request_input.name}) ------")
                as_dict = request_input.dict(exclude=_to_exclude)  # type: ignore
                print(json.dumps(as_dict, indent=2))
                print(f"------ Decoded input ({request_input.name}) ------")
                print(decoded_input)
    
                outputs.append(
                    ResponseOutput(
                        name=request_input.name,
                        datatype=request_input.datatype,
                        shape=request_input.shape,
                        data=request_input.data
                    )
                )
    
            return InferenceResponse(model_name=self.name, outputs=outputs)
    
    %%writefile model-settings.json
    
    {
        "name": "content-type-example",
        "implementation": "runtime.EchoRuntime"
    }
    import requests
    
    payload = {
        "inputs": [
            {
                "name": "parameters-np",
                "datatype": "INT32",
                "shape": [2, 2],
                "data": [1, 2, 3, 4],
                "parameters": {
                    "content_type": "np"
                }
            },
            {
                "name": "parameters-str",
                "datatype": "BYTES",
                "shape": [1],
                "data": "hello world 😁",
                "parameters": {
                    "content_type": "str"
                }
            }
        ]
    }
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload
    )
    import requests
    import numpy as np
    
    from mlserver.types import InferenceRequest, InferenceResponse
    from mlserver.codecs import NumpyCodec, StringCodec
    
    parameters_np = np.array([[1, 2], [3, 4]])
    parameters_str = ["hello world 😁"]
    
    payload = InferenceRequest(
        inputs=[
            NumpyCodec.encode_input("parameters-np", parameters_np),
            # The `use_bytes=False` flag will ensure that the encoded payload is JSON-compatible
            StringCodec.encode_input("parameters-str", parameters_str, use_bytes=False),
        ]
    )
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload.model_dump()
    )
    
    response_payload = InferenceResponse.parse_raw(response.text)
    print(NumpyCodec.decode_output(response_payload.outputs[0]))
    print(StringCodec.decode_output(response_payload.outputs[1]))
    %%writefile model-settings.json
    
    {
        "name": "content-type-example",
        "implementation": "runtime.EchoRuntime",
        "inputs": [
            {
                "name": "metadata-np",
                "datatype": "INT32",
                "shape": [2, 2],
                "parameters": {
                    "content_type": "np"
                }
            },
            {
                "name": "metadata-str",
                "datatype": "BYTES",
                "shape": [11],
                "parameters": {
                    "content_type": "str"
                }
            }
        ]
    }
    import requests
    
    payload = {
        "inputs": [
            {
                "name": "metadata-np",
                "datatype": "INT32",
                "shape": [2, 2],
                "data": [1, 2, 3, 4],
            },
            {
                "name": "metadata-str",
                "datatype": "BYTES",
                "shape": [11],
                "data": "hello world 😁",
            }
        ]
    }
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload
    )
    %%writefile runtime.py
    import io
    import json
    
    from PIL import Image
    
    from mlserver import MLModel
    from mlserver.types import (
        InferenceRequest,
        InferenceResponse,
        RequestInput,
        ResponseOutput,
    )
    from mlserver.codecs import NumpyCodec, register_input_codec, DecodedParameterName
    from mlserver.codecs.utils import InputOrOutput
    
    
    _to_exclude = {
        "parameters": {DecodedParameterName},
        "inputs": {"__all__": {"parameters": {DecodedParameterName}}},
    }
    
    
    @register_input_codec
    class PillowCodec(NumpyCodec):
        ContentType = "img"
        DefaultMode = "L"
    
        @classmethod
        def can_encode(cls, payload: Image) -> bool:
            return isinstance(payload, Image)
    
        @classmethod
        def _decode(cls, input_or_output: InputOrOutput) -> Image:
            if input_or_output.datatype != "BYTES":
                # If not bytes, assume it's an array
                image_array = super().decode_input(input_or_output)  # type: ignore
                return Image.fromarray(image_array, mode=cls.DefaultMode)
    
            encoded = input_or_output.data
            if isinstance(encoded, str):
                encoded = encoded.encode()
    
            return Image.frombytes(
                mode=cls.DefaultMode, size=input_or_output.shape, data=encoded
            )
    
        @classmethod
        def encode_output(cls, name: str, payload: Image) -> ResponseOutput:  # type: ignore
            byte_array = io.BytesIO()
            payload.save(byte_array, mode=cls.DefaultMode)
    
            return ResponseOutput(
                name=name, shape=payload.size, datatype="BYTES", data=byte_array.getvalue()
            )
    
        @classmethod
        def decode_output(cls, response_output: ResponseOutput) -> Image:
            return cls._decode(response_output)
    
        @classmethod
        def encode_input(cls, name: str, payload: Image) -> RequestInput:  # type: ignore
            output = cls.encode_output(name, payload)
            return RequestInput(
                name=output.name,
                shape=output.shape,
                datatype=output.datatype,
                data=output.data,
            )
    
        @classmethod
        def decode_input(cls, request_input: RequestInput) -> Image:
            return cls._decode(request_input)
    
    
    class EchoRuntime(MLModel):
        async def predict(self, payload: InferenceRequest) -> InferenceResponse:
            outputs = []
            for request_input in payload.inputs:
                decoded_input = self.decode(request_input)
                print(f"------ Encoded Input ({request_input.name}) ------")
                as_dict = request_input.dict(exclude=_to_exclude)  # type: ignore
                print(json.dumps(as_dict, indent=2))
                print(f"------ Decoded input ({request_input.name}) ------")
                print(decoded_input)
    
                outputs.append(
                    ResponseOutput(
                        name=request_input.name,
                        datatype=request_input.datatype,
                        shape=request_input.shape,
                        data=request_input.data,
                    )
                )
    
            return InferenceResponse(model_name=self.name, outputs=outputs)
    import requests
    
    payload = {
        "inputs": [
            {
                "name": "image-int32",
                "datatype": "INT32",
                "shape": [8, 8],
                "data": [
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0
                ],
                "parameters": {
                    "content_type": "img"
                }
            },
            {
                "name": "image-bytes",
                "datatype": "BYTES",
                "shape": [8, 8],
                "data": (
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                ),
                "parameters": {
                    "content_type": "img"
                }
            }
        ]
    }
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload
    )
    %%writefile runtime.py
    import json
    
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
    from mlserver.codecs import DecodedParameterName
    
    _to_exclude = {
        "parameters": {DecodedParameterName},
        'inputs': {"__all__": {"parameters": {DecodedParameterName}}}
    }
    
    class EchoRuntime(MLModel):
        async def predict(self, payload: InferenceRequest) -> InferenceResponse:
            print("------ Encoded Input (request) ------")
            as_dict = payload.dict(exclude=_to_exclude)  # type: ignore
            print(json.dumps(as_dict, indent=2))
            print("------ Decoded input (request) ------")
            decoded_request = None
            if payload.parameters:
                decoded_request = getattr(payload.parameters, DecodedParameterName)
            print(decoded_request)
    
            outputs = []
            for request_input in payload.inputs:
                outputs.append(
                    ResponseOutput(
                        name=request_input.name,
                        datatype=request_input.datatype,
                        shape=request_input.shape,
                        data=request_input.data
                    )
                )
    
            return InferenceResponse(model_name=self.name, outputs=outputs)
    
    import requests
    
    payload = {
        "inputs": [
            {
                "name": "parameters-np",
                "datatype": "INT32",
                "shape": [2, 2],
                "data": [1, 2, 3, 4],
                "parameters": {
                    "content_type": "np"
                }
            },
            {
                "name": "parameters-str",
                "datatype": "BYTES",
                "shape": [2, 11],
                "data": ["hello world 😁", "bye bye 😁"],
                "parameters": {
                    "content_type": "str"
                }
            }
        ],
        "parameters": {
            "content_type": "pd"
        }
    }
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload
    )

    Content Types (and Codecs)

    Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime objects, Pillow images, etc. Unfortunately, the definition of the V2 Inference Protocol doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.

    To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.

    To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.

    To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type field of the parameters section of your request.

    As an example, we can consider the following dataframe, containing two columns: Age and First Name.

    First Name
    Age

    This table, could be specified in the V2 protocol as the following payload, where we declare that:

    • The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd).

    • The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str).

    To learn more about the available content types and how to use them, you can see all the available ones in the section below.

    Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.

    Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.

    However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.

    To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):

    • Request Codecs

      • encode_request() <mlserver.codecs.RequestCodec.encode_request>

      • decode_request() <mlserver.codecs.RequestCodec.decode_request>

    Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.

    For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:

    For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .

    When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types package (i.e. InferenceRequest <mlserver.types.InferenceRequest> or InferenceResponse <mlserver.types.InferenceResponse>). Therefore, if you want to use them with requests (or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.

    Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump() or .model_dump_json() method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).

    For example, if we want to send an inference request to model foo, we could do something along the following lines:

    The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.

    In order to send / receive NaN values, you must ensure that:

    • You are using the REST interface.

    • The input / output entry containing NaN values uses either the FP16, FP32 or FP64 datatypes.

    Assuming those conditions are satisfied, any null value within your tensor payload will be converted to NaN.

    For example, if you take the following Numpy array:

    We could encode it as:

    Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.

    For example, to configure the content type values of the , one could create a model-settings.json file like the one below:

    It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.

    Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.

    Python Type
    Content Type
    Request Level
    Request Codec
    Input Level
    Input Codec

    The np content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:

    • The datatype field will be matched to the closest .

    • The shape field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.

    For example, if we think of the following NumPy Array:

    We could encode it as the input foo in a V2 protocol request as:

    When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single tensor.

    The pd content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,

    • Each entry of the inputs list (or outputs, in the case of responses), will represent a column of the DataFrame.

    • Each of these entires, will contain all the row elements for that particular column.

    • The shape field of each input

    For example, if we consider the following dataframe:

    A
    B
    C

    We could encode it to the V2 Inference Protocol as:

    The str content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:

    • The expected datatype is BYTES.

    • The shape field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"] payload will have a shape of 2 elements).

    For example, when if we consider the following list of strings:

    We could encode it to the V2 Inference Protocol as:

    When using the str content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single string or a set of strings.

    The base64 content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:

    • The expected datatype is BYTES.

    • The data field should contain the base64-encoded binary strings.

    • The shape

    For example, if we think of the following "bytes array":

    We could encode it as the input foo of a V2 request as:

    The datetime content type will decode a V2 input into a , taking into account the following:

    • The expected datatype is BYTES.

    • The data field should contain the dates serialised following the .

    • The shape

    For example, if we think of the following datetime object:

    We could encode it as the input foo of a V2 request as:

    encode_response() <mlserver.codecs.RequestCodec.encode_response>

  • decode_response() <mlserver.codecs.RequestCodec.decode_response>

  • Input Codecs

    • encode_input() <mlserver.codecs.InputCodec.encode_input>

    • decode_input() <mlserver.codecs.InputCodec.decode_input>

    • encode_output() <mlserver.codecs.InputCodec.encode_output>

    • decode_output() <mlserver.codecs.InputCodec.decode_output>

  • You are either using the Pandas codec or the Numpy codec.

    mlserver.codecs.NumpyCodec

    pd

    ✅

    mlserver.codecs.PandasCodec

    ❌

    str

    ✅

    mlserver.codecs.string.StringRequestCodec

    ✅

    mlserver.codecs.StringCodec

    base64

    ❌

    ✅

    mlserver.codecs.Base64Codec

    datetime

    ❌

    ✅

    mlserver.codecs.DatetimeCodec

    (or
    output
    ) entry will contain (at least) the amount of rows included in the dataframe.

    b3

    c3

    a4

    b4

    c4

    field represents the number of binary strings that are encoded in the payload.
    field represents the number of datetimes that are encoded in the payload.

    Joanne

    34

    Michael

    22

    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "First Name",
          "datatype": "BYTES",
          "parameters": {
            "content_type": "str"
          },
          "shape": [2],
          "data": ["Joanne", "Michael"]
        },
        {
          "name": "Age",
          "datatype": "INT32",
          "shape": [2],
          "data": [34, 22]
        },
      ]
    }
    import pandas as pd
    
    from mlserver.codecs import PandasCodec
    
    dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})
    
    inference_request = PandasCodec.encode_request(dataframe)
    print(inference_request)
    import pandas as pd
    import requests
    
    from mlserver.codecs import PandasCodec
    
    dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})
    
    inference_request = PandasCodec.encode_request(dataframe)
    
    # raw_request will be a Python dictionary compatible with `requests`'s `json` kwarg
    raw_request = inference_request.dict()
    
    response = requests.post("localhost:8080/v2/models/foo/infer", json=raw_request)
    
    # raw_response will be a dictionary (loaded from the response's JSON),
    # therefore we can pass it as the InferenceResponse constructors' kwargs
    raw_response = response.json()
    inference_response = InferenceResponse(**raw_response)
    import numpy as np
    
    foo = np.array([[1.2, 2.3], [np.NaN, 4.5]])
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "np"
          },
          "data": [1.2, 2.3, null, 4.5]
          "datatype": "FP64",
          "shape": [2, 2],
        }
      ]
    }
    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "First Name",
          "datatype": "BYTES",
          "parameters": {
            "content_type": "str"
          },
          "shape": [-1],
        },
        {
          "name": "Age",
          "datatype": "INT32",
          "shape": [-1],
        },
      ]
    }

    NumPy Array

    np

    ✅

    mlserver.codecs.NumpyRequestCodec

    import numpy as np
    
    foo = np.array([[1, 2], [3, 4]])
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "np"
          },
          "data": [1, 2, 3, 4]
          "datatype": "INT32",
          "shape": [2, 2],
        }
      ]
    }
    from mlserver.codecs import NumpyRequestCodec
    
    # Encode an entire V2 request
    inference_request = NumpyRequestCodec.encode_request(foo)
    from mlserver.types import InferenceRequest
    from mlserver.codecs import NumpyCodec
    
    # We can use the `NumpyCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        NumpyCodec.encode_input("foo", foo)
      ]
    )

    a1

    b1

    c1

    a2

    b2

    c2

    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "A",
          "data": ["a1", "a2", "a3", "a4"]
          "datatype": "BYTES",
          "shape": [4],
        },
        {
          "name": "B",
          "data": ["b1", "b2", "b3", "b4"]
          "datatype": "BYTES",
          "shape": [4],
        },
        {
          "name": "C",
          "data": ["c1", "c2", "c3", "c4"]
          "datatype": "BYTES",
          "shape": [4],
        },
      ]
    }
    import pandas as pd
    
    from mlserver.codecs import PandasCodec
    
    foo = pd.DataFrame({
      "A": ["a1", "a2", "a3", "a4"],
      "B": ["b1", "b2", "b3", "b4"],
      "C": ["c1", "c2", "c3", "c4"]
    })
    
    inference_request = PandasCodec.encode_request(foo)
    foo = ["bar", "bar2"]
    {
      "parameters": {
        "content_type": "str"
      },
      "inputs": [
        {
          "name": "foo",
          "data": ["bar", "bar2"]
          "datatype": "BYTES",
          "shape": [2],
        }
      ]
    }
    from mlserver.codecs.string import StringRequestCodec
    
    # Encode an entire V2 request
    inference_request = StringRequestCodec.encode_request(foo, use_bytes=False)
    from mlserver.types import InferenceRequest
    from mlserver.codecs import StringCodec
    
    # We can use the `StringCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        StringCodec.encode_input("foo", foo, use_bytes=False)
      ]
    )
    foo = b"Python is fun"
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "base64"
          },
          "data": ["UHl0aG9uIGlzIGZ1bg=="]
          "datatype": "BYTES",
          "shape": [1],
        }
      ]
    }
    from mlserver.types import InferenceRequest
    from mlserver.codecs import Base64Codec
    
    # We can use the `Base64Codec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        Base64Codec.encode_input("foo", foo, use_bytes=False)
      ]
    )
    import datetime
    
    foo = datetime.datetime(2022, 1, 11, 11, 0, 0)
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "datetime"
          },
          "data": ["2022-01-11T11:00:00"]
          "datatype": "BYTES",
          "shape": [1],
        }
      ]
    }
    from mlserver.types import InferenceRequest
    from mlserver.codecs import DatetimeCodec
    
    # We can use the `DatetimeCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        DatetimeCodec.encode_input("foo", foo, use_bytes=False)
      ]
    )

    Usage

    Some inference runtimes may apply a content type by default if none is present. To learn more about each runtime's defaults, please check the relevant inference runtime's docs.

    It's important to keep in mind that content types can be specified at both the request level and the input level. The former will apply to the entire set of inputs, whereas the latter will only apply to a particular input of the payload.

    Codecs

    Converting to / from JSON

    Support for NaN values

    Model Metadata

    Available Content Types

    MLServer allows you extend the supported content types by adding custom ones. To learn more about how to write your own custom content types, you can check this full end-to-end example. You can also learn more about building custom extensions for MLServer on the Custom Inference Runtime section of the docs.

    NumPy Array

    The V2 Inference Protocol expects that the data of each input is sent as a flat array. Therefore, the np content type will expect that tensors are sent flattened. The information in the shape field will then be used to reshape the vector into the right dimensions.

    By default, MLServer will always assume that an array with a single-dimensional shape, e.g. [N], is equivalent to [N, 1]. That is, each entry will be treated like a single one-dimensional data point (i.e. instead of a [1, D] array, where the full array is a single D-dimensional data point). To avoid any ambiguity, where possible, the Numpy codec will always explicitly encode [N] arrays as [N, 1].

    Pandas DataFrame

    The pd content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.

    UTF-8 String

    Base64

    Datetime

    Available Content Types
    Content Type Decoding example
    Pydantic
    other methods
    model's metadata
    example above
    NumPy dtype
    Python datetime.datetime object
    ISO 8601 standard
    Content Types
    Request Codecs
    Input Codecs

    ✅

    a3

    Pandas DataFrame
    UTF-8 String
    Base64
    Datetime

    Streaming

    The mlserver package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs.

    Overview

    In this example, we create a simple Identity Text Model which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model.

    Serving

    The next step will be to serve our model using mlserver. For that, we will first implement an extension that serves as the runtime to perform inference using our custom TextModel.

    Custom inference runtime

    This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:

    • split the text into words using the white space as the delimiter.

    • wait 0.5 seconds between each word to simulate a slow model.

    • return each word one by one.

    As it can be seen, the predict_stream method receives as an input an AsyncIterator of InferenceRequest and returns an AsyncIterator of InferenceResponse. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.

    Note that although unary-unary can be covered by predict_stream method as well, mlserver already covers that through the predict method.

    One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above.

    The next step will be to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Note the currently there are three main limitations of the streaming support in MLServer:

    • distributed workers are not supported (i.e., the parallel_workers setting should be set to 0)

    • gzip middleware is not supported for REST (i.e., gzip_enabled setting should be set to false)

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be run from the same directory where our config files are or point to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal.

    To test our model, we will use the following inference request:

    To send a REST streaming request to the server, we will use the following Python code:

    To send a gRPC streaming request to the server, we will use the following Python code:

    Note that for gRPC, the request is transformed into an async generator which is then passed to the ModelStreamInfer method. The response is also an async generator which can be iterated over to get the response.

    Settings file

    settings.json

    model-settings.json

    Start serving the model

    Inference request

    Send test generate stream request (REST)

    Send test generate stream request (gRPC)

    %%writefile text_model.py
    
    import asyncio
    from typing import AsyncIterator
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    from mlserver.codecs import StringCodec
    
    
    class TextModel(MLModel):
    
        async def predict_stream(
            self, payloads: AsyncIterator[InferenceRequest]
        ) -> AsyncIterator[InferenceResponse]:
            payload = [_ async for _ in payloads][0]
            text = StringCodec.decode_input(payload.inputs[0])[0]
            words = text.split(" ")
    
            split_text = []
            for i, word in enumerate(words):
                split_text.append(word if i == 0 else " " + word)
    
            for word in split_text:
                await asyncio.sleep(0.5)
                yield InferenceResponse(
                    model_name=self._settings.name,
                    outputs=[
                        StringCodec.encode_output(
                            name="output",
                            payload=[word],
                            use_bytes=True,
                        ),
                    ],
                )
    
    %%writefile settings.json
    
    {
      "debug": false,
      "parallel_workers": 0,
      "gzip_enabled": false
    }
    
    %%writefile model-settings.json
    
    {
      "name": "text-model",
    
      "implementation": "text_model.TextModel",
      
      "versions": ["text-model/v1.2.3"],
      "platform": "mlserver",
      "inputs": [
        {
          "datatype": "BYTES",
          "name": "prompt",
          "shape": [1]
        }
      ],
      "outputs": [
        {
          "datatype": "BYTES",
          "name": "output",
          "shape": [1]
        }
      ]
    }
    mlserver start .
    %%writefile generate-request.json
    
    {
        "inputs": [
            {
                "name": "prompt",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["What is the capital of France?"],
                "parameters": {
                "content_type": "str"
                }
            }
        ],
        "outputs": [
            {
              "name": "output"
            }
        ]
    }
    import httpx
    from httpx_sse import connect_sse
    from mlserver import types
    from mlserver.codecs import StringCodec
    
    inference_request = types.InferenceRequest.parse_file("./generate-request.json")
    
    with httpx.Client() as client:
        with connect_sse(client, "POST", "http://localhost:8080/v2/models/text-model/generate_stream", json=inference_request.dict()) as event_source:
            for sse in event_source.iter_sse():
                response = types.InferenceResponse.parse_raw(sse.data)
                print(StringCodec.decode_output(response.outputs[0]))
    
    import grpc
    import mlserver.types as types
    from mlserver.codecs import StringCodec
    from mlserver.grpc.converters import ModelInferResponseConverter
    import mlserver.grpc.converters as converters
    import mlserver.grpc.dataplane_pb2_grpc as dataplane
    
    inference_request = types.InferenceRequest.parse_file("./generate-request.json")
    
    # need to convert from string to bytes for grpc
    inference_request.inputs[0] = StringCodec.encode_input("prompt", inference_request.inputs[0].data.root)
    inference_request_g = converters.ModelInferRequestConverter.from_types(
        inference_request, model_name="text-model", model_version=None
    )
    
    async def get_inference_request_stream(inference_request):
        yield inference_request
    
    async with grpc.aio.insecure_channel("localhost:8081") as grpc_channel:
        grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
        inference_request_stream = get_inference_request_stream(inference_request_g)
        
        async for response in grpc_stub.ModelStreamInfer(inference_request_stream):
            response = ModelInferResponseConverter.to_types(response)
            print(StringCodec.decode_output(response.outputs[0]))

    Getting Started

    This guide will help you get started creating machine learning microservices with MLServer in less than 30 minutes. Our use case will be to create a service that helps us compare the similarity between two documents. Think about whenever you are comparing a book, news article, blog post, or tutorial to read next, wouldn't it be great to have a way to compare with similar ones that you have already read and liked (without having to rely on a recommendation's system)? That's what we'll focus on this guide, on creating a document similarity service. 📜 + 📃 = 😎👌🔥

    The code is showcased as if it were cells inside a notebook but you can run each of the steps inside Python files with minimal effort.

    00 What is MLServer?

    MLServer is an open-source Python library for building production-ready asynchronous APIs for machine learning models.

    01 Dependencies

    The first step is to install mlserver, the spacy library, and the language model spacy will need for our use case. We will also download the wikipedia-api library to test our use case with a few fun summaries.

    If you've never heard of before, it is an open-source Python library for advanced natural language processing that excels at large-scale information extraction and retrieval tasks, among many others. The model we'll use is a pre-trained model on English text from the web. This model will help us get started with our use case faster than if we had to train a model from scratch for our use case.

    Let's first install these libraries.

    We will also need to download the language model separately once we have spaCy inside our virtual environment.

    If you're going over this guide inside a notebook, don't forget to add an exclamation mark ! in front of the two commands above. If you are in VSCode, you can keep them as they are and change the cell type to bash.

    At its core, MLServer requires that users give it 3 things, a model-settings.json file with information about the model, an (optional) settings.json file with information related to the server you are about to set up, and a .py file with the load-predict recipe for your model (as shown in the picture above).

    Let's create a directory for our model.

    Before we create a service that allows us to compare the similarity between two documents, it is good practice to first test that our solution works first, especially if we're using a pre-trained model and/or a pipeline.

    Now that we have our model loaded, let's look at the similarity of the abstracts of using the wikipedia-api Python library. The main requirement of the API is that we pass into the main class, Wikipedia(), a project name, an email and the language we want information to be returned in. After that, we can search the for the movie summaries we want by passing the title of the movie to the .page() method and accessing the summary of it with the .summary attribute.

    Feel free to change the movies for other topics you might be interested in.

    You can run the following lines inside a notebook or, conversely, add them to a app.py file.

    If you created an app.py file with the code above, make sure you run python app.py from the terminal.

    Now that we have our two summaries, let's compare them using spacy.

    Notice that both summaries have information about the other movie, about "films" in general, and about the dates each aired on (which is the same). The reality is that, the model hasn't seen any of these movies so it might be generalizing to the context of each article, "movies," rather than their content, "dolls as humans and the atomic bomb."

    You should, of course, play around with different pages and see if what you get back is coherent with what you would expect.

    Time to create a machine learning API for our use-case. 😎

    MLServer allows us to wrap machine learning models into APIs and build microservices with replicas of a single model, or different models all together.

    To create a service with MLServer, we will define a class with two asynchronous functions, one that loads the model and another one to run inference (or predict) with. The former will load the spacy model we tested in the last section, and the latter will take in a list with the two documents we want to compare. Lastly, our function will return a numpy array with a single value, our similarity score. We'll write the file to our similarity_model directory and call it my_model.py.

    Now that we have our model file ready to go, the last piece of the puzzle is to tell MLServer a bit of info about it. In particular, it wants (or needs) to know the name of the model and how to implement it. The former can be anything you want (and it will be part of the URL of your API), and the latter will follow the recipe of name_of_py_file_with_your_model.class_with_your_model.

    Let's create the model-settings.json file MLServer is expecting inside our similarity_model directory and add the name and the implementation of our model to it.

    Now that everything is in place, we can start serving predictions locally to test how things would play out for our future users. We'll initiate our server via the command line, and later on we'll see how to do the same via Python files. Here's where we are at right now in the process of developing microservices with MLServer.

    As you can see in the image, our server will be initialized with three entry points, one for HTTP requests, another for gRPC, and another for the metrics. To learn more about the powerful metrics feature of MLServer, please visit the relevant docs page . To learn more about gRPC, please see this tutorial .

    To start our service, open up a terminal and run the following command.

    Note: If this is a fresh terminal, make sure you activate your environment before you run the command above. If you run the command above from your notebook (e.g. !mlserver start similarity_model/), you will have to send the request below from another notebook or terminal since the cell will continue to run until you turn it off.

    Time to become a client of our service and test it. For this, we'll set up the payload we'll send to our service and use the requests library to our request.

    Please note that the request below uses the variables we created earlier with the summaries of Barbie and Oppenheimer. If you are sending this POST request from a fresh python file, make sure you move those lines of code above into your request file.

    Let's decompose what just happened.

    The URL for our service might seem a bit odd if you've never heard of the . This protocol is a set of specifications that allows machine learning models to be shared and deployed in a standardized way. This protocol enables the use of machine learning models on a variety of platforms and devices without requiring changes to the model or its code. The OIP is useful because it allows us to integrate machine learning into a wide range of applications in a standard way.

    All URLs you create with MLServer will have the following structure.

    This kind of protocol is a standard adopted by different companies like NVIDIA, Tensorflow Serving, KServe, and others, to keep everyone on the same page. If you think about driving cars globally, your country has to apply a standard for driving on a particular side of the road, and this ensures you and everyone else stays on the left (or the right depending on where you are at). Adopting this means that you won't have to wonder where the next driver is going to come out of when you are driving and are about to take a turn, instead, you can focus on getting to where you're going to without much worrying.

    Let's describe what each of the components of our inference_request does.

    • name: this maps one-to-one to the name of the parameter in your predict() function.

    • shape: represents the shape of the elements in our data. In our case, it is a list with [2] strings.

    To learn more about the OIP and how MLServer content types work, please have a looks at their .

    Say you need to meet the demand of a high number of users and one model might not be enough, or is not using all of the resources of the virtual machine instance it was allocated to. What we can do in this case is to create multiple replicas of our model to increase the throughput of the requests that come in. This can be particularly useful at the peak times of our server. To do this, we need to tweak the settings of our server via the settings.json file. In it, we'll add the number of independent models we want to have to the parameter "parallel_workers": 3.

    Let's stop our server, change the settings of it, start it again, and test it.

    As you can see in the output of the terminal in the picture above, we now have 3 models running in parallel. The reason you might see 4 is because, by default, MLServer will print the name of the initialized model if it is one or more, and it will also print one for each of the replicas specified in the settings.

    Let's get a few more to test our server. Get as creative as you'd like. 💡

    Let's first test that the function works as intended.

    Now let's map three POST requests at the same time.

    We can also test it one by one.

    For the last step of this guide, we are going to package our model and service into a docker image that we can reuse in another project or share with colleagues immediately. This step requires that we have docker installed and configured in our PCs, so if you need to set up docker, you can do so by following the instructions in the documentation .

    The first step is to create a requirements.txt file with all of our dependencies and add it to the directory we've been using for our service (similarity_model).

    The next step is to build a docker image with our model, its dependencies and our server. If you've never heard of docker images before, here's a short description.

    A Docker image is a lightweight, standalone, and executable package that includes everything needed to run a piece of software, including code, libraries, dependencies, and settings. It's like a carry-on bag for your application, containing everything it needs to travel safely and run smoothly in different environments. Just as a carry-on bag allows you to bring your essentials with you on a trip, a Docker image enables you to transport your application and its requirements across various computing environments, ensuring consistent and reliable deployment.

    MLServer has a convenient function that lets us create docker images with our services. Let's use it.

    We can check that our image was successfully build not only by looking at the logs of the previous command but also with the docker images command.

    Let's test that our image works as intended with the following command. Make sure you have closed your previous server by using CTRL + C in your terminal.

    Now that you have a packaged and fully-functioning microservice with our model, we could deploy our container to a production serving platform like , or via different offerings available through the many cloud providers out there (e.g. AWS Lambda, Google Cloud Run, etc.). You could also run this image on KServe, a Kubernetes native tool for model serving, or anywhere else where you can bring your docker image with you.

    To learn more about MLServer and the different ways in which you can use it, head over to the section or the . To learn about some of the deployment options available, head over to the docs .

    To keep up to date with what we are up to at Seldon, make sure you join our .

    datatype: the different data types expected by the server, e.g., str, numpy array, pandas dataframe, bytes, etc.

  • parameters: allows us to specify the content_type beyond the data types

  • data: the inputs to our predict function.

  • pip install mlserver spacy wikipedia-api
    python -m spacy download en_core_web_lg
    mkdir -p similarity_model
    import spacy
    nlp = spacy.load("en_core_web_lg")
    import wikipediaapi
    wiki_wiki = wikipediaapi.Wikipedia('MyMovieEval ([email protected])', 'en')
    barbie = wiki_wiki.page('Barbie_(film)').summary
    oppenheimer = wiki_wiki.page('Oppenheimer_(film)').summary
    
    print(barbie)
    print()
    print(oppenheimer)
    Barbie is a 2023 American fantasy comedy film directed by Greta Gerwig and written by Gerwig and Noah Baumbach. Based on the Barbie fashion dolls by Mattel, it is the first live-action Barbie film after numerous computer-animated direct-to-video and streaming television films. The film stars Margot Robbie as Barbie and Ryan Gosling as Ken, and follows the two on a journey of self-discovery following an existential crisis. The film also features an ensemble cast that includes America Ferrera, Kate McKinnon, Issa Rae, Rhea Perlman, and Will Ferrell...
    
    Oppenheimer is a 2023 biographical thriller film written and directed by Christopher Nolan. Based on the 2005 biography American Prometheus by Kai Bird and Martin J. Sherwin, the film chronicles the life of J. Robert Oppenheimer, a theoretical physicist who was pivotal in developing the first nuclear weapons as part of the Manhattan Project, and thereby ushering in the Atomic Age. Cillian Murphy stars as Oppenheimer, with Emily Blunt as Oppenheimer's wife Katherine "Kitty" Oppenheimer; Matt Damon as General Leslie Groves, director of the Manhattan Project; and Robert Downey Jr. as Lewis Strauss, a senior member of the United States Atomic Energy Commission. The ensemble supporting cast includes Florence Pugh, Josh Hartnett, Casey Affleck, Rami Malek, Gary Oldman and Kenneth Branagh...
    doc1 = nlp(barbie)
    doc2 = nlp(oppenheimer)
    doc1.similarity(doc2)
    0.9866910567224084
    # similarity_model/my_model.py
    
    from mlserver.codecs import decode_args
    from mlserver import MLModel
    from typing import List
    import numpy as np
    import spacy
    
    class MyKulModel(MLModel):
    
        async def load(self):
            self.model = spacy.load("en_core_web_lg")
    
        @decode_args
        async def predict(self, docs: List[str]) -> np.ndarray:
    
            doc1 = self.model(docs[0])
            doc2 = self.model(docs[1])
    
            return np.array(doc1.similarity(doc2))
    # similarity_model/model-settings.json
    
    {
        "name": "doc-sim-model",
        "implementation": "my_model.MyKulModel"
    }
    mlserver start similarity_model/
    from mlserver.codecs import StringCodec
    import requests
    inference_request = {
        "inputs": [
            StringCodec.encode_input(name='docs', payload=[barbie, oppenheimer], use_bytes=False).model_dump()
        ]
    }
    print(inference_request)
    {'inputs': [{'name': 'docs',
       'shape': [2, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'str'},
       'data': [
            'Barbie is a 2023 American fantasy comedy...',
            'Oppenheimer is a 2023 biographical thriller...'
            ]
        }]
    }
    r = requests.post('http://0.0.0.0:8080/v2/models/doc-sim-model/infer', json=inference_request)
    r.json()
    {'model_name': 'doc-sim-model',
        'id': 'a4665ddb-1868-4523-bd00-a25902d9b124',
        'parameters': {},
        'outputs': [{'name': 'output-0',
        'shape': [1],
        'datatype': 'FP64',
        'parameters': {'content_type': 'np'},
        'data': [0.9866910567224084]}]}
    print(f"Our movies are {round(r.json()['outputs'][0]['data'][0] * 100, 4)}% similar!")
    Our movies are 98.6691% similar
    # similarity_model/settings.json
    
    {
        "parallel_workers": 3
    }
    mlserver start similarity_model
    deep_impact    = wiki_wiki.page('Deep_Impact_(film)').summary
    armageddon     = wiki_wiki.page('Armageddon_(1998_film)').summary
    
    antz           = wiki_wiki.page('Antz').summary
    a_bugs_life    = wiki_wiki.page("A_Bug's_Life").summary
    
    the_dark_night = wiki_wiki.page('The_Dark_Knight').summary
    mamma_mia      = wiki_wiki.page('Mamma_Mia!_(film)').summary
    def get_sim_score(movie1, movie2):
        response = requests.post(
            'http://0.0.0.0:8080/v2/models/doc-sim-model/infer',
            json={
                "inputs": [
                    StringCodec.encode_input(name='docs', payload=[movie1, movie2], use_bytes=False).model_dump()
                ]
            })
        return response.json()['outputs'][0]['data'][0]
    get_sim_score(deep_impact, armageddon)
    0.9569279450151813
    results = list(
        map(get_sim_score, (deep_impact, antz, the_dark_night), (armageddon, a_bugs_life, mamma_mia))
    )
    results
    [0.9569279450151813, 0.9725374771538605, 0.9626173937217876]
    for movie1, movie2 in zip((deep_impact, antz, the_dark_night), (armageddon, a_bugs_life, mamma_mia)):
        print(get_sim_score(movie1, movie2))
    0.9569279450151813
    0.9725374771538605
    0.9626173937217876
    # similarity_model/requirements.txt
    
    mlserver
    spacy==3.6.0
    https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl
    mlserver build similarity_model/ -t 'fancy_ml_service'
    docker images
    docker run -it --rm -p 8080:8080 fancy_ml_service

    02 Set Up

    03 Building a Service

    04 Testing our Service

    05 Creating Model Replicas

    06 Packaging our Service

    spaCy
    Barbieheimer
    here
    here
    POST
    V2/Open Inference Protocol (OIP)
    docs page here
    twin films examples
    here
    Seldon Core
    examples
    user guide
    here
    Slack community
    setup
    start
    v2
    multiplemodels
    serving3

    Types

    Datatype

    An enumeration.

    InferenceErrorResponse

    Field
    Type
    Default
    Description

    error

    Optional[str]

    None

    -

    JSON Schema

    InferenceRequest

    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description
    Field
    Type
    Default
    Description

    An enumeration.

    Field
    Type
    Default
    Description

    -

    outputs

    Optional[List[RequestOutput]]

    None

    -

    parameters

    Optional[Parameters]

    None

    -

    -

    model_version

    Optional[str]

    None

    -

    outputs

    List[ResponseOutput]

    -

    -

    parameters

    Optional[Parameters]

    None

    -

    -

    outputs

    Optional[List[MetadataTensor]]

    None

    -

    parameters

    Optional[Parameters]

    None

    -

    platform

    str

    -

    -

    versions

    Optional[List[str]]

    None

    -

    -

    version

    str

    -

    -

    -

    parameters

    Optional[Parameters]

    None

    -

    shape

    List[int]

    -

    -

    -

    -

    state

    State

    -

    -

    version

    Optional[str]

    None

    -

    -

    name

    str

    -

    -

    parameters

    Optional[Parameters]

    None

    -

    shape

    List[int]

    -

    -

    -

    -

    name

    str

    -

    -

    parameters

    Optional[Parameters]

    None

    -

    shape

    List[int]

    -

    -

    id

    Optional[str]

    None

    -

    inputs

    List[RequestInput]

    id

    Optional[str]

    None

    -

    model_name

    str

    error

    str

    -

    -

    inputs

    Optional[List[MetadataTensor]]

    None

    -

    name

    str

    error

    str

    -

    -

    extensions

    List[str]

    -

    -

    name

    str

    datatype

    Datatype

    -

    -

    name

    str

    content_type

    Optional[str]

    None

    -

    headers

    Optional[Dict[str, Any]]

    ready

    Optional[bool]

    None

    -

    root

    List[RepositoryIndexResponseItem]

    -

    -

    name

    str

    -

    -

    reason

    str

    error

    Optional[str]

    None

    -

    error

    Optional[str]

    None

    -

    data

    TensorData

    -

    -

    datatype

    Datatype

    name

    str

    -

    -

    parameters

    Optional[Parameters]

    data

    TensorData

    -

    -

    datatype

    Datatype

    root

    Union[List[Any], Any]

    -

    -

    JSON Schema
    
    {
      "$defs": {
    

    InferenceResponse

    JSON Schema
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        },
        "ResponseOutput": {
          "properties": {
            "name": {
              "title": "Name",
              "type": "string"
            },
            "shape": {
              "items": {
                "type": "integer"
              },
              "title": "Shape",
              "type": "array"
            },
            "datatype": {
              "$ref": "#/$defs/Datatype"
            },
            "parameters": {
              "anyOf": [
                {
                  "$ref": "#/$defs/Parameters"
                },
                {
                  "type": "null"
                }
              ],
              "default": null
            },
            "data": {
              "$ref": "#/$defs/TensorData"
            }
          },
          "required": [
            "name",
            "shape",
            "datatype",
            "data"
          ],
          "title": "ResponseOutput",
          "type": "object"
        },
        "TensorData": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {}
          ],
          "title": "TensorData"
        }
      },
      "properties": {
        "model_name": {
          "title": "Model Name",
          "type": "string"
        },
        "model_version": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Model Version"
        },
        "id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Id"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "outputs": {
          "items": {
            "$ref": "#/$defs/ResponseOutput"
          },
          "title": "Outputs",
          "type": "array"
        }
      },
      "required": [
        "model_name",
        "outputs"
      ],
      "title": "InferenceResponse",
      "type": "object"
    }
    

    MetadataModelErrorResponse

    JSON Schema
    
    {
      "properties": {
        "error": {
          "title": "Error",
          "type": "string"
        }
      },
      "required": [
        "error"
      ],
      "title": "MetadataModelErrorResponse",
      "type": "object"
    }
    

    MetadataModelResponse

    JSON Schema
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "MetadataTensor": {
          "properties": {
            "name": {
              "title": "Name",
              "type": "string"
            },
            "datatype": {
              "$ref": "#/$defs/Datatype"
            },
            "shape": {
              "items": {
                "type": "integer"
              },
              "title": "Shape",
              "type": "array"
            },
            "parameters": {
              "anyOf": [
                {
                  "$ref": "#/$defs/Parameters"
                },
                {
                  "type": "null"
                }
              ],
              "default": null
            }
          },
          "required": [
            "name",
            "datatype",
            "shape"
          ],
          "title": "MetadataTensor",
          "type": "object"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "versions": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Versions"
        },
        "platform": {
          "title": "Platform",
          "type": "string"
        },
        "inputs": {
          "anyOf": [
            {
              "items": {
                "$ref": "#/$defs/MetadataTensor"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Inputs"
        },
        "outputs": {
          "anyOf": [
            {
              "items": {
                "$ref": "#/$defs/MetadataTensor"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Outputs"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        }
      },
      "required": [
        "name",
        "platform"
      ],
      "title": "MetadataModelResponse",
      "type": "object"
    }
    

    MetadataServerErrorResponse

    JSON Schema
    
    {
      "properties": {
        "error": {
          "title": "Error",
          "type": "string"
        }
      },
      "required": [
        "error"
      ],
      "title": "MetadataServerErrorResponse",
      "type": "object"
    }
    

    MetadataServerResponse

    JSON Schema
    
    {
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "version": {
          "title": "Version",
          "type": "string"
        },
        "extensions": {
          "items": {
            "type": "string"
          },
          "title": "Extensions",
          "type": "array"
        }
      },
      "required": [
        "name",
        "version",
        "extensions"
      ],
      "title": "MetadataServerResponse",
      "type": "object"
    }
    

    MetadataTensor

    JSON Schema
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "datatype": {
          "$ref": "#/$defs/Datatype"
        },
        "shape": {
          "items": {
            "type": "integer"
          },
          "title": "Shape",
          "type": "array"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        }
      },
      "required": [
        "name",
        "datatype",
        "shape"
      ],
      "title": "MetadataTensor",
      "type": "object"
    }
    

    Parameters

    JSON Schema
    
    {
      "additionalProperties": true,
      "properties": {
        "content_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Content Type"
        },
        "headers": {
          "anyOf": [
            {
              "type": "object"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Headers"
        }
      },
      "title": "Parameters",
      "type": "object"
    }
    

    RepositoryIndexRequest

    JSON Schema
    
    {
      "properties": {
        "ready": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ready"
        }
      },
      "title": "RepositoryIndexRequest",
      "type": "object"
    }
    

    RepositoryIndexResponse

    JSON Schema
    
    {
      "$defs": {
        "RepositoryIndexResponseItem": {
          "properties": {
            "name": {
              "title": "Name",
              "type": "string"
            },
            "version": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Version"
            },
            "state": {
              "$ref": "#/$defs/State"
            },
            "reason": {
              "title": "Reason",
              "type": "string"
            }
          },
          "required": [
            "name",
            "state",
            "reason"
          ],
          "title": "RepositoryIndexResponseItem",
          "type": "object"
        },
        "State": {
          "enum": [
            "UNKNOWN",
            "READY",
            "UNAVAILABLE",
            "LOADING",
            "UNLOADING"
          ],
          "title": "State",
          "type": "string"
        }
      },
      "items": {
        "$ref": "#/$defs/RepositoryIndexResponseItem"
      },
      "title": "RepositoryIndexResponse",
      "type": "array"
    }
    

    RepositoryIndexResponseItem

    JSON Schema
    
    {
      "$defs": {
        "State": {
          "enum": [
            "UNKNOWN",
            "READY",
            "UNAVAILABLE",
            "LOADING",
            "UNLOADING"
          ],
          "title": "State",
          "type": "string"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "version": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Version"
        },
        "state": {
          "$ref": "#/$defs/State"
        },
        "reason": {
          "title": "Reason",
          "type": "string"
        }
      },
      "required": [
        "name",
        "state",
        "reason"
      ],
      "title": "RepositoryIndexResponseItem",
      "type": "object"
    }
    

    RepositoryLoadErrorResponse

    JSON Schema
    
    {
      "properties": {
        "error": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Error"
        }
      },
      "title": "RepositoryLoadErrorResponse",
      "type": "object"
    }
    

    RepositoryUnloadErrorResponse

    JSON Schema
    
    {
      "properties": {
        "error": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Error"
        }
      },
      "title": "RepositoryUnloadErrorResponse",
      "type": "object"
    }
    

    RequestInput

    JSON Schema
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        },
        "TensorData": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {}
          ],
          "title": "TensorData"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "shape": {
          "items": {
            "type": "integer"
          },
          "title": "Shape",
          "type": "array"
        },
        "datatype": {
          "$ref": "#/$defs/Datatype"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "data": {
          "$ref": "#/$defs/TensorData"
        }
      },
      "required": [
        "name",
        "shape",
        "datatype",
        "data"
      ],
      "title": "RequestInput",
      "type": "object"
    }
    

    RequestOutput

    JSON Schema
    
    {
      "$defs": {
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        }
      },
      "required": [
        "name"
      ],
      "title": "RequestOutput",
      "type": "object"
    }
    

    ResponseOutput

    JSON Schema
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        },
        "TensorData": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {}
          ],
          "title": "TensorData"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "shape": {
          "items": {
            "type": "integer"
          },
          "title": "Shape",
          "type": "array"
        },
        "datatype": {
          "$ref": "#/$defs/Datatype"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "data": {
          "$ref": "#/$defs/TensorData"
        }
      },
      "required": [
        "name",
        "shape",
        "datatype",
        "data"
      ],
      "title": "ResponseOutput",
      "type": "object"
    }
    

    State

    TensorData

    JSON Schema
    
    {
      "anyOf": [
        {
          "items": {},
          "type": "array"
        },
        {}
      ],
      "title": "TensorData"
    }
    
    
    {
      "properties": {
        "error": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Error"
        }
      },
      "title": "InferenceErrorResponse",
      "type": "object"
    }
    

    -

    -

    -

    -

    -

    None

    -

    -

    None

    -

    "Datatype": {
    "enum": [
    "BOOL",
    "UINT8",
    "UINT16",
    "UINT32",
    "UINT64",
    "INT8",
    "INT16",
    "INT32",
    "INT64",
    "FP16",
    "FP32",
    "FP64",
    "BYTES"
    ],
    "title": "Datatype",
    "type": "string"
    },
    "Parameters": {
    "additionalProperties": true,
    "properties": {
    "content_type": {
    "anyOf": [
    {
    "type": "string"
    },
    {
    "type": "null"
    }
    ],
    "default": null,
    "title": "Content Type"
    },
    "headers": {
    "anyOf": [
    {
    "type": "object"
    },
    {
    "type": "null"
    }
    ],
    "default": null,
    "title": "Headers"
    }
    },
    "title": "Parameters",
    "type": "object"
    },
    "RequestInput": {
    "properties": {
    "name": {
    "title": "Name",
    "type": "string"
    },
    "shape": {
    "items": {
    "type": "integer"
    },
    "title": "Shape",
    "type": "array"
    },
    "datatype": {
    "$ref": "#/$defs/Datatype"
    },
    "parameters": {
    "anyOf": [
    {
    "$ref": "#/$defs/Parameters"
    },
    {
    "type": "null"
    }
    ],
    "default": null
    },
    "data": {
    "$ref": "#/$defs/TensorData"
    }
    },
    "required": [
    "name",
    "shape",
    "datatype",
    "data"
    ],
    "title": "RequestInput",
    "type": "object"
    },
    "RequestOutput": {
    "properties": {
    "name": {
    "title": "Name",
    "type": "string"
    },
    "parameters": {
    "anyOf": [
    {
    "$ref": "#/$defs/Parameters"
    },
    {
    "type": "null"
    }
    ],
    "default": null
    }
    },
    "required": [
    "name"
    ],
    "title": "RequestOutput",
    "type": "object"
    },
    "TensorData": {
    "anyOf": [
    {
    "items": {},
    "type": "array"
    },
    {}
    ],
    "title": "TensorData"
    }
    },
    "properties": {
    "id": {
    "anyOf": [
    {
    "type": "string"
    },
    {
    "type": "null"
    }
    ],
    "default": null,
    "title": "Id"
    },
    "parameters": {
    "anyOf": [
    {
    "$ref": "#/$defs/Parameters"
    },
    {
    "type": "null"
    }
    ],
    "default": null
    },
    "inputs": {
    "items": {
    "$ref": "#/$defs/RequestInput"
    },
    "title": "Inputs",
    "type": "array"
    },
    "outputs": {
    "anyOf": [
    {
    "items": {
    "$ref": "#/$defs/RequestOutput"
    },
    "type": "array"
    },
    {
    "type": "null"
    }
    ],
    "default": null,
    "title": "Outputs"
    }
    },
    "required": [
    "inputs"
    ],
    "title": "InferenceRequest",
    "type": "object"
    }