Only this pageAll pages
Powered by GitBook
1 of 50

MLServer

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

User Guide

Custom

There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

To learn more about how you can write custom runtimes with MLServer, check out the Custom Runtimes user guide. Alternatively, you can also see this end-to-end example which walks through the process of writing a custom runtime.

Metrics

MetricsServer

Methods

on_worker_stop()

start()

stop()

configure_metrics()

No description available.

log()

Logs a new set of metric values. Each kwarg of this method will be treated as a separate metric / value pair. If any of the metrics does not exist, a new one will be created with a default description.

register()

Registers a new metric with its description. If the metric already exists, it will just return the existing one.

MLFlow

This package provides a MLServer runtime compatible with MLflow models.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-mlflow

Content Types

The MLflow inference runtime introduces a new dict content type, which decodes an incoming V2 request as a . This is useful for certain MLflow-serialised models, which will expect that the model inputs are serialised in this format.

OpenAPI Support

MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:

Name
Description
OpenAPI Spec

Deployment

MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including and . This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.

In general, it should be possible to deploy models using MLServer into any serving engine compatible with the V2 protocol. Alternatively, it's also possible to manage MLServer deployments manually as regular processes (i.e. in a non-Kubernetes-native way). However, this may be more involved and highly dependant on the deployment infrastructure.

Streaming

Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers.

REST Server

Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.

The streaming endpoints are available for both the infer and generate methods through the following endpoints:

Python API

MLServer exposes a Python framework to build custom inference runtimes, define request/response types, plug codecs for payload conversion, and emit metrics. This page provides a high-level overview and links to the API docs.

    • Base class to implement custom inference runtimes.

    • Core lifecycle: load()

Examples

To see MLServer in action you can check out the examples below. These are end-to-end notebooks, showing how to serve models with MLServer.

Inference Runtimes

If you are interested in how MLServer interacts with particular model frameworks, you can check the following examples. These focus on showcasing the different that ship with MLServer out of the box. Note that, for advanced use cases, you can also write your own custom inference runtime (see the ).

HuggingFace

This package provides a MLServer runtime compatible with HuggingFace Transformers.

Usage

You can install the runtime, alongside mlserver, as:

For further information on how to use MLServer with HuggingFace, you can check out this .

Catboost

This package provides a MLServer runtime compatible with CatBoost's CatboostClassifier.

Usage

You can install the runtime, alongside mlserver, as:

For further information on how to use MLServer with CatBoost, you can check out this .

Alibi-Explain

This package provides a MLServer runtime compatible with Alibi-Explain.

Usage

You can install the runtime, alongside mlserver, as:

/v2/models/{model_name}/versions/{model_version}/infer_stream

  • /v2/models/{model_name}/infer_stream

  • /v2/models/{model_name}/versions/{model_version}/generate_stream

  • /v2/models/{model_name}/generate_stream

  • Note that for REST, the generate and generate_stream endpoints are aliases for the infer and infer_stream endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs).

    gRPC Server

    Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.

    The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but mlserver already has the predict method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic.

    The stub method for streaming to be used by the client is ModelStreamInfer.

    Limitation

    There are three main limitations of the streaming support in MLServer:

    • the parallel_workers setting should be set to 0 to disable distributed workers (to be addressed in future releases)

    • for REST, the gzip_enabled setting should be set to false to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue here)

    Swagger UI

    On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.

    The autogenerated Swagger UI can be accessed under the /v2/docs endpoint.

    Besides the Swagger UI, you can also access the raw OpenAPI spec through the /v2/docs/dataplane.json endpoint.

    Model Swagger UI

    Alongside the general API documentation, MLServer will also autogenerate a Swagger UI tailored to individual models, showing the endpoints available for each one.

    The model-specific autogenerated Swagger UI can be accessed under the following endpoints:

    • /v2/models/{model_name}/docs

    • /v2/models/{model_name}/versions/{model_version}/docs

    Besides the Swagger UI, you can also access the model-specific raw OpenAPI spec through the following endpoints:

    • /v2/models/{model_name}/docs/dataplane.json

    • /v2/models/{model_name}/versions/{model_version}/docs/dataplane.json

    Open Inference Protocol

    Main dataplane for inference, health and metadata

    dataplane.json

    Model Repository Extension

    Extension to the protocol to provide a control plane which lets you load / unload models dynamically

    model_repository.json

    ,
    predict()
    ,
    unload()
    .
  • Helpers for encoding/decoding requests and responses.

  • Access to model metadata and settings.

  • Extend this class to implement your own model logic.

  • Types

    • Data structures and enums for the V2 inference protocol.

    • Includes Pydantic models like InferenceRequest, InferenceResponse, RequestInput, ResponseOutput.

    • See model fields (type and default) and JSON Schemas in the docs.

  • Codecs

    • Encode/decode payloads between Open Inference Protocol types and Python types.

    • Base classes: InputCodec (inputs/outputs) and RequestCodec (requests/responses).

    • Built-ins include codecs such as NumpyCodec, Base64Codec, StringCodec, etc.

  • Metrics

    • Emit and configure metrics within MLServer.

    • Use log() to record custom metrics; see server lifecycle hooks and utilities.

  • When creating a custom runtime, start by subclassing MLModel, use the structures from Types for requests/responses, pick or implement the appropriate Codecs, and optionally emit Metrics from your model code.

    MLModel
    Seldon Core
    KServe (formerly known as KFServing)

    Seldon Core

    KServe

    The `dict` content type can be _stacked_ with other content types, like
    [`np`](../../docs/user-guide/content-type).
    This allows the user to use a different set of content types to decode each of
    the dict entries.
    dictionary of tensors
    Content Types

    If no content type is present on the request or metadata, the CatBoost runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

    pip install mlserver mlserver-catboost
    worked out example
    pip install mlserver mlserver-alibi-explain

    Spark MlLib

    This package provides a MLServer runtime compatible with Spark MLlib.

    Usage

    You can install the runtime, alongside mlserver, as:

    pip install mlserver mlserver-mllib

    For further information on how to use MLServer with Spark MLlib, you can check out the MLServer repository.

    Content Types

    The HuggingFace runtime will always decode the input request using its own built-in codec. Therefore, content type annotations at the request level will be ignored. Note that this doesn't include input-level content type annotations, which will be respected as usual.

    Settings

    The HuggingFace runtime exposes a couple extra parameters which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.

    Loading models

    Local models

    It is possible to load a local model into a HuggingFace pipeline by specifying the model artefact folder path in parameters.uri in model-settings.json.

    HuggingFace models

    Models in the HuggingFace hub can be loaded by specifying their name in parameters.extra.pretrained_model in model-settings.json.

    Reference

    You can find the full reference of the accepted extra settings for the HuggingFace runtime below:

    worked out example
    on_worker_stop(worker: Worker) -> None
    start()
    stop(sig: Optional[int] = None)
    configure_metrics(settings: Settings)
    log(metrics)
    register(name: str, description: str) -> Histogram
    pip install mlserver mlserver-huggingface
    ---
    emphasize-lines: 5-8
    ---
    {
      "name": "qa",
      "implementation": "mlserver_huggingface.HuggingFaceRuntime",
      "parameters": {
        "extra": {
          "task": "question-answering",
          "optimum_model": true
        }
      }
    }
    These settings can also be injected through environment variables prefixed with `MLSERVER_MODEL_HUGGINGFACE_`, e.g.
    
    ```bash
    MLSERVER_MODEL_HUGGINGFACE_TASK="question-answering"
    MLSERVER_MODEL_HUGGINGFACE_OPTIMUM_MODEL=true
    ```
    If `parameters.extra.pretrained_model` is specified, it takes precedence over `parameters.uri`.
    
    .. autopydantic_settings:: mlserver_huggingface.settings.HuggingFaceSettings
  • Serving XGBoost models

  • Serving LightGBM models

  • Serving CatBoost models

  • Serving MLflow models

  • Serving custom models

  • Serving Alibi Detect models

  • Serving HuggingFace models

  • MLServer Features

    To see some of the advanced features included in MLServer (e.g. multi-model serving), check out the examples below.

    • Multi-Model Serving with multiple frameworks

    • Loading / unloading models from a model repository

    • Content-Type Decoding

    • Custom Conda environment

    Tutorials

    Tutorials are designed to be beginner-friendly and walk through accomplishing a series of tasks using MLServer (and other tools).

    • Deploying a Custom Tensorflow Model with MLServer and Seldon Core

    inference runtimes
    example below on custom models
    Serving Scikit-Learn models

    LightGBM

    This package provides a MLServer runtime compatible with LightGBM.

    Usage

    You can install the runtime, alongside mlserver, as:

    pip install mlserver mlserver-lightgbm

    For further information on how to use MLServer with LightGBM, you can check out this worked out example.

    Content Types

    If no is present on the request or metadata, the LightGBM runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .

    MLModel

    Abstract inference runtime which exposes the main interface to interact with ML models.

    Methods

    decode()

    Helper to decode a request input into its corresponding high-level Python object. This method will find the most appropiate :doc:input codec </user-guide/content-type> based on the model's metadata and the input's content type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

    decode_request()

    Helper to decode an inference request into its corresponding high-level Python object. This method will find the most appropiate :doc:request codec </user-guide/content-type> based on the model's metadata and the requests's content type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

    encode()

    Helper to encode a high-level Python object into its corresponding response output. This method will find the most appropiate :doc:input codec </user-guide/content-type> based on the model's metadata, request output's content type or payload's type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

    encode_response()

    Helper to encode a high-level Python object into its corresponding inference response. This method will find the most appropiate :doc:request codec </user-guide/content-type> based on the payload's type. Otherwise, it will fall back to the codec specified in the default_codec kwarg.

    load()

    Method responsible for loading the model from a model artefact. This method will be called on each of the parallel workers (when :doc:parallel inference </user-guide/parallel-inference>) is enabled). Its return value will represent the model's readiness status. A return value of True will mean the model is ready.

    This method can be overriden to implement your custom load logic.

    metadata()

    No description available.

    predict()

    Method responsible for running inference on the model.

    This method can be overriden to implement your custom inference logic.

    predict_stream()

    Method responsible for running generation on the model, streaming a set of responses back to the client.

    This method can be overriden to implement your custom inference logic.

    unload()

    Method responsible for unloading the model, freeing any resources (e.g. CPU memory, GPU memory, etc.). This method will be called on each of the parallel workers (when :doc:parallel inference </user-guide/parallel-inference>) is enabled). A return value of True will mean the model is now unloaded.

    This method can be overriden to implement your custom unload logic.

    Parallel Inference

    Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes. This allows MLServer to scale out beyond the limitations of the Python interpreter. To learn more about why this can be beneficial, you can check the below.

    By default, MLServer will spin up a pool with only one worker process to run inference. All models will be loaded uniformly across the inference pool workers. To read more about advanced settings, please see the .

    Concurrency in Python

    The is a mutex lock that exists in most Python interpreters (e.g. CPython). Its main purpose is to lock Python’s execution so that it only runs on a single processor at the same time. This simplifies certain things to the interpreter. However, it also adds the limitation that a

    API Reference

    This page links to the key reference docs for configuring and using MLServer.

    MLServer Settings

    Server-wide configuration (e.g., HTTP/GRPC ports) loaded from a settings.json in the working directory. Settings can also be provided via environment variables prefixed with MLSERVER_ (e.g., MLSERVER_GRPC_PORT).

    Alibi-Detect

    This package provides a MLServer runtime compatible with models.

    Usage

    You can install the mlserver-alibi-detect runtime, alongside mlserver, as:

    For further information on how to use MLServer with Alibi-Detect, you can check out this .

    Model Repository API

    MLServer supports loading and unloading models dynamically from a models repository. This allows you to enable and disable the models accessible by MLServer on demand. This extension builds on top of the support for , letting you change at runtime which models is MLServer currently serving.

    The API to manage the model repository is modelled after to the V2 Dataplane and is thus fully compatible with it.

    This notebook will walk you through an example using the Model Repository API.

    Training

    First of all, we will need to train some models. For that, we will re-use the models we trained previously in the . You can check the details on how they are trained following that notebook.

    SKLearn

    This package provides a MLServer runtime compatible with Scikit-Learn.

    Usage

    You can install the runtime, alongside mlserver, as:

    For further information on how to use MLServer with Scikit-Learn, you can check out this .

    decode(request_input: RequestInput, default_codec: Union[type[ForwardRef('InputCodec')], ForwardRef('InputCodec'), None] = None) -> Any
    Serving custom models requiring JSON inputs or outputs
    Serving models through Kafka
    Streaming inference
    content type
    NumPy Array
    model's metadata
    single Python process will never be able to leverage multiple cores
    .

    When we think about MLServer's support for Multi-Model Serving (MMS), this could lead to scenarios where a heavily-used model starves the other models running within the same MLServer instance. Similarly, even if we don’t take MMS into account, the GIL also makes it harder to scale inference for a single model.

    To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.

    Overhead

    Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020.

    The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.

    Despite the above, even though this overhead is minimised, this it can still be particularly noticeable for lightweight inference methods, where the extra IPC overhead can take a large percentage of the overall time. In these cases (which can only be assessed on a model-by-model basis), the user has the option to disable the parallel inference feature.

    For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.

    Usage

    By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers setting.

    parallel_workers

    The parallel_workers field of the settings.json file (or alternatively, the MLSERVER_PARALLEL_WORKERS global environment variable) controls the size of MLServer's inference pool. The expected values are:

    • N, where N > 0, will create a pool of N workers.

    • 0, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.

    inference_pool_gid

    The inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_INFERENCE_POOL_GID global environment variable) allows to load models on a dedicated inference pool based on the group ID (GID) to prevent starvation behavior.

    Complementing the inference_pool_gid, if the autogenerate_inference_pool_gid field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_AUTOGENERATE_INFERENCE_POOL_GID global environment variable) is set to True, a UUID is automatically generated, and a dedicated inference pool will load the given model. This option is useful if the user wants to load a single model on an dedicated inference pool without having to manage the GID themselves.

    References

    Jiale Zhi, Rui Wang, Jeff Clune, and Kenneth O. Stanley. Fiber: A Platform for Efficient Development and Distributed Training for Reinforcement Learning and Population-Based Methods. arXiv:2003.11164 [cs, stat], March 2020. arXiv:2003.11164.

    concurrency section
    usage section below
    Global Interpreter Lock (GIL)

    Scope: server-wide (independent from model-specific settings)

  • Sources: settings.json or env vars MLSERVER_*

  • Read the full reference →

    Model Settings

    Each model has its own configuration (metadata, parallelism, etc.). Typically provided via a model-settings.json next to the model artifacts. Alternatively, use env vars prefixed with MLSERVER_MODEL_ (e.g., MLSERVER_MODEL_IMPLEMENTATION). If no model-settings.json is found, MLServer will try to load a default model from these env vars. Note: these env vars are shared across models unless overridden by model-settings.json.

    • Scope: per-model

    • Sources: model-settings.json or env vars MLSERVER_MODEL_*

    Read the full reference →

    MLServer CLI

    The mlserver CLI helps with common model lifecycle tasks (build images, init projects, start serving, etc.). For a quick overview:

    • Commands include: build, dockerfile, infer (deprecated), init, start

    • Each command lists its options, arguments, and examples

    Read the full CLI reference →

    Python API

    Build custom runtimes and integrate with MLServer using Python:

    • MLModel: base class for custom inference runtimes

    • Types: request/response schemas and enums (Pydantic)

    • Codecs: payload conversions between protocol types and Python types

    • Metrics: emit and configure metrics

    Browse the Python API →

    Content Types

    If no content type is present on the request or metadata, the Alibi-Detect runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

    Settings

    The Alibi Detect runtime exposes a couple setting flags which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.

    Reference

    You can find the full reference of the accepted extra settings for the Alibi Detect runtime below:

    alibi-detect
    worked out example
    Serving

    Next up, we will start our mlserver inference server. Note that, by default, this will load all our models.

    List available models

    Now that we've got our inference server up and running, and serving 2 different models, we can start using the Model Repository API. To get us started, we will first list all available models in the repository.

    As we can, the repository lists 2 models (i.e. mushroom-xgboost and mnist-svm). Note that the state for both is set to READY. This means that both models are loaded, and thus ready for inference.

    Unloading our mushroom-xgboost model

    We will now try to unload one of the 2 models, mushroom-xgboost. This will unload the model from the inference server but will keep it available on our model repository.

    If we now try to list the models available in our repository, we will see that the mushroom-xgboost model is flagged as UNAVAILABLE. This means that it's present in the repository but it's not loaded for inference.

    Loading our mushroom-xgboost model back

    We will now load our model back into our inference server.

    If we now try to list the models again, we will see that our mushroom-xgboost is back again, ready for inference.

    Multi-Model Serving
    Triton's Model Repository extension
    Multi-Model Serving example
    decode_request(inference_request: InferenceRequest, default_codec: Union[type[ForwardRef('RequestCodec')], ForwardRef('RequestCodec'), None] = None) -> Any
    encode(payload: Any, request_output: RequestOutput, default_codec: Union[type[ForwardRef('InputCodec')], ForwardRef('InputCodec'), None] = None) -> ResponseOutput
    encode_response(payload: Any, default_codec: Union[type[ForwardRef('RequestCodec')], ForwardRef('RequestCodec'), None] = None) -> InferenceResponse
    load() -> bool
    metadata() -> MetadataModelResponse
    predict(payload: InferenceRequest) -> InferenceResponse
    predict_stream(payloads: AsyncIterator[InferenceRequest]) -> AsyncIterator[InferenceResponse]
    unload() -> bool
    mlserver --help
    pip install mlserver mlserver-alibi-detect
    ---
    emphasize-lines: 6-8
    ---
    {
      "name": "drift-detector",
      "implementation": "mlserver_alibi_detect.AlibiDetectRuntime",
      "parameters": {
        "uri": "./alibi-detect-artifact/",
        "extra": {
          "batch_size": 5
        }
      }
    }
    
    .. autopydantic_settings:: mlserver_alibi_detect.runtime.AlibiDetectSettings
    !cp -r ../mms/models/* ./models
    mlserver start .
    import requests
    
    response = requests.post("http://localhost:8080/v2/repository/index", json={})
    response.json()
    requests.post("http://localhost:8080/v2/repository/models/mushroom-xgboost/unload")
    response = requests.post("http://localhost:8080/v2/repository/index", json={})
    response.json()
    requests.post("http://localhost:8080/v2/repository/models/mushroom-xgboost/load")
    response = requests.post("http://localhost:8080/v2/repository/index", json={})
    response.json()
    Content Types

    If no content type is present on the request or metadata, the Scikit-Learn runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

    Model Outputs

    The Scikit-Learn inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict, predict_proba and transform methods of the Scikit-Learn model.

    Output
    Returned By Default
    Availability

    predict

    ✅

    Available on most models, but not in .

    predict_proba

    ❌

    Only available on non-regressor models.

    transform

    ❌

    Only available on .

    By default, the runtime will only return the output of predict. However, you are able to control which outputs you want back through the outputs field of your {class}InferenceRequest <mlserver.types.InferenceRequest> payload.

    For example, to only return the model's predict_proba output, you could define a payload such as:

    worked out example

    Model Parameters

    Config

    Attribute
    Type
    Default

    extra

    str

    "allow"

    Fields

    Field
    Type
    Default
    Description

    Serving LightGBM models

    Out of the box, mlserver supports the deployment and serving of lightgbm models. By default, it will assume that these models have been serialised using the bst.save_model() method.

    In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

    Training

    To test the LightGBM Server, first we need to generate a simple LightGBM model using Python.

    Our model will be persisted as a file named iris-lightgbm.bst.

    Serving

    Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    settings.json

    model-settings.json

    Start serving our model

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    As we can see above, the model predicted the probability for each class, and the probability of class 1 is the biggest, close to 0.99, which matches what's on the test set.

    Inference Runtimes

    Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice.

    Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common ML frameworks. This allows you to start serving models saved in these frameworks straight away. To avoid bringing in dependencies for frameworks that you don't need to use, these runtimes are implemented as independent (and optional) Python packages. This mechanism also allows you to rollout your very easily.

    To pick which runtime you want to use for your model, you just need to make sure that the right package is installed, and then point to the correct runtime class in your model-settings.json file.

    Adaptive Batching

    MLServer includes support to batch requests together transparently on-the-fly. We refer to this as "adaptive batching", although it can also be known as "predictive batching".

    Benefits

    There are usually two main reasons to adopt adaptive batching:

    • Maximise resource usage. Usually, inference operations are “vectorised” (i.e. are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time. Therefore, to make sure that it’s used at maximum capacity, we need to run inference across batches.

    MLServer CLI

    The MLServer package includes a mlserver CLI designed to help with common tasks in a model’s lifecycle. You can see a high-level outline at any time via:

    mlserver

    Command-line interface to manage MLServer models.

    Options

    Model Settings

    Config

    Attribute
    Type
    Default

    KServe

    MLServer is used as the in . This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.

    This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about or , please visit the .

    pip install mlserver mlserver-sklearn
    ---
    emphasize-lines: 10-12
    ---
    {
      "inputs": [
        {
          "name": "my-input",
          "datatype": "INT32",
          "shape": [2, 2],
          "data": [1, 2, 3, 4]
        }
      ],
      "outputs": [
        { "name": "predict_proba" }
      ]
    }
    import lightgbm as lgb
    from sklearn.datasets import load_iris
    from sklearn.model_selection import train_test_split
    import os
    
    model_dir = "."
    BST_FILE = "iris-lightgbm.bst"
    
    iris = load_iris()
    y = iris['target']
    X = iris['data']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
    dtrain = lgb.Dataset(X_train, label=y_train)
    
    params = {
        'objective':'multiclass', 
        'metric':'softmax',
        'num_class': 3
    }
    lgb_model = lgb.train(params=params, train_set=dtrain)
    model_file = os.path.join(model_dir, BST_FILE)
    lgb_model.save_model(model_file)
    Scikit-Learn pipelines
    Scikit-Learn pipelines

    environment_tarball

    Optional[str]

    None

    Path to the environment tarball which should be used to load this model.

    extra

    Optional[dict]

    <factory>

    Arbitrary settings, dependent on the inference runtime implementation.

    format

    Optional[str]

    None

    Format of the model (only available on certain runtimes).

    inference_pool_gid

    Optional[str]

    None

    Inference pool group id to be used to serve this model.

    uri

    Optional[str]

    None

    URI where the model artifacts can be found. This path must be either absolute or relative to where MLServer is running.

    version

    Optional[str]

    None

    Version of the model.

    env_prefix

    str

    "MLSERVER_MODEL_"

    env_file

    str

    ".env"

    protected_namespaces

    tuple

    ('model_', 'settings_')

    autogenerate_inference_pool_gid

    bool

    False

    Flag to autogenerate the inference pool group id for this model.

    content_type

    Optional[str]

    None

    Default content type to use for requests and responses.

    environment_path

    Optional[str]

    None

    Path to a directory that contains the python environment to be used to load this model.

    • --version (Default: False) Show the version and exit.

    build

    Build a Docker image for a custom MLServer runtime.

    Options

    • -t, --tag <text>

    • --no-cache (Default: False)

    Arguments

    • FOLDER Required argument

    dockerfile

    Generate a Dockerfile

    Options

    • -i, --include-dockerignore (Default: False)

    Arguments

    • FOLDER Required argument

    infer

    Deprecated: This experimental feature will be removed in future work. Execute batch inference requests against V2 inference server.

    Deprecated: This experimental feature will be removed in future work.

    Options

    • --url, -u <text> (Default: localhost:8080; Env: MLSERVER_INFER_URL) URL of the MLServer to send inference requests to. Should not contain http or https.

    • --model-name, -m <text> (Required; Env: MLSERVER_INFER_MODEL_NAME) Name of the model to send inference requests to.

    • --input-data-path, -i <path> (Required; Env: MLSERVER_INFER_INPUT_DATA_PATH) Local path to the input file containing inference requests to be processed.

    • --output-data-path, -o <path> (Required; Env: MLSERVER_INFER_OUTPUT_DATA_PATH) Local path to the output file for the inference responses to be written to.

    • --workers, -w <integer> (Default: 10; Env: MLSERVER_INFER_WORKERS)

    • --retries, -r <integer> (Default: 3; Env: MLSERVER_INFER_RETRIES)

    • --batch-size, -s <integer> (Default: 1; Env: MLSERVER_INFER_BATCH_SIZE) Send inference requests grouped together as micro-batches.

    • --binary-data, -b (Default: False; Env: MLSERVER_INFER_BINARY_DATA) Send inference requests as binary data (not fully supported).

    • --verbose, -v (Default: False; Env: MLSERVER_INFER_VERBOSE) Verbose mode.

    • --extra-verbose, -vv (Default: False; Env: MLSERVER_INFER_EXTRA_VERBOSE) Extra verbose mode (shows detailed requests and responses).

    • --transport, -t <choice> (Options: rest | grpc; Default: rest; Env: MLSERVER_INFER_TRANSPORT) Transport type to use to send inference requests. Can be 'rest' or 'grpc' (not yet supported).

    • --request-headers, -H <text> (Env: MLSERVER_INFER_REQUEST_HEADERS) Headers to be set on each inference request send to the server. Multiple options are allowed as: -H 'Header1: Val1' -H 'Header2: Val2'. When setting up as environmental provide as 'Header1:Val1 Header2:Val2'.

    • --timeout <integer> (Default: 60; Env: MLSERVER_INFER_CONNECTION_TIMEOUT) Connection timeout to be passed to tritonclient.

    • --batch-interval <float> (Default: 0; Env: MLSERVER_INFER_BATCH_INTERVAL) Minimum time interval (in seconds) between requests made by each worker.

    • --batch-jitter <float> (Default: 0; Env: MLSERVER_INFER_BATCH_JITTER) Maximum random jitter (in seconds) added to batch interval between requests.

    • --use-ssl (Default: False; Env: MLSERVER_INFER_USE_SSL) Use SSL in communications with inference server.

    • --insecure (Default: False; Env: MLSERVER_INFER_INSECURE) Disable SSL verification in communications. Use with caution.

    init

    Generate a base project template

    Options

    • -t, --template <text> (Default: https://github.com/EthicalML/sml-security/)

    start

    Start serving a machine learning model with MLServer.

    Arguments

    • FOLDER Required argument

    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile model-settings.json
    {
        "name": "iris-lgb",
        "implementation": "mlserver_lightgbm.LightGBMModel",
        "parameters": {
            "uri": "./iris-lightgbm.bst",
            "version": "v0.1.0"
        }
    }
    mlserver start .
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict-prob",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/iris-lgb/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    y_test[0]
    mlserver --help
    mlserver [OPTIONS] COMMAND [ARGS]...
    mlserver build [OPTIONS] FOLDER
    mlserver dockerfile [OPTIONS] FOLDER
    mlserver infer [OPTIONS]
    mlserver init [OPTIONS]
    mlserver start [OPTIONS] FOLDER
    Included Inference Runtimes
    Framework
    Package Name
    Implementation Class
    Example
    Documentation

    Scikit-Learn

    mlserver-sklearn

    mlserver_sklearn.SKLearnModel

    XGBoost

    mlserver-xgboost

    mlserver_xgboost.XGBoostModel

    own custom runtimes
  • Minimise any inference overhead. Usually, all models will have to “pay” a constant overhead when running any type of inference. This can be something like IO to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it’s in our interest to send as large batches as we can without deteriorating performance.

  • However, these benefits will usually scale only up to a certain point, which is usually determined by either the infrastructure, the machine learning framework used to train your model, or a combination of both. Therefore, to maximise the performance improvements brought in by adaptive batching it will be important to configure it with the appropriate values for your model. Since these values are usually found through experimentation, MLServer won't enable by default adaptive batching on newly loaded models.

    Usage

    MLServer lets you configure adaptive batching independently for each model through two main parameters:

    • Maximum batch size, that is how many requests you want to group together.

    • Maximum batch time, that is how much time we should wait for new requests until we reach our maximum batch size.

    max_batch_size

    The max_batch_size field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_SIZE global environment variable) controls the maximum number of requests that should be grouped together on each batch. The expected values are:

    • N, where N > 1, will create batches of up to N elements.

    • 0 or 1, will disable adaptive batching.

    max_batch_time

    The max_batch_time field of the model-settings.json file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_TIME global environment variable) controls the time that MLServer should wait for new requests to come in until we reach our maximum batch size.

    The expected format is in seconds, but it will take fractional values. That is, 500ms could be expressed as 0.5.

    The expected values are:

    • T, where T > 0, will wait T seconds at most.

    • 0, will disable adaptive batching.

    Merge and split of custom parameters

    MLserver allows adding custom parameters to the parameters field of the requests. These parameters are received as a merged list of parameters inside the server, e.g.

    is received as follows in the batched request in the server:

    The same way if the request is sent back from the server as a batched request

    it will be returned unbatched from the server as follows:

    str

    "MLSERVER_MODEL_"

    env_file

    str

    ".env"

    protected_namespaces

    tuple

    ('model_', 'settings_')

    Fields

    Field
    Type
    Default
    Description

    cache_enabled

    bool

    False

    Enable caching for a specific model. This parameter can be used to disable cache for a specific model, if the server-level caching is enabled. If the server-level caching is disabled, this parameter value will have no effect.

    implementation_

    str

    -

    Python path to the inference runtime to use to serve this model (e.g. mlserver_sklearn.SKLearnModel).

    inputs

    List[MetadataTensor]

    <factory>

    extra

    str

    "ignore"

    env_prefix

    Serving Runtimes

    KServe provides built-in serving runtimes to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.

    Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.

    Usage

    To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService manifest.

    For example, to serve a Scikit-Learn model, you could use a manifest like the one below:

    As you can see highlighted above, the InferenceService manifest will only need to specify the following points:

    • The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn serving runtime to deploy it.

    • The model will be served using the V2 inference protocol, which can be enabled by setting the protocolVersion field to v2.

    Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

    Supported Serving Runtimes

    As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.

    Framework
    MLServer Runtime
    KServe Serving Runtime
    Documentation

    Scikit-Learn

    sklearn

    XGBoost

    xgboost

    Note that, on top of the ones shown above (backed by MLServer), KServe also provides a wider set of serving runtimes. To see the full list, please visit the KServe documentation.

    Custom Runtimes

    Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for custom runtimes.

    Usage

    The InferenceService manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the custom MLServer image containing your custom logic. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write an InferenceService manifest like the one below:

    As we can see highlighted above, the main points that we'll need to take into account are:

    • Pointing to our custom MLServer image in the custom container section of our InferenceService.

    • Explicitly choosing the V2 inference protocol to serve our model.

    • Let KServe know what port will be exposed by our custom container to send inference requests.

    Once you have your InferenceService manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

    core Python inference server
    KServe (formerly known as KFServing)
    KServe
    how to install it
    KServe documentation

    Serving Alibi-Detect models

    Out of the box, mlserver supports the deployment and serving of alibi_detect models. Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. In this example, we will cover how we can create a detector configuration to then serve it using mlserver.

    Fetch reference data

    The first step will be to fetch a reference data and other relevant metadata for an alibi-detect model.

    For that, we will use the alibi library to get the adult dataset with .

    Drift Detector Configuration

    This example is based on the Categorical and mixed type data drift detection on income prediction tabular data from the documentation.

    Creating detector and saving configuration

    Detecting data drift directly

    Serving

    Now that we have the reference data and other configuration parameters, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    settings.json

    model-settings.json

    Start serving our model

    Now that we have our config in-place, we can start the server by running mlserver start command. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request

    We now have our alibi-detect model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    View model response

    XGBoost

    This package provides a MLServer runtime compatible with XGBoost.

    Usage

    You can install the runtime, alongside mlserver, as:

    For further information on how to use MLServer with XGBoost, you can check out this worked out example.

    XGBoost Artifact Type

    The XGBoost inference runtime will expect that your model is serialised via one of the following methods:

    Extension
    Docs
    Example

    Content Types

    If no is present on the request or metadata, the XGBoost runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .

    Model Outputs

    The XGBoost inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict and predict_proba methods of the XGBoost model.

    Output
    Returned By Default
    Availability

    By default, the runtime will only return the output of predict. However, you are able to control which outputs you want back through the outputs field of your {class}InferenceRequest <mlserver.types.InferenceRequest> payload.

    For example, to only return the model's predict_proba output, you could define a payload such as:

    Serving XGBoost models

    Out of the box, mlserver supports the deployment and serving of xgboost models. By default, it will assume that these models have been serialised using the bst.save_model() method.

    In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

    Training

    The first step will be to train a simple xgboost model. For that, we will use the .

    Saving our trained model

    To save our trained model, we will serialise it using bst.save_model() and the JSON format. This is the .

    Our model will be persisted as a file named mushroom-xgboost.json.

    Serving

    Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    settings.json

    model-settings.json

    Start serving our model

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    As we can see above, the model predicted the input as close to 0, which matches what's on the test set.

    Metrics

    Out-of-the-box, MLServer exposes a set of metrics that help you monitor your machine learning workloads in production. These include standard metrics like number of requests and latency.

    On top of these, you can also register and track your own custom metrics as part of your custom inference runtimes.

    Default Metrics

    By default, MLServer will expose metrics around inference requests (count and error rate) and the status of its internal requests queues. These internal queues are used for adaptive batching and communication with the inference workers.

    Metric Name
    Description

    REST Server Metrics

    On top of the default set of metrics, MLServer's REST server will also expose a set of metrics specific to REST.

    The prefix for the REST-specific metrics will be dependent on the metrics_rest_server_prefix flag from the .

    Metric Name
    Description

    gRPC Server Metrics

    On top of the default set of metrics, MLServer's gRPC server will also expose a set of metrics specific to gRPC.

    Metric Name
    Description

    Custom Metrics

    MLServer allows you to register custom metrics within your custom inference runtimes. This can be done through the mlserver.register() and mlserver.log() methods.

    • mlserver.register: Register a new metric.

    • mlserver.log: Log a new set of metric / value pairs. If there's any unregistered metric, it will get registered on-the-fly.

    Under the hood, metrics logged through the mlserver.log method will get exposed to Prometheus as a Histogram.

    Custom metrics will generally be registered in the load() <mlserver.MLModel.load> method and then used in the predict() <mlserver.MLModel.predict> method of your .

    Metrics Labelling

    For metrics specific to a model (e.g. , request counts, etc), MLServer will always label these with the model name and model version. Downstream, this will allow to aggregate and query metrics per model.

    If these labels are not present on a specific metric, this means that those metrics can't be sliced at the model level.

    Below, you can find the list of standardised labels that you will be able to find on model-specific metrics:

    Label Name
    Description

    Settings

    MLServer will expose metric values through a metrics endpoint exposed on its own metric server. This endpoint can be polled by or other -compatible backends.

    Below you can find the available to control the behaviour of the metrics server:

    Label Name
    Description
    Default

    Serving Scikit-Learn models

    Out of the box, mlserver supports the deployment and serving of scikit-learn models. By default, it will assume that these models have been serialised using joblib.

    In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver.

    Training

    The first step will be to train a simple scikit-learn model. For that, we will use the which trains an SVM model.

    Saving our trained model

    To save our trained model, we will serialise it using joblib. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the .

    Our model will be persisted as a file named mnist-svm.joblib

    Serving

    Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    settings.json

    model-settings.json

    Start serving our model

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    As we can see above, the model predicted the input as the number 8, which matches what's on the test set.

    Custom Conda environments in MLServer

    It's not unusual that model runtimes require extra dependencies that are not direct dependencies of MLServer. This is the case when we want to use , but also when our model artifacts are the output of older versions of a toolkit (e.g. models trained with an older version of SKLearn).

    In these cases, since these dependencies (or dependency versions) are not known in advance by MLServer, they won't be included in the default seldonio/mlserver Docker image. To cover these cases, the seldonio/mlserver Docker image allows you to load custom environments before starting the server itself.

    This example will walk you through how to create and save an custom environment, so that it can be loaded in MLServer without any extra change to the seldonio/mlserver Docker image.

    Seldon Core

    MLServer is used as the in . Therefore, it should be straightforward to deploy your models either by using one of the or by pointing to a .

    This section assumes a basic knowledge of Seldon Core and Kubernetes, as well as access to a working Kubernetes cluster with Seldon Core installed. To learn more about or , please visit the .

    # request 1
    types.RequestInput(
        name="parameters-np",
        shape=[1],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param='value-1',
        )
    )
    
    # request 2
    types.RequestInput(
        name="parameters-np",
        shape=[1],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param='value-2',
        )
    )
    types.RequestInput(
        name="parameters-np",
        shape=[2],
        datatype="BYTES",
        data=[],
        parameters=types.Parameters(
            custom-param=['value-1', 'value-2'],
        )
    )
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[3, 3],
        data=[1, 2, 3, 4, 5, 6, 7, 8, 9],
        parameters=types.Parameters(
            content_type="np",
            foo=["foo_1", "foo_2"],
            bar=["bar_1", "bar_2", "bar_3"],
        ),
    )
    # Request 1
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[1, 2, 3],
        parameters=types.Parameters(
            content_type="np", foo="foo_1", bar="'bar_1"
        ),
    )
    
    # Request 2
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[4, 5, 6],
        parameters=types.Parameters(
            content_type="np", foo="foo_2", bar="bar_2"
        ),
    )
    
    # Request 3
    types.ResponseOutput(
        name="foo",
        datatype="INT32",
        shape=[1, 3],
        data=[7, 8, 9],
        parameters=types.Parameters(content_type="np", bar="bar_3"),
    )
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: my-model
    spec:
      predictor:
        sklearn:
          protocolVersion: v2
          storageUri: gs://seldon-models/sklearn/iris
    kubectl apply -f my-inferenceservice-manifest.yaml
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    metadata:
      name: my-model
    spec:
      predictor:
        containers:
          - name: classifier
            image: my-custom-server:0.1.0
            env:
              - name: PROTOCOL
                value: v2
            ports:
              - containerPort: 8080
                protocol: TCP
    kubectl apply -f my-inferenceservice-manifest.yaml
    pip install mlserver mlserver-xgboost

    Metadata about the inputs accepted by the model.

    max_batch_size

    int

    0

    When adaptive batching is enabled, maximum number of requests to group together in a single batch.

    max_batch_time

    float

    0.0

    When adaptive batching is enabled, maximum amount of time (in seconds) to wait for enough requests to build a full batch.

    name

    str

    ''

    Name of the model.

    outputs

    List[MetadataTensor]

    <factory>

    Metadata about the outputs returned by the model.

    parallel_workers

    Optional[int]

    None

    Use the parallel_workers field in the server-wide settings instead.

    parameters

    Optional[ModelParameters]

    None

    Extra parameters for each instance of this model.

    platform

    str

    ''

    Framework used to train and serialise the model (e.g. sklearn).

    versions

    List[str]

    <factory>

    Versions of dependencies used to train the model (e.g. sklearn/0.20.1).

    warm_workers

    bool

    False

    Inference workers will now always be warmed up at start time.

    Spark MLlib

    mlserver-mllib

    mlserver_mllib.MLlibModel

    MLlib example

    MLServer MLlib

    LightGBM

    mlserver-lightgbm

    mlserver_lightgbm.LightGBMModel

    LightGBM example

    MLServer LightGBM

    CatBoost

    mlserver-catboost

    mlserver_catboost.CatboostModel

    CatBoost example

    MLServer CatBoost

    MLflow

    mlserver-mlflow

    mlserver_mlflow.MLflowRuntime

    MLflow example

    MLServer MLflow

    Alibi-Detect

    mlserver-alibi-detect

    mlserver_alibi_detect.AlibiDetectRuntime

    Alibi-detect example

    MLServer Alibi-Detect

    Scikit-Learn example
    MLServer SKLearn
    XGBoost example
    MLServer XGBoost
    MLServer SKLearn
    SKLearn Serving Runtime
    MLServer XGBoost
    XGBoost Serving Runtime
    demographic features from a 1996 US census
    alibi-detect

    *.json

    JSON Format

    booster.save_model("model.json")

    *.ubj

    Binary JSON Format

    booster.save_model("model.ubj")

    *.bst

    (Old) Binary Format

    booster.save_model("model.bst")

    predict

    ✅

    Available on all XGBoost models.

    predict_proba

    ❌

    Only available on non-regressor models (i.e. XGBClassifier models).

    content type
    NumPy Array
    model's metadata
    mushrooms example from the xgboost Getting Started guide
    approach by the XGBoost project

    model_infer_request_success

    Number of successful inference requests.

    model_infer_request_failure

    Number of failed inference requests.

    batch_request_queue

    Queue size for the adaptive batching queue.

    parallel_request_queue

    Queue size for the inference workers queue.

    [rest_server]_requests

    Number of REST requests, labelled by endpoint and status code.

    [rest_server]_requests_duration_seconds

    Latency of REST requests.

    [rest_server]_requests_in_progress

    Number of in-flight REST requests.

    grpc_server_handled

    Number of gRPC requests, labelled by gRPC code and method.

    grpc_server_started

    Number of in-flight gRPC requests.

    model_name

    Model Name (e.g. my-custom-model)

    model_version

    Model Version (e.g. v1.2.3)

    metrics_endpoint

    Path under which the metrics endpoint will be exposed.

    /metrics

    metrics_port

    Port used to serve the metrics server.

    8082

    metrics_rest_server_prefix

    Prefix used for metric names specific to MLServer's REST inference interface.

    rest_server

    metrics_dir

    Directory used to store internal metric files (used to support metrics sharing across inference workers). This is equivalent to Prometheus' $PROMETHEUS_MULTIPROC_DIR env var.

    MLServer settings
    custom runtime
    custom metrics
    Prometheus
    OpenMetrics
    settings

    MLServer's current working directory (i.e. $PWD)

    MNIST example from the scikit-learn documentation
    scikit-learn documentation
    Define our environment

    For this example, we will create a custom environment to serve a model trained with an older version of Scikit-Learn. The first step will be define this environment, using a environment.yml.

    Note that these environments can also be created on the fly as we go, and then serialised later.

    Train model in our custom environment

    To illustrate the point, we will train a Scikit-Learn model using our older environment.

    The first step will be to create and activate an environment which reflects what's outlined in our environment.yml file.

    NOTE: If you are running this from a Jupyter Notebook, you will need to restart your Jupyter instance so that it runs from this environment.

    We can now train and save a Scikit-Learn model using the older version of our environment. This model will be serialised as model.joblib.

    You can find more details of this process in the Scikit-Learn example.

    Serialise our custom environment

    Lastly, we will need to serialise our environment in the format expected by MLServer. To do that, we will use a tool called conda-pack.

    This tool, will save a portable version of our environment as a .tar.gz file, also known as tarball.

    Serving

    Now that we have defined our environment (and we've got a sample artifact trained in that environment), we can move to serving our model.

    To do that, we will first need to select the right runtime through a model-settings.json config file.

    We can then spin up our model, using our custom environment, leveraging MLServer's Docker image. Keep in mind that you will need Docker installed in your machine to run this example.

    Our Docker command will need to take into account the following points:

    • Mount the example's folder as a volume so that it can be accessed from within the container.

    • Let MLServer know that our custom environment's tarball can be found as old-sklearn.tar.gz.

    • Expose port 8080 so that we can send requests from the outside.

    From the command line, this can be done using Docker's CLI as:

    Note that we need to keep the server running in the background while we send requests. Therefore, it's best to run this command on a separate terminal session.

    Send test inference request

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    custom runtimes
    Install `alibi` library for dataset dependencies and `alibi_detect` library for detector configuration from Pypi
    ```python
    !pip install alibi alibi_detect
    ```
    import alibi
    import matplotlib.pyplot as plt
    import numpy as np
    adult = alibi.datasets.fetch_adult()
    X, y = adult.data, adult.target
    feature_names = adult.feature_names
    category_map = adult.category_map
    n_ref = 10000
    n_test = 10000
    
    X_ref, X_t0, X_t1 = X[:n_ref], X[n_ref:n_ref + n_test], X[n_ref + n_test:n_ref + 2 * n_test]
    categories_per_feature = {f: None for f in list(category_map.keys())}
    from alibi_detect.cd import TabularDrift
    cd_tabular = TabularDrift(X_ref, p_val=.05, categories_per_feature=categories_per_feature)
    from alibi_detect.utils.saving import save_detector
    filepath = "alibi-detector-artifacts"
    save_detector(cd_tabular, filepath)
    preds = cd_tabular.predict(X_t0,drift_type="feature")
    
    labels = ['No!', 'Yes!']
    print(f"Threshold {preds['data']['threshold']}")
    for f in range(cd_tabular.n_features):
        fname = feature_names[f]
        is_drift = (preds['data']['p_val'][f] < preds['data']['threshold']).astype(int)
        stat_val, p_val = preds['data']['distance'][f], preds['data']['p_val'][f]
        print(f'{fname} -- Drift? {labels[is_drift]} -- Chi2 {stat_val:.3f} -- p-value {p_val:.3f}')
    Threshold 0.05
    Age -- Drift? No! -- Chi2 0.012 -- p-value 0.508
    Workclass -- Drift? No! -- Chi2 8.487 -- p-value 0.387
    Education -- Drift? No! -- Chi2 4.753 -- p-value 0.576
    Marital Status -- Drift? No! -- Chi2 3.160 -- p-value 0.368
    Occupation -- Drift? No! -- Chi2 8.194 -- p-value 0.415
    Relationship -- Drift? No! -- Chi2 0.485 -- p-value 0.993
    Race -- Drift? No! -- Chi2 0.587 -- p-value 0.965
    Sex -- Drift? No! -- Chi2 0.217 -- p-value 0.641
    Capital Gain -- Drift? No! -- Chi2 0.002 -- p-value 1.000
    Capital Loss -- Drift? No! -- Chi2 0.002 -- p-value 1.000
    Hours per week -- Drift? No! -- Chi2 0.012 -- p-value 0.508
    Country -- Drift? No! -- Chi2 9.991 -- p-value 0.441
    %%writefile settings.json
    {
        "debug": "true"
    }
    Overwriting settings.json
    %%writefile model-settings.json
    {
      "name": "income-tabular-drift",
      "implementation": "mlserver_alibi_detect.AlibiDetectRuntime",
      "parameters": {
        "uri": "./alibi-detector-artifacts",
        "version": "v0.1.0",
        "extra": {
          "predict_parameters":{
            "drift_type": "feature"
          }
        }
      }
    }
    Overwriting model-settings.json
    mlserver start .
    import requests
    
    inference_request = {
        "inputs": [
            {
                "name": "predict",
                "shape": X_t0.shape,
                "datatype": "FP32",
                "data": X_t0.tolist(),
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/income-tabular-drift/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    import json
    response_dict = json.loads(response.text)
    
    labels = ['No!', 'Yes!']
    for f in range(cd_tabular.n_features):
        stat = 'Chi2' if f in list(categories_per_feature.keys()) else 'K-S'
        fname = feature_names[f]
        is_drift = response_dict['outputs'][0]['data'][f]
        stat_val, p_val = response_dict['outputs'][1]['data'][f], response_dict['outputs'][2]['data'][f]
        print(f'{fname} -- Drift? {labels[is_drift]} -- Chi2 {stat_val:.3f} -- p-value {p_val:.3f}')
    Age -- Drift? No! -- Chi2 0.012 -- p-value 0.508
    Workclass -- Drift? No! -- Chi2 8.487 -- p-value 0.387
    Education -- Drift? No! -- Chi2 4.753 -- p-value 0.576
    Marital Status -- Drift? No! -- Chi2 3.160 -- p-value 0.368
    Occupation -- Drift? No! -- Chi2 8.194 -- p-value 0.415
    Relationship -- Drift? No! -- Chi2 0.485 -- p-value 0.993
    Race -- Drift? No! -- Chi2 0.587 -- p-value 0.965
    Sex -- Drift? No! -- Chi2 0.217 -- p-value 0.641
    Capital Gain -- Drift? No! -- Chi2 0.002 -- p-value 1.000
    Capital Loss -- Drift? No! -- Chi2 0.002 -- p-value 1.000
    Hours per week -- Drift? No! -- Chi2 0.012 -- p-value 0.508
    Country -- Drift? No! -- Chi2 9.991 -- p-value 0.441
    By default, the runtime will look for a file called `model.[json | ubj | bst]`.
    However, this can be modified through the `parameters.uri` field of your
    {class}`ModelSettings <mlserver.settings.ModelSettings>` config (see the
    section on [Model Settings](../../docs/reference/model-settings.md) for more
    details).
    
    ```{code-block} json
    ---
    emphasize-lines: 3-5
    ---
    {
      "name": "foo",
      "parameters": {
        "uri": "./my-own-model-filename.json"
      }
    }
    ```
    ---
    emphasize-lines: 10-12
    ---
    {
      "inputs": [
        {
          "name": "my-input",
          "datatype": "INT32",
          "shape": [2, 2],
          "data": [1, 2, 3, 4]
        }
      ],
      "outputs": [
        { "name": "predict_proba" }
      ]
    }
    # Original code and extra details can be found in:
    # https://xgboost.readthedocs.io/en/latest/get_started.html#python
    
    import os
    import xgboost as xgb
    import requests
    
    from urllib.parse import urlparse
    from sklearn.datasets import load_svmlight_file
    
    
    TRAIN_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.train'
    TEST_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.test'
    
    
    def _download_file(url: str) -> str:
        parsed = urlparse(url)
        file_name = os.path.basename(parsed.path)
        file_path = os.path.join(os.getcwd(), file_name)
        
        res = requests.get(url)
        
        with open(file_path, 'wb') as file:
            file.write(res.content)
        
        return file_path
    
    train_dataset_path = _download_file(TRAIN_DATASET_URL)
    test_dataset_path = _download_file(TEST_DATASET_URL)
    
    # NOTE: Workaround to load SVMLight files from the XGBoost example
    X_train, y_train = load_svmlight_file(train_dataset_path)
    X_test, y_test = load_svmlight_file(test_dataset_path)
    X_train = X_train.toarray()
    X_test = X_test.toarray()
    
    # read in data
    dtrain = xgb.DMatrix(data=X_train, label=y_train)
    
    # specify parameters via map
    param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
    num_round = 2
    bst = xgb.train(param, dtrain, num_round)
    
    bst
    model_file_name = 'mushroom-xgboost.json'
    bst.save_model(model_file_name)
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile model-settings.json
    {
        "name": "mushroom-xgboost",
        "implementation": "mlserver_xgboost.XGBoostModel",
        "parameters": {
            "uri": "./mushroom-xgboost.json",
            "version": "v0.1.0"
        }
    }
    mlserver start .
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mushroom-xgboost/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    y_test[0]
    import mlserver
    
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class MyCustomRuntime(mlserver.MLModel):
      async def load(self) -> bool:
        self._model = load_my_custom_model()
        mlserver.register("my_custom_metric", "This is a custom metric example")
        return True
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        mlserver.log(my_custom_metric=34)
        # TODO: Replace for custom logic to run inference
        return self._model.predict(payload)
    # Original source code and more details can be found in:
    # https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
    
    # Import datasets, classifiers and performance metrics
    from sklearn import datasets, svm, metrics
    from sklearn.model_selection import train_test_split
    
    # The digits dataset
    digits = datasets.load_digits()
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))
    
    # Create a classifier: a support vector classifier
    classifier = svm.SVC(gamma=0.001)
    
    # Split data into train and test subsets
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.5, shuffle=False)
    
    # We learn the digits on the first half of the digits
    classifier.fit(X_train, y_train)
    import joblib
    
    model_file_name = "mnist-svm.joblib"
    joblib.dump(classifier, model_file_name)
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile model-settings.json
    {
        "name": "mnist-svm",
        "implementation": "mlserver_sklearn.SKLearnModel",
        "parameters": {
            "uri": "./mnist-svm.joblib",
            "version": "v0.1.0"
        }
    }
    mlserver start .
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    y_test[0]
    %%writefile environment.yml
    
    name: old-sklearn
    channels:
        - conda-forge
    dependencies:
        - python == 3.8
        - scikit-learn == 0.24.2
        - joblib == 0.17.0
        - requests
        - pip
        - pip:
            - mlserver == 1.1.0
            - mlserver-sklearn == 1.1.0
    !conda env create --force -f environment.yml
    !conda activate old-sklearn
    # Original source code and more details can be found in:
    # https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
    
    # Import datasets, classifiers and performance metrics
    from sklearn import datasets, svm, metrics
    from sklearn.model_selection import train_test_split
    
    # The digits dataset
    digits = datasets.load_digits()
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))
    
    # Create a classifier: a support vector classifier
    classifier = svm.SVC(gamma=0.001)
    
    # Split data into train and test subsets
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.5, shuffle=False)
    
    # We learn the digits on the first half of the digits
    classifier.fit(X_train, y_train)
    import joblib
    
    model_file_name = "model.joblib"
    joblib.dump(classifier, model_file_name)
    !conda pack --force -n old-sklearn -o old-sklearn.tar.gz
    %%writefile model-settings.json
    {
        "name": "mnist-svm",
        "implementation": "mlserver_sklearn.SKLearnModel"
    }
    docker run -it --rm \
        -v "$PWD":/mnt/models \
        -e "MLSERVER_ENV_TARBALL=/mnt/models/old-sklearn.tar.gz" \
        -p 8080:8080 \
        seldonio/mlserver:1.1.0-slim
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mnist-svm/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    Pre-packaged Servers

    Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.

    Usage

    To let Seldon Core know what framework was used to train your model, you can use the implementation field of your SeldonDeployment manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:

    As you can see highlighted above, all that we need to specify is that:

    • Our inference deployment should use the V2 inference protocol, which is done by setting the protocol field to kfserving.

    • Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the MLServer SKLearn runtime, which is done by setting the implementation field to SKLEARN_SERVER.

    Note that, while the protocol should always be set to kfserving (i.e. so that models are served using the V2 inference protocol), the value of the implementation field will be dependant on your ML framework. The valid values of the implementation field are pre-determined by Seldon Core. However, it should also be possible to configure and add new ones (e.g. to support a custom MLServer runtime).

    Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

    To consult the supported values of the implementation field where MLServer is used, you can check the support table below.

    Supported Pre-packaged Servers

    As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.

    The table below shows a list of the currently supported values of the implementation field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.

    Framework
    MLServer Runtime
    Seldon Core Pre-packaged Server
    Documentation

    Scikit-Learn

    SKLEARN_SERVER

    XGBoost

    XGBOOST_SERVER

    MLflow

    Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a wider set of pre-packaged servers. To check the full list, please visit the Seldon Core documentation.

    Custom Runtimes

    There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.

    Usage

    The componentSpecs field of the SeldonDeployment manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0, we could write our SeldonDeployment manifest as follows:

    As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:

    • Letting Seldon Core know that the model deployment will be served through the V2 inference protocol) by setting the protocol field to v2.

    • Pointing our model container to use our custom MLServer image, by specifying it on the image field of the componentSpecs section of the manifest.

    Once you have your SeldonDeployment manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl, by running:

    core Python inference server
    Seldon Core
    built-in pre-packaged servers
    custom image of MLServer
    Seldon Core
    how to install it
    Seldon Core documentation

    Streaming

    The mlserver package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs.

    Overview

    In this example, we create a simple Identity Text Model which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model.

    Serving

    The next step will be to serve our model using mlserver. For that, we will first implement an extension that serves as the runtime to perform inference using our custom TextModel.

    Custom inference runtime

    This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:

    • split the text into words using the white space as the delimiter.

    • wait 0.5 seconds between each word to simulate a slow model.

    • return each word one by one.

    As it can be seen, the predict_stream method receives as an input an AsyncIterator of InferenceRequest and returns an AsyncIterator of InferenceResponse. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.

    Note that although unary-unary can be covered by predict_stream method as well, mlserver already covers that through the predict method.

    One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above.

    Settings file

    The next step will be to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    settings.json

    Note the currently there are three main limitations of the streaming support in MLServer:

    • distributed workers are not supported (i.e., the parallel_workers setting should be set to 0)

    • gzip middleware is not supported for REST (i.e., gzip_enabled setting should be set to false)

    model-settings.json

    Start serving the model

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be run from the same directory where our config files are or point to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal.

    Inference request

    To test our model, we will use the following inference request:

    Send test generate stream request (REST)

    To send a REST streaming request to the server, we will use the following Python code:

    Send test generate stream request (gRPC)

    To send a gRPC streaming request to the server, we will use the following Python code:

    Note that for gRPC, the request is transformed into an async generator which is then passed to the ModelStreamInfer method. The response is also an async generator which can be iterated over to get the response.

    Serving a custom model with JSON serialization

    The mlserver package comes with inference runtime implementations for scikit-learn and xgboost models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.

    Overview

    In this example, we create a simple Hello World JSON model that parses and modifies a JSON data chunk. This is often useful as a means to quickly bootstrap existing models that utilize JSON based model inputs.

    Serving

    The next step will be to serve our model using mlserver. For that, we will first implement an extension which serve as the runtime to perform inference using our custom Hello World JSON model.

    Custom inference runtime

    This is a trivial model to demonstrate how to conceptually work with JSON inputs / outputs. In this example:

    • Parse the JSON input from the client

    • Create a JSON response echoing back the client request as well as a server generated message

    Settings files

    The next step will be to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    settings.json

    model-settings.json

    Start serving our model

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request (REST)

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Send test inference request (gRPC)

    Utilizing string data with the gRPC interface can be a bit tricky. To ensure we are correctly handling inputs and outputs we will be handled correctly.

    For simplicity in this case, we leverage the Python types that mlserver provides out of the box. Alternatively, the gRPC stubs can be generated regenerated from the V2 specification directly for use by non-Python as well as Python clients.

    MLServer

    An open source inference server for your machine learning models.

    Overview

    MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with spec. Watch a quick video introducing the project .

    • Multi-model serving, letting users run multiple models within the same process.

    Multi-Model Serving

    MLServer has been built with in mind. This means that, within a single instance of MLServer, you can serve multiple models under different paths. This also includes multiple versions of the same model.

    This notebook shows an example of how you can leverage MMS with MLServer.

    Training

    We will first start by training 2 different models:

    Name
    Framework
    Source

    Codecs

    Base64Codec

    Codec that convers to / from a base64 input.

    Methods

    Serving models through Kafka

    Out of the box, MLServer provides support to receive inference requests from Kafka. The Kafka server can run side-by-side with the REST and gRPC ones, and adds a new interface to interact with your model. The inference responses coming back from your model, will also get written back to their own output topic.

    In this example, we will showcase the integration with Kafka by serving a model thorugh Kafka.

    Run Kafka

    We are going to start by running a simple local docker deployment of kafka that we can test against. This will be a minimal cluster that will consist of a single zookeeper node and a single broker.

    You need to have Java installed in order for it to work correctly.

    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
      name: my-model
    spec:
      protocol: v2
      predictors:
        - name: default
          graph:
            name: classifier
            implementation: SKLEARN_SERVER
            modelUri: gs://seldon-models/sklearn/iris
    kubectl apply -f my-seldondeployment-manifest.yaml
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
      name: my-model
    spec:
      protocol: v2
      predictors:
        - name: default
          graph:
            name: classifier
          componentSpecs:
            - spec:
                containers:
                  - name: classifier
                    image: my-custom-server:0.1.0
    kubectl apply -f my-seldondeployment-manifest.yaml

    MLFLOW_SERVER

    MLflow Server

    MLServer SKLearn
    SKLearn Server
    MLServer XGBoost
    XGBoost Server
    MLServer MLflow
    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_input()

    Decode a request input into a high-level Python type.

    decode_output()

    Decode a response output into a high-level Python type.

    encode_input()

    Encode the given payload into a RequestInput.

    encode_output()

    Encode the given payload into a response output.

    CodecError

    Methods

    add_note()

    Exception.add_note(note) -- add a note to the exception

    with_traceback()

    Exception.with_traceback(tb) -- set self.traceback to tb and return self.

    DatetimeCodec

    Codec that convers to / from a datetime input.

    Methods

    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_input()

    Decode a request input into a high-level Python type.

    decode_output()

    Decode a response output into a high-level Python type.

    encode_input()

    Encode the given payload into a RequestInput.

    encode_output()

    Encode the given payload into a response output.

    InputCodec

    The InputCodec interface lets you define type conversions of your raw input data to / from the Open Inference Protocol. Note that this codec applies at the individual input (output) level.

    For request-wide transformations (e.g. dataframes), use the RequestCodec interface instead.

    Methods

    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_input()

    Decode a request input into a high-level Python type.

    decode_output()

    Decode a response output into a high-level Python type.

    encode_input()

    Encode the given payload into a RequestInput.

    encode_output()

    Encode the given payload into a response output.

    NumpyCodec

    Decodes an request input (response output) as a NumPy array.

    Methods

    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_input()

    Decode a request input into a high-level Python type.

    decode_output()

    Decode a response output into a high-level Python type.

    encode_input()

    Encode the given payload into a RequestInput.

    encode_output()

    Encode the given payload into a response output.

    NumpyRequestCodec

    Decodes the first input (output) of request (response) as a NumPy array. This codec can be useful for cases where the whole payload is a single NumPy tensor.

    Methods

    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_request()

    Decode an inference request into a high-level Python object.

    decode_response()

    Decode an inference response into a high-level Python object.

    encode_request()

    Encode the given payload into an inference request.

    encode_response()

    Encode the given payload into an inference response.

    PandasCodec

    Decodes a request (response) into a Pandas DataFrame, assuming each input (output) head corresponds to a column of the DataFrame.

    Methods

    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_request()

    Decode an inference request into a high-level Python object.

    decode_response()

    Decode an inference response into a high-level Python object.

    encode_outputs()

    encode_request()

    Encode the given payload into an inference request.

    encode_response()

    Encode the given payload into an inference response.

    RequestCodec

    The RequestCodec interface lets you define request-level conversions between high-level Python types and the Open Inference Protocol. This can be useful where the encoding of your payload encompases multiple input heads (e.g. dataframes, where each column can be thought as a separate input head).

    For individual input-level encoding / decoding, use the InputCodec interface instead.

    Methods

    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_request()

    Decode an inference request into a high-level Python object.

    decode_response()

    Decode an inference response into a high-level Python object.

    encode_request()

    Encode the given payload into an inference request.

    encode_response()

    Encode the given payload into an inference response.

    StringCodec

    Encodes a list of Python strings as a BYTES input (output).

    Methods

    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_input()

    Decode a request input into a high-level Python type.

    decode_output()

    Decode a response output into a high-level Python type.

    encode_input()

    Encode the given payload into a RequestInput.

    encode_output()

    Encode the given payload into a response output.

    StringRequestCodec

    Decodes the first input (output) of request (response) as a list of strings. This codec can be useful for cases where the whole payload is a single list of strings.

    Methods

    can_encode()

    Evaluate whether the codec can encode (decode) the payload.

    decode_request()

    Decode an inference request into a high-level Python object.

    decode_response()

    Decode an inference response into a high-level Python object.

    encode_request()

    Encode the given payload into an inference request.

    encode_response()

    Encode the given payload into an inference response.

    decode_args()

    No description available.

    decode_inference_request()

    No description available.

    decode_request_input()

    No description available.

    encode_inference_response()

    No description available.

    encode_response_output()

    No description available.

    get_decoded()

    No description available.

    get_decoded_or_raw()

    No description available.

    has_decoded()

    No description available.

    register_input_codec()

    No description available.

    register_request_codec()

    No description available.

    Run the no-zookeeper kafka broker

    Now you can just run it with the following command outside the terminal:

    Create Topics

    Now we can create the input and output topics required

    Training

    The first step will be to train a simple scikit-learn model. For that, we will use the MNIST example from the scikit-learn documentation which trains an SVM model.

    Saving our trained model

    To save our trained model, we will serialise it using joblib. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the scikit-learn documentation.

    Our model will be persisted as a file named mnist-svm.joblib

    Serving

    Now that we have trained and saved our model, the next step will be to serve it using mlserver. For that, we will need to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    Note that, the settings.json file will contain our Kafka configuration, including the address of the Kafka broker and the input / output topics that will be used for inference.

    settings.json

    model-settings.json

    Start serving our model

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Send inference request through Kafka

    Now that we have verified that our server is accepting REST requests, we will try to send a new inference request through Kafka. For this, we just need to send a request to the mlserver-input topic (which is the default input topic):

    Once the message has gone into the queue, the Kafka server running within MLServer should receive this message and run inference. The prediction output should then get posted into an output queue, which will be named mlserver-output by default.

    As we should now be able to see above, the results of our inference request should now be visible in the output Kafka queue.

    Scikit-Learn
    %%writefile text_model.py
    
    import asyncio
    from typing import AsyncIterator
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    from mlserver.codecs import StringCodec
    
    
    class TextModel(MLModel):
    
        async def predict_stream(
            self, payloads: AsyncIterator[InferenceRequest]
        ) -> AsyncIterator[InferenceResponse]:
            payload = [_ async for _ in payloads][0]
            text = StringCodec.decode_input(payload.inputs[0])[0]
            words = text.split(" ")
    
            split_text = []
            for i, word in enumerate(words):
                split_text.append(word if i == 0 else " " + word)
    
            for word in split_text:
                await asyncio.sleep(0.5)
                yield InferenceResponse(
                    model_name=self._settings.name,
                    outputs=[
                        StringCodec.encode_output(
                            name="output",
                            payload=[word],
                            use_bytes=True,
                        ),
                    ],
                )
    
    %%writefile settings.json
    
    {
      "debug": false,
      "parallel_workers": 0,
      "gzip_enabled": false
    }
    
    %%writefile model-settings.json
    
    {
      "name": "text-model",
    
      "implementation": "text_model.TextModel",
      
      "versions": ["text-model/v1.2.3"],
      "platform": "mlserver",
      "inputs": [
        {
          "datatype": "BYTES",
          "name": "prompt",
          "shape": [1]
        }
      ],
      "outputs": [
        {
          "datatype": "BYTES",
          "name": "output",
          "shape": [1]
        }
      ]
    }
    mlserver start .
    %%writefile generate-request.json
    
    {
        "inputs": [
            {
                "name": "prompt",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["What is the capital of France?"],
                "parameters": {
                "content_type": "str"
                }
            }
        ],
        "outputs": [
            {
              "name": "output"
            }
        ]
    }
    import httpx
    from httpx_sse import connect_sse
    from mlserver import types
    from mlserver.codecs import StringCodec
    
    inference_request = types.InferenceRequest.parse_file("./generate-request.json")
    
    with httpx.Client() as client:
        with connect_sse(client, "POST", "http://localhost:8080/v2/models/text-model/generate_stream", json=inference_request.dict()) as event_source:
            for sse in event_source.iter_sse():
                response = types.InferenceResponse.parse_raw(sse.data)
                print(StringCodec.decode_output(response.outputs[0]))
    
    import grpc
    import mlserver.types as types
    from mlserver.codecs import StringCodec
    from mlserver.grpc.converters import ModelInferResponseConverter
    import mlserver.grpc.converters as converters
    import mlserver.grpc.dataplane_pb2_grpc as dataplane
    
    inference_request = types.InferenceRequest.parse_file("./generate-request.json")
    
    # need to convert from string to bytes for grpc
    inference_request.inputs[0] = StringCodec.encode_input("prompt", inference_request.inputs[0].data.root)
    inference_request_g = converters.ModelInferRequestConverter.from_types(
        inference_request, model_name="text-model", model_version=None
    )
    
    async def get_inference_request_stream(inference_request):
        yield inference_request
    
    async with grpc.aio.insecure_channel("localhost:8081") as grpc_channel:
        grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
        inference_request_stream = get_inference_request_stream(inference_request_g)
        
        async for response in grpc_stub.ModelStreamInfer(inference_request_stream):
            response = ModelInferResponseConverter.to_types(response)
            print(StringCodec.decode_output(response.outputs[0]))
    %%writefile jsonmodels.py
    import json
    
    from typing import Dict, Any
    from mlserver import MLModel, types
    from mlserver.codecs import StringCodec
    
    
    class JsonHelloWorldModel(MLModel):
        async def load(self) -> bool:
            # Perform additional custom initialization here.
            print("Initialize model")
    
            # Set readiness flag for model
            return await super().load()
    
        async def predict(self, payload: types.InferenceRequest) -> types.InferenceResponse:
            request = self._extract_json(payload)
            response = {
                "request": request,
                "server_response": "Got your request. Hello from the server.",
            }
            response_bytes = json.dumps(response).encode("UTF-8")
    
            return types.InferenceResponse(
                id=payload.id,
                model_name=self.name,
                model_version=self.version,
                outputs=[
                    types.ResponseOutput(
                        name="echo_response",
                        shape=[len(response_bytes)],
                        datatype="BYTES",
                        data=[response_bytes],
                        parameters=types.Parameters(content_type="str"),
                    )
                ],
            )
    
        def _extract_json(self, payload: types.InferenceRequest) -> Dict[str, Any]:
            inputs = {}
            for inp in payload.inputs:
                inputs[inp.name] = json.loads(
                    "".join(self.decode(inp, default_codec=StringCodec))
                )
    
            return inputs
    
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile model-settings.json
    {
        "name": "json-hello-world",
        "implementation": "jsonmodels.JsonHelloWorldModel"
    }
    mlserver start .
    import requests
    import json
    from mlserver.types import InferenceResponse
    from mlserver.codecs.string import StringRequestCodec
    from pprint import PrettyPrinter
    
    pp = PrettyPrinter(indent=1)
    
    inputs = {"name": "Foo Bar", "message": "Hello from Client (REST)!"}
    
    # NOTE: this uses characters rather than encoded bytes. It is recommended that you use the `mlserver` types to assist in the correct encoding.
    inputs_string = json.dumps(inputs)
    
    inference_request = {
        "inputs": [
            {
                "name": "echo_request",
                "shape": [len(inputs_string)],
                "datatype": "BYTES",
                "data": [inputs_string],
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/json-hello-world/infer"
    response = requests.post(endpoint, json=inference_request)
    
    print(f"full response:\n")
    print(response)
    # retrive text output as dictionary
    inference_response = InferenceResponse.parse_raw(response.text)
    raw_json = StringRequestCodec.decode_response(inference_response)
    output = json.loads(raw_json[0])
    print(f"\ndata part:\n")
    pp.pprint(output)
    import requests
    import json
    import grpc
    from mlserver.codecs.string import StringRequestCodec
    import mlserver.grpc.converters as converters
    import mlserver.grpc.dataplane_pb2_grpc as dataplane
    import mlserver.types as types
    from pprint import PrettyPrinter
    
    pp = PrettyPrinter(indent=1)
    
    model_name = "json-hello-world"
    inputs = {"name": "Foo Bar", "message": "Hello from Client (gRPC)!"}
    inputs_bytes = json.dumps(inputs).encode("UTF-8")
    
    inference_request = types.InferenceRequest(
        inputs=[
            types.RequestInput(
                name="echo_request",
                shape=[len(inputs_bytes)],
                datatype="BYTES",
                data=[inputs_bytes],
                parameters=types.Parameters(content_type="str"),
            )
        ]
    )
    
    inference_request_g = converters.ModelInferRequestConverter.from_types(
        inference_request, model_name=model_name, model_version=None
    )
    
    grpc_channel = grpc.insecure_channel("localhost:8081")
    grpc_stub = dataplane.GRPCInferenceServiceStub(grpc_channel)
    
    response = grpc_stub.ModelInfer(inference_request_g)
    
    print(f"full response:\n")
    print(response)
    # retrive text output as dictionary
    inference_response = converters.ModelInferResponseConverter.to_types(response)
    raw_json = StringRequestCodec.decode_response(inference_response)
    output = json.loads(raw_json[0])
    print(f"\ndata part:\n")
    pp.pprint(output)
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> List[bytes]
    decode_output(response_output: ResponseOutput) -> List[bytes]
    encode_input(name: str, payload: List[bytes], use_bytes: bool = True, kwargs) -> RequestInput
    encode_output(name: str, payload: List[bytes], use_bytes: bool = True, kwargs) -> ResponseOutput
    add_note(...)
    with_traceback(...)
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> List[datetime]
    decode_output(response_output: ResponseOutput) -> List[datetime]
    encode_input(name: str, payload: List[Union[str, datetime]], use_bytes: bool = True, kwargs) -> RequestInput
    encode_output(name: str, payload: List[Union[str, datetime]], use_bytes: bool = True, kwargs) -> ResponseOutput
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> Any
    decode_output(response_output: ResponseOutput) -> Any
    encode_input(name: str, payload: Any, kwargs) -> RequestInput
    encode_output(name: str, payload: Any, kwargs) -> ResponseOutput
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> ndarray
    decode_output(response_output: ResponseOutput) -> ndarray
    encode_input(name: str, payload: ndarray, kwargs) -> RequestInput
    encode_output(name: str, payload: ndarray, kwargs) -> ResponseOutput
    can_encode(payload: Any) -> bool
    decode_request(request: InferenceRequest) -> Any
    decode_response(response: InferenceResponse) -> Any
    encode_request(payload: Any, kwargs) -> InferenceRequest
    encode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponse
    can_encode(payload: Any) -> bool
    decode_request(request: InferenceRequest) -> DataFrame
    decode_response(response: InferenceResponse) -> DataFrame
    encode_outputs(payload: DataFrame, use_bytes: bool = True) -> List[ResponseOutput]
    encode_request(payload: DataFrame, use_bytes: bool = True, kwargs) -> InferenceRequest
    encode_response(model_name: str, payload: DataFrame, model_version: Optional[str] = None, use_bytes: bool = True, kwargs) -> InferenceResponse
    can_encode(payload: Any) -> bool
    decode_request(request: InferenceRequest) -> Any
    decode_response(response: InferenceResponse) -> Any
    encode_request(payload: Any, kwargs) -> InferenceRequest
    encode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponse
    can_encode(payload: Any) -> bool
    decode_input(request_input: RequestInput) -> List[str]
    decode_output(response_output: ResponseOutput) -> List[str]
    encode_input(name: str, payload: List[str], use_bytes: bool = True, kwargs) -> RequestInput
    encode_output(name: str, payload: List[str], use_bytes: bool = True, kwargs) -> ResponseOutput
    can_encode(payload: Any) -> bool
    decode_request(request: InferenceRequest) -> Any
    decode_response(response: InferenceResponse) -> Any
    encode_request(payload: Any, kwargs) -> InferenceRequest
    encode_response(model_name: str, payload: Any, model_version: Optional[str] = None, kwargs) -> InferenceResponse
    decode_args(predict: Callable) -> Callable[[ForwardRef('MLModel'), <class 'mlserver.types.dataplane.InferenceRequest'>], Coroutine[Any, Any, InferenceResponse]]
    decode_inference_request(inference_request: InferenceRequest, model_settings: Optional[ModelSettings] = None, metadata_inputs: Dict[str, MetadataTensor] = {}) -> Optional[Any]
    decode_request_input(request_input: RequestInput, metadata_inputs: Dict[str, MetadataTensor] = {}) -> Optional[Any]
    encode_inference_response(payload: Any, model_settings: ModelSettings) -> Optional[InferenceResponse]
    encode_response_output(payload: Any, request_output: RequestOutput, metadata_outputs: Dict[str, MetadataTensor] = {}) -> Optional[ResponseOutput]
    get_decoded(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> Any
    get_decoded_or_raw(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> Any
    has_decoded(parametrised_obj: Union[InferenceRequest, RequestInput, RequestOutput, ResponseOutput, InferenceResponse]) -> bool
    register_input_codec(CodecKlass: Union[type[InputCodec], InputCodec])
    register_request_codec(CodecKlass: Union[type[RequestCodec], RequestCodec])
    !wget https://apache.mirrors.nublue.co.uk/kafka/2.8.0/kafka_2.12-2.8.0.tgz
    !tar -zxvf kafka_2.12-2.8.0.tgz
    !./kafka_2.12-2.8.0/bin/kafka-storage.sh format -t OXn8RTSlQdmxwjhKnSB_6A -c ./kafka_2.12-2.8.0/config/kraft/server.properties
    !./kafka_2.12-2.8.0/bin/kafka-server-start.sh ./kafka_2.12-2.8.0/config/kraft/server.properties
    !./kafka_2.12-2.8.0/bin/kafka-topics.sh --create --topic mlserver-input --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
    !./kafka_2.12-2.8.0/bin/kafka-topics.sh --create --topic mlserver-output --partitions 1 --replication-factor 1 --bootstrap-server localhost:9092
    # Original source code and more details can be found in:
    # https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
    
    # Import datasets, classifiers and performance metrics
    from sklearn import datasets, svm, metrics
    from sklearn.model_selection import train_test_split
    
    # The digits dataset
    digits = datasets.load_digits()
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))
    
    # Create a classifier: a support vector classifier
    classifier = svm.SVC(gamma=0.001)
    
    # Split data into train and test subsets
    X_train, X_test, y_train, y_test = train_test_split(
        data, digits.target, test_size=0.5, shuffle=False)
    
    # We learn the digits on the first half of the digits
    classifier.fit(X_train, y_train)
    import joblib
    
    model_file_name = "mnist-svm.joblib"
    joblib.dump(classifier, model_file_name)
    %%writefile settings.json
    {
        "debug": "true",
        "kafka_enabled": "true"
    }
    %%writefile model-settings.json
    {
        "name": "mnist-svm",
        "implementation": "mlserver_sklearn.SKLearnModel",
        "parameters": {
            "uri": "./mnist-svm.joblib",
            "version": "v0.1.0"
        }
    }
    mlserver start .
    import requests
    
    x_0 = X_test[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    import json
    from kafka import KafkaProducer
    
    producer = KafkaProducer(bootstrap_servers="localhost:9092")
    
    headers = {
        "mlserver-model": b"mnist-svm",
        "mlserver-version": b"v0.1.0",
    }
    
    producer.send(
        "mlserver-input",
        json.dumps(inference_request).encode("utf-8"),
        headers=list(headers.items()))
    from kafka import KafkaConsumer
    
    consumer = KafkaConsumer(
        "mlserver-output",
        bootstrap_servers="localhost:9092",
        auto_offset_reset="earliest")
    
    for msg in consumer:
        print(f"key: {msg.key}")
        print(f"value: {msg.value}\n")
        break
  • Ability to run inference in parallel for vertical scaling across multiple models through a pool of inference workers.

  • Support for adaptive batching, to group inference requests together on the fly.

  • Scalability with deployment in Kubernetes native frameworks, including Seldon Core and KServe (formerly known as KFServing), where MLServer is the core Python inference server used to serve machine learning models.

  • Support for the standard V2 Inference Protocol on both the gRPC and REST flavours, which has been standardised and adopted by various model serving frameworks.

  • You can read more about the goals of this project on the initial design document.

    Usage

    You can install the mlserver package running:

    Note that to use any of the optional inference runtimes, you'll need to install the relevant package. For example, to serve a scikit-learn model, you would need to install the mlserver-sklearn package:

    For further information on how to use MLServer, you can check any of the available examples.

    Inference Runtimes

    Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. You can read more about inference runtimes in their documentation page.

    Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common frameworks. This allows you to start serving models saved in these frameworks straight away. However, it's also possible to write custom runtimes.

    Out of the box, MLServer provides support for:

    Framework
    Supported
    Documentation

    Scikit-Learn

    ✅

    XGBoost

    ✅

    Spark MLlib

    ✅

    LightGBM

    ✅

    Supported Python Versions

    🔴 Unsupported

    🟠 Deprecated: To be removed in a future version

    🟢 Supported

    🔵 Untested

    Python Version
    Status

    3.7

    🔴

    3.8

    🔴

    3.9

    🟢

    3.10

    🟢

    3.11

    🟢

    3.12

    🟢

    Examples

    To see MLServer in action, check out our full list of examples. You can find below a few selected examples showcasing how you can leverage MLServer to start serving your machine learning models.

    • Serving a scikit-learn model

    • Serving a xgboost model

    • Serving a lightgbm model

    • Serving a catboost model

    Developer Guide

    Versioning

    Both the main mlserver package and the inference runtimes packages try to follow the same versioning schema. To bump the version across all of them, you can use the ./hack/update-version.sh script.

    We generally keep the version as a placeholder for an upcoming version.

    For example:

    Testing

    To run all of the tests for MLServer and the runtimes, use:

    To run run tests for a single file, use something like:

    KFServing's V2 Dataplane
    here
    Trained Model Path

    mnist-svm

    scikit-learn

    ./models/mnist-svm/model.joblib

    mushroom-xgboost

    xgboost

    ./models/mushroom-xgboost/model.json

    Training our mnist-svm model

    Training our mushroom-xgboost model

    Serving

    The next step will be serving both our models within the same MLServer instance. For that, we will just need to create a model-settings.json file local to each of our models and a server-wide settings.json. That is,

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • models/mnist-svm/model-settings.json: holds the configuration specific to our mnist-svm model (e.g. input type, runtime to use, etc.).

    • models/mushroom-xgboost/model-settings.json: holds the configuration specific to our mushroom-xgboost model (e.g. input type, runtime to use, etc.).

    settings.json

    models/mnist-svm/model-settings.json

    models/mushroom-xgboost/model-settings.json

    Start serving our model

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Testing

    By this point, we should have both our models getting served by MLServer. To make sure that everything is working as expected, let's send a request from each test set.

    For that, we can use the Python types that the mlserver package provides out of box, or we can build our request manually.

    Testing our mnist-svm model

    Testing our mushroom-xgboost model

    Multi-Model Serving (MMS)

    MLServer Settings

    Config

    Attribute
    Type
    Default

    extra

    str

    "ignore"

    Fields

    Field
    Type
    Default
    Description

    Serving a custom model

    The mlserver package comes with inference runtime implementations for scikit-learn and xgboost models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.

    Overview

    In this example, we will train a numpyro model. The numpyro library streamlines the implementation of probabilistic models, abstracting away advanced inference and training algorithms.

    Out of the box, mlserver doesn't provide an inference runtime for numpyro. However, through this example we will see how easy is to develop our own.

    Training

    The first step will be to train our model. This will be a very simple bayesian regression model, based on an example provided in the .

    Since this is a probabilistic model, during training we will compute an approximation to the posterior distribution of our model using MCMC.

    Saving our trained model

    Now that we have trained our model, the next step will be to save it so that it can be loaded afterwards at serving-time. Note that, since this is a probabilistic model, we will only need to save the traces that approximate the posterior distribution over latent parameters.

    This will get saved in a numpyro-divorce.json file.

    Serving

    The next step will be to serve our model using mlserver. For that, we will first implement an extension which serve as the runtime to perform inference using our custom numpyro model.

    Custom inference runtime

    Our custom inference wrapper should be responsible of:

    • Loading the model from the set samples we saved previously.

    • Running inference using our model structure, and the posterior approximated from the samples.

    Settings files

    The next step will be to create 2 configuration files:

    • settings.json: holds the configuration of our server (e.g. ports, log level, etc.).

    • model-settings.json: holds the configuration of our model (e.g. input type, runtime to use, etc.).

    settings.json

    model-settings.json

    Start serving our model

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set.

    For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Deployment

    Now that we have written and tested our custom model, the next step is to deploy it. With that goal in mind, the rough outline of steps will be to first build a custom image containing our code, and then deploy it.

    Specifying requirements

    MLServer will automatically find your requirements.txt file and install necessary python packages

    Building a custom image

    MLServer offers helpers to build a custom Docker image containing your code. In this example, we will use the mlserver build subcommand to create an image, which we'll be able to deploy later.

    Note that this section expects that Docker is available and running in the background, as well as a functional cluster with Seldon Core installed and some familiarity with kubectl.

    To ensure that the image is fully functional, we can spin up a container and then send a test request. To start the container, you can run something along the following lines in a separate terminal:

    As we should be able to see, the server running within our Docker image responds as expected.

    Deploying our custom image

    Now that we've built a custom image and verified that it works as expected, we can move to the next step and deploy it. There is a large number of tools out there to deploy images. However, for our example, we will focus on deploying it to a cluster running .

    For that, we will need to create a SeldonDeployment resource which instructs Seldon Core to deploy a model embedded within our custom image and compliant with the . This can be achieved by applying (i.e. kubectl apply) a SeldonDeployment manifest to the cluster, similar to the one below:

    Custom Inference Runtimes

    There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

    This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this which walks through the process of writing a custom runtime.

    Writing a custom inference runtime

    MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel> abstract class, whose main methods are:

    pip install mlserver
    pip install mlserver-sklearn
    ./hack/update-version.sh 0.2.0.dev1
    make test
    tox -e py3 -- tests/batch_processing/test_rest.py
    # Original source code and more details can be found in:
    # https://scikit-learn.org/stable/auto_examples/classification/plot_digits_classification.html
    
    # Import datasets, classifiers and performance metrics
    from sklearn import datasets, svm, metrics
    from sklearn.model_selection import train_test_split
    
    # The digits dataset
    digits = datasets.load_digits()
    
    # To apply a classifier on this data, we need to flatten the image, to
    # turn the data in a (samples, feature) matrix:
    n_samples = len(digits.images)
    data = digits.images.reshape((n_samples, -1))
    
    # Create a classifier: a support vector classifier
    classifier = svm.SVC(gamma=0.001)
    
    # Split data into train and test subsets
    X_train, X_test_digits, y_train, y_test_digits = train_test_split(
        data, digits.target, test_size=0.5, shuffle=False)
    
    # We learn the digits on the first half of the digits
    classifier.fit(X_train, y_train)
    import joblib
    import os
    
    mnist_svm_path = os.path.join("models", "mnist-svm")
    os.makedirs(mnist_svm_path, exist_ok=True)
    
    mnist_svm_model_path = os.path.join(mnist_svm_path, "model.joblib")
    joblib.dump(classifier, mnist_svm_model_path)
    # Original code and extra details can be found in:
    # https://xgboost.readthedocs.io/en/latest/get_started.html#python
    
    import os
    import xgboost as xgb
    import requests
    
    from urllib.parse import urlparse
    from sklearn.datasets import load_svmlight_file
    
    
    TRAIN_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.train'
    TEST_DATASET_URL = 'https://raw.githubusercontent.com/dmlc/xgboost/master/demo/data/agaricus.txt.test'
    
    
    def _download_file(url: str) -> str:
        parsed = urlparse(url)
        file_name = os.path.basename(parsed.path)
        file_path = os.path.join(os.getcwd(), file_name)
        
        res = requests.get(url)
        
        with open(file_path, 'wb') as file:
            file.write(res.content)
        
        return file_path
    
    train_dataset_path = _download_file(TRAIN_DATASET_URL)
    test_dataset_path = _download_file(TEST_DATASET_URL)
    
    # NOTE: Workaround to load SVMLight files from the XGBoost example
    X_train, y_train = load_svmlight_file(train_dataset_path)
    X_test_agar, y_test_agar = load_svmlight_file(test_dataset_path)
    X_train = X_train.toarray()
    X_test_agar = X_test_agar.toarray()
    
    # read in data
    dtrain = xgb.DMatrix(data=X_train, label=y_train)
    
    # specify parameters via map
    param = {'max_depth':2, 'eta':1, 'objective':'binary:logistic' }
    num_round = 2
    bst = xgb.train(param, dtrain, num_round)
    
    bst
    import os
    
    mushroom_xgboost_path = os.path.join("models", "mushroom-xgboost")
    os.makedirs(mushroom_xgboost_path, exist_ok=True)
    
    mushroom_xgboost_model_path = os.path.join(mushroom_xgboost_path, "model.json")
    bst.save_model(mushroom_xgboost_model_path)
    %%writefile settings.json
    {
        "debug": "true"
    }
    %%writefile models/mnist-svm/model-settings.json
    {
        "name": "mnist-svm",
        "implementation": "mlserver_sklearn.SKLearnModel",
        "parameters": {
            "version": "v0.1.0"
        }
    }
    %%writefile models/mushroom-xgboost/model-settings.json
    {
        "name": "mushroom-xgboost",
        "implementation": "mlserver_xgboost.XGBoostModel",
        "parameters": {
            "version": "v0.1.0"
        }
    }
    
    mlserver start .
    import requests
    
    x_0 = X_test_digits[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mnist-svm/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    import requests
    
    x_0 = X_test_agar[0:1]
    inference_request = {
        "inputs": [
            {
              "name": "predict",
              "shape": x_0.shape,
              "datatype": "FP32",
              "data": x_0.tolist()
            }
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/mushroom-xgboost/versions/v0.1.0/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    MNIST example from the scikit-learn documentation
    Mushrooms example from the xgboost Getting Started guide

    debug

    bool

    True

    -

    environments_dir

    str

    '-'

    -

    extensions

    List[str]

    []

    -

    grpc_max_message_length

    Optional[int]

    None

    -

    grpc_port

    int

    8081

    -

    gzip_enabled

    bool

    True

    Enable GZipMiddleware.

    host

    str

    '0.0.0.0'

    -

    http_port

    int

    8080

    -

    kafka_enabled

    bool

    False

    Enable Kafka integration for the server.

    kafka_servers

    str

    'localhost:9092'

    Comma-separated list of Kafka servers.

    kafka_topic_input

    str

    'mlserver-input'

    Kafka topic for input messages.

    kafka_topic_output

    str

    'mlserver-output'

    Kafka topic for output messages.

    load_models_at_startup

    bool

    True

    -

    logging_settings

    Union[str, Dict[Any, Any], None]

    None

    Path to logging config file or dictionary configuration.

    metrics_dir

    str

    '-'

    Directory used to share metrics across parallel workers. Equivalent to the PROMETHEUS_MULTIPROC_DIR env var in prometheus-client. Note that this won't be used if the parallel_workers flag is disabled. By default, the .metrics folder of the current working directory will be used.

    metrics_endpoint

    Optional[str]

    '/metrics'

    Endpoint used to expose Prometheus metrics. Alternatively, can be set to None to disable it.

    metrics_port

    int

    8082

    Port used to expose metrics endpoint.

    metrics_rest_server_prefix

    str

    'rest_server'

    Metrics rest server string prefix to be exported.

    model_repository_implementation

    Optional[ImportString]

    None

    -

    model_repository_implementation_args

    dict

    {}

    -

    model_repository_root

    str

    '.'

    -

    parallel_workers

    int

    1

    -

    parallel_workers_timeout

    int

    5

    -

    root_path

    str

    ''

    -

    server_name

    str

    'mlserver'

    -

    server_version

    str

    '1.7.0.dev0'

    -

    tracing_server

    Optional[str]

    None

    Server name used to export OpenTelemetry tracing to collector service.

    use_structured_logging

    bool

    False

    Use JSON-formatted structured logging instead of default format.

    env_prefix

    str

    "MLSERVER_"

    env_file

    str

    ".env"

    protected_namespaces

    tuple

    ()

    cache_enabled

    bool

    False

    Enable caching for the model predictions.

    cache_size

    int

    100

    Cache size to be used if caching is enabled.

    cors_settings

    Optional[CORSSettings]

    None

    -

    numpyro docs
    Seldon Core
    V2 Inference Protocol
    # Original source code and more details can be found in:
    # https://nbviewer.jupyter.org/github/pyro-ppl/numpyro/blob/master/notebooks/source/bayesian_regression.ipynb
    
    
    import numpyro
    import numpy as np
    import pandas as pd
    
    from numpyro import distributions as dist
    from jax import random
    from numpyro.infer import MCMC, NUTS
    
    DATASET_URL = "https://raw.githubusercontent.com/rmcelreath/rethinking/master/data/WaffleDivorce.csv"
    dset = pd.read_csv(DATASET_URL, sep=";")
    
    standardize = lambda x: (x - x.mean()) / x.std()
    
    dset["AgeScaled"] = dset.MedianAgeMarriage.pipe(standardize)
    dset["MarriageScaled"] = dset.Marriage.pipe(standardize)
    dset["DivorceScaled"] = dset.Divorce.pipe(standardize)
    
    
    def model(marriage=None, age=None, divorce=None):
        a = numpyro.sample("a", dist.Normal(0.0, 0.2))
        M, A = 0.0, 0.0
        if marriage is not None:
            bM = numpyro.sample("bM", dist.Normal(0.0, 0.5))
            M = bM * marriage
        if age is not None:
            bA = numpyro.sample("bA", dist.Normal(0.0, 0.5))
            A = bA * age
        sigma = numpyro.sample("sigma", dist.Exponential(1.0))
        mu = a + M + A
        numpyro.sample("obs", dist.Normal(mu, sigma), obs=divorce)
    
    
    # Start from this source of randomness. We will split keys for subsequent operations.
    rng_key = random.PRNGKey(0)
    rng_key, rng_key_ = random.split(rng_key)
    
    num_warmup, num_samples = 1000, 2000
    
    # Run NUTS.
    kernel = NUTS(model)
    mcmc = MCMC(kernel, num_warmup=num_warmup, num_samples=num_samples)
    mcmc.run(
        rng_key_, marriage=dset.MarriageScaled.values, divorce=dset.DivorceScaled.values
    )
    mcmc.print_summary()
    import json
    
    samples = mcmc.get_samples()
    serialisable = {}
    for k, v in samples.items():
        serialisable[k] = np.asarray(v).tolist()
    
    model_file_name = "numpyro-divorce.json"
    with open(model_file_name, "w") as model_file:
        json.dump(serialisable, model_file)
    # %load models.py
    import json
    import numpyro
    import numpy as np
    
    from jax import random
    from mlserver import MLModel
    from mlserver.codecs import decode_args
    from mlserver.utils import get_model_uri
    from numpyro.infer import Predictive
    from numpyro import distributions as dist
    from typing import Optional
    
    
    class NumpyroModel(MLModel):
        async def load(self) -> bool:
            model_uri = await get_model_uri(self._settings)
            with open(model_uri) as model_file:
                raw_samples = json.load(model_file)
    
            self._samples = {}
            for k, v in raw_samples.items():
                self._samples[k] = np.array(v)
    
            self._predictive = Predictive(self._model, self._samples)
    
            return True
    
        @decode_args
        async def predict(
            self,
            marriage: Optional[np.ndarray] = None,
            age: Optional[np.ndarray] = None,
            divorce: Optional[np.ndarray] = None,
        ) -> np.ndarray:
            predictions = self._predictive(
                rng_key=random.PRNGKey(0), marriage=marriage, age=age, divorce=divorce
            )
    
            obs = predictions["obs"]
            obs_mean = obs.mean()
    
            return np.asarray(obs_mean)
    
        def _model(self, marriage=None, age=None, divorce=None):
            a = numpyro.sample("a", dist.Normal(0.0, 0.2))
            M, A = 0.0, 0.0
            if marriage is not None:
                bM = numpyro.sample("bM", dist.Normal(0.0, 0.5))
                M = bM * marriage
            if age is not None:
                bA = numpyro.sample("bA", dist.Normal(0.0, 0.5))
                A = bA * age
            sigma = numpyro.sample("sigma", dist.Exponential(1.0))
            mu = a + M + A
            numpyro.sample("obs", dist.Normal(mu, sigma), obs=divorce)
    
    # %load settings.json
    {
        "debug": "true"
    }
    
    # %load model-settings.json
    {
        "name": "numpyro-divorce",
        "implementation": "models.NumpyroModel",
        "parameters": {
            "uri": "./numpyro-divorce.json"
        }
    }
    
    mlserver start .
    import requests
    import numpy as np
    
    from mlserver.types import InferenceRequest
    from mlserver.codecs import NumpyCodec
    
    x_0 = np.array([28.0])
    inference_request = InferenceRequest(
        inputs=[
            NumpyCodec.encode_input(name="marriage", payload=x_0)
        ]
    )
    
    endpoint = "http://localhost:8080/v2/models/numpyro-divorce/infer"
    response = requests.post(endpoint, json=inference_request.model_dump())
    
    response.json()
    # %load requirements.txt
    numpy==1.22.4
    numpyro==0.8.0
    jax==0.2.24
    jaxlib==0.3.7
    
    This section expects that Docker is available and running in the background.
    %%bash
    mlserver build . -t 'my-custom-numpyro-server:0.1.0'
    docker run -it --rm -p 8080:8080 my-custom-numpyro-server:0.1.0
    import numpy as np
    
    from mlserver.types import InferenceRequest
    from mlserver.codecs import NumpyCodec
    
    x_0 = np.array([28.0])
    inference_request = InferenceRequest(
        inputs=[
            NumpyCodec.encode_input(name="marriage", payload=x_0)
        ]
    )
    
    endpoint = "http://localhost:8080/v2/models/numpyro-divorce/infer"
    response = requests.post(endpoint, json=inference_request.model_dump())
    
    response.json()
    This section expects access to a functional Kubernetes cluster with Seldon Core installed and some familiarity with `kubectl`.
    Also consider that depending on your Kubernetes installation Seldon Core might expect to get the container image from a public container registry like [Docker hub](https://hub.docker.com/) or [Google Container Registry](https://cloud.google.com/container-registry). For that you need to do an extra step of pushing the container to the registry using `docker tag <image name> <container registry>/<image name>` and `docker push <container registry>/<image name>` and also updating the `image` section of the yaml file to `<container registry>/<image name>`.
    %%writefile seldondeployment.yaml
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
      name: numpyro-model
    spec:
      protocol: v2
      predictors:
        - name: default
          graph:
            name: numpyro-divorce
            type: MODEL
          componentSpecs:
            - spec:
                containers:
                  - name: numpyro-divorce
                    image: my-custom-numpyro-server:0.1.0
  • load() <mlserver.MLModel.load>: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).

  • unload() <mlserver.MLModel.unload>: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).

  • predict() <mlserver.MLModel.predict>: Responsible for using a model to perform inference on an incoming data point.

  • Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>, and then overriding those methods with your custom logic.

    Simplified interface

    MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict() method with the mlserver.codecs.decode_args decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.

    Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through MLServer's codecs and content types system.

    MLServer's "simplified" interface aims to cover use cases where encoding / decoding can be done through one of the codecs built-in into the MLServer package. However, there are instances where this may not be enough (e.g. variable number of inputs, variable content types, etc.). For these types of cases, please use MLServer's "advanced" interface, where you will have full control over the full encoding / decoding process.

    As an example of the above, let's assume a model which

    • Takes two lists of strings as inputs:

      • questions, containing multiple questions to ask our model.

      • context, containing multiple contexts for each of the questions.

    • Returns a Numpy array with some predictions as the output.

    Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:

    Note that, the method signature of our predict method now specifies:

    • The input names that we should be looking for in the request payload (i.e. questions and context).

    • The expected content type for each of the request inputs (i.e. List[str] on both cases).

    • The expected content type of the response outputs (i.e. np.ndarray).

    Read and write headers

    The headers field within the parameters section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.

    There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers field of the parameters object within the InferenceRequest instance.

    Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers field within the parameters object of the returned InferenceResponse instance.

    Loading a custom MLServer runtime

    MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json configuration file.

    For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:

    Note that, from the example above, we are assuming that:

    • Your custom runtime code lives in the models.py file.

    • The implementation field of your model-settings.json configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime).

    Loading a custom Python environment

    More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.

    It is possible to load this custom set of dependencies by providing them through an environment tarball, whose path can be specified within your model-settings.json file.

    To load a custom environment, parallel inference must be enabled.

    The main MLServer process communicates with workers in custom environments via multiprocessing.Queue using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same default pickle protocol as the main process. Consult the tables below for environment compatibility.

    Status
    Description

    🔴

    Unsupported

    🟢

    Supported

    🔵

    Untested

    Worker Python \ Server Python
    3.9
    3.10
    3.11

    3.9

    🟢

    🟢

    🔵

    3.10

    🟢

    🟢

    🔵

    3.11

    🔵

    🔵

    If we take the previous example above as a reference, we could extend it to include our custom environment as:

    Note that, in the folder layout above, we are assuming that:

    • The environment.tar.gz tarball contains a pre-packaged version of your custom environment.

    • The environment_tarball field of your model-settings.json configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz).

    Building a custom MLServer image

    The mlserver build command expects that a Docker runtime is available and running in the background.

    MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a Conda environment file or a requirements.txt file.

    To leverage these, we can use the mlserver build command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:

    The output will be a Docker image named my-custom-server, ready to be used.

    Custom Environment

    The mlserver build subcommand will search for any Conda environment file (i.e. named either as environment.yaml or conda.yaml) and / or any requirements.txt present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.

    The environment built by the mlserver build will be global to the whole MLServer image (i.e. every loaded model will, by default, use that custom environment). For Multi-Model Serving scenarios, it may be better to use per-model custom environments instead - which will allow you to run multiple custom environments at the same time.

    Default Settings

    The mlserver build subcommand will treat any settings.json or model-settings.json files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.

    Default setting values can still be overriden by external environment variables or model-specific model-settings.json.

    Custom Dockerfile

    Out-of-the-box, the mlserver build subcommand leverages a default Dockerfile which takes into account a number of requirements, like

    • Supporting arbitrary user IDs.

    • Building your base custom environment on the fly.

    • Configure a set of default setting values.

    However, there may be occasions where you need to customise your Dockerfile even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like Kaniko or Buildah).

    To account for these cases, MLServer also includes a mlserver dockerfile subcommand which will just generate a Dockerfile (and optionally a .dockerignore file) exactly like the one used by the mlserver build command. This Dockerfile can then be customised according to your needs.

    The base Dockerfile requires Docker's Buildkit to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1 environment variable, e.g.

    end-to-end example

    MLServer LightGBM

    CatBoost

    ✅

    MLServer CatBoost

    Tempo

    ✅

    github.com/SeldonIO/tempo

    MLflow

    ✅

    MLServer MLflow

    Alibi-Detect

    ✅

    MLServer Alibi Detect

    Alibi-Explain

    ✅

    MLServer Alibi Explain

    HuggingFace

    ✅

    MLServer HuggingFace

    3.13

    🔴

    Serving a tempo pipeline
    Serving a custom model
    Serving an alibi-detect model
    Serving a HuggingFace model
    Multi-Model Serving with multiple frameworks
    Loading / unloading models from a model repository
    MLServer SKLearn
    MLServer XGBoost
    MLServer MLlib

    Deploying a Custom Tensorflow Model with MLServer and Seldon Core

    Background

    Intro

    This tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we'll cover:

    • Running the model locally

    • Turning the ML model into an API

    • Containerizing the model

    • Storing the container in a registry

    • Deploying the model to Kubernetes (with Seldon Core)

    • Scaling the model

    The tutorial comes with an accompanying video which you might find useful as you work through the steps:

    The slides used in the video can be found .

    The Use Case

    For this tutorial, we're going to use the available from the Tensorflow Catalog. This dataset includes leaf images from the cassava plant. Each plant can be classified as either "healthly" or as having one of four diseases (Mosaic Disease, Bacterial Blight, Green Mite, Brown Streak Disease).

    We won't go through the steps of training the classifier. Instead, we'll be using a pre-trained one available on TensorFlow Hub. You can find the .

    Getting Set Up

    The easiest way to run this example is to clone the repository located :

    If you've already cloned the MLServer repository, you can also find it in docs/examples/cassava.

    Once you've done that, you can just run:

    And it'll set you up with all the libraries required to run the code.

    Running The Python App

    The starting point for this tutorial is python script app.py. This is typical of the kind of python code we'd run standalone or in a jupyter notebook. Let's familiarise ourself with the code:

    First up, we're importing a couple of functions from our helpers.py file:

    • plot provides the visualisation of the samples, labels and predictions.

    • preprocess is used to resize images to 224x224 pixels and normalize the RGB values.

    The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.

    Try it yourself by running:

    Here's what our setup currently looks like:

    Creating an API for The Model

    The problem with running our code like we did earlier is that it's not accessible to anyone who doesn't have the python script (and all of it's dependencies). A good way to solve this is to turn our model into an API.

    Typically people turn to popular python web servers like or . This is a good approach and gives us lots of flexibility but it also requires us to do a lot of the work ourselves. We need to impelement routes, set up logging, capture metrics and define an API schema among other things. A simpler way to tackle this problem is to use an inference server. For this tutorial we're going to use the open source framework.

    MLServer supports a bunch of out of the box, but it also supports which is what we'll use for our Tensorflow model.

    Setting Things Up

    In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load() and predict(). Let's take a look at the code (found in model/serve-model.py):

    The load() method is used to define any logic required to set up our model for inference. In our case, we're loading the model weights into self._model. The predict() method is where we include all of our prediction logic.

    You may notice that we've slightly modified our code from earlier (in app.py). The biggest change is that it is now wrapped in a single class CassavaModel.

    The only other task we need to do to run our model on MLServer is to specify a model-settings.json file:

    This is a simple configuration file that tells MLServer how to handle our model. In our case, we've provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel).

    Serving The Model

    We're now ready to serve our model with MLServer. To do that we can simply run:

    MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.

    Making Predictions Using The API

    Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the test.py file by running:

    Our setup has now evloved and looks like this:

    Containerizing The Model

    are an easy way to package our application together with it's runtime and dependencies. More importantly, containerizing our model allows it to run in a variety of different environments.

    Note: you will need installed to run this section of the tutorial. You'll also need a account or another container registry.

    Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build command.

    Before we run this command, we need to provide our dependencies in either a requirements.txt or a conda.env file. The requirements file we'll use for this example is stored in model/requirements.txt:

    Notice that we didn't need to include mlserver in our requirements? That's because the builder image has mlserver included already.

    We're now ready to build our container image using:

    Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".

    MLServer will now build the model into a container image for us. We can check the output of this by running:

    Finally, we want to send this container image to be stored in our container registry. We can do this by running:

    Our setup now looks like this. Where our model has been packaged and sent to a container registry:

    Deploying to Kubernetes

    Now that we've turned our model into a production-ready API, containerized it and pushed it to a registry, it's time to deploy our model.

    We're going to use a popular open source framework called to deploy our model. Seldon Core is great because it combines all of the awesome cloud-native features we get from but it also adds machine-learning specific features.

    This tutorial assumes you already have a Seldon Core cluster up and running. If that's not the case, head over the and get set up first. You'll also need to install the kubectl command line interface.

    Creating the Deployment

    To create our deployment with Seldon Core we need to create a small configuration file that looks like this:

    You can find this file named deployment.yaml in the base folder of this tutorial's repository.

    Make sure you replace YOUR_CONTAINER_REGISTRY and IMAGE_NAME with your dockerhub username and a suitable name e.g. "bobsmith/cassava".

    We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:

    To check our deployment is up and running we can run:

    We should see STATUS = Running once our deployment has finalized.

    Testing the Deployment

    Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it's working.

    To do this, we simply run the test.py file in the following way:

    This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.

    A note on running this yourself: This example is set up to connect to a kubernetes cluster running locally on your machine. If yours is local too, you'll need to make sure you before sending requests. If your cluster is remote, you'll need to change the inference_url variable on line 21 of test.py.

    Having deployed our model to kubernetes and tested it, our setup now looks like this:

    Scaling the Model

    Our model is now running in a production environment and able to handle requests from external sources. This is awesome but what happens as the number of requests being sent to our model starts to increase? Eventually, we'll reach the limit of what a single server can handle. Thankfully, we can get around this problem by scaling our model .

    Kubernetes and Seldon Core make this really easy to do by simply running:

    We can replace the --replicas=3 with any number we want to scale to.

    To watch the servers scaling out we can run:

    Once the new replicas have finished rolling out, our setup now looks like this:

    In this tutorial we've scaled the model out manually to show how it works. In a real environment we'd want to set up to make sure our prediction API is always online and performing as expected.

    Serving HuggingFace Transformer Models

    Out of the box, MLServer supports the deployment and serving of HuggingFace Transformer models with the following features:

    • Loading of Transformer Model artifacts from the Hugging Face Hub.

    • Model quantization & optimization using the Hugging Face Optimum library

    • Request batching for GPU optimization (via adaptive batching and request batching)

    In this example, we will showcase some of this features using an example model.

    Serving MLflow models

    Out of the box, MLServer supports the deployment and serving of MLflow models with the following features:

    • Loading of MLflow Model artifacts.

    • Support of dataframes, dict-of-tensors and tensor inputs.

    In this example, we will showcase some of this features using an example model.

    {
      "model": "sum-model",
      "implementation": "models.MyCustomRuntime"
    }
    {
      "model": "sum-model",
      "implementation": "models.MyCustomRuntime",
      "parameters": {
        "environment_tarball": "./environment.tar.gz"
      }
    }
    DOCKER_BUILDKIT=1 docker build . -t my-custom-runtime:0.1.0
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class MyCustomRuntime(MLModel):
    
      async def load(self) -> bool:
        # TODO: Replace for custom logic to load a model artifact
        self._model = load_my_custom_model()
        return True
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        # TODO: Replace for custom logic to run inference
        return self._model.predict(payload)
    from mlserver import MLModel
    from mlserver.codecs import decode_args
    from typing import List
    
    class MyCustomRuntime(MLModel):
    
      async def load(self) -> bool:
        # TODO: Replace for custom logic to load a model artifact
        self._model = load_my_custom_model()
        return True
    
      @decode_args
      async def predict(self, questions: List[str], context: List[str]) -> np.ndarray:
        # TODO: Replace for custom logic to run inference
        return self._model.predict(questions, context)
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class CustomHeadersRuntime(MLModel):
    
      ...
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        if payload.parameters and payload.parametes.headers:
          # These are all the incoming HTTP headers / gRPC metadata
          print(payload.parameters.headers)
        ...
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse
    
    class CustomHeadersRuntime(MLModel):
    
      ...
    
      async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        ...
        return InferenceResponse(
          # Include any actual outputs from inference
          outputs=[],
          parameters=Parameters(headers={"foo": "bar"})
        )
    .
    └── models
        └── sum-model
            ├── model-settings.json
            ├── models.py
    .
    └── models
        └── sum-model
            ├── environment.tar.gz
            ├── model-settings.json
            ├── models.py
    mlserver build . -t my-custom-server

    🔵

    video_play_icon
    here
    Cassava dataset
    model details here
    here
    Flask
    FastAPI
    MLServer
    inference runtimes
    custom python code
    Containers
    Docker
    docker hub
    Seldon Core
    Kubernetes
    installation instructions
    port forward
    horizontally
    auto-scaling
    Serving

    Since we're using a pretrained model, we can skip straight to serving.

    model-settings.json

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request

    Using Optimum Optimized Models

    We can also leverage the Optimum library that allows us to access quantized and optimized models.

    We can download pretrained optimized models from the hub if available by enabling the optimum_model flag:

    Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Send Test Request to Optimum Optimized Model

    The request can now be sent using the same request structure but using optimized models for better performance.

    Testing Supported Tasks

    We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.

    Question Answering

    Once again, you are able to run the model using the MLServer CLI.

    Sentiment Analysis

    Once again, you are able to run the model using the MLServer CLI.

    GPU Acceleration

    We can also evaluate GPU acceleration, we can test the speed on CPU vs GPU using the following parameters

    Testing with CPU

    We first test the time taken with the device=-1 which configures CPU by default

    Once again, you are able to run the model using the MLServer CLI.

    We can see that it takes 81 seconds which is 8 times longer than the gpu example below.

    Testing with GPU

    IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.

    Now we'll run the benchmark with GPU configured, which we can do by setting device=0

    We can see that the elapsed time is 8 times less than the CPU version!

    Adaptive Batching with GPU

    We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.

    In our case we can enable adaptive batching with the max_batch_size which in our case we will set it ot 128.

    We will also configure max_batch_time which specifies` the maximum amount of time the MLServer orchestrator will wait before sending for inference.

    In order to achieve the throughput required of 50 requests per second, we will use the tool vegeta which performs load testing.

    We can now see that we are able to see that the requests are batched and we receive 100% success eventhough the requests are sent one-by-one.

    Training

    The first step will be to train and serialise a MLflow model. For that, we will use the linear regression examle from the MLflow docs.

    The training script will also serialise our trained model, leveraging the MLflow Model format. By default, we should be able to find the saved artifact under the mlruns folder.

    Serving

    Now that we have trained and serialised our model, we are ready to start serving it. For that, the initial step will be to set up a model-settings.json that instructs MLServer to load our artifact using the MLflow Inference Runtime.

    Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.

    Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.

    Send test inference request

    We now have our model being served by mlserver. To make sure that everything is working as expected, let's send a request from our test set. For that, we can use the Python types that mlserver provides out of box, or we can build our request manually.

    Note that, the request specifies the value pd as its content type, whereas every input specifies the content type np. These parameters will instruct MLServer to:

    • Convert every input value to a NumPy array, using the data type and shape information provided.

    • Group all the different inputs into a Pandas DataFrame, using their names as the column names.

    To learn more about how MLServer uses content type parameters, you can check this worked out example.

    As we can see in the output above, the predicted quality score for our input wine was 5.57.

    MLflow Scoring Protocol

    MLflow currently ships with an scoring server with its own protocol. In order to provide a drop-in replacement, the MLflow runtime in MLServer also exposes a custom endpoint which matches the signature of the MLflow's /invocations endpoint.

    As an example, we can try to send the same request that sent previously, but using MLflow's protocol. Note that, in both cases, the request will be handled by the same MLServer instance.

    As we can see above, the predicted quality for our input is 5.57, matching the prediction we obtained above.

    MLflow Model Signature

    MLflow lets users define a model signature, where they can specify what types of inputs does the model accept, and what types of outputs it returns. Similarly, the V2 inference protocol employed by MLServer defines a metadata endpoint which can be used to query what inputs and outputs does the model accept. However, even though they serve similar functions, the data schemas used by each one of them are not compatible between them.

    To solve this, if your model defines a MLflow model signature, MLServer will convert on-the-fly this signature to a metadata schema compatible with the V2 Inference Protocol. This will also include specifying any extra content type that is required to correctly decode / encode your data.

    As an example, we can first have a look at the model signature saved for our MLflow model. This can be seen directly on the MLModel file saved by our model.

    We can then query the metadata endpoint, to see the model metadata inferred by MLServer from our test model's signature. For this, we will use the /v2/models/wine-classifier/ endpoint.

    As we should be able to see, the model metadata now matches the information contained in our model signature, including any extra content types necessary to decode our data correctly.

    git clone https://github.com/SeldonIO/cassava-example.git
    cd cassava-example/
    pip install -r requirements.txt
    from helpers import plot, preprocess
    import tensorflow as tf
    import tensorflow_datasets as tfds
    import tensorflow_hub as hub
    
    # Fixes an issue with Jax and TF competing for GPU
    tf.config.experimental.set_visible_devices([], 'GPU')
    
    # Load the model
    model_path = './model'
    classifier = hub.KerasLayer(model_path)
    
    # Load the dataset and store the class names
    dataset, info = tfds.load('cassava', with_info=True)
    class_names = info.features['label'].names + ['unknown']
    
    # Select a batch of examples and plot them
    batch_size = 9
    batch = dataset['validation'].map(preprocess).batch(batch_size).as_numpy_iterator()
    examples = next(batch)
    plot(examples, class_names)
    
    # Generate predictions for the batch and plot them against their labels
    predictions = classifier(examples['image'])
    predictions_max = tf.argmax(predictions, axis=-1)
    print(predictions_max)
    plot(examples, class_names, predictions_max)
    python app.py
    from mlserver import MLModel
    from mlserver.codecs import decode_args
    import numpy as np
    import tensorflow as tf
    import tensorflow_hub as hub
    
    # Define a class for our Model, inheriting the MLModel class from MLServer
    class CassavaModel(MLModel):
    
      # Load the model into memory
      async def load(self) -> bool:
        tf.config.experimental.set_visible_devices([], 'GPU')
        model_path = '.'
        self._model = hub.KerasLayer(model_path)
        self.ready = True
        return self.ready
    
      # Logic for making predictions against our model
      @decode_args
      async def predict(self, payload: np.ndarray) -> np.ndarray:
        # convert payload to tf.tensor
        payload_tensor = tf.constant(payload)
    
        # Make predictions
        predictions = self._model(payload_tensor)
        predictions_max = tf.argmax(predictions, axis=-1)
    
        # convert predictions to np.ndarray
        response_data = np.array(predictions_max)
    
        return response_data
    {
        "name": "cassava",
        "implementation": "serve-model.CassavaModel"
    }
    mlserver start model/
    python test.py --local
    tensorflow==2.12.0
    tensorflow-hub==0.13.0
    mlserver build model/ -t [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]
    docker images
    docker push [YOUR_CONTAINER_REGISTRY]/[IMAGE_NAME]
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    metadata:
      name: cassava
    spec:
      protocol: v2
      predictors:
        - componentSpecs:
            - spec:
                containers:
                  - image: YOUR_CONTAINER_REGISTRY/IMAGE_NAME
                    name: cassava
                    imagePullPolicy: Always
          graph:
            name: cassava
            type: MODEL
          name: cassava
    kubectl create -f deployment.yaml
    kubectl get pods
    python test.py --remote
    kubectl scale sdep cassava --replicas=3
    kubectl get pods --watch
    # Import required dependencies
    import requests
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "text-generation",
                "pretrained_model": "distilgpt2"
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "args",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["this is a test"],
            }
        ]
    }
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    ).json()
    {'model_name': 'transformer',
     'id': 'eb160c6b-8223-4342-ad92-6ac301a9fa5d',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'hg_jsonlist'},
       'data': ['{"generated_text": "this is a testnet with 1-3,000-bit nodes as nodes."}']}]}
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "text-generation",
                "pretrained_model": "distilgpt2",
                "optimum_model": true
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "args",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["this is a test"],
            }
        ]
    }
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    ).json()
    {'model_name': 'transformer',
     'id': '9c482c8d-b21e-44b1-8a42-7650a9dc01ef',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'hg_jsonlist'},
       'data': ['{"generated_text": "this is a test of the \\"safe-code-safe-code-safe-code\\" approach. The method only accepts two parameters as parameters: the code. The parameter \'unsafe-code-safe-code-safe-code\' should"}']}]}
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "question-answering"
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "question",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["what is your name?"],
            },
            {
                "name": "context",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["Hello, I am Seldon, how is it going"],
            },
        ]
    }
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    ).json()
    {'model_name': 'transformer',
     'id': '4efac938-86d8-41a1-b78f-7690b2dcf197',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'hg_jsonlist'},
       'data': ['{"score": 0.9869915843009949, "start": 12, "end": 18, "answer": "Seldon"}']}]}
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "text-classification"
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "args",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["This is terrible!"],
            }
        ]
    }
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    ).json()
    {'model_name': 'transformer',
     'id': '835eabbd-daeb-4423-a64f-a7c4d7c60a9b',
     'parameters': {},
     'outputs': [{'name': 'output',
       'shape': [1, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'hg_jsonlist'},
       'data': ['{"label": "NEGATIVE", "score": 0.9996137022972107}']}]}
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "max_batch_size": 128,
        "max_batch_time": 1,
        "parameters": {
            "extra": {
                "task": "text-generation",
                "device": -1
            }
        }
    }
    Overwriting ./model-settings.json
    mlserver start .
    inference_request = {
        "inputs": [
            {
                "name": "text_inputs",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["This is a generation for the work" for i in range(512)],
            }
        ]
    }
    
    # Benchmark time
    import time
    
    start_time = time.monotonic()
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    )
    
    print(f"Elapsed time: {time.monotonic() - start_time}")
    Elapsed time: 66.42268538899953
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "parameters": {
            "extra": {
                "task": "text-generation",
                "device": 0
            }
        }
    }
    Overwriting ./model-settings.json
    inference_request = {
        "inputs": [
            {
                "name": "text_inputs",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["This is a generation for the work" for i in range(512)],
            }
        ]
    }
    
    # Benchmark time
    import time
    
    start_time = time.monotonic()
    
    requests.post(
        "http://localhost:8080/v2/models/transformer/infer", json=inference_request
    )
    
    print(f"Elapsed time: {time.monotonic() - start_time}")
    Elapsed time: 11.27933280000434
    %%writefile ./model-settings.json
    {
        "name": "transformer",
        "implementation": "mlserver_huggingface.HuggingFaceRuntime",
        "max_batch_size": 128,
        "max_batch_time": 1,
        "parameters": {
            "extra": {
                "task": "text-generation",
                "pretrained_model": "distilgpt2",
                "device": 0
            }
        }
    }
    Overwriting ./model-settings.json
    %%bash
    jq -ncM '{"method": "POST", "header": {"Content-Type": ["application/json"] }, "url": "http://localhost:8080/v2/models/transformer/infer", "body": "{\"inputs\":[{\"name\":\"text_inputs\",\"shape\":[1],\"datatype\":\"BYTES\",\"data\":[\"test\"]}]}" | @base64 }' \
              | vegeta \
                    -cpus="2" \
                    attack \
                    -duration="3s" \
                    -rate="50" \
                    -format=json \
              | vegeta \
                    report \
                    -type=text
    Requests      [total, rate, throughput]         150, 50.34, 22.28
    Duration      [total, attack, wait]             6.732s, 2.98s, 3.753s
    Latencies     [min, mean, 50, 90, 95, 99, max]  1.975s, 3.168s, 3.22s, 4.065s, 4.183s, 4.299s, 4.318s
    Bytes In      [total, mean]                     60978, 406.52
    Bytes Out     [total, mean]                     12300, 82.00
    Success       [ratio]                           100.00%
    Status Codes  [code:count]                      200:150  
    Error Set:
    from IPython.core.magic import register_line_cell_magic
    
    @register_line_cell_magic
    def writetemplate(line, cell):
        with open(line, 'w') as f:
            f.write(cell.format(**globals()))
    # %load src/train.py
    # Original source code and more details can be found in:
    # https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html
    
    # The data set used in this example is from
    # http://archive.ics.uci.edu/ml/datasets/Wine+Quality
    # P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
    # Modeling wine preferences by data mining from physicochemical properties.
    # In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
    
    import warnings
    import sys
    
    import pandas as pd
    import numpy as np
    from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import ElasticNet
    from urllib.parse import urlparse
    import mlflow
    import mlflow.sklearn
    from mlflow.models.signature import infer_signature
    
    import logging
    
    logging.basicConfig(level=logging.WARN)
    logger = logging.getLogger(__name__)
    
    
    def eval_metrics(actual, pred):
        rmse = np.sqrt(mean_squared_error(actual, pred))
        mae = mean_absolute_error(actual, pred)
        r2 = r2_score(actual, pred)
        return rmse, mae, r2
    
    
    if __name__ == "__main__":
        warnings.filterwarnings("ignore")
        np.random.seed(40)
    
        # Read the wine-quality csv file from the URL
        csv_url = (
            "http://archive.ics.uci.edu/ml"
            "/machine-learning-databases/wine-quality/winequality-red.csv"
        )
        try:
            data = pd.read_csv(csv_url, sep=";")
        except Exception as e:
            logger.exception(
                "Unable to download training & test CSV, "
                "check your internet connection. Error: %s",
                e,
            )
    
        # Split the data into training and test sets. (0.75, 0.25) split.
        train, test = train_test_split(data)
    
        # The predicted column is "quality" which is a scalar from [3, 9]
        train_x = train.drop(["quality"], axis=1)
        test_x = test.drop(["quality"], axis=1)
        train_y = train[["quality"]]
        test_y = test[["quality"]]
    
        alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
        l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5
    
        with mlflow.start_run():
            lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
            lr.fit(train_x, train_y)
    
            predicted_qualities = lr.predict(test_x)
    
            (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
    
            print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
            print("  RMSE: %s" % rmse)
            print("  MAE: %s" % mae)
            print("  R2: %s" % r2)
    
            mlflow.log_param("alpha", alpha)
            mlflow.log_param("l1_ratio", l1_ratio)
            mlflow.log_metric("rmse", rmse)
            mlflow.log_metric("r2", r2)
            mlflow.log_metric("mae", mae)
    
            tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
            model_signature = infer_signature(train_x, train_y)
    
            # Model registry does not work with file store
            if tracking_url_type_store != "file":
    
                # Register the model
                # There are other ways to use the Model Registry,
                # which depends on the use case,
                # please refer to the doc for more information:
                # https://mlflow.org/docs/latest/model-registry.html#api-workflow
                mlflow.sklearn.log_model(
                    lr,
                    "model",
                    registered_model_name="ElasticnetWineModel",
                    signature=model_signature,
                )
            else:
                mlflow.sklearn.log_model(lr, "model", signature=model_signature)
    
    !python src/train.py
    import os
    
    [experiment_file_path] = !ls -td ./mlruns/0/* | head -1
    model_path = os.path.join(experiment_file_path, "artifacts", "model")
    print(model_path)
    !ls {model_path} 
    %%writetemplate ./model-settings.json
    {{
        "name": "wine-classifier",
        "implementation": "mlserver_mlflow.MLflowRuntime",
        "parameters": {{
            "uri": "{model_path}"
        }}
    }}
    mlserver start .
    import requests
    
    inference_request = {
        "inputs": [
            {
              "name": "fixed acidity",
              "shape": [1],
              "datatype": "FP32",
              "data": [7.4],
            },
            {
              "name": "volatile acidity",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.7000],
            },
            {
              "name": "citric acid",
              "shape": [1],
              "datatype": "FP32",
              "data": [0],
            },
            {
              "name": "residual sugar",
              "shape": [1],
              "datatype": "FP32",
              "data": [1.9],
            },
            {
              "name": "chlorides",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.076],
            },
            {
              "name": "free sulfur dioxide",
              "shape": [1],
              "datatype": "FP32",
              "data": [11],
            },
            {
              "name": "total sulfur dioxide",
              "shape": [1],
              "datatype": "FP32",
              "data": [34],
            },
            {
              "name": "density",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.9978],
            },
            {
              "name": "pH",
              "shape": [1],
              "datatype": "FP32",
              "data": [3.51],
            },
            {
              "name": "sulphates",
              "shape": [1],
              "datatype": "FP32",
              "data": [0.56],
            },
            {
              "name": "alcohol",
              "shape": [1],
              "datatype": "FP32",
              "data": [9.4],
            },
        ]
    }
    
    endpoint = "http://localhost:8080/v2/models/wine-classifier/infer"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    import requests
    
    inference_request = {
        "dataframe_split": {
            "columns": [
                "fixed acidity",
                "volatile acidity",
                "citric acid",
                "residual sugar",
                "chlorides",
                "free sulfur dioxide",
                "total sulfur dioxide",
                "density",
                "pH",
                "sulphates",
                "alcohol",
            ],
            "data": [[7.4,0.7,0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4]]
        }
    }
    
    endpoint = "http://localhost:8080/invocations"
    response = requests.post(endpoint, json=inference_request)
    
    response.json()
    !cat {model_path}/MLmodel
    import requests
    
    
    endpoint = "http://localhost:8080/v2/models/wine-classifier"
    response = requests.get(endpoint)
    
    response.json()

    Getting Started

    This guide will help you get started creating machine learning microservices with MLServer in less than 30 minutes. Our use case will be to create a service that helps us compare the similarity between two documents. Think about whenever you are comparing a book, news article, blog post, or tutorial to read next, wouldn't it be great to have a way to compare with similar ones that you have already read and liked (without having to rely on a recommendation's system)? That's what we'll focus on this guide, on creating a document similarity service. 📜 + 📃 = 😎👌🔥

    The code is showcased as if it were cells inside a notebook but you can run each of the steps inside Python files with minimal effort.

    00 What is MLServer?

    MLServer is an open-source Python library for building production-ready asynchronous APIs for machine learning models.

    01 Dependencies

    The first step is to install mlserver, the spacy library, and the language model spacy will need for our use case. We will also download the wikipedia-api library to test our use case with a few fun summaries.

    If you've never heard of spaCy before, it is an open-source Python library for advanced natural language processing that excels at large-scale information extraction and retrieval tasks, among many others. The model we'll use is a pre-trained model on English text from the web. This model will help us get started with our use case faster than if we had to train a model from scratch for our use case.

    Let's first install these libraries.

    We will also need to download the language model separately once we have spaCy inside our virtual environment.

    If you're going over this guide inside a notebook, don't forget to add an exclamation mark ! in front of the two commands above. If you are in VSCode, you can keep them as they are and change the cell type to bash.

    02 Set Up

    At its core, MLServer requires that users give it 3 things, a model-settings.json file with information about the model, an (optional) settings.json file with information related to the server you are about to set up, and a .py file with the load-predict recipe for your model (as shown in the picture above).

    Let's create a directory for our model.

    Before we create a service that allows us to compare the similarity between two documents, it is good practice to first test that our solution works first, especially if we're using a pre-trained model and/or a pipeline.

    Now that we have our model loaded, let's look at the similarity of the abstracts of Barbieheimer using the wikipedia-api Python library. The main requirement of the API is that we pass into the main class, Wikipedia(), a project name, an email and the language we want information to be returned in. After that, we can search the for the movie summaries we want by passing the title of the movie to the .page() method and accessing the summary of it with the .summary attribute.

    Feel free to change the movies for other topics you might be interested in.

    You can run the following lines inside a notebook or, conversely, add them to a app.py file.

    If you created an app.py file with the code above, make sure you run python app.py from the terminal.

    Now that we have our two summaries, let's compare them using spacy.

    Notice that both summaries have information about the other movie, about "films" in general, and about the dates each aired on (which is the same). The reality is that, the model hasn't seen any of these movies so it might be generalizing to the context of each article, "movies," rather than their content, "dolls as humans and the atomic bomb."

    You should, of course, play around with different pages and see if what you get back is coherent with what you would expect.

    Time to create a machine learning API for our use-case. 😎

    03 Building a Service

    MLServer allows us to wrap machine learning models into APIs and build microservices with replicas of a single model, or different models all together.

    To create a service with MLServer, we will define a class with two asynchronous functions, one that loads the model and another one to run inference (or predict) with. The former will load the spacy model we tested in the last section, and the latter will take in a list with the two documents we want to compare. Lastly, our function will return a numpy array with a single value, our similarity score. We'll write the file to our similarity_model directory and call it my_model.py.

    Now that we have our model file ready to go, the last piece of the puzzle is to tell MLServer a bit of info about it. In particular, it wants (or needs) to know the name of the model and how to implement it. The former can be anything you want (and it will be part of the URL of your API), and the latter will follow the recipe of name_of_py_file_with_your_model.class_with_your_model.

    Let's create the model-settings.json file MLServer is expecting inside our similarity_model directory and add the name and the implementation of our model to it.

    Now that everything is in place, we can start serving predictions locally to test how things would play out for our future users. We'll initiate our server via the command line, and later on we'll see how to do the same via Python files. Here's where we are at right now in the process of developing microservices with MLServer.

    As you can see in the image, our server will be initialized with three entry points, one for HTTP requests, another for gRPC, and another for the metrics. To learn more about the powerful metrics feature of MLServer, please visit the relevant docs page here. To learn more about gRPC, please see this tutorial here.

    To start our service, open up a terminal and run the following command.

    Note: If this is a fresh terminal, make sure you activate your environment before you run the command above. If you run the command above from your notebook (e.g. !mlserver start similarity_model/), you will have to send the request below from another notebook or terminal since the cell will continue to run until you turn it off.

    04 Testing our Service

    Time to become a client of our service and test it. For this, we'll set up the payload we'll send to our service and use the requests library to POST our request.

    Please note that the request below uses the variables we created earlier with the summaries of Barbie and Oppenheimer. If you are sending this POST request from a fresh python file, make sure you move those lines of code above into your request file.

    Let's decompose what just happened.

    The URL for our service might seem a bit odd if you've never heard of the V2/Open Inference Protocol (OIP). This protocol is a set of specifications that allows machine learning models to be shared and deployed in a standardized way. This protocol enables the use of machine learning models on a variety of platforms and devices without requiring changes to the model or its code. The OIP is useful because it allows us to integrate machine learning into a wide range of applications in a standard way.

    All URLs you create with MLServer will have the following structure.

    This kind of protocol is a standard adopted by different companies like NVIDIA, Tensorflow Serving, KServe, and others, to keep everyone on the same page. If you think about driving cars globally, your country has to apply a standard for driving on a particular side of the road, and this ensures you and everyone else stays on the left (or the right depending on where you are at). Adopting this means that you won't have to wonder where the next driver is going to come out of when you are driving and are about to take a turn, instead, you can focus on getting to where you're going to without much worrying.

    Let's describe what each of the components of our inference_request does.

    • name: this maps one-to-one to the name of the parameter in your predict() function.

    • shape: represents the shape of the elements in our data. In our case, it is a list with [2] strings.

    • datatype: the different data types expected by the server, e.g., str, numpy array, pandas dataframe, bytes, etc.

    • parameters: allows us to specify the content_type beyond the data types

    • data: the inputs to our predict function.

    To learn more about the OIP and how MLServer content types work, please have a looks at their docs page here.

    05 Creating Model Replicas

    Say you need to meet the demand of a high number of users and one model might not be enough, or is not using all of the resources of the virtual machine instance it was allocated to. What we can do in this case is to create multiple replicas of our model to increase the throughput of the requests that come in. This can be particularly useful at the peak times of our server. To do this, we need to tweak the settings of our server via the settings.json file. In it, we'll add the number of independent models we want to have to the parameter "parallel_workers": 3.

    Let's stop our server, change the settings of it, start it again, and test it.

    As you can see in the output of the terminal in the picture above, we now have 3 models running in parallel. The reason you might see 4 is because, by default, MLServer will print the name of the initialized model if it is one or more, and it will also print one for each of the replicas specified in the settings.

    Let's get a few more twin films examples to test our server. Get as creative as you'd like. 💡

    Let's first test that the function works as intended.

    Now let's map three POST requests at the same time.

    We can also test it one by one.

    06 Packaging our Service

    For the last step of this guide, we are going to package our model and service into a docker image that we can reuse in another project or share with colleagues immediately. This step requires that we have docker installed and configured in our PCs, so if you need to set up docker, you can do so by following the instructions in the documentation here.

    The first step is to create a requirements.txt file with all of our dependencies and add it to the directory we've been using for our service (similarity_model).

    The next step is to build a docker image with our model, its dependencies and our server. If you've never heard of docker images before, here's a short description.

    A Docker image is a lightweight, standalone, and executable package that includes everything needed to run a piece of software, including code, libraries, dependencies, and settings. It's like a carry-on bag for your application, containing everything it needs to travel safely and run smoothly in different environments. Just as a carry-on bag allows you to bring your essentials with you on a trip, a Docker image enables you to transport your application and its requirements across various computing environments, ensuring consistent and reliable deployment.

    MLServer has a convenient function that lets us create docker images with our services. Let's use it.

    We can check that our image was successfully build not only by looking at the logs of the previous command but also with the docker images command.

    Let's test that our image works as intended with the following command. Make sure you have closed your previous server by using CTRL + C in your terminal.

    Now that you have a packaged and fully-functioning microservice with our model, we could deploy our container to a production serving platform like Seldon Core, or via different offerings available through the many cloud providers out there (e.g. AWS Lambda, Google Cloud Run, etc.). You could also run this image on KServe, a Kubernetes native tool for model serving, or anywhere else where you can bring your docker image with you.

    To learn more about MLServer and the different ways in which you can use it, head over to the examples section or the user guide. To learn about some of the deployment options available, head over to the docs here.

    To keep up to date with what we are up to at Seldon, make sure you join our Slack community.

    pip install mlserver spacy wikipedia-api
    python -m spacy download en_core_web_lg
    mkdir -p similarity_model
    import spacy
    nlp = spacy.load("en_core_web_lg")
    import wikipediaapi
    wiki_wiki = wikipediaapi.Wikipedia('MyMovieEval ([email protected])', 'en')
    barbie = wiki_wiki.page('Barbie_(film)').summary
    oppenheimer = wiki_wiki.page('Oppenheimer_(film)').summary
    
    print(barbie)
    print()
    print(oppenheimer)
    Barbie is a 2023 American fantasy comedy film directed by Greta Gerwig and written by Gerwig and Noah Baumbach. Based on the Barbie fashion dolls by Mattel, it is the first live-action Barbie film after numerous computer-animated direct-to-video and streaming television films. The film stars Margot Robbie as Barbie and Ryan Gosling as Ken, and follows the two on a journey of self-discovery following an existential crisis. The film also features an ensemble cast that includes America Ferrera, Kate McKinnon, Issa Rae, Rhea Perlman, and Will Ferrell...
    
    Oppenheimer is a 2023 biographical thriller film written and directed by Christopher Nolan. Based on the 2005 biography American Prometheus by Kai Bird and Martin J. Sherwin, the film chronicles the life of J. Robert Oppenheimer, a theoretical physicist who was pivotal in developing the first nuclear weapons as part of the Manhattan Project, and thereby ushering in the Atomic Age. Cillian Murphy stars as Oppenheimer, with Emily Blunt as Oppenheimer's wife Katherine "Kitty" Oppenheimer; Matt Damon as General Leslie Groves, director of the Manhattan Project; and Robert Downey Jr. as Lewis Strauss, a senior member of the United States Atomic Energy Commission. The ensemble supporting cast includes Florence Pugh, Josh Hartnett, Casey Affleck, Rami Malek, Gary Oldman and Kenneth Branagh...
    doc1 = nlp(barbie)
    doc2 = nlp(oppenheimer)
    doc1.similarity(doc2)
    0.9866910567224084
    # similarity_model/my_model.py
    
    from mlserver.codecs import decode_args
    from mlserver import MLModel
    from typing import List
    import numpy as np
    import spacy
    
    class MyKulModel(MLModel):
    
        async def load(self):
            self.model = spacy.load("en_core_web_lg")
    
        @decode_args
        async def predict(self, docs: List[str]) -> np.ndarray:
    
            doc1 = self.model(docs[0])
            doc2 = self.model(docs[1])
    
            return np.array(doc1.similarity(doc2))
    # similarity_model/model-settings.json
    
    {
        "name": "doc-sim-model",
        "implementation": "my_model.MyKulModel"
    }
    mlserver start similarity_model/
    from mlserver.codecs import StringCodec
    import requests
    inference_request = {
        "inputs": [
            StringCodec.encode_input(name='docs', payload=[barbie, oppenheimer], use_bytes=False).model_dump()
        ]
    }
    print(inference_request)
    {'inputs': [{'name': 'docs',
       'shape': [2, 1],
       'datatype': 'BYTES',
       'parameters': {'content_type': 'str'},
       'data': [
            'Barbie is a 2023 American fantasy comedy...',
            'Oppenheimer is a 2023 biographical thriller...'
            ]
        }]
    }
    r = requests.post('http://0.0.0.0:8080/v2/models/doc-sim-model/infer', json=inference_request)
    r.json()
    {'model_name': 'doc-sim-model',
        'id': 'a4665ddb-1868-4523-bd00-a25902d9b124',
        'parameters': {},
        'outputs': [{'name': 'output-0',
        'shape': [1],
        'datatype': 'FP64',
        'parameters': {'content_type': 'np'},
        'data': [0.9866910567224084]}]}
    print(f"Our movies are {round(r.json()['outputs'][0]['data'][0] * 100, 4)}% similar!")
    Our movies are 98.6691% similar
    # similarity_model/settings.json
    
    {
        "parallel_workers": 3
    }
    mlserver start similarity_model
    deep_impact    = wiki_wiki.page('Deep_Impact_(film)').summary
    armageddon     = wiki_wiki.page('Armageddon_(1998_film)').summary
    
    antz           = wiki_wiki.page('Antz').summary
    a_bugs_life    = wiki_wiki.page("A_Bug's_Life").summary
    
    the_dark_night = wiki_wiki.page('The_Dark_Knight').summary
    mamma_mia      = wiki_wiki.page('Mamma_Mia!_(film)').summary
    def get_sim_score(movie1, movie2):
        response = requests.post(
            'http://0.0.0.0:8080/v2/models/doc-sim-model/infer',
            json={
                "inputs": [
                    StringCodec.encode_input(name='docs', payload=[movie1, movie2], use_bytes=False).model_dump()
                ]
            })
        return response.json()['outputs'][0]['data'][0]
    get_sim_score(deep_impact, armageddon)
    0.9569279450151813
    results = list(
        map(get_sim_score, (deep_impact, antz, the_dark_night), (armageddon, a_bugs_life, mamma_mia))
    )
    results
    [0.9569279450151813, 0.9725374771538605, 0.9626173937217876]
    for movie1, movie2 in zip((deep_impact, antz, the_dark_night), (armageddon, a_bugs_life, mamma_mia)):
        print(get_sim_score(movie1, movie2))
    0.9569279450151813
    0.9725374771538605
    0.9626173937217876
    # similarity_model/requirements.txt
    
    mlserver
    spacy==3.6.0
    https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.6.0/en_core_web_lg-3.6.0-py3-none-any.whl
    mlserver build similarity_model/ -t 'fancy_ml_service'
    docker images
    docker run -it --rm -p 8080:8080 fancy_ml_service

    Content Type Decoding

    MLServer extends the V2 inference protocol by adding support for a content_type annotation. This annotation can be provided either through the model metadata parameters, or through the input parameters. By leveraging the content_type annotation, we can provide the necessary information to MLServer so that it can decode the input payload from the "wire" V2 protocol to something meaningful to the model / user (e.g. a NumPy array).

    This example will walk you through some examples which illustrate how this works, and how it can be extended.

    Echo Inference Runtime

    To start with, we will write a dummy runtime which just prints the input, the decoded input and returns it. This will serve as a testbed to showcase how the content_type support works.

    Later on, we will extend this runtime by adding custom codecs that will decode our V2 payload to custom types.

    As you can see above, this runtime will decode the incoming payloads by calling the self.decode() helper method. This method will check what's the right content type for each input in the following order:

    1. Is there any content type defined in the inputs[].parameters.content_type field within the request payload?

    2. Is there any content type defined in the inputs[].parameters.content_type field within the model metadata?

    3. Is there any default content type that should be assumed?

    Model Settings

    In order to enable this runtime, we will also create a model-settings.json file. This file should be present (or accessible from) in the folder where we run mlserver start ..

    Request Inputs

    Our initial step will be to decide the content type based on the incoming inputs[].parameters field. For this, we will start our MLServer in the background (e.g. running mlserver start .)

    Codecs

    As you've probably already noticed, writing request payloads compliant with both the V2 Inference Protocol requires a certain knowledge about both the V2 spec and the structure expected by each content type. To account for this and simplify usage, the MLServer package exposes a set of utilities which will help you interact with your models via the V2 protocol.

    These helpers are mainly shaped as "codecs". That is, abstractions which know how to "encode" and "decode" arbitrary Python datatypes to and from the V2 Inference Protocol.

    Generally, we recommend using the existing set of codecs to generate your V2 payloads. This will ensure that requests and responses follow the right structure, and should provide a more seamless experience.

    Following with our previous example, the same code could be rewritten using codecs as:

    Note that the rewritten snippet now makes use of the built-in InferenceRequest class, which represents a V2 inference request. On top of that, it also uses the NumpyCodec and StringCodec implementations, which know how to encode a Numpy array and a list of strings into V2-compatible request inputs.

    Model Metadata

    Our next step will be to define the expected content type through the model metadata. This can be done by extending the model-settings.json file, and adding a section on inputs.

    After adding this metadata, we will re-start MLServer (e.g. mlserver start .) and we will send a new request without any explicit parameters.

    As you should be able to see in the server logs, MLServer will cross-reference the input names against the model metadata to find the right content type.

    Custom Codecs

    There may be cases where a custom inference runtime may need to encode / decode to custom datatypes. As an example, we can think of computer vision models which may only operate with pillow image objects.

    In these scenarios, it's possible to extend the Codec interface to write our custom encoding logic. A Codec, is simply an object which defines a decode() and encode() methods. To illustrate how this would work, we will extend our custom runtime to add a custom PillowCodec.

    We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.

    As you should be able to see in the MLServer logs, the server is now able to decode the payload into a Pillow image. This example also illustrates how Codec objects can be compatible with multiple datatype values (e.g. tensor and BYTES in this case).

    Request Codecs

    So far, we've seen how you can specify codecs so that they get applied at the input level. However, it is also possible to use request-wide codecs that aggregate multiple inputs to decode the payload. This is usually relevant for cases where the models expect a multi-column input type, like a Pandas DataFrame.

    To illustrate this, we will first tweak our EchoRuntime so that it prints the decoded contents at the request level.

    We should now be able to restart our instance of MLServer (i.e. with the mlserver start . command), to send a few test requests.

    %%writefile runtime.py
    import json
    
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
    from mlserver.codecs import DecodedParameterName
    
    _to_exclude = {
        "parameters": {DecodedParameterName, "headers"},
        'inputs': {"__all__": {"parameters": {DecodedParameterName, "headers"}}}
    }
    
    class EchoRuntime(MLModel):
        async def predict(self, payload: InferenceRequest) -> InferenceResponse:
            outputs = []
            for request_input in payload.inputs:
                decoded_input = self.decode(request_input)
                print(f"------ Encoded Input ({request_input.name}) ------")
                as_dict = request_input.dict(exclude=_to_exclude)  # type: ignore
                print(json.dumps(as_dict, indent=2))
                print(f"------ Decoded input ({request_input.name}) ------")
                print(decoded_input)
    
                outputs.append(
                    ResponseOutput(
                        name=request_input.name,
                        datatype=request_input.datatype,
                        shape=request_input.shape,
                        data=request_input.data
                    )
                )
    
            return InferenceResponse(model_name=self.name, outputs=outputs)
    
    %%writefile model-settings.json
    
    {
        "name": "content-type-example",
        "implementation": "runtime.EchoRuntime"
    }
    import requests
    
    payload = {
        "inputs": [
            {
                "name": "parameters-np",
                "datatype": "INT32",
                "shape": [2, 2],
                "data": [1, 2, 3, 4],
                "parameters": {
                    "content_type": "np"
                }
            },
            {
                "name": "parameters-str",
                "datatype": "BYTES",
                "shape": [1],
                "data": "hello world 😁",
                "parameters": {
                    "content_type": "str"
                }
            }
        ]
    }
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload
    )
    import requests
    import numpy as np
    
    from mlserver.types import InferenceRequest, InferenceResponse
    from mlserver.codecs import NumpyCodec, StringCodec
    
    parameters_np = np.array([[1, 2], [3, 4]])
    parameters_str = ["hello world 😁"]
    
    payload = InferenceRequest(
        inputs=[
            NumpyCodec.encode_input("parameters-np", parameters_np),
            # The `use_bytes=False` flag will ensure that the encoded payload is JSON-compatible
            StringCodec.encode_input("parameters-str", parameters_str, use_bytes=False),
        ]
    )
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload.model_dump()
    )
    
    response_payload = InferenceResponse.parse_raw(response.text)
    print(NumpyCodec.decode_output(response_payload.outputs[0]))
    print(StringCodec.decode_output(response_payload.outputs[1]))
    %%writefile model-settings.json
    
    {
        "name": "content-type-example",
        "implementation": "runtime.EchoRuntime",
        "inputs": [
            {
                "name": "metadata-np",
                "datatype": "INT32",
                "shape": [2, 2],
                "parameters": {
                    "content_type": "np"
                }
            },
            {
                "name": "metadata-str",
                "datatype": "BYTES",
                "shape": [11],
                "parameters": {
                    "content_type": "str"
                }
            }
        ]
    }
    import requests
    
    payload = {
        "inputs": [
            {
                "name": "metadata-np",
                "datatype": "INT32",
                "shape": [2, 2],
                "data": [1, 2, 3, 4],
            },
            {
                "name": "metadata-str",
                "datatype": "BYTES",
                "shape": [11],
                "data": "hello world 😁",
            }
        ]
    }
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload
    )
    %%writefile runtime.py
    import io
    import json
    
    from PIL import Image
    
    from mlserver import MLModel
    from mlserver.types import (
        InferenceRequest,
        InferenceResponse,
        RequestInput,
        ResponseOutput,
    )
    from mlserver.codecs import NumpyCodec, register_input_codec, DecodedParameterName
    from mlserver.codecs.utils import InputOrOutput
    
    
    _to_exclude = {
        "parameters": {DecodedParameterName},
        "inputs": {"__all__": {"parameters": {DecodedParameterName}}},
    }
    
    
    @register_input_codec
    class PillowCodec(NumpyCodec):
        ContentType = "img"
        DefaultMode = "L"
    
        @classmethod
        def can_encode(cls, payload: Image) -> bool:
            return isinstance(payload, Image)
    
        @classmethod
        def _decode(cls, input_or_output: InputOrOutput) -> Image:
            if input_or_output.datatype != "BYTES":
                # If not bytes, assume it's an array
                image_array = super().decode_input(input_or_output)  # type: ignore
                return Image.fromarray(image_array, mode=cls.DefaultMode)
    
            encoded = input_or_output.data
            if isinstance(encoded, str):
                encoded = encoded.encode()
    
            return Image.frombytes(
                mode=cls.DefaultMode, size=input_or_output.shape, data=encoded
            )
    
        @classmethod
        def encode_output(cls, name: str, payload: Image) -> ResponseOutput:  # type: ignore
            byte_array = io.BytesIO()
            payload.save(byte_array, mode=cls.DefaultMode)
    
            return ResponseOutput(
                name=name, shape=payload.size, datatype="BYTES", data=byte_array.getvalue()
            )
    
        @classmethod
        def decode_output(cls, response_output: ResponseOutput) -> Image:
            return cls._decode(response_output)
    
        @classmethod
        def encode_input(cls, name: str, payload: Image) -> RequestInput:  # type: ignore
            output = cls.encode_output(name, payload)
            return RequestInput(
                name=output.name,
                shape=output.shape,
                datatype=output.datatype,
                data=output.data,
            )
    
        @classmethod
        def decode_input(cls, request_input: RequestInput) -> Image:
            return cls._decode(request_input)
    
    
    class EchoRuntime(MLModel):
        async def predict(self, payload: InferenceRequest) -> InferenceResponse:
            outputs = []
            for request_input in payload.inputs:
                decoded_input = self.decode(request_input)
                print(f"------ Encoded Input ({request_input.name}) ------")
                as_dict = request_input.dict(exclude=_to_exclude)  # type: ignore
                print(json.dumps(as_dict, indent=2))
                print(f"------ Decoded input ({request_input.name}) ------")
                print(decoded_input)
    
                outputs.append(
                    ResponseOutput(
                        name=request_input.name,
                        datatype=request_input.datatype,
                        shape=request_input.shape,
                        data=request_input.data,
                    )
                )
    
            return InferenceResponse(model_name=self.name, outputs=outputs)
    import requests
    
    payload = {
        "inputs": [
            {
                "name": "image-int32",
                "datatype": "INT32",
                "shape": [8, 8],
                "data": [
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0,
                    1, 0, 1, 0, 1, 0, 1, 0
                ],
                "parameters": {
                    "content_type": "img"
                }
            },
            {
                "name": "image-bytes",
                "datatype": "BYTES",
                "shape": [8, 8],
                "data": (
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                    "10101010"
                ),
                "parameters": {
                    "content_type": "img"
                }
            }
        ]
    }
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload
    )
    %%writefile runtime.py
    import json
    
    from mlserver import MLModel
    from mlserver.types import InferenceRequest, InferenceResponse, ResponseOutput
    from mlserver.codecs import DecodedParameterName
    
    _to_exclude = {
        "parameters": {DecodedParameterName},
        'inputs': {"__all__": {"parameters": {DecodedParameterName}}}
    }
    
    class EchoRuntime(MLModel):
        async def predict(self, payload: InferenceRequest) -> InferenceResponse:
            print("------ Encoded Input (request) ------")
            as_dict = payload.dict(exclude=_to_exclude)  # type: ignore
            print(json.dumps(as_dict, indent=2))
            print("------ Decoded input (request) ------")
            decoded_request = None
            if payload.parameters:
                decoded_request = getattr(payload.parameters, DecodedParameterName)
            print(decoded_request)
    
            outputs = []
            for request_input in payload.inputs:
                outputs.append(
                    ResponseOutput(
                        name=request_input.name,
                        datatype=request_input.datatype,
                        shape=request_input.shape,
                        data=request_input.data
                    )
                )
    
            return InferenceResponse(model_name=self.name, outputs=outputs)
    
    import requests
    
    payload = {
        "inputs": [
            {
                "name": "parameters-np",
                "datatype": "INT32",
                "shape": [2, 2],
                "data": [1, 2, 3, 4],
                "parameters": {
                    "content_type": "np"
                }
            },
            {
                "name": "parameters-str",
                "datatype": "BYTES",
                "shape": [2, 11],
                "data": ["hello world 😁", "bye bye 😁"],
                "parameters": {
                    "content_type": "str"
                }
            }
        ],
        "parameters": {
            "content_type": "pd"
        }
    }
    
    response = requests.post(
        "http://localhost:8080/v2/models/content-type-example/infer",
        json=payload
    )

    Content Types (and Codecs)

    Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime objects, Pillow images, etc. Unfortunately, the definition of the V2 Inference Protocol doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.

    To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.

    To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.

    Usage

    Some inference runtimes may apply a content type by default if none is present. To learn more about each runtime's defaults, please check the .

    To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type field of the parameters section of your request.

    As an example, we can consider the following dataframe, containing two columns: Age and First Name.

    First Name
    Age

    This table, could be specified in the V2 protocol as the following payload, where we declare that:

    • The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd).

    • The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str).

    To learn more about the available content types and how to use them, you can see all the available ones in the section below.

    It's important to keep in mind that content types can be specified at both the request level and the input level. The former will apply to the entire set of inputs, whereas the latter will only apply to a particular input of the payload.

    Codecs

    Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.

    Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.

    However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.

    To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):

    • Request Codecs

      • encode_request() <mlserver.codecs.RequestCodec.encode_request>

      • decode_request() <mlserver.codecs.RequestCodec.decode_request>

    Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.

    For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:

    For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .

    Converting to / from JSON

    When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types package (i.e. InferenceRequest <mlserver.types.InferenceRequest> or InferenceResponse <mlserver.types.InferenceResponse>). Therefore, if you want to use them with requests (or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.

    Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump() or .model_dump_json() method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).

    For example, if we want to send an inference request to model foo, we could do something along the following lines:

    Support for NaN values

    The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.

    In order to send / receive NaN values, you must ensure that:

    • You are using the REST interface.

    • The input / output entry containing NaN values uses either the FP16, FP32 or FP64 datatypes.

    • You are either using the or the .

    Assuming those conditions are satisfied, any null value within your tensor payload will be converted to NaN.

    For example, if you take the following Numpy array:

    We could encode it as:

    Model Metadata

    Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.

    For example, to configure the content type values of the , one could create a model-settings.json file like the one below:

    It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.

    Available Content Types

    Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.

    Python Type
    Content Type
    Request Level
    Request Codec
    Input Level
    Input Codec

    MLServer allows you extend the supported content types by adding custom ones. To learn more about how to write your own custom content types, you can check this . You can also learn more about building custom extensions for MLServer on the of the docs.

    NumPy Array

    The expects that the data of each input is sent as a flat array. Therefore, the np content type will expect that tensors are sent flattened. The information in the shape field will then be used to reshape the vector into the right dimensions.

    The np content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:

    • The datatype field will be matched to the closest .

    • The shape field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.

    By default, MLServer will always assume that an array with a single-dimensional shape, e.g. [N], is equivalent to [N, 1]. That is, each entry will be treated like a single one-dimensional data point (i.e. instead of a [1, D] array, where the full array is a single D-dimensional data point). To avoid any ambiguity, where possible, the Numpy codec will always explicitly encode [N] arrays as [N, 1].

    For example, if we think of the following NumPy Array:

    We could encode it as the input foo in a V2 protocol request as:

    When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single tensor.

    Pandas DataFrame

    The pd content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.

    The pd content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,

    • Each entry of the inputs list (or outputs, in the case of responses), will represent a column of the DataFrame.

    • Each of these entires, will contain all the row elements for that particular column.

    • The shape field of each input (or output

    For example, if we consider the following dataframe:

    A
    B
    C

    We could encode it to the V2 Inference Protocol as:

    UTF-8 String

    The str content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:

    • The expected datatype is BYTES.

    • The shape field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"] payload will have a shape of 2 elements).

    For example, when if we consider the following list of strings:

    We could encode it to the V2 Inference Protocol as:

    When using the str content type at the request-level, it will decode the entire request by considering only the first input element. This can be used as a helper for models which only expect a single string or a set of strings.

    Base64

    The base64 content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:

    • The expected datatype is BYTES.

    • The data field should contain the base64-encoded binary strings.

    • The shape field represents the number of binary strings that are encoded in the payload.

    For example, if we think of the following "bytes array":

    We could encode it as the input foo of a V2 request as:

    Datetime

    The datetime content type will decode a V2 input into a , taking into account the following:

    • The expected datatype is BYTES.

    • The data field should contain the dates serialised following the .

    • The shape field represents the number of datetimes that are encoded in the payload.

    For example, if we think of the following datetime object:

    We could encode it as the input foo of a V2 request as:

    encode_response() <mlserver.codecs.RequestCodec.encode_response>
  • decode_response() <mlserver.codecs.RequestCodec.decode_response>

  • Input Codecs

    • encode_input() <mlserver.codecs.InputCodec.encode_input>

    • decode_input() <mlserver.codecs.InputCodec.decode_input>

    • encode_output() <mlserver.codecs.InputCodec.encode_output>

    • decode_output() <mlserver.codecs.InputCodec.decode_output>

  • ❌

    str

    ✅

    mlserver.codecs.string.StringRequestCodec

    ✅

    mlserver.codecs.StringCodec

    base64

    ❌

    ✅

    mlserver.codecs.Base64Codec

    datetime

    ❌

    ✅

    mlserver.codecs.DatetimeCodec

    ) entry will contain (at least) the amount of rows included in the dataframe.

    Joanne

    34

    Michael

    22

    NumPy Array

    np

    ✅

    mlserver.codecs.NumpyRequestCodec

    ✅

    mlserver.codecs.NumpyCodec

    Pandas DataFrame

    pd

    ✅

    a1

    b1

    c1

    a2

    b2

    c2

    a3

    b3

    c3

    a4

    b4

    c4

    relevant inference runtime's docs
    Available Content Types
    Content Type Decoding example
    Pydantic
    other methods
    Pandas codec
    Numpy codec
    model's metadata
    example above
    full end-to-end example
    Custom Inference Runtime section
    V2 Inference Protocol
    NumPy dtype
    Python datetime.datetime object
    ISO 8601 standard

    mlserver.codecs.PandasCodec

    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "First Name",
          "datatype": "BYTES",
          "parameters": {
            "content_type": "str"
          },
          "shape": [2],
          "data": ["Joanne", "Michael"]
        },
        {
          "name": "Age",
          "datatype": "INT32",
          "shape": [2],
          "data": [34, 22]
        },
      ]
    }
    import pandas as pd
    
    from mlserver.codecs import PandasCodec
    
    dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})
    
    inference_request = PandasCodec.encode_request(dataframe)
    print(inference_request)
    import pandas as pd
    import requests
    
    from mlserver.codecs import PandasCodec
    
    dataframe = pd.DataFrame({'First Name': ["Joanne", "Michael"], 'Age': [34, 22]})
    
    inference_request = PandasCodec.encode_request(dataframe)
    
    # raw_request will be a Python dictionary compatible with `requests`'s `json` kwarg
    raw_request = inference_request.dict()
    
    response = requests.post("localhost:8080/v2/models/foo/infer", json=raw_request)
    
    # raw_response will be a dictionary (loaded from the response's JSON),
    # therefore we can pass it as the InferenceResponse constructors' kwargs
    raw_response = response.json()
    inference_response = InferenceResponse(**raw_response)
    import numpy as np
    
    foo = np.array([[1.2, 2.3], [np.NaN, 4.5]])
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "np"
          },
          "data": [1.2, 2.3, null, 4.5]
          "datatype": "FP64",
          "shape": [2, 2],
        }
      ]
    }
    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "First Name",
          "datatype": "BYTES",
          "parameters": {
            "content_type": "str"
          },
          "shape": [-1],
        },
        {
          "name": "Age",
          "datatype": "INT32",
          "shape": [-1],
        },
      ]
    }
    import numpy as np
    
    foo = np.array([[1, 2], [3, 4]])
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "np"
          },
          "data": [1, 2, 3, 4]
          "datatype": "INT32",
          "shape": [2, 2],
        }
      ]
    }
    from mlserver.codecs import NumpyRequestCodec
    
    # Encode an entire V2 request
    inference_request = NumpyRequestCodec.encode_request(foo)
    from mlserver.types import InferenceRequest
    from mlserver.codecs import NumpyCodec
    
    # We can use the `NumpyCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        NumpyCodec.encode_input("foo", foo)
      ]
    )
    {
      "parameters": {
        "content_type": "pd"
      },
      "inputs": [
        {
          "name": "A",
          "data": ["a1", "a2", "a3", "a4"]
          "datatype": "BYTES",
          "shape": [4],
        },
        {
          "name": "B",
          "data": ["b1", "b2", "b3", "b4"]
          "datatype": "BYTES",
          "shape": [4],
        },
        {
          "name": "C",
          "data": ["c1", "c2", "c3", "c4"]
          "datatype": "BYTES",
          "shape": [4],
        },
      ]
    }
    import pandas as pd
    
    from mlserver.codecs import PandasCodec
    
    foo = pd.DataFrame({
      "A": ["a1", "a2", "a3", "a4"],
      "B": ["b1", "b2", "b3", "b4"],
      "C": ["c1", "c2", "c3", "c4"]
    })
    
    inference_request = PandasCodec.encode_request(foo)
    foo = ["bar", "bar2"]
    {
      "parameters": {
        "content_type": "str"
      },
      "inputs": [
        {
          "name": "foo",
          "data": ["bar", "bar2"]
          "datatype": "BYTES",
          "shape": [2],
        }
      ]
    }
    from mlserver.codecs.string import StringRequestCodec
    
    # Encode an entire V2 request
    inference_request = StringRequestCodec.encode_request(foo, use_bytes=False)
    from mlserver.types import InferenceRequest
    from mlserver.codecs import StringCodec
    
    # We can use the `StringCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        StringCodec.encode_input("foo", foo, use_bytes=False)
      ]
    )
    foo = b"Python is fun"
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "base64"
          },
          "data": ["UHl0aG9uIGlzIGZ1bg=="]
          "datatype": "BYTES",
          "shape": [1],
        }
      ]
    }
    from mlserver.types import InferenceRequest
    from mlserver.codecs import Base64Codec
    
    # We can use the `Base64Codec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        Base64Codec.encode_input("foo", foo, use_bytes=False)
      ]
    )
    import datetime
    
    foo = datetime.datetime(2022, 1, 11, 11, 0, 0)
    {
      "inputs": [
        {
          "name": "foo",
          "parameters": {
            "content_type": "datetime"
          },
          "data": ["2022-01-11T11:00:00"]
          "datatype": "BYTES",
          "shape": [1],
        }
      ]
    }
    from mlserver.types import InferenceRequest
    from mlserver.codecs import DatetimeCodec
    
    # We can use the `DatetimeCodec` to encode a single input head with name `foo`
    # within a larger request
    inference_request = InferenceRequest(
      inputs=[
        DatetimeCodec.encode_input("foo", foo, use_bytes=False)
      ]
    )
    UTF-8 String
    Base64
    Datetime

    Types

    Datatype

    An enumeration.

    InferenceErrorResponse

    Field
    Type
    Default
    Description
    JSON Schema

    InferenceRequest

    Field
    Type
    Default
    Description
    JSON Schema

    InferenceResponse

    Field
    Type
    Default
    Description
    JSON Schema

    MetadataModelErrorResponse

    Field
    Type
    Default
    Description
    JSON Schema

    MetadataModelResponse

    Field
    Type
    Default
    Description
    JSON Schema

    MetadataServerErrorResponse

    Field
    Type
    Default
    Description
    JSON Schema

    MetadataServerResponse

    Field
    Type
    Default
    Description
    JSON Schema

    MetadataTensor

    Field
    Type
    Default
    Description
    JSON Schema

    Parameters

    Field
    Type
    Default
    Description
    JSON Schema

    RepositoryIndexRequest

    Field
    Type
    Default
    Description
    JSON Schema

    RepositoryIndexResponse

    Field
    Type
    Default
    Description
    JSON Schema

    RepositoryIndexResponseItem

    Field
    Type
    Default
    Description
    JSON Schema

    RepositoryLoadErrorResponse

    Field
    Type
    Default
    Description
    JSON Schema

    RepositoryUnloadErrorResponse

    Field
    Type
    Default
    Description
    JSON Schema

    RequestInput

    Field
    Type
    Default
    Description
    JSON Schema

    RequestOutput

    Field
    Type
    Default
    Description
    JSON Schema

    ResponseOutput

    Field
    Type
    Default
    Description
    JSON Schema

    State

    An enumeration.

    TensorData

    Field
    Type
    Default
    Description
    JSON Schema

    parameters

    Optional[Parameters]

    None

    -

    outputs

    List[ResponseOutput]

    -

    -

    parameters

    Optional[Parameters]

    None

    -

    parameters

    Optional[Parameters]

    None

    -

    platform

    str

    -

    -

    versions

    Optional[List[str]]

    None

    -

    shape

    List[int]

    -

    -

    version

    Optional[str]

    None

    -

    parameters

    Optional[Parameters]

    None

    -

    shape

    List[int]

    -

    -

    parameters

    Optional[Parameters]

    None

    -

    shape

    List[int]

    -

    -

    error

    Optional[str]

    None

    -

    id

    Optional[str]

    None

    -

    inputs

    List[RequestInput]

    -

    -

    outputs

    Optional[List[RequestOutput]]

    None

    id

    Optional[str]

    None

    -

    model_name

    str

    -

    -

    model_version

    Optional[str]

    None

    error

    str

    -

    -

    inputs

    Optional[List[MetadataTensor]]

    None

    -

    name

    str

    -

    -

    outputs

    Optional[List[MetadataTensor]]

    None

    error

    str

    -

    -

    extensions

    List[str]

    -

    -

    name

    str

    -

    -

    version

    str

    -

    datatype

    Datatype

    -

    -

    name

    str

    -

    -

    parameters

    Optional[Parameters]

    None

    content_type

    Optional[str]

    None

    -

    headers

    Optional[Dict[str, Any]]

    None

    -

    ready

    Optional[bool]

    None

    -

    root

    List[RepositoryIndexResponseItem]

    -

    -

    name

    str

    -

    -

    reason

    str

    -

    -

    state

    State

    -

    error

    Optional[str]

    None

    -

    error

    Optional[str]

    None

    -

    data

    TensorData

    -

    -

    datatype

    Datatype

    -

    -

    name

    str

    -

    name

    str

    -

    -

    parameters

    Optional[Parameters]

    None

    -

    data

    TensorData

    -

    -

    datatype

    Datatype

    -

    -

    name

    str

    -

    root

    Union[List[Any], Any]

    -

    -

    
    {
      "properties": {
        "error": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Error"
        }
      },
      "title": "InferenceErrorResponse",
      "type": "object"
    }
    
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        },
        "RequestInput": {
          "properties": {
            "name": {
              "title": "Name",
              "type": "string"
            },
            "shape": {
              "items": {
                "type": "integer"
              },
              "title": "Shape",
              "type": "array"
            },
            "datatype": {
              "$ref": "#/$defs/Datatype"
            },
            "parameters": {
              "anyOf": [
                {
                  "$ref": "#/$defs/Parameters"
                },
                {
                  "type": "null"
                }
              ],
              "default": null
            },
            "data": {
              "$ref": "#/$defs/TensorData"
            }
          },
          "required": [
            "name",
            "shape",
            "datatype",
            "data"
          ],
          "title": "RequestInput",
          "type": "object"
        },
        "RequestOutput": {
          "properties": {
            "name": {
              "title": "Name",
              "type": "string"
            },
            "parameters": {
              "anyOf": [
                {
                  "$ref": "#/$defs/Parameters"
                },
                {
                  "type": "null"
                }
              ],
              "default": null
            }
          },
          "required": [
            "name"
          ],
          "title": "RequestOutput",
          "type": "object"
        },
        "TensorData": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {}
          ],
          "title": "TensorData"
        }
      },
      "properties": {
        "id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Id"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "inputs": {
          "items": {
            "$ref": "#/$defs/RequestInput"
          },
          "title": "Inputs",
          "type": "array"
        },
        "outputs": {
          "anyOf": [
            {
              "items": {
                "$ref": "#/$defs/RequestOutput"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Outputs"
        }
      },
      "required": [
        "inputs"
      ],
      "title": "InferenceRequest",
      "type": "object"
    }
    

    -

    -

    -

    -

    -

    -

    -

    -

    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        },
        "ResponseOutput": {
          "properties": {
            "name": {
              "title": "Name",
              "type": "string"
            },
            "shape": {
              "items": {
                "type": "integer"
              },
              "title": "Shape",
              "type": "array"
            },
            "datatype": {
              "$ref": "#/$defs/Datatype"
            },
            "parameters": {
              "anyOf": [
                {
                  "$ref": "#/$defs/Parameters"
                },
                {
                  "type": "null"
                }
              ],
              "default": null
            },
            "data": {
              "$ref": "#/$defs/TensorData"
            }
          },
          "required": [
            "name",
            "shape",
            "datatype",
            "data"
          ],
          "title": "ResponseOutput",
          "type": "object"
        },
        "TensorData": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {}
          ],
          "title": "TensorData"
        }
      },
      "properties": {
        "model_name": {
          "title": "Model Name",
          "type": "string"
        },
        "model_version": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Model Version"
        },
        "id": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Id"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "outputs": {
          "items": {
            "$ref": "#/$defs/ResponseOutput"
          },
          "title": "Outputs",
          "type": "array"
        }
      },
      "required": [
        "model_name",
        "outputs"
      ],
      "title": "InferenceResponse",
      "type": "object"
    }
    
    
    {
      "properties": {
        "error": {
          "title": "Error",
          "type": "string"
        }
      },
      "required": [
        "error"
      ],
      "title": "MetadataModelErrorResponse",
      "type": "object"
    }
    
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "MetadataTensor": {
          "properties": {
            "name": {
              "title": "Name",
              "type": "string"
            },
            "datatype": {
              "$ref": "#/$defs/Datatype"
            },
            "shape": {
              "items": {
                "type": "integer"
              },
              "title": "Shape",
              "type": "array"
            },
            "parameters": {
              "anyOf": [
                {
                  "$ref": "#/$defs/Parameters"
                },
                {
                  "type": "null"
                }
              ],
              "default": null
            }
          },
          "required": [
            "name",
            "datatype",
            "shape"
          ],
          "title": "MetadataTensor",
          "type": "object"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "versions": {
          "anyOf": [
            {
              "items": {
                "type": "string"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Versions"
        },
        "platform": {
          "title": "Platform",
          "type": "string"
        },
        "inputs": {
          "anyOf": [
            {
              "items": {
                "$ref": "#/$defs/MetadataTensor"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Inputs"
        },
        "outputs": {
          "anyOf": [
            {
              "items": {
                "$ref": "#/$defs/MetadataTensor"
              },
              "type": "array"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Outputs"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        }
      },
      "required": [
        "name",
        "platform"
      ],
      "title": "MetadataModelResponse",
      "type": "object"
    }
    
    
    {
      "properties": {
        "error": {
          "title": "Error",
          "type": "string"
        }
      },
      "required": [
        "error"
      ],
      "title": "MetadataServerErrorResponse",
      "type": "object"
    }
    
    
    {
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "version": {
          "title": "Version",
          "type": "string"
        },
        "extensions": {
          "items": {
            "type": "string"
          },
          "title": "Extensions",
          "type": "array"
        }
      },
      "required": [
        "name",
        "version",
        "extensions"
      ],
      "title": "MetadataServerResponse",
      "type": "object"
    }
    
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "datatype": {
          "$ref": "#/$defs/Datatype"
        },
        "shape": {
          "items": {
            "type": "integer"
          },
          "title": "Shape",
          "type": "array"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        }
      },
      "required": [
        "name",
        "datatype",
        "shape"
      ],
      "title": "MetadataTensor",
      "type": "object"
    }
    
    
    {
      "additionalProperties": true,
      "properties": {
        "content_type": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Content Type"
        },
        "headers": {
          "anyOf": [
            {
              "type": "object"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Headers"
        }
      },
      "title": "Parameters",
      "type": "object"
    }
    
    
    {
      "properties": {
        "ready": {
          "anyOf": [
            {
              "type": "boolean"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Ready"
        }
      },
      "title": "RepositoryIndexRequest",
      "type": "object"
    }
    
    
    {
      "$defs": {
        "RepositoryIndexResponseItem": {
          "properties": {
            "name": {
              "title": "Name",
              "type": "string"
            },
            "version": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Version"
            },
            "state": {
              "$ref": "#/$defs/State"
            },
            "reason": {
              "title": "Reason",
              "type": "string"
            }
          },
          "required": [
            "name",
            "state",
            "reason"
          ],
          "title": "RepositoryIndexResponseItem",
          "type": "object"
        },
        "State": {
          "enum": [
            "UNKNOWN",
            "READY",
            "UNAVAILABLE",
            "LOADING",
            "UNLOADING"
          ],
          "title": "State",
          "type": "string"
        }
      },
      "items": {
        "$ref": "#/$defs/RepositoryIndexResponseItem"
      },
      "title": "RepositoryIndexResponse",
      "type": "array"
    }
    
    
    {
      "$defs": {
        "State": {
          "enum": [
            "UNKNOWN",
            "READY",
            "UNAVAILABLE",
            "LOADING",
            "UNLOADING"
          ],
          "title": "State",
          "type": "string"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "version": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Version"
        },
        "state": {
          "$ref": "#/$defs/State"
        },
        "reason": {
          "title": "Reason",
          "type": "string"
        }
      },
      "required": [
        "name",
        "state",
        "reason"
      ],
      "title": "RepositoryIndexResponseItem",
      "type": "object"
    }
    
    
    {
      "properties": {
        "error": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Error"
        }
      },
      "title": "RepositoryLoadErrorResponse",
      "type": "object"
    }
    
    
    {
      "properties": {
        "error": {
          "anyOf": [
            {
              "type": "string"
            },
            {
              "type": "null"
            }
          ],
          "default": null,
          "title": "Error"
        }
      },
      "title": "RepositoryUnloadErrorResponse",
      "type": "object"
    }
    
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        },
        "TensorData": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {}
          ],
          "title": "TensorData"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "shape": {
          "items": {
            "type": "integer"
          },
          "title": "Shape",
          "type": "array"
        },
        "datatype": {
          "$ref": "#/$defs/Datatype"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "data": {
          "$ref": "#/$defs/TensorData"
        }
      },
      "required": [
        "name",
        "shape",
        "datatype",
        "data"
      ],
      "title": "RequestInput",
      "type": "object"
    }
    
    
    {
      "$defs": {
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        }
      },
      "required": [
        "name"
      ],
      "title": "RequestOutput",
      "type": "object"
    }
    
    
    {
      "$defs": {
        "Datatype": {
          "enum": [
            "BOOL",
            "UINT8",
            "UINT16",
            "UINT32",
            "UINT64",
            "INT8",
            "INT16",
            "INT32",
            "INT64",
            "FP16",
            "FP32",
            "FP64",
            "BYTES"
          ],
          "title": "Datatype",
          "type": "string"
        },
        "Parameters": {
          "additionalProperties": true,
          "properties": {
            "content_type": {
              "anyOf": [
                {
                  "type": "string"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Content Type"
            },
            "headers": {
              "anyOf": [
                {
                  "type": "object"
                },
                {
                  "type": "null"
                }
              ],
              "default": null,
              "title": "Headers"
            }
          },
          "title": "Parameters",
          "type": "object"
        },
        "TensorData": {
          "anyOf": [
            {
              "items": {},
              "type": "array"
            },
            {}
          ],
          "title": "TensorData"
        }
      },
      "properties": {
        "name": {
          "title": "Name",
          "type": "string"
        },
        "shape": {
          "items": {
            "type": "integer"
          },
          "title": "Shape",
          "type": "array"
        },
        "datatype": {
          "$ref": "#/$defs/Datatype"
        },
        "parameters": {
          "anyOf": [
            {
              "$ref": "#/$defs/Parameters"
            },
            {
              "type": "null"
            }
          ],
          "default": null
        },
        "data": {
          "$ref": "#/$defs/TensorData"
        }
      },
      "required": [
        "name",
        "shape",
        "datatype",
        "data"
      ],
      "title": "ResponseOutput",
      "type": "object"
    }
    
    
    {
      "anyOf": [
        {
          "items": {},
          "type": "array"
        },
        {}
      ],
      "title": "TensorData"
    }