1 of 11

Inference Runtimes

Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice.

Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common ML frameworks. This allows you to start serving models saved in these frameworks straight away. To avoid bringing in dependencies for frameworks that you don't need to use, these runtimes are implemented as independent (and optional) Python packages. This mechanism also allows you to rollout your own custom runtimes very easily.

To pick which runtime you want to use for your model, you just need to make sure that the right package is installed, and then point to the correct runtime class in your model-settings.json file.

Included Inference Runtimes

Framework

Package Name

Implementation Class

Example

Documentation

Scikit-Learn

mlserver-sklearn

mlserver_sklearn.SKLearnModel

XGBoost

mlserver-xgboost

mlserver_xgboost.XGBoostModel

Spark MLlib

mlserver-mllib

mlserver_mllib.MLlibModel

LightGBM

mlserver-lightgbm

mlserver_lightgbm.LightGBMModel

CatBoost

mlserver-catboost

mlserver_catboost.CatboostModel

MLflow

mlserver-mlflow

mlserver_mlflow.MLflowRuntime

Alibi-Detect

mlserver-alibi-detect

mlserver_alibi_detect.AlibiDetectRuntime

SKLearn

This package provides a MLServer runtime compatible with Scikit-Learn.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-sklearn

For further information on how to use MLServer with Scikit-Learn, you can check out this .

Content Types

If no is present on the request or metadata, the Scikit-Learn runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .

Model Outputs

The Scikit-Learn inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict, predict_proba and transform methods of the Scikit-Learn model.

Output

Returned By Default

Availability

By default, the runtime will only return the output of predict. However, you are able to control which outputs you want back through the outputs field of your {class}InferenceRequest <mlserver.types.InferenceRequest> payload.

For example, to only return the model's predict_proba output, you could define a payload such as:

---
emphasize-lines: 10-12
---
{
  "inputs": [
    {
      "name": "my-input",
      "datatype": "INT32",
      "shape": [2, 2],
      "data": [1, 2, 3, 4]
    }
  ],
  "outputs": [
    { "name": "predict_proba" }
  ]
}

XGBoost

This package provides a MLServer runtime compatible with XGBoost.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-xgboost

For further information on how to use MLServer with XGBoost, you can check out this worked out example.

XGBoost Artifact Type

The XGBoost inference runtime will expect that your model is serialised via one of the following methods:

Extension

Docs

Example

*.json

booster.save_model("model.json")

*.ubj

booster.save_model("model.ubj")

*.bst

booster.save_model("model.bst")

By default, the runtime will look for a file called `model.[json | ubj | bst]`.
However, this can be modified through the `parameters.uri` field of your
{class}`ModelSettings <mlserver.settings.ModelSettings>` config (see the
section on [Model Settings](../../docs/reference/model-settings.md) for more
details).

```{code-block} json
---
emphasize-lines: 3-5
---
{
  "name": "foo",
  "parameters": {
    "uri": "./my-own-model-filename.json"
  }
}
```

Content Types

If no content type is present on the request or metadata, the XGBoost runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

Model Outputs

The XGBoost inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict and predict_proba methods of the XGBoost model.

Output

Returned By Default

Availability

predict

✅

Available on all XGBoost models.

predict_proba

❌

Only available on non-regressor models (i.e. XGBClassifier models).

For example, to only return the model's predict_proba output, you could define a payload such as:

---
emphasize-lines: 10-12
---
{
  "inputs": [
    {
      "name": "my-input",
      "datatype": "INT32",
      "shape": [2, 2],
      "data": [1, 2, 3, 4]
    }
  ],
  "outputs": [
    { "name": "predict_proba" }
  ]
}

MLFlow

This package provides a MLServer runtime compatible with MLflow models.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-mlflow

Content Types

The MLflow inference runtime introduces a new dict content type, which decodes an incoming V2 request as a dictionary of tensors. This is useful for certain MLflow-serialised models, which will expect that the model inputs are serialised in this format.

The `dict` content type can be _stacked_ with other content types, like
[`np`](../../docs/user-guide/content-type).
This allows the user to use a different set of content types to decode each of
the dict entries.

Spark MlLib

This package provides a MLServer runtime compatible with Spark MLlib.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-mllib

For further information on how to use MLServer with Spark MLlib, you can check out the MLServer repository.

LightGBM

This package provides a MLServer runtime compatible with LightGBM.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-lightgbm

For further information on how to use MLServer with LightGBM, you can check out this worked out example.

Content Types

If no content type is present on the request or metadata, the LightGBM runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

Catboost

This package provides a MLServer runtime compatible with CatBoost's CatboostClassifier.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-catboost

For further information on how to use MLServer with CatBoost, you can check out this worked out example.

Content Types

If no content type is present on the request or metadata, the CatBoost runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

Alibi-Detect

This package provides a MLServer runtime compatible with alibi-detect models.

Usage

You can install the mlserver-alibi-detect runtime, alongside mlserver, as:

pip install mlserver mlserver-alibi-detect

For further information on how to use MLServer with Alibi-Detect, you can check out this worked out example.

Content Types

If no content type is present on the request or metadata, the Alibi-Detect runtime will try to decode the payload as a NumPy Array. To avoid this, either send a different content type explicitly, or define the correct one as part of your model's metadata.

Settings

The Alibi Detect runtime exposes a couple setting flags which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.

---
emphasize-lines: 6-8
---
{
  "name": "drift-detector",
  "implementation": "mlserver_alibi_detect.AlibiDetectRuntime",
  "parameters": {
    "uri": "./alibi-detect-artifact/",
    "extra": {
      "batch_size": 5
    }
  }
}

Reference

You can find the full reference of the accepted extra settings for the Alibi Detect runtime below:


.. autopydantic_settings:: mlserver_alibi_detect.runtime.AlibiDetectSettings

Alibi-Explain

This package provides a MLServer runtime compatible with Alibi-Explain.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-alibi-explain

HuggingFace

This package provides a MLServer runtime compatible with HuggingFace Transformers.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-huggingface

For further information on how to use MLServer with HuggingFace, you can check out this .

Content Types

The HuggingFace runtime will always decode the input request using its own built-in codec. Therefore, at the request level will be ignored. Note that this doesn't include annotations, which will be respected as usual.

Settings

The HuggingFace runtime exposes a couple extra parameters which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra section of your model-settings.json file, e.g.

---
emphasize-lines: 5-8
---
{
  "name": "qa",
  "implementation": "mlserver_huggingface.HuggingFaceRuntime",
  "parameters": {
    "extra": {
      "task": "question-answering",
      "optimum_model": true
    }
  }
}

These settings can also be injected through environment variables prefixed with `MLSERVER_MODEL_HUGGINGFACE_`, e.g.

```bash
MLSERVER_MODEL_HUGGINGFACE_TASK="question-answering"
MLSERVER_MODEL_HUGGINGFACE_OPTIMUM_MODEL=true
```

Loading models

Local models

It is possible to load a local model into a HuggingFace pipeline by specifying the model artefact folder path in parameters.uri in model-settings.json.

HuggingFace models

Models in the HuggingFace hub can be loaded by specifying their name in parameters.extra.pretrained_model in model-settings.json.

If `parameters.extra.pretrained_model` is specified, it takes precedence over `parameters.uri`.

Reference

You can find the full reference of the accepted extra settings for the HuggingFace runtime below:


.. autopydantic_settings:: mlserver_huggingface.settings.HuggingFaceSettings

Custom

There may be cases where the offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.

To learn more about how you can write custom runtimes with MLServer, check out the . Alternatively, you can also see this which walks through the process of writing a custom runtime.

XGBoost

This package provides a MLServer runtime compatible with XGBoost.

Usage

You can install the runtime, alongside mlserver, as:

pip install mlserver mlserver-xgboost

For further information on how to use MLServer with XGBoost, you can check out this worked out example.

XGBoost Artifact Type

The XGBoost inference runtime will expect that your model is serialised via one of the following methods:

Extension

Docs

Example

*.json

booster.save_model("model.json")

*.ubj

booster.save_model("model.ubj")

*.bst

booster.save_model("model.bst")

By default, the runtime will look for a file called `model.[json | ubj | bst]`.
However, this can be modified through the `parameters.uri` field of your
{class}`ModelSettings <mlserver.settings.ModelSettings>` config (see the
section on [Model Settings](../../docs/reference/model-settings.md) for more
details).

```{code-block} json
---
emphasize-lines: 3-5
---
{
  "name": "foo",
  "parameters": {
    "uri": "./my-own-model-filename.json"
  }
}
```

Content Types

Model Outputs

The XGBoost inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict and predict_proba methods of the XGBoost model.

Output

Returned By Default

Availability

predict

✅

Available on all XGBoost models.

predict_proba

❌

Only available on non-regressor models (i.e. XGBClassifier models).

For example, to only return the model's predict_proba output, you could define a payload such as:

---
emphasize-lines: 10-12
---
{
  "inputs": [
    {
      "name": "my-input",
      "datatype": "INT32",
      "shape": [2, 2],
      "data": [1, 2, 3, 4]
    }
  ],
  "outputs": [
    { "name": "predict_proba" }
  ]
}