Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers.
Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.
The streaming endpoints are available for both the infer
and generate
methods through the following endpoints:
/v2/models/{model_name}/versions/{model_version}/infer_stream
/v2/models/{model_name}/infer_stream
/v2/models/{model_name}/versions/{model_version}/generate_stream
/v2/models/{model_name}/generate_stream
Note that for REST, the generate
and generate_stream
endpoints are aliases for the infer
and infer_stream
endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs).
Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.
The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but mlserver
already has the predict
method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic.
The stub method for streaming to be used by the client is ModelStreamInfer
.
There are three main limitations of the streaming support in MLServer:
the parallel_workers
setting should be set to 0
to disable distributed workers (to be addressed in future releases)
for REST, the gzip_enabled
setting should be set to false
to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue here)
MLServer is used as the core Python inference server in KServe (formerly known as KFServing). This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.
This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about KServe or how to install it, please visit the KServe documentation.
KServe provides built-in serving runtimes to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.
Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.
To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService
manifest.
For example, to serve a Scikit-Learn model, you could use a manifest like the one below:
As you can see highlighted above, the InferenceService
manifest will only need to specify the following points:
The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn
serving runtime to deploy it.
The model will be served using the V2 inference protocol, which can be enabled by setting the protocolVersion
field to v2
.
Once you have your InferenceService
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.
Note that, on top of the ones shown above (backed by MLServer), KServe also provides a wider set of serving runtimes. To see the full list, please visit the KServe documentation.
Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for custom runtimes.
The InferenceService
manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the custom MLServer image containing your custom logic. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0
, we could write an InferenceService
manifest like the one below:
As we can see highlighted above, the main points that we'll need to take into account are:
Pointing to our custom MLServer image
in the custom container section of our InferenceService
.
Explicitly choosing the V2 inference protocol to serve our model.
Let KServe know what port will be exposed by our custom container to send inference requests.
Once you have your InferenceService
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
MLServer is used as the core Python inference server in Seldon Core. Therefore, it should be straightforward to deploy your models either by using one of the built-in pre-packaged servers or by pointing to a custom image of MLServer.
This section assumes a basic knowledge of Seldon Core and Kubernetes, as well as access to a working Kubernetes cluster with Seldon Core installed. To learn more about Seldon Core or how to install it, please visit the Seldon Core documentation.
Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.
To let Seldon Core know what framework was used to train your model, you can use the implementation
field of your SeldonDeployment
manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:
As you can see highlighted above, all that we need to specify is that:
Our inference deployment should use the V2 inference protocol, which is done by setting the protocol
field to kfserving
.
Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the MLServer SKLearn runtime, which is done by setting the implementation
field to SKLEARN_SERVER
.
Note that, while the protocol
should always be set to kfserving
(i.e. so that models are served using the V2 inference protocol), the value of the implementation
field will be dependant on your ML framework. The valid values of the implementation
field are pre-determined by Seldon Core. However, it should also be possible to configure and add new ones (e.g. to support a custom MLServer runtime).
Once you have your SeldonDeployment
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
To consult the supported values of the implementation
field where MLServer is used, you can check the support table below.
As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.
The table below shows a list of the currently supported values of the implementation
field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.
Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a wider set of pre-packaged servers. To check the full list, please visit the Seldon Core documentation.
There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.
The componentSpecs
field of the SeldonDeployment
manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0
, we could write our SeldonDeployment
manifest as follows:
As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:
Letting Seldon Core know that the model deployment will be served through the V2 inference protocol) by setting the protocol
field to v2
.
Pointing our model container to use our custom MLServer image, by specifying it on the image
field of the componentSpecs
section of the manifest.
Once you have your SeldonDeployment
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.
This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this end-to-end example which walks through the process of writing a custom runtime.
MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel>
abstract class, whose main methods are:
load() <mlserver.MLModel.load>
: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>
: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>
: Responsible for using a model to perform inference on an incoming data point.
Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>
, and then overriding those methods with your custom logic.
MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict()
method with the mlserver.codecs.decode_args
decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.
Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through MLServer's codecs and content types system.
MLServer's "simplified" interface aims to cover use cases where encoding / decoding can be done through one of the codecs built-in into the MLServer package. However, there are instances where this may not be enough (e.g. variable number of inputs, variable content types, etc.). For these types of cases, please use MLServer's "advanced" interface, where you will have full control over the full encoding / decoding process.
As an example of the above, let's assume a model which
Takes two lists of strings as inputs:
questions
, containing multiple questions to ask our model.
context
, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.
Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:
Note that, the method signature of our predict
method now specifies:
The input names that we should be looking for in the request payload (i.e. questions
and context
).
The expected content type for each of the request inputs (i.e. List[str]
on both cases).
The expected content type of the response outputs (i.e. np.ndarray
).
The headers
field within the parameters
section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.
There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers
field of the parameters
object within the InferenceRequest
instance.
Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers
field within the parameters
object of the returned InferenceResponse
instance.
MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json
configuration file.
For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:
Note that, from the example above, we are assuming that:
Your custom runtime code lives in the models.py
file.
The implementation
field of your model-settings.json
configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime
).
More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.
It is possible to load this custom set of dependencies by providing them through an environment tarball, whose path can be specified within your model-settings.json
file.
To load a custom environment, parallel inference must be enabled.
The main MLServer process communicates with workers in custom environments via multiprocessing.Queue
using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same default pickle protocol as the main process. Consult the tables below for environment compatibility.
If we take the previous example above as a reference, we could extend it to include our custom environment as:
Note that, in the folder layout above, we are assuming that:
The environment.tar.gz
tarball contains a pre-packaged version of your custom environment.
The environment_tarball
field of your model-settings.json
configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz
).
The mlserver build
command expects that a Docker runtime is available and running in the background.
MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a Conda environment file or a requirements.txt
file.
To leverage these, we can use the mlserver build
command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:
The output will be a Docker image named my-custom-server
, ready to be used.
The mlserver build
subcommand will search for any Conda environment file (i.e. named either as environment.yaml
or conda.yaml
) and / or any requirements.txt
present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.
The environment built by the mlserver build
will be global to the whole MLServer image (i.e. every loaded model will, by default, use that custom environment). For Multi-Model Serving scenarios, it may be better to use per-model custom environments instead - which will allow you to run multiple custom environments at the same time.
The mlserver build
subcommand will treat any settings.json
or model-settings.json
files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.
Default setting values can still be overriden by external environment variables or model-specific model-settings.json
.
Out-of-the-box, the mlserver build
subcommand leverages a default Dockerfile
which takes into account a number of requirements, like
Supporting arbitrary user IDs.
Building your base custom environment on the fly.
Configure a set of default setting values.
However, there may be occasions where you need to customise your Dockerfile
even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like Kaniko or Buildah).
To account for these cases, MLServer also includes a mlserver dockerfile
subcommand which will just generate a Dockerfile
(and optionally a .dockerignore
file) exactly like the one used by the mlserver build
command. This Dockerfile
can then be customised according to your needs.
The base Dockerfile
requires Docker's Buildkit to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1
environment variable, e.g.
Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes. This allows MLServer to scale out beyond the limitations of the Python interpreter. To learn more about why this can be beneficial, you can check the concurrency section below.
By default, MLServer will spin up a pool with only one worker process to run inference. All models will be loaded uniformly across the inference pool workers. To read more about advanced settings, please see the usage section below.
The Global Interpreter Lock (GIL) is a mutex lock that exists in most Python interpreters (e.g. CPython). Its main purpose is to lock Python’s execution so that it only runs on a single processor at the same time. This simplifies certain things to the interpreter. However, it also adds the limitation that a single Python process will never be able to leverage multiple cores.
When we think about MLServer's support for Multi-Model Serving (MMS), this could lead to scenarios where a heavily-used model starves the other models running within the same MLServer instance. Similarly, even if we don’t take MMS into account, the GIL also makes it harder to scale inference for a single model.
To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.
Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing
library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020
.
The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.
Despite the above, even though this overhead is minimised, this it can still be particularly noticeable for lightweight inference methods, where the extra IPC overhead can take a large percentage of the overall time. In these cases (which can only be assessed on a model-by-model basis), the user has the option to disable the parallel inference feature.
For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.
By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers
setting.
parallel_workers
The parallel_workers
field of the settings.json
file (or alternatively, the MLSERVER_PARALLEL_WORKERS
global environment variable) controls the size of MLServer's inference pool. The expected values are:
N
, where N > 0
, will create a pool of N
workers.
0
, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.
Jiale Zhi, Rui Wang, Jeff Clune, and Kenneth O. Stanley. Fiber: A Platform for Efficient Development and Distributed Training for Reinforcement Learning and Population-Based Methods. arXiv:2003.11164 [cs, stat], March 2020. arXiv:2003.11164.
MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:
Name | Description | OpenAPI Spec |
---|---|---|
On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.
The autogenerated Swagger UI can be accessed under the /v2/docs
endpoint.
Besides the Swagger UI, you can also access the raw OpenAPI spec through the /v2/docs/dataplane.json
endpoint.
Alongside the general API documentation, MLServer will also autogenerate a Swagger UI tailored to individual models, showing the endpoints available for each one.
The model-specific autogenerated Swagger UI can be accessed under the following endpoints:
/v2/models/{model_name}/docs
/v2/models/{model_name}/versions/{model_version}/docs
Besides the Swagger UI, you can also access the model-specific raw OpenAPI spec through the following endpoints:
/v2/models/{model_name}/docs/dataplane.json
/v2/models/{model_name}/versions/{model_version}/docs/dataplane.json
Out-of-the-box, MLServer exposes a set of metrics that help you monitor your machine learning workloads in production. These include standard metrics like number of requests and latency.
On top of these, you can also register and track your own custom metrics as part of your custom inference runtimes.
By default, MLServer will expose metrics around inference requests (count and error rate) and the status of its internal requests queues. These internal queues are used for adaptive batching and communication with the inference workers.
Metric Name | Description |
---|---|
On top of the default set of metrics, MLServer's REST server will also expose a set of metrics specific to REST.
The prefix for the REST-specific metrics will be dependent on the metrics_rest_server_prefix
flag from the MLServer settings.
Metric Name | Description |
---|---|
On top of the default set of metrics, MLServer's gRPC server will also expose a set of metrics specific to gRPC.
MLServer allows you to register custom metrics within your custom inference runtimes. This can be done through the mlserver.register()
and mlserver.log()
methods.
mlserver.register
: Register a new metric.
mlserver.log
: Log a new set of metric / value pairs. If there's any unregistered metric, it will get registered on-the-fly.
Under the hood, metrics logged through the mlserver.log
method will get exposed to Prometheus as a Histogram.
Custom metrics will generally be registered in the load() <mlserver.MLModel.load>
method and then used in the predict() <mlserver.MLModel.predict>
method of your custom runtime.
For metrics specific to a model (e.g. custom metrics, request counts, etc), MLServer will always label these with the model name and model version. Downstream, this will allow to aggregate and query metrics per model.
If these labels are not present on a specific metric, this means that those metrics can't be sliced at the model level.
Below, you can find the list of standardised labels that you will be able to find on model-specific metrics:
MLServer will expose metric values through a metrics endpoint exposed on its own metric server. This endpoint can be polled by Prometheus or other OpenMetrics-compatible backends.
Below you can find the settings available to control the behaviour of the metrics server:
MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including Seldon Core and KServe (formerly known as KFServing). This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.
In general, it should be possible to deploy models using MLServer into any serving engine compatible with the V2 protocol. Alternatively, it's also possible to manage MLServer deployments manually as regular processes (i.e. in a non-Kubernetes-native way). However, this may be more involved and highly dependant on the deployment infrastructure.
MLServer includes support to batch requests together transparently on-the-fly. We refer to this as "adaptive batching", although it can also be known as "predictive batching".
There are usually two main reasons to adopt adaptive batching:
Maximise resource usage. Usually, inference operations are “vectorised” (i.e. are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time. Therefore, to make sure that it’s used at maximum capacity, we need to run inference across batches.
Minimise any inference overhead. Usually, all models will have to “pay” a constant overhead when running any type of inference. This can be something like IO to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it’s in our interest to send as large batches as we can without deteriorating performance.
However, these benefits will usually scale only up to a certain point, which is usually determined by either the infrastructure, the machine learning framework used to train your model, or a combination of both. Therefore, to maximise the performance improvements brought in by adaptive batching it will be important to configure it with the appropriate values for your model. Since these values are usually found through experimentation, MLServer won't enable by default adaptive batching on newly loaded models.
MLServer lets you configure adaptive batching independently for each model through two main parameters:
Maximum batch size, that is how many requests you want to group together.
Maximum batch time, that is how much time we should wait for new requests until we reach our maximum batch size.
max_batch_size
The max_batch_size
field of the model-settings.json
file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_SIZE
global environment variable) controls the maximum number of requests that should be grouped together on each batch. The expected values are:
N
, where N > 1
, will create batches of up to N
elements.
0
or 1
, will disable adaptive batching.
max_batch_time
The max_batch_time
field of the model-settings.json
file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_TIME
global environment variable) controls the time that MLServer should wait for new requests to come in until we reach our maximum batch size.
The expected format is in seconds, but it will take fractional values. That is, 500ms could be expressed as 0.5
.
The expected values are:
T
, where T > 0
, will wait T
seconds at most.
0
, will disable adaptive batching.
MLserver allows adding custom parameters to the parameters
field of the requests. These parameters are received as a merged list of parameters inside the server, e.g.
is received as follows in the batched request in the server:
The same way if the request is sent back from the server as a batched request
it will be returned unbatched from the server as follows:
Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime
objects, Pillow
images, etc. Unfortunately, the definition of the doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.
To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.
To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.
To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type
field of the parameters
section of your request.
As an example, we can consider the following dataframe, containing two columns: Age and First Name.
This table, could be specified in the V2 protocol as the following payload, where we declare that:
The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd
).
The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str
).
It's important to keep in mind that content types can be specified at both the request level and the input level. The former will apply to the entire set of inputs, whereas the latter will only apply to a particular input of the payload.
Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.
Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.
However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.
To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any
represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):
Request Codecs
encode_request() <mlserver.codecs.RequestCodec.encode_request>
decode_request() <mlserver.codecs.RequestCodec.decode_request>
encode_response() <mlserver.codecs.RequestCodec.encode_response>
decode_response() <mlserver.codecs.RequestCodec.decode_response>
Input Codecs
encode_input() <mlserver.codecs.InputCodec.encode_input>
decode_input() <mlserver.codecs.InputCodec.decode_input>
encode_output() <mlserver.codecs.InputCodec.encode_output>
decode_output() <mlserver.codecs.InputCodec.decode_output>
Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.
For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:
When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types
package (i.e. InferenceRequest <mlserver.types.InferenceRequest>
or InferenceResponse <mlserver.types.InferenceResponse>
). Therefore, if you want to use them with requests
(or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.
For example, if we want to send an inference request to model foo
, we could do something along the following lines:
The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.
In order to send / receive NaN values, you must ensure that:
You are using the REST
interface.
The input / output entry containing NaN values uses either the FP16
, FP32
or FP64
datatypes.
Assuming those conditions are satisfied, any null
value within your tensor payload will be converted to NaN.
For example, if you take the following Numpy array:
We could encode it as:
It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.
Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.
The np
content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:
The shape
field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.
By default, MLServer will always assume that an array with a single-dimensional shape, e.g. [N]
, is equivalent to [N, 1]
. That is, each entry will be treated like a single one-dimensional data point (i.e. instead of a [1, D]
array, where the full array is a single D
-dimensional data point). To avoid any ambiguity, where possible, the Numpy codec will always explicitly encode [N]
arrays as [N, 1]
.
For example, if we think of the following NumPy Array:
We could encode it as the input foo
in a V2 protocol request as:
When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input
element. This can be used as a helper for models which only expect a single tensor.
The pd
content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.
The pd
content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,
Each entry of the inputs
list (or outputs
, in the case of responses), will represent a column of the DataFrame.
Each of these entires, will contain all the row elements for that particular column.
The shape
field of each input
(or output
) entry will contain (at least) the amount of rows included in the dataframe.
For example, if we consider the following dataframe:
We could encode it to the V2 Inference Protocol as:
The str
content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:
The expected datatype
is BYTES
.
The shape
field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"]
payload will have a shape of 2 elements).
For example, when if we consider the following list of strings:
We could encode it to the V2 Inference Protocol as:
When using the str
content type at the request-level, it will decode the entire request by considering only the first input
element. This can be used as a helper for models which only expect a single string or a set of strings.
The base64
content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:
The expected datatype
is BYTES
.
The data
field should contain the base64-encoded binary strings.
The shape
field represents the number of binary strings that are encoded in the payload.
For example, if we think of the following "bytes array":
We could encode it as the input foo
of a V2 request as:
The expected datatype
is BYTES
.
The shape
field represents the number of datetimes that are encoded in the payload.
For example, if we think of the following datetime
object:
We could encode it as the input foo
of a V2 request as:
Framework | MLServer Runtime | KServe Serving Runtime | Documentation |
---|---|---|---|
Framework | MLServer Runtime | Seldon Core Pre-packaged Server | Documentation |
---|---|---|---|
Status | Description |
---|---|
Worker Python \ Server Python | 3.9 | 3.10 | 3.11 |
---|---|---|---|
Metric Name | Description |
---|---|
Label Name | Description |
---|---|
Label Name | Description | Default |
---|---|---|
Some inference runtimes may apply a content type by default if none is present. To learn more about each runtime's defaults, please check the .
First Name | Age |
---|
To learn more about the available content types and how to use them, you can see all the available ones in the section below.
For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .
Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump()
or .model_dump_json()
method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).
You are either using the or the .
Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.
For example, to configure the content type values of the , one could create a model-settings.json
file like the one below:
Python Type | Content Type | Request Level | Request Codec | Input Level | Input Codec |
---|
MLServer allows you extend the supported content types by adding custom ones. To learn more about how to write your own custom content types, you can check this . You can also learn more about building custom extensions for MLServer on the of the docs.
The expects that the data
of each input is sent as a flat array. Therefore, the np
content type will expect that tensors are sent flattened. The information in the shape
field will then be used to reshape the vector into the right dimensions.
The datatype
field will be matched to the closest .
A | B | C |
---|
The datetime
content type will decode a V2 input into a , taking into account the following:
The data
field should contain the dates serialised following the .
Scikit-Learn
sklearn
XGBoost
xgboost
Scikit-Learn
SKLEARN_SERVER
XGBoost
XGBOOST_SERVER
MLflow
MLFLOW_SERVER
model_infer_request_success
Number of successful inference requests.
model_infer_request_failure
Number of failed inference requests.
batch_request_queue
Queue size for the adaptive batching queue.
parallel_request_queue
Queue size for the inference workers queue.
[rest_server]_requests
Number of REST requests, labelled by endpoint and status code.
[rest_server]_requests_duration_seconds
Latency of REST requests.
[rest_server]_requests_in_progress
Number of in-flight REST requests.
grpc_server_handled
Number of gRPC requests, labelled by gRPC code and method.
grpc_server_started
Number of in-flight gRPC requests.
model_name
Model Name (e.g. my-custom-model
)
model_version
Model Version (e.g. v1.2.3
)
metrics_endpoint
Path under which the metrics endpoint will be exposed.
/metrics
metrics_port
Port used to serve the metrics server.
8082
metrics_rest_server_prefix
Prefix used for metric names specific to MLServer's REST inference interface.
rest_server
metrics_dir
Directory used to store internal metric files (used to support metrics sharing across inference workers). This is equivalent to Prometheus' $PROMETHEUS_MULTIPROC_DIR
env var.
MLServer's current working directory (i.e. $PWD
)
🔴
Unsupported
🟢
Supported
🔵
Untested
3.9
🟢
🟢
🔵
3.10
🟢
🟢
🔵
3.11
🔵
🔵
🔵
Joanne | 34 |
Michael | 22 |
a1 | b1 | c1 |
a2 | b2 | c2 |
a3 | b3 | c3 |
a4 | b4 | c4 |
| ✅ |
| ✅ |
|
| ✅ |
| ❌ |
| ✅ |
| ✅ |
|
| ❌ | ✅ |
|
| ❌ | ✅ |
|
Open Inference Protocol
Main dataplane for inference, health and metadata
Model Repository Extension
Extension to the protocol to provide a control plane which lets you load / unload models dynamically