MLServer extends the V2 inference protocol by adding support for a content_type annotation. This annotation can be provided either through the model metadata parameters, or through the input parameters. By leveraging the content_type annotation, we can provide the necessary information to MLServer so that it can decode the input payload from the "wire" V2 protocol to something meaningful to the model / user (e.g. a NumPy array).
This example will walk you through some examples which illustrate how this works, and how it can be extended.
Echo Inference Runtime
To start with, we will write a dummy runtime which just prints the input, the decoded input and returns it. This will serve as a testbed to showcase how the content_type support works.
Later on, we will extend this runtime by adding custom codecs that will decode our V2 payload to custom types.
As you can see above, this runtime will decode the incoming payloads by calling the self.decode() helper method. This method will check what's the right content type for each input in the following order:
Is there any content type defined in the inputs[].parameters.content_type field within the request payload?
Is there any content type defined in the inputs[].parameters.content_type field within the model metadata?
Is there any default content type that should be assumed?
Model Settings
In order to enable this runtime, we will also create a model-settings.json file. This file should be present (or accessible from) in the folder where we run mlserver start ..
Our initial step will be to decide the content type based on the incoming inputs[].parameters field. For this, we will start our MLServer in the background (e.g. running mlserver start .)
As you've probably already noticed, writing request payloads compliant with both the V2 Inference Protocol requires a certain knowledge about both the V2 spec and the structure expected by each content type. To account for this and simplify usage, the MLServer package exposes a set of utilities which will help you interact with your models via the V2 protocol.
These helpers are mainly shaped as "codecs". That is, abstractions which know how to "encode" and "decode" arbitrary Python datatypes to and from the V2 Inference Protocol.
Generally, we recommend using the existing set of codecs to generate your V2 payloads. This will ensure that requests and responses follow the right structure, and should provide a more seamless experience.
Following with our previous example, the same code could be rewritten using codecs as:
import requestsimport numpy as npfrom mlserver.types import InferenceRequest, InferenceResponsefrom mlserver.codecs import NumpyCodec, StringCodecparameters_np = np.array([[1, 2], [3, 4]])parameters_str = ["hello world 😁"]payload =InferenceRequest( inputs=[ NumpyCodec.encode_input("parameters-np", parameters_np),# The `use_bytes=False` flag will ensure that the encoded payload is JSON-compatible StringCodec.encode_input("parameters-str", parameters_str, use_bytes=False), ])response = requests.post("http://localhost:8080/v2/models/content-type-example/infer", json=payload.model_dump())response_payload = InferenceResponse.parse_raw(response.text)print(NumpyCodec.decode_output(response_payload.outputs[0]))print(StringCodec.decode_output(response_payload.outputs[1]))
Note that the rewritten snippet now makes use of the built-in InferenceRequest class, which represents a V2 inference request. On top of that, it also uses the NumpyCodec and StringCodec implementations, which know how to encode a Numpy array and a list of strings into V2-compatible request inputs.
Model Metadata
Our next step will be to define the expected content type through the model metadata. This can be done by extending the model-settings.json file, and adding a section on inputs.
As you should be able to see in the server logs, MLServer will cross-reference the input names against the model metadata to find the right content type.
Custom Codecs
There may be cases where a custom inference runtime may need to encode / decode to custom datatypes. As an example, we can think of computer vision models which may only operate with pillow image objects.
In these scenarios, it's possible to extend the Codec interface to write our custom encoding logic. A Codec, is simply an object which defines a decode() and encode() methods. To illustrate how this would work, we will extend our custom runtime to add a custom PillowCodec.
As you should be able to see in the MLServer logs, the server is now able to decode the payload into a Pillow image. This example also illustrates how Codec objects can be compatible with multiple datatype values (e.g. tensor and BYTES in this case).
Request Codecs
So far, we've seen how you can specify codecs so that they get applied at the input level. However, it is also possible to use request-wide codecs that aggregate multiple inputs to decode the payload. This is usually relevant for cases where the models expect a multi-column input type, like a Pandas DataFrame.
To illustrate this, we will first tweak our EchoRuntime so that it prints the decoded contents at the request level.