Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.
To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.
To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type
field of the parameters
section of your request.
As an example, we can consider the following dataframe, containing two columns: Age and First Name.
Joanne
34
Michael
22
This table, could be specified in the V2 protocol as the following payload, where we declare that:
The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd
).
The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str
).
Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.
Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.
However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.
To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any
represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):
Request Codecs
encode_request() <mlserver.codecs.RequestCodec.encode_request>
decode_request() <mlserver.codecs.RequestCodec.decode_request>
encode_response() <mlserver.codecs.RequestCodec.encode_response>
decode_response() <mlserver.codecs.RequestCodec.decode_response>
Input Codecs
encode_input() <mlserver.codecs.InputCodec.encode_input>
decode_input() <mlserver.codecs.InputCodec.decode_input>
encode_output() <mlserver.codecs.InputCodec.encode_output>
decode_output() <mlserver.codecs.InputCodec.decode_output>
Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.
For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:
When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types
package (i.e. InferenceRequest <mlserver.types.InferenceRequest>
or InferenceResponse <mlserver.types.InferenceResponse>
). Therefore, if you want to use them with requests
(or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.
For example, if we want to send an inference request to model foo
, we could do something along the following lines:
The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.
In order to send / receive NaN values, you must ensure that:
You are using the REST
interface.
The input / output entry containing NaN values uses either the FP16
, FP32
or FP64
datatypes.
Assuming those conditions are satisfied, any null
value within your tensor payload will be converted to NaN.
For example, if you take the following Numpy array:
We could encode it as:
It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.
Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.
np
✅
mlserver.codecs.NumpyRequestCodec
✅
mlserver.codecs.NumpyCodec
pd
✅
mlserver.codecs.PandasCodec
❌
str
✅
mlserver.codecs.string.StringRequestCodec
✅
mlserver.codecs.StringCodec
base64
❌
✅
mlserver.codecs.Base64Codec
datetime
❌
✅
mlserver.codecs.DatetimeCodec
The np
content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:
The shape
field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.
For example, if we think of the following NumPy Array:
We could encode it as the input foo
in a V2 protocol request as:
When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input
element. This can be used as a helper for models which only expect a single tensor.
The pd
content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,
Each entry of the inputs
list (or outputs
, in the case of responses), will represent a column of the DataFrame.
Each of these entires, will contain all the row elements for that particular column.
The shape
field of each input
(or output
) entry will contain (at least) the amount of rows included in the dataframe.
For example, if we consider the following dataframe:
a1
b1
c1
a2
b2
c2
a3
b3
c3
a4
b4
c4
We could encode it to the V2 Inference Protocol as:
The str
content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:
The expected datatype
is BYTES
.
The shape
field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"]
payload will have a shape of 2 elements).
For example, when if we consider the following list of strings:
We could encode it to the V2 Inference Protocol as:
When using the str
content type at the request-level, it will decode the entire request by considering only the first input
element. This can be used as a helper for models which only expect a single string or a set of strings.
The base64
content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:
The expected datatype
is BYTES
.
The data
field should contain the base64-encoded binary strings.
The shape
field represents the number of binary strings that are encoded in the payload.
For example, if we think of the following "bytes array":
We could encode it as the input foo
of a V2 request as:
The expected datatype
is BYTES
.
The shape
field represents the number of datetimes that are encoded in the payload.
For example, if we think of the following datetime
object:
We could encode it as the input foo
of a V2 request as:
Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime
objects, Pillow
images, etc. Unfortunately, the definition of the doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.
Some inference runtimes may apply a content type by default if none is present. To learn more about each runtime's defaults, please check the .
To learn more about the available content types and how to use them, you can see all the available ones in the section below.
For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .
Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump()
or .model_dump_json()
method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).
You are either using the or the .
Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.
For example, to configure the content type values of the , one could create a model-settings.json
file like the one below:
MLServer allows you extend the supported content types by adding custom ones. To learn more about how to write your own custom content types, you can check this . You can also learn more about building custom extensions for MLServer on the of the docs.
The expects that the data
of each input is sent as a flat array. Therefore, the np
content type will expect that tensors are sent flattened. The information in the shape
field will then be used to reshape the vector into the right dimensions.
The datatype
field will be matched to the closest .
The datetime
content type will decode a V2 input into a , taking into account the following:
The data
field should contain the dates serialised following the .
MLServer includes support to batch requests together transparently on-the-fly. We refer to this as "adaptive batching", although it can also be known as "predictive batching".
There are usually two main reasons to adopt adaptive batching:
Maximise resource usage. Usually, inference operations are “vectorised” (i.e. are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time. Therefore, to make sure that it’s used at maximum capacity, we need to run inference across batches.
Minimise any inference overhead. Usually, all models will have to “pay” a constant overhead when running any type of inference. This can be something like IO to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it’s in our interest to send as large batches as we can without deteriorating performance.
MLServer lets you configure adaptive batching independently for each model through two main parameters:
Maximum batch size, that is how many requests you want to group together.
Maximum batch time, that is how much time we should wait for new requests until we reach our maximum batch size.
max_batch_size
The max_batch_size
field of the model-settings.json
file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_SIZE
global environment variable) controls the maximum number of requests that should be grouped together on each batch. The expected values are:
N
, where N > 1
, will create batches of up to N
elements.
0
or 1
, will disable adaptive batching.
max_batch_time
The max_batch_time
field of the model-settings.json
file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_TIME
global environment variable) controls the time that MLServer should wait for new requests to come in until we reach our maximum batch size.
The expected format is in seconds, but it will take fractional values. That is, 500ms could be expressed as 0.5
.
The expected values are:
T
, where T > 0
, will wait T
seconds at most.
0
, will disable adaptive batching.
MLserver allows adding custom parameters to the parameters
field of the requests. These parameters are received as a merged list of parameters inside the server, e.g.
is received as follows in the batched request in the server:
The same way if the request is sent back from the server as a batched request
it will be returned unbatched from the server as follows:
Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.
To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService
manifest.
For example, to serve a Scikit-Learn model, you could use a manifest like the one below:
As you can see highlighted above, the InferenceService
manifest will only need to specify the following points:
The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn
serving runtime to deploy it.
Once you have your InferenceService
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.
As we can see highlighted above, the main points that we'll need to take into account are:
Pointing to our custom MLServer image
in the custom container section of our InferenceService
.
Let KServe know what port will be exposed by our custom container to send inference requests.
Once you have your InferenceService
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.
Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing
library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020
.
The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.
For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.
By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers
setting.
parallel_workers
The parallel_workers
field of the settings.json
file (or alternatively, the MLSERVER_PARALLEL_WORKERS
global environment variable) controls the size of MLServer's inference pool. The expected values are:
N
, where N > 0
, will create a pool of N
workers.
0
, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.
inference_pool_gid
The inference_pool_gid
field of the model-settings.json
file (or alternatively, the MLSERVER_MODEL_INFERENCE_POOL_GID
global environment variable) allows to load models on a dedicated inference pool based on the group ID (GID) to prevent starvation behavior.
Complementing the inference_pool_gid
, if the autogenerate_inference_pool_gid
field of the model-settings.json
file (or alternatively, the MLSERVER_MODEL_AUTOGENERATE_INFERENCE_POOL_GID
global environment variable) is set to True
, a UUID is automatically generated, and a dedicated inference pool will load the given model. This option is useful if the user wants to load a single model on an dedicated inference pool without having to manage the GID themselves.
There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.
MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel>
abstract class, whose main methods are:
load() <mlserver.MLModel.load>
: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>
: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>
: Responsible for using a model to perform inference on an incoming data point.
Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>
, and then overriding those methods with your custom logic.
MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict()
method with the mlserver.codecs.decode_args
decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.
As an example of the above, let's assume a model which
Takes two lists of strings as inputs:
questions
, containing multiple questions to ask our model.
context
, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.
Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:
Note that, the method signature of our predict
method now specifies:
The input names that we should be looking for in the request payload (i.e. questions
and context
).
The expected content type for each of the request inputs (i.e. List[str]
on both cases).
The expected content type of the response outputs (i.e. np.ndarray
).
There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers
field of the parameters
object within the InferenceRequest
instance.
Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers
field within the parameters
object of the returned InferenceResponse
instance.
MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json
configuration file.
For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:
Note that, from the example above, we are assuming that:
Your custom runtime code lives in the models.py
file.
The implementation
field of your model-settings.json
configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime
).
More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.
Note that, in the folder layout above, we are assuming that:
The environment.tar.gz
tarball contains a pre-packaged version of your custom environment.
The environment_tarball
field of your model-settings.json
configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz
).
To leverage these, we can use the mlserver build
command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:
The output will be a Docker image named my-custom-server
, ready to be used.
Out-of-the-box, the mlserver build
subcommand leverages a default Dockerfile
which takes into account a number of requirements, like
Supporting arbitrary user IDs.
MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including and . This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.
However, these benefits will usually scale only up to a certain point, which is usually determined by either the infrastructure, the machine learning framework used to train your model, or a combination of both. Therefore, to maximise the performance improvements brought in by adaptive batching it will be important to . Since these values are usually found through experimentation, MLServer won't enable by default adaptive batching on newly loaded models.
MLServer is used as the in . This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.
This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about or , please visit the .
KServe provides built-in to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.
The model will be served using the , which can be enabled by setting the protocolVersion
field to v2
.
Note that, on top of the ones shown above (backed by MLServer), KServe also provides a of serving runtimes. To see the full list, please visit the .
Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to , which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for .
The InferenceService
manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the . For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0
, we could write an InferenceService
manifest like the one below:
Explicitly choosing the to serve our model.
Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes. This allows MLServer to scale out beyond the limitations of the Python interpreter. To learn more about why this can be beneficial, you can check the below.
By default, MLServer will spin up a pool with only one worker process to run inference. All models will be loaded uniformly across the inference pool workers. To read more about advanced settings, please see the .
The is a mutex lock that exists in most Python interpreters (e.g. CPython). Its main purpose is to lock Python’s execution so that it only runs on a single processor at the same time. This simplifies certain things to the interpreter. However, it also adds the limitation that a single Python process will never be able to leverage multiple cores.
When we think about MLServer's support for , this could lead to scenarios where a heavily-used model starves the other models running within the same MLServer instance. Similarly, even if we don’t take MMS into account, the GIL also makes it harder to scale inference for a single model.
Despite the above, even though this overhead is minimised, this it can still be particularly noticeable for lightweight inference methods, where the extra IPC overhead can take a large percentage of the overall time. In these cases (which can only be assessed on a model-by-model basis), the user has the option to .
Jiale Zhi, Rui Wang, Jeff Clune, and Kenneth O. Stanley. Fiber: A Platform for Efficient Development and Distributed Training for Reinforcement Learning and Population-Based Methods. arXiv:2003.11164 [cs, stat], March 2020. .
This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this which walks through the process of writing a custom runtime.
Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through .
MLServer's "simplified" interface aims to cover use cases where encoding / decoding can be done through one of the codecs built-in into the MLServer package. However, there are instances where this may not be enough (e.g. variable number of inputs, variable content types, etc.). For these types of cases, please use MLServer's , where you will have full control over the full encoding / decoding process.
It is possible to load this custom set of dependencies by providing them through an , whose path can be specified within your model-settings.json
file.
To load a custom environment, must be enabled.
The main MLServer process communicates with workers in custom environments via using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same as the main process. Consult the tables below for environment compatibility.
If we take the above as a reference, we could extend it to include our custom environment as:
MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a or a requirements.txt
file.
The subcommand will search for any Conda environment file (i.e. named either as environment.yaml
or conda.yaml
) and / or any requirements.txt
present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.
The environment built by the mlserver build
will be global to the whole MLServer image (i.e. every loaded model will, by default, use that custom environment). For Multi-Model Serving scenarios, it may be better to use instead - which will allow you to run multiple custom environments at the same time.
The mlserver build
subcommand will treat any or files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.
Building your on the fly.
Configure a set of .
However, there may be occasions where you need to customise your Dockerfile
even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like or ).
To account for these cases, MLServer also includes a subcommand which will just generate a Dockerfile
(and optionally a .dockerignore
file) exactly like the one used by the mlserver build
command. This Dockerfile
can then be customised according to your needs.
The base Dockerfile
requires to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1
environment variable, e.g.
🔴
Unsupported
🟢
Supported
🔵
Untested
3.9
🟢
🟢
🔵
3.10
🟢
🟢
🔵
3.11
🔵
🔵
🔵
Scikit-Learn
sklearn
XGBoost
xgboost
MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:
Open Inference Protocol
Main dataplane for inference, health and metadata
Model Repository Extension
Extension to the protocol to provide a control plane which lets you load / unload models dynamically
On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.
The autogenerated Swagger UI can be accessed under the /v2/docs
endpoint.
The model-specific autogenerated Swagger UI can be accessed under the following endpoints:
/v2/models/{model_name}/docs
/v2/models/{model_name}/versions/{model_version}/docs
Out-of-the-box, MLServer exposes a set of metrics that help you monitor your machine learning workloads in production. These include standard metrics like number of requests and latency.
model_infer_request_success
Number of successful inference requests.
model_infer_request_failure
Number of failed inference requests.
batch_request_queue
parallel_request_queue
On top of the default set of metrics, MLServer's REST server will also expose a set of metrics specific to REST.
[rest_server]_requests
Number of REST requests, labelled by endpoint and status code.
[rest_server]_requests_duration_seconds
Latency of REST requests.
[rest_server]_requests_in_progress
Number of in-flight REST requests.
On top of the default set of metrics, MLServer's gRPC server will also expose a set of metrics specific to gRPC.
grpc_server_handled
Number of gRPC requests, labelled by gRPC code and method.
grpc_server_started
Number of in-flight gRPC requests.
MLServer allows you to register custom metrics within your custom inference runtimes. This can be done through the mlserver.register()
and mlserver.log()
methods.
mlserver.register
: Register a new metric.
mlserver.log
: Log a new set of metric / value pairs. If there's any unregistered metric, it will get registered on-the-fly.
Below, you can find the list of standardised labels that you will be able to find on model-specific metrics:
model_name
Model Name (e.g. my-custom-model
)
model_version
Model Version (e.g. v1.2.3
)
metrics_endpoint
Path under which the metrics endpoint will be exposed.
/metrics
metrics_port
Port used to serve the metrics server.
8082
metrics_rest_server_prefix
Prefix used for metric names specific to MLServer's REST inference interface.
rest_server
metrics_dir
MLServer's current working directory (i.e. $PWD
)
Alongside the , MLServer will also autogenerate a Swagger UI tailored to individual models, showing the endpoints available for each one.
On top of these, you can also register and track your own as part of your .
By default, MLServer will expose metrics around inference requests (count and error rate) and the status of its internal requests queues. These internal queues are used for and .
Queue size for the queue.
Queue size for the queue.
The prefix for the REST-specific metrics will be dependent on the metrics_rest_server_prefix
flag from the .
Custom metrics will generally be registered in the load() <mlserver.MLModel.load>
method and then used in the predict() <mlserver.MLModel.predict>
method of your .
For metrics specific to a model (e.g. , request counts, etc), MLServer will always label these with the model name and model version. Downstream, this will allow to aggregate and query metrics per model.
MLServer will expose metric values through a metrics endpoint exposed on its own metric server. This endpoint can be polled by or other -compatible backends.
Below you can find the available to control the behaviour of the metrics server:
Directory used to store internal metric files (used to support metrics sharing across ). This is equivalent to Prometheus' env var.
Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers.
Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.
The streaming endpoints are available for both the infer
and generate
methods through the following endpoints:
/v2/models/{model_name}/versions/{model_version}/infer_stream
/v2/models/{model_name}/infer_stream
/v2/models/{model_name}/versions/{model_version}/generate_stream
/v2/models/{model_name}/generate_stream
Note that for REST, the generate
and generate_stream
endpoints are aliases for the infer
and infer_stream
endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs).
Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.
The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but mlserver
already has the predict
method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic.
The stub method for streaming to be used by the client is ModelStreamInfer
.
There are three main limitations of the streaming support in MLServer:
the parallel_workers
setting should be set to 0
to disable distributed workers (to be addressed in future releases)
for REST, the gzip_enabled
setting should be set to false
to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue )
Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.
To let Seldon Core know what framework was used to train your model, you can use the implementation
field of your SeldonDeployment
manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:
As you can see highlighted above, all that we need to specify is that:
Once you have your SeldonDeployment
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
To consult the supported values of the implementation
field where MLServer is used, you can check the support table below.
As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.
The table below shows a list of the currently supported values of the implementation
field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.
Scikit-Learn
SKLEARN_SERVER
XGBoost
XGBOOST_SERVER
MLflow
MLFLOW_SERVER
The componentSpecs
field of the SeldonDeployment
manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0
, we could write our SeldonDeployment
manifest as follows:
As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:
Pointing our model container to use our custom MLServer image, by specifying it on the image
field of the componentSpecs
section of the manifest.
Once you have your SeldonDeployment
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
MLServer is used as the in . Therefore, it should be straightforward to deploy your models either by using one of the or by pointing to a .
This section assumes a basic knowledge of Seldon Core and Kubernetes, as well as access to a working Kubernetes cluster with Seldon Core installed. To learn more about or , please visit the .
Our inference deployment should use the , which is done by setting the protocol
field to kfserving
.
Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the , which is done by setting the implementation
field to SKLEARN_SERVER
.
Note that, while the protocol
should always be set to kfserving
(i.e. so that models are served using the ), the value of the implementation
field will be dependant on your ML framework. The valid values of the implementation
field are . However, it should also be possible to (e.g. to support a ).
Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a of pre-packaged servers. To check the full list, please visit the .
There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to , which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.
Letting Seldon Core know that the model deployment will be served through the ) by setting the protocol
field to v2
.