Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
MLServer is used as the core Python inference server in KServe (formerly known as KFServing). This allows for a straightforward avenue to deploy your models into a scalable serving infrastructure backed by Kubernetes.
This section assumes a basic knowledge of KServe and Kubernetes, as well as access to a working Kubernetes cluster with KServe installed. To learn more about KServe or how to install it, please visit the KServe documentation.
KServe provides built-in serving runtimes to deploy models trained in common ML frameworks. These allow you to deploy your models into a robust infrastructure by just pointing to where the model artifacts are stored remotely.
Some of these runtimes leverage MLServer as the core inference server. Therefore, it should be straightforward to move from your local testing to your serving infrastructure.
To use any of the built-in serving runtimes offered by KServe, it should be enough to select the relevant one your InferenceService
manifest.
For example, to serve a Scikit-Learn model, you could use a manifest like the one below:
As you can see highlighted above, the InferenceService
manifest will only need to specify the following points:
The model artifact is a Scikit-Learn model. Therefore, we will use the sklearn
serving runtime to deploy it.
The model will be served using the V2 inference protocol, which can be enabled by setting the protocolVersion
field to v2
.
Once you have your InferenceService
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
As mentioned above, KServe offers support for built-in serving runtimes, some of which leverage MLServer as the inference server. Below you can find a table listing these runtimes, and the MLServer inference runtime that they correspond to.
Note that, on top of the ones shown above (backed by MLServer), KServe also provides a wider set of serving runtimes. To see the full list, please visit the KServe documentation.
Sometimes, the serving runtimes built into KServe may not be enough for our use case. The framework provided by MLServer makes it easy to write custom runtimes, which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore, it's easy to deploy them into your serving infrastructure leveraging KServe support for custom runtimes.
The InferenceService
manifest gives you full control over the containers used to deploy your machine learning model. This can be leveraged to point your deployment to the custom MLServer image containing your custom logic. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0
, we could write an InferenceService
manifest like the one below:
As we can see highlighted above, the main points that we'll need to take into account are:
Pointing to our custom MLServer image
in the custom container section of our InferenceService
.
Explicitly choosing the V2 inference protocol to serve our model.
Let KServe know what port will be exposed by our custom container to send inference requests.
Once you have your InferenceService
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
Out of the box, MLServer includes support to offload inference workloads to a pool of workers running in separate processes. This allows MLServer to scale out beyond the limitations of the Python interpreter. To learn more about why this can be beneficial, you can check the concurrency section below.
By default, MLServer will spin up a pool with only one worker process to run inference. All models will be loaded uniformly across the inference pool workers. To read more about advanced settings, please see the usage section below.
The Global Interpreter Lock (GIL) is a mutex lock that exists in most Python interpreters (e.g. CPython). Its main purpose is to lock Python’s execution so that it only runs on a single processor at the same time. This simplifies certain things to the interpreter. However, it also adds the limitation that a single Python process will never be able to leverage multiple cores.
When we think about MLServer's support for Multi-Model Serving (MMS), this could lead to scenarios where a heavily-used model starves the other models running within the same MLServer instance. Similarly, even if we don’t take MMS into account, the GIL also makes it harder to scale inference for a single model.
To work around this limitation, MLServer offloads the model inference to a pool of workers, where each worker is a separate Python process (and thus has its own separate GIL). This means that we can get full access to the underlying hardware.
Managing the Inter-Process Communication (IPC) between the main MLServer process and the inference pool workers brings in some overhead. Under the hood, MLServer uses the multiprocessing
library to implement the distributed processing management, which has been shown to offer the smallest possible overhead when implementing these type of distributed strategies {cite}zhiFiberPlatformEfficient2020
.
The extra overhead introduced by other libraries is usually brought in as a trade off in exchange of other advanced features for complex distributed processing scenarios. However, MLServer's use case is simple enough to not require any of these.
Despite the above, even though this overhead is minimised, this it can still be particularly noticeable for lightweight inference methods, where the extra IPC overhead can take a large percentage of the overall time. In these cases (which can only be assessed on a model-by-model basis), the user has the option to disable the parallel inference feature.
For regular models where inference can take a bit more time, this overhead is usually offset by the benefit of having multiple cores to compute inference on.
By default, MLServer will always create an inference pool with one single worker. The number of workers (i.e. the size of the inference pool) can be adjusted globally through the server-level parallel_workers
setting.
parallel_workers
The parallel_workers
field of the settings.json
file (or alternatively, the MLSERVER_PARALLEL_WORKERS
global environment variable) controls the size of MLServer's inference pool. The expected values are:
N
, where N > 0
, will create a pool of N
workers.
0
, will disable the parallel inference feature. In other words, inference will happen within the main MLServer process.
Jiale Zhi, Rui Wang, Jeff Clune, and Kenneth O. Stanley. Fiber: A Platform for Efficient Development and Distributed Training for Reinforcement Learning and Population-Based Methods. arXiv:2003.11164 [cs, stat], March 2020. arXiv:2003.11164.
An open source inference server for your machine learning models.
MLServer aims to provide an easy way to start serving your machine learning models through a REST and gRPC interface, fully compliant with KFServing's V2 Dataplane spec. Watch a quick video introducing the project here.
Multi-model serving, letting users run multiple models within the same process.
Ability to run inference in parallel for vertical scaling across multiple models through a pool of inference workers.
Support for adaptive batching, to group inference requests together on the fly.
Scalability with deployment in Kubernetes native frameworks, including Seldon Core and KServe (formerly known as KFServing), where MLServer is the core Python inference server used to serve machine learning models.
Support for the standard V2 Inference Protocol on both the gRPC and REST flavours, which has been standardised and adopted by various model serving frameworks.
You can read more about the goals of this project on the initial design document.
You can install the mlserver
package running:
Note that to use any of the optional inference runtimes, you'll need to install the relevant package. For example, to serve a scikit-learn
model, you would need to install the mlserver-sklearn
package:
For further information on how to use MLServer, you can check any of the available examples.
Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice. You can read more about inference runtimes in their documentation page.
Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common frameworks. This allows you to start serving models saved in these frameworks straight away. However, it's also possible to write custom runtimes.
Out of the box, MLServer provides support for:
🔴 Unsupported
🟠 Deprecated: To be removed in a future version
🟢 Supported
🔵 Untested
To see MLServer in action, check out our full list of examples. You can find below a few selected examples showcasing how you can leverage MLServer to start serving your machine learning models.
Both the main mlserver
package and the inference runtimes packages try to follow the same versioning schema. To bump the version across all of them, you can use the ./hack/update-version.sh
script.
We generally keep the version as a placeholder for an upcoming version.
For example:
To run all of the tests for MLServer and the runtimes, use:
To run run tests for a single file, use something like:
MLServer follows the Open Inference Protocol (previously known as the "V2 Protocol"). You can find the full OpenAPI spec for the Open Inference Protocol in the links below:
Name | Description | OpenAPI Spec |
---|---|---|
On top of the OpenAPI spec above, MLServer also autogenerates a Swagger UI which can be used to interact dynamycally with the Open Inference Protocol.
The autogenerated Swagger UI can be accessed under the /v2/docs
endpoint.
Besides the Swagger UI, you can also access the raw OpenAPI spec through the /v2/docs/dataplane.json
endpoint.
Alongside the general API documentation, MLServer will also autogenerate a Swagger UI tailored to individual models, showing the endpoints available for each one.
The model-specific autogenerated Swagger UI can be accessed under the following endpoints:
/v2/models/{model_name}/docs
/v2/models/{model_name}/versions/{model_version}/docs
Besides the Swagger UI, you can also access the model-specific raw OpenAPI spec through the following endpoints:
/v2/models/{model_name}/docs/dataplane.json
/v2/models/{model_name}/versions/{model_version}/docs/dataplane.json
This guide will help you get started creating machine learning microservices with MLServer in less than 30 minutes. Our use case will be to create a service that helps us compare the similarity between two documents. Think about whenever you are comparing a book, news article, blog post, or tutorial to read next, wouldn't it be great to have a way to compare with similar ones that you have already read and liked (without having to rely on a recommendation's system)? That's what we'll focus on this guide, on creating a document similarity service. 📜 + 📃 = 😎👌🔥
The code is showcased as if it were cells inside a notebook but you can run each of the steps inside Python files with minimal effort.
MLServer is an open-source Python library for building production-ready asynchronous APIs for machine learning models.
The first step is to install mlserver
, the spacy
library, and the language model spacy
will need for our use case. We will also download the wikipedia-api
library to test our use case with a few fun summaries.
If you've never heard of spaCy before, it is an open-source Python library for advanced natural language processing that excels at large-scale information extraction and retrieval tasks, among many others. The model we'll use is a pre-trained model on English text from the web. This model will help us get started with our use case faster than if we had to train a model from scratch for our use case.
Let's first install these libraries.
We will also need to download the language model separately once we have spaCy inside our virtual environment.
If you're going over this guide inside a notebook, don't forget to add an exclamation mark !
in front of the two commands above. If you are in VSCode, you can keep them as they are and change the cell type to bash.
At its core, MLServer requires that users give it 3 things, a model-settings.json
file with information about the model, an (optional) settings.json
file with information related to the server you are about to set up, and a .py
file with the load-predict recipe for your model (as shown in the picture above).
Let's create a directory for our model.
Before we create a service that allows us to compare the similarity between two documents, it is good practice to first test that our solution works first, especially if we're using a pre-trained model and/or a pipeline.
Now that we have our model loaded, let's look at the similarity of the abstracts of Barbieheimer using the wikipedia-api
Python library. The main requirement of the API is that we pass into the main class, Wikipedia()
, a project name, an email and the language we want information to be returned in. After that, we can search the for the movie summaries we want by passing the title of the movie to the .page()
method and accessing the summary of it with the .summary
attribute.
Feel free to change the movies for other topics you might be interested in.
You can run the following lines inside a notebook or, conversely, add them to a app.py
file.
If you created an app.py
file with the code above, make sure you run python app.py
from the terminal.
Now that we have our two summaries, let's compare them using spacy.
Notice that both summaries have information about the other movie, about "films" in general, and about the dates each aired on (which is the same). The reality is that, the model hasn't seen any of these movies so it might be generalizing to the context of each article, "movies," rather than their content, "dolls as humans and the atomic bomb."
You should, of course, play around with different pages and see if what you get back is coherent with what you would expect.
Time to create a machine learning API for our use-case. 😎
MLServer allows us to wrap machine learning models into APIs and build microservices with replicas of a single model, or different models all together.
To create a service with MLServer, we will define a class with two asynchronous functions, one that loads the model and another one to run inference (or predict) with. The former will load the spacy
model we tested in the last section, and the latter will take in a list with the two documents we want to compare. Lastly, our function will return a numpy
array with a single value, our similarity score. We'll write the file to our similarity_model
directory and call it my_model.py
.
Now that we have our model file ready to go, the last piece of the puzzle is to tell MLServer a bit of info about it. In particular, it wants (or needs) to know the name of the model and how to implement it. The former can be anything you want (and it will be part of the URL of your API), and the latter will follow the recipe of name_of_py_file_with_your_model.class_with_your_model
.
Let's create the model-settings.json
file MLServer is expecting inside our similarity_model
directory and add the name and the implementation of our model to it.
Now that everything is in place, we can start serving predictions locally to test how things would play out for our future users. We'll initiate our server via the command line, and later on we'll see how to do the same via Python files. Here's where we are at right now in the process of developing microservices with MLServer.
As you can see in the image, our server will be initialized with three entry points, one for HTTP requests, another for gRPC, and another for the metrics. To learn more about the powerful metrics feature of MLServer, please visit the relevant docs page here. To learn more about gRPC, please see this tutorial here.
To start our service, open up a terminal and run the following command.
Note: If this is a fresh terminal, make sure you activate your environment before you run the command above. If you run the command above from your notebook (e.g. !mlserver start similarity_model/
), you will have to send the request below from another notebook or terminal since the cell will continue to run until you turn it off.
Time to become a client of our service and test it. For this, we'll set up the payload we'll send to our service and use the requests
library to POST our request.
Please note that the request below uses the variables we created earlier with the summaries of Barbie and Oppenheimer. If you are sending this POST request from a fresh python file, make sure you move those lines of code above into your request file.
Let's decompose what just happened.
The URL
for our service might seem a bit odd if you've never heard of the V2/Open Inference Protocol (OIP). This protocol is a set of specifications that allows machine learning models to be shared and deployed in a standardized way. This protocol enables the use of machine learning models on a variety of platforms and devices without requiring changes to the model or its code. The OIP is useful because it allows us to integrate machine learning into a wide range of applications in a standard way.
All URLs you create with MLServer will have the following structure.
This kind of protocol is a standard adopted by different companies like NVIDIA, Tensorflow Serving, KServe, and others, to keep everyone on the same page. If you think about driving cars globally, your country has to apply a standard for driving on a particular side of the road, and this ensures you and everyone else stays on the left (or the right depending on where you are at). Adopting this means that you won't have to wonder where the next driver is going to come out of when you are driving and are about to take a turn, instead, you can focus on getting to where you're going to without much worrying.
Let's describe what each of the components of our inference_request
does.
name
: this maps one-to-one to the name of the parameter in your predict()
function.
shape
: represents the shape of the elements in our data
. In our case, it is a list with [2]
strings.
datatype
: the different data types expected by the server, e.g., str, numpy array, pandas dataframe, bytes, etc.
parameters
: allows us to specify the content_type
beyond the data types
data
: the inputs to our predict function.
To learn more about the OIP and how MLServer content types work, please have a looks at their docs page here.
Say you need to meet the demand of a high number of users and one model might not be enough, or is not using all of the resources of the virtual machine instance it was allocated to. What we can do in this case is to create multiple replicas of our model to increase the throughput of the requests that come in. This can be particularly useful at the peak times of our server. To do this, we need to tweak the settings of our server via the settings.json
file. In it, we'll add the number of independent models we want to have to the parameter "parallel_workers": 3
.
Let's stop our server, change the settings of it, start it again, and test it.
As you can see in the output of the terminal in the picture above, we now have 3 models running in parallel. The reason you might see 4 is because, by default, MLServer will print the name of the initialized model if it is one or more, and it will also print one for each of the replicas specified in the settings.
Let's get a few more twin films examples to test our server. Get as creative as you'd like. 💡
Let's first test that the function works as intended.
Now let's map three POST requests at the same time.
We can also test it one by one.
For the last step of this guide, we are going to package our model and service into a docker image that we can reuse in another project or share with colleagues immediately. This step requires that we have docker installed and configured in our PCs, so if you need to set up docker, you can do so by following the instructions in the documentation here.
The first step is to create a requirements.txt
file with all of our dependencies and add it to the directory we've been using for our service (similarity_model
).
The next step is to build a docker image with our model, its dependencies and our server. If you've never heard of docker images before, here's a short description.
A Docker image is a lightweight, standalone, and executable package that includes everything needed to run a piece of software, including code, libraries, dependencies, and settings. It's like a carry-on bag for your application, containing everything it needs to travel safely and run smoothly in different environments. Just as a carry-on bag allows you to bring your essentials with you on a trip, a Docker image enables you to transport your application and its requirements across various computing environments, ensuring consistent and reliable deployment.
MLServer has a convenient function that lets us create docker images with our services. Let's use it.
We can check that our image was successfully build not only by looking at the logs of the previous command but also with the docker images
command.
Let's test that our image works as intended with the following command. Make sure you have closed your previous server by using CTRL + C
in your terminal.
Now that you have a packaged and fully-functioning microservice with our model, we could deploy our container to a production serving platform like Seldon Core, or via different offerings available through the many cloud providers out there (e.g. AWS Lambda, Google Cloud Run, etc.). You could also run this image on KServe, a Kubernetes native tool for model serving, or anywhere else where you can bring your docker image with you.
To learn more about MLServer and the different ways in which you can use it, head over to the examples section or the user guide. To learn about some of the deployment options available, head over to the docs here.
To keep up to date with what we are up to at Seldon, make sure you join our Slack community.
MLServer includes support to batch requests together transparently on-the-fly. We refer to this as "adaptive batching", although it can also be known as "predictive batching".
There are usually two main reasons to adopt adaptive batching:
Maximise resource usage. Usually, inference operations are “vectorised” (i.e. are designed to operate across batches). For example, a GPU is designed to operate on multiple data points at the same time. Therefore, to make sure that it’s used at maximum capacity, we need to run inference across batches.
Minimise any inference overhead. Usually, all models will have to “pay” a constant overhead when running any type of inference. This can be something like IO to communicate with the GPU or some kind of processing in the incoming data. Up to a certain size, this overhead tends to not scale linearly with the number of data points. Therefore, it’s in our interest to send as large batches as we can without deteriorating performance.
However, these benefits will usually scale only up to a certain point, which is usually determined by either the infrastructure, the machine learning framework used to train your model, or a combination of both. Therefore, to maximise the performance improvements brought in by adaptive batching it will be important to configure it with the appropriate values for your model. Since these values are usually found through experimentation, MLServer won't enable by default adaptive batching on newly loaded models.
MLServer lets you configure adaptive batching independently for each model through two main parameters:
Maximum batch size, that is how many requests you want to group together.
Maximum batch time, that is how much time we should wait for new requests until we reach our maximum batch size.
max_batch_size
The max_batch_size
field of the model-settings.json
file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_SIZE
global environment variable) controls the maximum number of requests that should be grouped together on each batch. The expected values are:
N
, where N > 1
, will create batches of up to N
elements.
0
or 1
, will disable adaptive batching.
max_batch_time
The max_batch_time
field of the model-settings.json
file (or alternatively, the MLSERVER_MODEL_MAX_BATCH_TIME
global environment variable) controls the time that MLServer should wait for new requests to come in until we reach our maximum batch size.
The expected format is in seconds, but it will take fractional values. That is, 500ms could be expressed as 0.5
.
The expected values are:
T
, where T > 0
, will wait T
seconds at most.
0
, will disable adaptive batching.
MLserver allows adding custom parameters to the parameters
field of the requests. These parameters are received as a merged list of parameters inside the server, e.g.
is received as follows in the batched request in the server:
The same way if the request is sent back from the server as a batched request
it will be returned unbatched from the server as follows:
Inference runtimes allow you to define how your model should be used within MLServer. You can think of them as the backend glue between MLServer and your machine learning framework of choice.
Out of the box, MLServer comes with a set of pre-packaged runtimes which let you interact with a subset of common ML frameworks. This allows you to start serving models saved in these frameworks straight away. To avoid bringing in dependencies for frameworks that you don't need to use, these runtimes are implemented as independent (and optional) Python packages. This mechanism also allows you to rollout your very easily.
To pick which runtime you want to use for your model, you just need to make sure that the right package is installed, and then point to the correct runtime class in your model-settings.json
file.
MLServer is used as the in . Therefore, it should be straightforward to deploy your models either by using one of the or by pointing to a .
This section assumes a basic knowledge of Seldon Core and Kubernetes, as well as access to a working Kubernetes cluster with Seldon Core installed. To learn more about or , please visit the .
Out of the box, Seldon Core comes a few MLServer runtimes pre-configured to run straight away. This allows you to deploy a MLServer instance by just pointing to where your model artifact is and specifying what ML framework was used to train it.
To let Seldon Core know what framework was used to train your model, you can use the implementation
field of your SeldonDeployment
manifest. For example, to deploy a Scikit-Learn artifact stored remotely in GCS, one could do:
As you can see highlighted above, all that we need to specify is that:
Once you have your SeldonDeployment
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
To consult the supported values of the implementation
field where MLServer is used, you can check the support table below.
As mentioned above, pre-packaged servers come built-in into Seldon Core. Therefore, only a pre-determined subset of them will be supported for a given release of Seldon Core.
The table below shows a list of the currently supported values of the implementation
field. Each row will also show what ML framework they correspond to and also what MLServer runtime will be enabled internally on your model deployment when used.
The componentSpecs
field of the SeldonDeployment
manifest will allow us to let Seldon Core know what image should be used to serve a custom model. For example, if we assume that our custom image has been tagged as my-custom-server:0.1.0
, we could write our SeldonDeployment
manifest as follows:
As we can see highlighted on the snippet above, all that's needed to deploy a custom MLServer image is:
Pointing our model container to use our custom MLServer image, by specifying it on the image
field of the componentSpecs
section of the manifest.
Once you have your SeldonDeployment
manifest ready, then the next step is to apply it to your cluster. There are multiple ways to do this, but the simplest is probably to just apply it directly through kubectl
, by running:
This package provides a MLServer runtime compatible with XGBoost.
You can install the runtime, alongside mlserver
, as:
For further information on how to use MLServer with XGBoost, you can check out this .
The XGBoost inference runtime will expect that your model is serialised via one of the following methods:
Extension | Docs | Example |
---|
The XGBoost inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict
and predict_proba
methods of the XGBoost model.
By default, the runtime will only return the output of predict
. However, you are able to control which outputs you want back through the outputs
field of your {class}InferenceRequest <mlserver.types.InferenceRequest>
payload.
For example, to only return the model's predict_proba
output, you could define a payload such as:
This package provides a MLServer runtime compatible with LightGBM.
You can install the runtime, alongside mlserver
, as:
For further information on how to use MLServer with LightGBM, you can check out this .
If no is present on the request or metadata, the LightGBM runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .
This package provides a MLServer runtime compatible with Scikit-Learn.
You can install the runtime, alongside mlserver
, as:
For further information on how to use MLServer with Scikit-Learn, you can check out this .
If no is present on the request or metadata, the Scikit-Learn runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .
The Scikit-Learn inference runtime exposes a number of outputs depending on the model type. These outputs match to the predict
, predict_proba
and transform
methods of the Scikit-Learn model.
Output | Returned By Default | Availability |
---|
By default, the runtime will only return the output of predict
. However, you are able to control which outputs you want back through the outputs
field of your {class}InferenceRequest <mlserver.types.InferenceRequest>
payload.
For example, to only return the model's predict_proba
output, you could define a payload such as:
This package provides a MLServer runtime compatible with models.
You can install the mlserver-alibi-detect
runtime, alongside mlserver
, as:
For further information on how to use MLServer with Alibi-Detect, you can check out this .
If no is present on the request or metadata, the Alibi-Detect runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .
The Alibi Detect runtime exposes a couple setting flags which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra
section of your model-settings.json
file, e.g.
You can find the full reference of the accepted extra settings for the Alibi Detect runtime below:
Machine learning models generally expect their inputs to be passed down as a particular Python type. Most commonly, this type ranges from "general purpose" NumPy arrays or Pandas DataFrames to more granular definitions, like datetime
objects, Pillow
images, etc. Unfortunately, the definition of the doesn't cover any of the specific use cases. This protocol can be thought of a wider "lower level" spec, which only defines what fields a payload should have.
To account for this gap, MLServer introduces support for content types, which offer a way to let MLServer know how it should "decode" V2-compatible payloads. When shaped in the right way, these payloads should "encode" all the information required to extract the higher level Python type that will be required for a model.
To illustrate the above, we can think of a Scikit-Learn pipeline, which takes in a Pandas DataFrame and returns a NumPy Array. Without the use of content types, the V2 payload itself would probably lack information about how this payload should be treated by MLServer Likewise, the Scikit-Learn pipeline wouldn't know how to treat a raw V2 payload. In this scenario, the use of content types allows us to specify information on what's the actual "higher level" information encoded within the V2 protocol payloads.
To let MLServer know that a particular payload must be decoded / encoded as a different Python data type (e.g. NumPy Array, Pandas DataFrame, etc.), you can specify it through the content_type
field of the parameters
section of your request.
As an example, we can consider the following dataframe, containing two columns: Age and First Name.
This table, could be specified in the V2 protocol as the following payload, where we declare that:
The whole set of inputs should be decoded as a Pandas Dataframe (i.e. setting the content type as pd
).
The First Name column should be decoded as a UTF-8 string (i.e. setting the content type as str
).
It's important to keep in mind that content types can be specified at both the request level and the input level. The former will apply to the entire set of inputs, whereas the latter will only apply to a particular input of the payload.
Under the hood, the conversion between content types is implemented using codecs. In the MLServer architecture, codecs are an abstraction which know how to encode and decode high-level Python types to and from the V2 Inference Protocol.
Depending on the high-level Python type, encoding / decoding operations may require access to multiple input or output heads. For example, a Pandas Dataframe would need to aggregate all of the input-/output-heads present in a V2 Inference Protocol response.
However, a Numpy array or a list of strings, could be encoded directly as an input head within a larger request.
To account for this, codecs can work at either the request- / response-level (known as request codecs), or the input- / output-level (known as input codecs). Each of these codecs, expose the following public interface, where Any
represents a high-level Python datatype (e.g. a Pandas Dataframe, a Numpy Array, etc.):
Request Codecs
encode_request() <mlserver.codecs.RequestCodec.encode_request>
decode_request() <mlserver.codecs.RequestCodec.decode_request>
encode_response() <mlserver.codecs.RequestCodec.encode_response>
decode_response() <mlserver.codecs.RequestCodec.decode_response>
Input Codecs
encode_input() <mlserver.codecs.InputCodec.encode_input>
decode_input() <mlserver.codecs.InputCodec.decode_input>
encode_output() <mlserver.codecs.InputCodec.encode_output>
decode_output() <mlserver.codecs.InputCodec.decode_output>
Note that, these methods can also be used as helpers to encode requests and decode responses on the client side. This can help to abstract away from the user most of the details about the underlying structure of V2-compatible payloads.
For example, in the example above, we could use codecs to encode the DataFrame into a V2-compatible request simply as:
When using MLServer's request codecs, the output of encoding payloads will always be one of the classes within the mlserver.types
package (i.e. InferenceRequest <mlserver.types.InferenceRequest>
or InferenceResponse <mlserver.types.InferenceResponse>
). Therefore, if you want to use them with requests
(or other package outside of MLServer) you will need to convert them to a Python dict or a JSON string.
For example, if we want to send an inference request to model foo
, we could do something along the following lines:
The NaN (Not a Number) value is used in Numpy and other scientific libraries to describe an invalid or missing value (e.g. a division by zero). In some scenarios, it may be desirable to let your models receive and / or output NaN values (e.g. these can be useful sometimes with GBTs, like XGBoost models). This is why MLServer supports encoding NaN values on your request / response payloads under some conditions.
In order to send / receive NaN values, you must ensure that:
You are using the REST
interface.
The input / output entry containing NaN values uses either the FP16
, FP32
or FP64
datatypes.
Assuming those conditions are satisfied, any null
value within your tensor payload will be converted to NaN.
For example, if you take the following Numpy array:
We could encode it as:
It's important to keep in mind that content types passed explicitly as part of the request will always take precedence over the model's metadata. Therefore, we can leverage this to override the model's metadata when needed.
Out of the box, MLServer supports the following list of content types. However, this can be extended through the use of 3rd-party or custom runtimes.
The np
content type will decode / encode V2 payloads to a NumPy Array, taking into account the following:
The shape
field will be used to reshape the flattened array expected by the V2 protocol into the expected tensor shape.
By default, MLServer will always assume that an array with a single-dimensional shape, e.g. [N]
, is equivalent to [N, 1]
. That is, each entry will be treated like a single one-dimensional data point (i.e. instead of a [1, D]
array, where the full array is a single D
-dimensional data point). To avoid any ambiguity, where possible, the Numpy codec will always explicitly encode [N]
arrays as [N, 1]
.
For example, if we think of the following NumPy Array:
We could encode it as the input foo
in a V2 protocol request as:
When using the NumPy Array content type at the request-level, it will decode the entire request by considering only the first input
element. This can be used as a helper for models which only expect a single tensor.
The pd
content type can be stacked with other content types. This allows the user to use a different set of content types to decode each of the columns.
The pd
content type will decode / encode a V2 request into a Pandas DataFrame. For this, it will expect that the DataFrame is shaped in a columnar way. That is,
Each entry of the inputs
list (or outputs
, in the case of responses), will represent a column of the DataFrame.
Each of these entires, will contain all the row elements for that particular column.
The shape
field of each input
(or output
) entry will contain (at least) the amount of rows included in the dataframe.
For example, if we consider the following dataframe:
We could encode it to the V2 Inference Protocol as:
The str
content type lets you encode / decode a V2 input into a UTF-8 Python string, taking into account the following:
The expected datatype
is BYTES
.
The shape
field represents the number of "strings" that are encoded in the payload (e.g. the ["hello world", "one more time"]
payload will have a shape of 2 elements).
For example, when if we consider the following list of strings:
We could encode it to the V2 Inference Protocol as:
When using the str
content type at the request-level, it will decode the entire request by considering only the first input
element. This can be used as a helper for models which only expect a single string or a set of strings.
The base64
content type will decode a binary V2 payload into a Base64-encoded string (and viceversa), taking into account the following:
The expected datatype
is BYTES
.
The data
field should contain the base64-encoded binary strings.
The shape
field represents the number of binary strings that are encoded in the payload.
For example, if we think of the following "bytes array":
We could encode it as the input foo
of a V2 request as:
The expected datatype
is BYTES
.
The shape
field represents the number of datetimes that are encoded in the payload.
For example, if we think of the following datetime
object:
We could encode it as the input foo
of a V2 request as:
This package provides a MLServer runtime compatible with CatBoost's CatboostClassifier
.
You can install the runtime, alongside mlserver
, as:
For further information on how to use MLServer with CatBoost, you can check out this .
If no is present on the request or metadata, the CatBoost runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .
This package provides a MLServer runtime compatible with HuggingFace Transformers.
You can install the runtime, alongside mlserver
, as:
For further information on how to use MLServer with HuggingFace, you can check out this .
The HuggingFace runtime will always decode the input request using its own built-in codec. Therefore, at the request level will be ignored. Note that this doesn't include annotations, which will be respected as usual.
The HuggingFace runtime exposes a couple extra parameters which can be used to customise how the runtime behaves. These settings can be added under the parameters.extra
section of your model-settings.json
file, e.g.
It is possible to load a local model into a HuggingFace pipeline by specifying the model artefact folder path in parameters.uri
in model-settings.json
.
Models in the HuggingFace hub can be loaded by specifying their name in parameters.extra.pretrained_model
in model-settings.json
.
You can find the full reference of the accepted extra settings for the HuggingFace runtime below:
The MLModel
class is the base class for all . It exposes the main interface that MLServer will use to interact with ML models.
The bulk of its public interface are the {func}load() <mlserver.MLModel.load>
, {func}unload() <mlserver.MLModel.unload>
and {func}predict() <mlserver.MLModel.predict>
methods. However, it also contains helpers with encoding / decoding of requests and responses, as well as properties to access the most common bits of the model's metadata.
When writing , this class should be extended to implement your own load and predict logic.
There may be cases where the offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.
To learn more about how you can write custom runtimes with MLServer, check out the . Alternatively, you can also see this which walks through the process of writing a custom runtime.
To see MLServer in action you can check out the examples below. These are end-to-end notebooks, showing how to serve models with MLServer.
If you are interested in how MLServer interacts with particular model frameworks, you can check the following examples. These focus on showcasing the different that ship with MLServer out of the box. Note that, for advanced use cases, you can also write your own custom inference runtime (see the ).
To see some of the advanced features included in MLServer (e.g. multi-model serving), check out the examples below.
Tutorials are designed to be beginner-friendly and walk through accomplishing a series of tasks using MLServer (and other tools).
MLServer can be configured through a settings.json
file on the root folder from where MLServer is started. Note that these are server-wide settings (e.g. gRPC or HTTP port) which are separate from the . Alternatively, this configuration can also be passed through environment variables prefixed with MLSERVER_
(e.g. MLSERVER_GRPC_PORT
).
Framework | MLServer Runtime | KServe Serving Runtime | Documentation |
---|---|---|---|
Framework | Supported | Documentation |
---|---|---|
Python Version | Status |
---|---|
Framework | Package Name | Implementation Class | Example | Documentation |
---|
Our inference deployment should use the , which is done by setting the protocol
field to kfserving
.
Our model artifact is a serialised Scikit-Learn model, therefore it should be served using the , which is done by setting the implementation
field to SKLEARN_SERVER
.
Note that, while the protocol
should always be set to kfserving
(i.e. so that models are served using the ), the value of the implementation
field will be dependant on your ML framework. The valid values of the implementation
field are . However, it should also be possible to (e.g. to support a ).
Framework | MLServer Runtime | Seldon Core Pre-packaged Server | Documentation |
---|
Note that, on top of the ones shown above (backed by MLServer), Seldon Core also provides a of pre-packaged servers. To check the full list, please visit the .
There could be cases where the pre-packaged MLServer runtimes supported out-of-the-box in Seldon Core may not be enough for our use case. The framework provided by MLServer makes it easy to , which can then get packaged up as images. These images then become self-contained model servers with your custom runtime. Therefore Seldon Core makes it as easy to deploy them into your serving infrastructure.
Letting Seldon Core know that the model deployment will be served through the ) by setting the protocol
field to v2
.
If no is present on the request or metadata, the XGBoost runtime will try to decode the payload as a . To avoid this, either send a different content type explicitly, or define the correct one as part of your .
Output | Returned By Default | Availability |
---|
Some inference runtimes may apply a content type by default if none is present. To learn more about each runtime's defaults, please check the .
First Name | Age |
---|
To learn more about the available content types and how to use them, you can see all the available ones in the section below.
For a full end-to-end example on how content types and codecs work under the hood, feel free to check out this .
Luckily, these classes leverage under the hood. Therefore you can just call the .model_dump()
or .model_dump_json()
method to convert them. Likewise, to read them back from JSON, we can always pass the JSON fields as kwargs to the class' constructor (or use any of the available within Pydantic).
You are either using the or the .
Content types can also be defined as part of the . This lets the user pre-configure what content types should a model use by default to decode / encode its requests / responses, without the need to specify it on each request.
For example, to configure the content type values of the , one could create a model-settings.json
file like the one below:
Python Type | Content Type | Request Level | Request Codec | Input Level | Input Codec |
---|
MLServer allows you extend the supported content types by adding custom ones. To learn more about how to write your own custom content types, you can check this . You can also learn more about building custom extensions for MLServer on the of the docs.
The expects that the data
of each input is sent as a flat array. Therefore, the np
content type will expect that tensors are sent flattened. The information in the shape
field will then be used to reshape the vector into the right dimensions.
The datatype
field will be matched to the closest .
A | B | C |
---|
The datetime
content type will decode a V2 input into a , taking into account the following:
The data
field should contain the dates serialised following the .
Scikit-Learn
✅
XGBoost
✅
Spark MLlib
✅
LightGBM
✅
CatBoost
✅
Tempo
✅
MLflow
✅
Alibi-Detect
✅
Alibi-Explain
✅
HuggingFace
✅
3.7
🔴
3.8
🔴
3.9
🟢
3.10
🟢
3.11
🔵
3.12
🔵
Scikit-Learn
sklearn
XGBoost
xgboost
| ✅ | Available on all XGBoost models. |
| ❌ | Only available on non-regressor models (i.e. |
Joanne | 34 |
Michael | 22 |
a1 | b1 | c1 |
a2 | b2 | c2 |
a3 | b3 | c3 |
a4 | b4 | c4 |
Out of the box, mlserver
supports the deployment and serving of scikit-learn
models. By default, it will assume that these models have been serialised using joblib
.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver
.
The first step will be to train a simple scikit-learn
model. For that, we will use the MNIST example from the scikit-learn
documentation which trains an SVM model.
To save our trained model, we will serialise it using joblib
. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the scikit-learn
documentation.
Our model will be persisted as a file named mnist-svm.joblib
Now that we have trained and saved our model, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
As we can see above, the model predicted the input as the number 8
, which matches what's on the test set.
Out of the box, mlserver
supports the deployment and serving of lightgbm
models. By default, it will assume that these models have been serialised using the bst.save_model()
method.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver
.
To test the LightGBM Server, first we need to generate a simple LightGBM model using Python.
Our model will be persisted as a file named iris-lightgbm.bst
.
Now that we have trained and saved our model, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
As we can see above, the model predicted the probability for each class, and the probability of class 1
is the biggest, close to 0.99
, which matches what's on the test set.
Out of the box, mlserver
supports the deployment and serving of alibi_detect models. Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. In this example, we will cover how we can create a detector configuration to then serve it using mlserver
.
The first step will be to fetch a reference data and other relevant metadata for an alibi-detect
model.
For that, we will use the alibi library to get the adult dataset with demographic features from a 1996 US census.
This example is based on the Categorical and mixed type data drift detection on income prediction tabular data from the alibi-detect documentation.
Now that we have the reference data and other configuration parameters, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start command. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our alibi-detect model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
The mlserver
package comes with inference runtime implementations for scikit-learn
and xgboost
models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.
In this example, we will train a numpyro
model. The numpyro
library streamlines the implementation of probabilistic models, abstracting away advanced inference and training algorithms.
Out of the box, mlserver
doesn't provide an inference runtime for numpyro
. However, through this example we will see how easy is to develop our own.
The first step will be to train our model. This will be a very simple bayesian regression model, based on an example provided in the numpyro
docs.
Since this is a probabilistic model, during training we will compute an approximation to the posterior distribution of our model using MCMC.
Now that we have trained our model, the next step will be to save it so that it can be loaded afterwards at serving-time. Note that, since this is a probabilistic model, we will only need to save the traces that approximate the posterior distribution over latent parameters.
This will get saved in a numpyro-divorce.json
file.
The next step will be to serve our model using mlserver
. For that, we will first implement an extension which serve as the runtime to perform inference using our custom numpyro
model.
Our custom inference wrapper should be responsible of:
Loading the model from the set samples we saved previously.
Running inference using our model structure, and the posterior approximated from the samples.
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Now that we have written and tested our custom model, the next step is to deploy it. With that goal in mind, the rough outline of steps will be to first build a custom image containing our code, and then deploy it.
MLServer will automatically find your requirements.txt file and install necessary python packages
MLServer offers helpers to build a custom Docker image containing your code. In this example, we will use the mlserver build
subcommand to create an image, which we'll be able to deploy later.
Note that this section expects that Docker is available and running in the background, as well as a functional cluster with Seldon Core installed and some familiarity with kubectl
.
To ensure that the image is fully functional, we can spin up a container and then send a test request. To start the container, you can run something along the following lines in a separate terminal:
As we should be able to see, the server running within our Docker image responds as expected.
Now that we've built a custom image and verified that it works as expected, we can move to the next step and deploy it. There is a large number of tools out there to deploy images. However, for our example, we will focus on deploying it to a cluster running Seldon Core.
For that, we will need to create a SeldonDeployment
resource which instructs Seldon Core to deploy a model embedded within our custom image and compliant with the V2 Inference Protocol. This can be achieved by applying (i.e. kubectl apply
) a SeldonDeployment
manifest to the cluster, similar to the one below:
The MLServer package exposes a set of methods that let you register and track custom metrics. This can be used within your own custom inference runtimes. To learn more about how to expose custom metrics, check out the metrics usage guide.
Out of the box, MLServer supports the deployment and serving of HuggingFace Transformer models with the following features:
Loading of Transformer Model artifacts from the Hugging Face Hub.
Model quantization & optimization using the Hugging Face Optimum library
Request batching for GPU optimization (via adaptive batching and request batching)
In this example, we will showcase some of this features using an example model.
Since we're using a pretrained model, we can skip straight to serving.
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We can also leverage the Optimum library that allows us to access quantized and optimized models.
We can download pretrained optimized models from the hub if available by enabling the optimum_model
flag:
Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
The request can now be sent using the same request structure but using optimized models for better performance.
We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.
Once again, you are able to run the model using the MLServer CLI.
Once again, you are able to run the model using the MLServer CLI.
We can also evaluate GPU acceleration, we can test the speed on CPU vs GPU using the following parameters
We first test the time taken with the device=-1 which configures CPU by default
Once again, you are able to run the model using the MLServer CLI.
We can see that it takes 81 seconds which is 8 times longer than the gpu example below.
IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.
Now we'll run the benchmark with GPU configured, which we can do by setting device=0
We can see that the elapsed time is 8 times less than the CPU version!
We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.
In our case we can enable adaptive batching with the max_batch_size
which in our case we will set it ot 128.
We will also configure max_batch_time
which specifies` the maximum amount of time the MLServer orchestrator will wait before sending for inference.
In order to achieve the throughput required of 50 requests per second, we will use the tool vegeta
which performs load testing.
We can now see that we are able to see that the requests are batched and we receive 100% success eventhough the requests are sent one-by-one.
Out of the box, mlserver
supports the deployment and serving of xgboost
models. By default, it will assume that these models have been serialised using the bst.save_model()
method.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver
.
The first step will be to train a simple xgboost
model. For that, we will use the mushrooms example from the xgboost
Getting Started guide.
To save our trained model, we will serialise it using bst.save_model()
and the JSON format. This is the approach by the XGBoost project.
Our model will be persisted as a file named mushroom-xgboost.json
.
Now that we have trained and saved our model, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
As we can see above, the model predicted the input as close to 0
, which matches what's on the test set.
Codecs are used to encapsulate the logic required to encode / decode payloads following the Open Inference Protocol into high-level Python types. You can read more about the high-level concepts behind codecs in thesection of the docs, as well as how to use them.
All the codecs within MLServer extend from either the {class}InputCodec <mlserver.codecs.base.InputCodec>
or the {class}RequestCodec <mlserver.codecs.base.RequestCodec>
base classes. These define the interface to deal with input (outputs) and request (responses) respectively.
The mlserver
package will include a set of built-in codecs to cover common conversions. You can learn more about these in thesection of the docs.
In MLServer, each loaded model can be configured separately. This configuration will include model information (e.g. metadata about the accepted inputs), but also model-specific settings (e.g. number of parallel workers to run inference).
This configuration will usually be provided through a model-settings.json
file which sits next to the model artifacts. However, it's also possible to provide this through environment variables prefixed with MLSERVER_MODEL_
(e.g. MLSERVER_MODEL_IMPLEMENTATION
). Note that, in the latter case, this environment variables will be shared across all loaded models (unless they get overriden by a model-settings.json
file). Additionally, if no model-settings.json
file is found, MLServer will also try to load a "default" model from these environment variables.
Out of the box, MLServer supports the deployment and serving of MLflow models with the following features:
Loading of MLflow Model artifacts.
Support of dataframes, dict-of-tensors and tensor inputs.
In this example, we will showcase some of this features using an example model.
The first step will be to train and serialise a MLflow model. For that, we will use the linear regression examle from the MLflow docs.
The training script will also serialise our trained model, leveraging the MLflow Model format. By default, we should be able to find the saved artifact under the mlruns
folder.
Now that we have trained and serialised our model, we are ready to start serving it. For that, the initial step will be to set up a model-settings.json
that instructs MLServer to load our artifact using the MLflow Inference Runtime.
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set. For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Note that, the request specifies the value pd
as its content type, whereas every input specifies the content type np
. These parameters will instruct MLServer to:
Convert every input value to a NumPy array, using the data type and shape information provided.
Group all the different inputs into a Pandas DataFrame, using their names as the column names.
To learn more about how MLServer uses content type parameters, you can check this worked out example.
As we can see in the output above, the predicted quality score for our input wine was 5.57
.
MLflow currently ships with an scoring server with its own protocol. In order to provide a drop-in replacement, the MLflow runtime in MLServer also exposes a custom endpoint which matches the signature of the MLflow's /invocations
endpoint.
As an example, we can try to send the same request that sent previously, but using MLflow's protocol. Note that, in both cases, the request will be handled by the same MLServer instance.
As we can see above, the predicted quality for our input is 5.57
, matching the prediction we obtained above.
MLflow lets users define a model signature, where they can specify what types of inputs does the model accept, and what types of outputs it returns. Similarly, the V2 inference protocol employed by MLServer defines a metadata endpoint which can be used to query what inputs and outputs does the model accept. However, even though they serve similar functions, the data schemas used by each one of them are not compatible between them.
To solve this, if your model defines a MLflow model signature, MLServer will convert on-the-fly this signature to a metadata schema compatible with the V2 Inference Protocol. This will also include specifying any extra content type that is required to correctly decode / encode your data.
As an example, we can first have a look at the model signature saved for our MLflow model. This can be seen directly on the MLModel
file saved by our model.
We can then query the metadata endpoint, to see the model metadata inferred by MLServer from our test model's signature. For this, we will use the /v2/models/wine-classifier/
endpoint.
As we should be able to see, the model metadata now matches the information contained in our model signature, including any extra content types necessary to decode our data correctly.
It's not unusual that model runtimes require extra dependencies that are not direct dependencies of MLServer. This is the case when we want to use custom runtimes, but also when our model artifacts are the output of older versions of a toolkit (e.g. models trained with an older version of SKLearn).
In these cases, since these dependencies (or dependency versions) are not known in advance by MLServer, they won't be included in the default seldonio/mlserver
Docker image. To cover these cases, the seldonio/mlserver
Docker image allows you to load custom environments before starting the server itself.
This example will walk you through how to create and save an custom environment, so that it can be loaded in MLServer without any extra change to the seldonio/mlserver
Docker image.
For this example, we will create a custom environment to serve a model trained with an older version of Scikit-Learn. The first step will be define this environment, using a environment.yml
.
Note that these environments can also be created on the fly as we go, and then serialised later.
To illustrate the point, we will train a Scikit-Learn model using our older environment.
The first step will be to create and activate an environment which reflects what's outlined in our environment.yml
file.
NOTE: If you are running this from a Jupyter Notebook, you will need to restart your Jupyter instance so that it runs from this environment.
We can now train and save a Scikit-Learn model using the older version of our environment. This model will be serialised as model.joblib
.
You can find more details of this process in the Scikit-Learn example.
Lastly, we will need to serialise our environment in the format expected by MLServer. To do that, we will use a tool called conda-pack
.
This tool, will save a portable version of our environment as a .tar.gz
file, also known as tarball.
Now that we have defined our environment (and we've got a sample artifact trained in that environment), we can move to serving our model.
To do that, we will first need to select the right runtime through a model-settings.json
config file.
We can then spin up our model, using our custom environment, leveraging MLServer's Docker image. Keep in mind that you will need Docker installed in your machine to run this example.
Our Docker command will need to take into account the following points:
Mount the example's folder as a volume so that it can be accessed from within the container.
Let MLServer know that our custom environment's tarball can be found as old-sklearn.tar.gz
.
Expose port 8080
so that we can send requests from the outside.
From the command line, this can be done using Docker's CLI as:
Note that we need to keep the server running in the background while we send requests. Therefore, it's best to run this command on a separate terminal session.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
The mlserver
package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs.
In this example, we create a simple Identity Text Model
which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model.
The next step will be to serve our model using mlserver
. For that, we will first implement an extension that serves as the runtime to perform inference using our custom TextModel
.
This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:
split the text into words using the white space as the delimiter.
wait 0.5 seconds between each word to simulate a slow model.
return each word one by one.
As it can be seen, the predict_stream
method receives as an input an AsyncIterator
of InferenceRequest
and returns an AsyncIterator
of InferenceResponse
. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.
Note that although unary-unary can be covered by predict_stream
method as well, mlserver
already covers that through the predict
method.
One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above.
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
Note the currently there are three main limitations of the streaming support in MLServer:
distributed workers are not supported (i.e., the parallel_workers
setting should be set to 0
)
gzip
middleware is not supported for REST (i.e., gzip_enabled
setting should be set to false
)
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be run from the same directory where our config files are or point to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal.
To test our model, we will use the following inference request:
To send a REST streaming request to the server, we will use the following Python code:
To send a gRPC streaming request to the server, we will use the following Python code:
Note that for gRPC, the request is transformed into an async generator which is then passed to the ModelStreamInfer
method. The response is also an async generator which can be iterated over to get the response.
The mlserver
package comes with inference runtime implementations for scikit-learn
and xgboost
models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.
In this example, we create a simple Hello World JSON
model that parses and modifies a JSON data chunk. This is often useful as a means to quickly bootstrap existing models that utilize JSON based model inputs.
The next step will be to serve our model using mlserver
. For that, we will first implement an extension which serve as the runtime to perform inference using our custom Hello World JSON
model.
This is a trivial model to demonstrate how to conceptually work with JSON inputs / outputs. In this example:
Parse the JSON input from the client
Create a JSON response echoing back the client request as well as a server generated message
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Utilizing string data with the gRPC interface can be a bit tricky. To ensure we are correctly handling inputs and outputs we will be handled correctly.
For simplicity in this case, we leverage the Python types that mlserver
provides out of the box. Alternatively, the gRPC stubs can be generated regenerated from the V2 specification directly for use by non-Python as well as Python clients.
Out of the box, MLServer provides support to receive inference requests from Kafka. The Kafka server can run side-by-side with the REST and gRPC ones, and adds a new interface to interact with your model. The inference responses coming back from your model, will also get written back to their own output topic.
In this example, we will showcase the integration with Kafka by serving a Scikit-Learn model thorugh Kafka.
We are going to start by running a simple local docker deployment of kafka that we can test against. This will be a minimal cluster that will consist of a single zookeeper node and a single broker.
You need to have Java installed in order for it to work correctly.
Now you can just run it with the following command outside the terminal:
Now we can create the input and output topics required
The first step will be to train a simple scikit-learn
model. For that, we will use the MNIST example from the scikit-learn
documentation which trains an SVM model.
To save our trained model, we will serialise it using joblib
. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the scikit-learn
documentation.
Our model will be persisted as a file named mnist-svm.joblib
Now that we have trained and saved our model, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
Note that, the settings.json
file will contain our Kafka configuration, including the address of the Kafka broker and the input / output topics that will be used for inference.
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Now that we have verified that our server is accepting REST requests, we will try to send a new inference request through Kafka. For this, we just need to send a request to the mlserver-input
topic (which is the default input topic):
Once the message has gone into the queue, the Kafka server running within MLServer should receive this message and run inference. The prediction output should then get posted into an output queue, which will be named mlserver-output
by default.
As we should now be able to see above, the results of our inference request should now be visible in the output Kafka queue.
MLServer extends the V2 inference protocol by adding support for a content_type
annotation. This annotation can be provided either through the model metadata parameters
, or through the input parameters
. By leveraging the content_type
annotation, we can provide the necessary information to MLServer so that it can decode the input payload from the "wire" V2 protocol to something meaningful to the model / user (e.g. a NumPy array).
This example will walk you through some examples which illustrate how this works, and how it can be extended.
To start with, we will write a dummy runtime which just prints the input, the decoded input and returns it. This will serve as a testbed to showcase how the content_type
support works.
Later on, we will extend this runtime by adding custom codecs that will decode our V2 payload to custom types.
As you can see above, this runtime will decode the incoming payloads by calling the self.decode()
helper method. This method will check what's the right content type for each input in the following order:
Is there any content type defined in the inputs[].parameters.content_type
field within the request payload?
Is there any content type defined in the inputs[].parameters.content_type
field within the model metadata?
Is there any default content type that should be assumed?
In order to enable this runtime, we will also create a model-settings.json
file. This file should be present (or accessible from) in the folder where we run mlserver start .
.
Our initial step will be to decide the content type based on the incoming inputs[].parameters
field. For this, we will start our MLServer in the background (e.g. running mlserver start .
)
As you've probably already noticed, writing request payloads compliant with both the V2 Inference Protocol requires a certain knowledge about both the V2 spec and the structure expected by each content type. To account for this and simplify usage, the MLServer package exposes a set of utilities which will help you interact with your models via the V2 protocol.
These helpers are mainly shaped as "codecs". That is, abstractions which know how to "encode" and "decode" arbitrary Python datatypes to and from the V2 Inference Protocol.
Generally, we recommend using the existing set of codecs to generate your V2 payloads. This will ensure that requests and responses follow the right structure, and should provide a more seamless experience.
Following with our previous example, the same code could be rewritten using codecs as:
Note that the rewritten snippet now makes use of the built-in InferenceRequest
class, which represents a V2 inference request. On top of that, it also uses the NumpyCodec
and StringCodec
implementations, which know how to encode a Numpy array and a list of strings into V2-compatible request inputs.
Our next step will be to define the expected content type through the model metadata. This can be done by extending the model-settings.json
file, and adding a section on inputs.
After adding this metadata, we will re-start MLServer (e.g. mlserver start .
) and we will send a new request without any explicit parameters
.
As you should be able to see in the server logs, MLServer will cross-reference the input names against the model metadata to find the right content type.
There may be cases where a custom inference runtime may need to encode / decode to custom datatypes. As an example, we can think of computer vision models which may only operate with pillow
image objects.
In these scenarios, it's possible to extend the Codec
interface to write our custom encoding logic. A Codec
, is simply an object which defines a decode()
and encode()
methods. To illustrate how this would work, we will extend our custom runtime to add a custom PillowCodec
.
We should now be able to restart our instance of MLServer (i.e. with the mlserver start .
command), to send a few test requests.
As you should be able to see in the MLServer logs, the server is now able to decode the payload into a Pillow image. This example also illustrates how Codec
objects can be compatible with multiple datatype
values (e.g. tensor and BYTES
in this case).
So far, we've seen how you can specify codecs so that they get applied at the input level. However, it is also possible to use request-wide codecs that aggregate multiple inputs to decode the payload. This is usually relevant for cases where the models expect a multi-column input type, like a Pandas DataFrame.
To illustrate this, we will first tweak our EchoRuntime
so that it prints the decoded contents at the request level.
We should now be able to restart our instance of MLServer (i.e. with the mlserver start .
command), to send a few test requests.
MLServer supports loading and unloading models dynamically from a models repository. This allows you to enable and disable the models accessible by MLServer on demand. This extension builds on top of the support for Multi-Model Serving, letting you change at runtime which models is MLServer currently serving.
The API to manage the model repository is modelled after Triton's Model Repository extension to the V2 Dataplane and is thus fully compatible with it.
This notebook will walk you through an example using the Model Repository API.
First of all, we will need to train some models. For that, we will re-use the models we trained previously in the Multi-Model Serving example. You can check the details on how they are trained following that notebook.
Next up, we will start our mlserver
inference server. Note that, by default, this will load all our models.
Now that we've got our inference server up and running, and serving 2 different models, we can start using the Model Repository API. To get us started, we will first list all available models in the repository.
As we can, the repository lists 2 models (i.e. mushroom-xgboost
and mnist-svm
). Note that the state for both is set to READY
. This means that both models are loaded, and thus ready for inference.
mushroom-xgboost
modelWe will now try to unload one of the 2 models, mushroom-xgboost
. This will unload the model from the inference server but will keep it available on our model repository.
If we now try to list the models available in our repository, we will see that the mushroom-xgboost
model is flagged as UNAVAILABLE
. This means that it's present in the repository but it's not loaded for inference.
mushroom-xgboost
model backWe will now load our model back into our inference server.
If we now try to list the models again, we will see that our mushroom-xgboost
is back again, ready for inference.
Scikit-Learn |
|
|
XGBoost |
|
|
Spark MLlib |
|
|
LightGBM |
|
|
CatBoost |
|
|
MLflow |
|
|
Alibi-Detect |
|
|
Scikit-Learn |
|
XGBoost |
|
MLflow |
|
|
|
|
|
|
|
| ✅ |
| ❌ | Only available on non-regressor models. |
| ❌ |
| ✅ |
| ✅ |
|
| ✅ |
| ❌ |
| ✅ |
| ✅ |
|
| ❌ | ✅ |
|
| ❌ | ✅ |
|
MLServer supports Pydantic V2.
MLServer supports streaming data to and from your models.
Streaming support is available for both the REST and gRPC servers:
for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.
for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.
See our docs and example for more details.
fix(ci): fix typo in CI name by @sakoush in https://github.com/SeldonIO/MLServer/pull/1623
Update CHANGELOG by @github-actions in https://github.com/SeldonIO/MLServer/pull/1624
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1634
Fix mlserver_huggingface settings device type by @geodavic in https://github.com/SeldonIO/MLServer/pull/1486
Update README.md w licensing clarification by @paulb-seldon in https://github.com/SeldonIO/MLServer/pull/1636
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1642
fix(ci): optimise disk space for GH workers by @sakoush in https://github.com/SeldonIO/MLServer/pull/1644
build: Update maintainers by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1659
fix: Missing f-string directives by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1677
build: Add Catboost runtime to Dependabot by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1689
Fix JSON input shapes by @ReveStobinson in https://github.com/SeldonIO/MLServer/pull/1679
build(deps): bump alibi-detect from 0.11.5 to 0.12.0 by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1702
build(deps): bump alibi from 0.9.5 to 0.9.6 by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1704
Docs correction - Updated README.md in mlflow to match column names order by @vivekk0903 in https://github.com/SeldonIO/MLServer/pull/1703
fix(runtimes): Remove unused Pydantic dependencies by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1725
test: Detect generate failures by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1729
build: Add granularity in types generation by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1749
Migrate to Pydantic v2 by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1748
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1753
Revert "build(deps): bump uvicorn from 0.28.0 to 0.29.0" by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1758
refactor(pydantic): Remaining migrations for deprecated functions by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1757
Fixed openapi dataplane.yaml by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1752
fix(pandas): Use Pydantic v2 compatible type by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1760
Fix Pandas codec decoding from numpy arrays by @lhnwrk in https://github.com/SeldonIO/MLServer/pull/1751
build: Bump versions for Read the Docs by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1761
docs: Remove quotes around local TOC by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1764
Spawn worker in custom environment by @lhnwrk in https://github.com/SeldonIO/MLServer/pull/1739
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1767
basic contributing guide on contributing and opening a PR by @bohemia420 in https://github.com/SeldonIO/MLServer/pull/1773
Inference streaming support by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1750
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1779
build: Lock GitHub runners' OS by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1765
Removed text-model form benchmarking by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1790
Bumped mlflow to 2.13.1 and gunicorn to 22.0.0 by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1791
Build(deps): Update to poetry version 1.8.3 in docker build by @sakoush in https://github.com/SeldonIO/MLServer/pull/1792
Bumped werkzeug to 3.0.3 by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1793
Docs streaming by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1789
Bump uvicorn 0.30.1 by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1795
Fixes for all-runtimes by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1794
Fix BaseSettings import for pydantic v2 by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1798
Bumped preflight version to 1.9.7 by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1797
build: Install dependencies only in Tox environments by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1785
Bumped to 1.6.0.dev2 by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1803
Fix CI/CD macos-huggingface by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1805
Fixed macos kafka CI by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1807
Update poetry lock by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1808
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1813
Fix/macos all runtimes by @RobertSamoilescu in https://github.com/SeldonIO/MLServer/pull/1823
fix: Update stale reviewer in licenses.yml workflow by @sakoush in https://github.com/SeldonIO/MLServer/pull/1824
ci: Merge changes from master to release branch by @sakoush in https://github.com/SeldonIO/MLServer/pull/1825
@paulb-seldon made their first contribution in https://github.com/SeldonIO/MLServer/pull/1636
@ReveStobinson made their first contribution in https://github.com/SeldonIO/MLServer/pull/1679
@vivekk0903 made their first contribution in https://github.com/SeldonIO/MLServer/pull/1703
@RobertSamoilescu made their first contribution in https://github.com/SeldonIO/MLServer/pull/1752
@lhnwrk made their first contribution in https://github.com/SeldonIO/MLServer/pull/1751
@bohemia420 made their first contribution in https://github.com/SeldonIO/MLServer/pull/1773
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.5.0...1.6.0
Update CHANGELOG by @github-actions in https://github.com/SeldonIO/MLServer/pull/1592
build: Migrate away from Node v16 actions by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1596
build: Bump version and improve release doc by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1602
build: Upgrade stale packages (fastapi, starlette, tensorflow, torch) by @sakoush in https://github.com/SeldonIO/MLServer/pull/1603
fix(ci): tests and security workflow fixes by @sakoush in https://github.com/SeldonIO/MLServer/pull/1608
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1612
fix(ci): Missing quote in CI test for all_runtimes by @sakoush in https://github.com/SeldonIO/MLServer/pull/1617
build(docker): Bump dependencies by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1618
docs: List supported Python versions by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1591
fix(ci): Have separate smaller tasks for release by @sakoush in https://github.com/SeldonIO/MLServer/pull/1619
We remove support for python 3.8, check https://github.com/SeldonIO/MLServer/pull/1603 for more info. Docker images for mlserver are already using python 3.10.
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.4.0...1.5.0
Free up some space for GH actions by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1282
Introduce tracing with OpenTelemetry by @vtaskow in https://github.com/SeldonIO/MLServer/pull/1281
Update release CI to use Poetry by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1283
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1284
Add support for white-box explainers to alibi-explain runtime by @ascillitoe in https://github.com/SeldonIO/MLServer/pull/1279
Update CHANGELOG by @github-actions in https://github.com/SeldonIO/MLServer/pull/1294
Fix build-wheels.sh error when copying to output path by @lc525 in https://github.com/SeldonIO/MLServer/pull/1286
Fix typo by @strickvl in https://github.com/SeldonIO/MLServer/pull/1289
feat(logging): Distinguish logs from different models by @vtaskow in https://github.com/SeldonIO/MLServer/pull/1302
Make sure we use our Response class by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1314
Adding Quick-Start Guide to docs by @ramonpzg in https://github.com/SeldonIO/MLServer/pull/1315
feat(logging): Provide JSON-formatted structured logging as option by @vtaskow in https://github.com/SeldonIO/MLServer/pull/1308
Bump in conda version and mamba solver by @dtpryce in https://github.com/SeldonIO/MLServer/pull/1298
feat(huggingface): Merge model settings by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1337
feat(huggingface): Load local artefacts in HuggingFace runtime by @vtaskow in https://github.com/SeldonIO/MLServer/pull/1319
Document and test behaviour around NaN by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1346
Address flakiness on 'mlserver build' tests by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1363
Bump Poetry and lockfiles by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1369
Bump Miniforge3 to 23.3.1 by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1372
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1373
Improved huggingface batch logic by @ajsalow in https://github.com/SeldonIO/MLServer/pull/1336
Add inference params support to MLFlow's custom invocation endpoint (… by @M4nouel in https://github.com/SeldonIO/MLServer/pull/1375
Increase build space for runtime builds by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1385
Fix minor typo in sklearn
README by @krishanbhasin-gc in https://github.com/SeldonIO/MLServer/pull/1402
Add catboost classifier support by @krishanbhasin-gc in https://github.com/SeldonIO/MLServer/pull/1403
added model_kwargs to huggingface model by @nanbo-liu in https://github.com/SeldonIO/MLServer/pull/1417
Re-generate License Info by @github-actions in https://github.com/SeldonIO/MLServer/pull/1456
Local response cache implementation by @SachinVarghese in https://github.com/SeldonIO/MLServer/pull/1440
fix link to custom runtimes by @kretes in https://github.com/SeldonIO/MLServer/pull/1467
Improve typing on Environment
class by @krishanbhasin-gc in https://github.com/SeldonIO/MLServer/pull/1469
build(dependabot): Change reviewers by @jesse-c in https://github.com/SeldonIO/MLServer/pull/1548
MLServer changes from internal fork - deps and CI updates by @sakoush in https://github.com/SeldonIO/MLServer/pull/1588
@vtaskow made their first contribution in https://github.com/SeldonIO/MLServer/pull/1281
@lc525 made their first contribution in https://github.com/SeldonIO/MLServer/pull/1286
@strickvl made their first contribution in https://github.com/SeldonIO/MLServer/pull/1289
@ramonpzg made their first contribution in https://github.com/SeldonIO/MLServer/pull/1315
@jesse-c made their first contribution in https://github.com/SeldonIO/MLServer/pull/1337
@ajsalow made their first contribution in https://github.com/SeldonIO/MLServer/pull/1336
@M4nouel made their first contribution in https://github.com/SeldonIO/MLServer/pull/1375
@nanbo-liu made their first contribution in https://github.com/SeldonIO/MLServer/pull/1417
@kretes made their first contribution in https://github.com/SeldonIO/MLServer/pull/1467
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.3.5...1.4.0
Rename HF codec to hf
by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1268
Publish is_drift metric to Prom by @joshsgoldstein in https://github.com/SeldonIO/MLServer/pull/1263
@joshsgoldstein made their first contribution in https://github.com/SeldonIO/MLServer/pull/1263
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.3.4...1.3.5
Silent logging by @dtpryce in https://github.com/SeldonIO/MLServer/pull/1230
Fix mlserver infer
with BYTES
by @RafalSkolasinski in https://github.com/SeldonIO/MLServer/pull/1213
@dtpryce made their first contribution in https://github.com/SeldonIO/MLServer/pull/1230
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.3.3...1.3.4
Add default LD_LIBRARY_PATH env var by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1120
Adding cassava tutorial (mlserver + seldon core) by @edshee in https://github.com/SeldonIO/MLServer/pull/1156
Add docs around converting to / from JSON by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1165
Document SKLearn available outputs by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1167
Fix minor typo in alibi-explain
tests by @ascillitoe in https://github.com/SeldonIO/MLServer/pull/1170
Add support for .ubj
models and improve XGBoost docs by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1168
Fix content type annotations for pandas codecs by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1162
Added option to configure the grpc histogram by @cristiancl25 in https://github.com/SeldonIO/MLServer/pull/1143
Add OS classifiers to project's metadata by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1171
Don't use qsize
for parallel worker queue by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1169
Fix small typo in Python API docs by @krishanbhasin-gc in https://github.com/SeldonIO/MLServer/pull/1174
Fix star import in mlserver.codecs.*
by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1172
@cristiancl25 made their first contribution in https://github.com/SeldonIO/MLServer/pull/1143
@krishanbhasin-gc made their first contribution in https://github.com/SeldonIO/MLServer/pull/1174
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.3.2...1.3.3
Use default initialiser if not using a custom env by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1104
Add support for online drift detectors by @ascillitoe in https://github.com/SeldonIO/MLServer/pull/1108
added intera and inter op parallelism parameters to the hugggingface … by @saeid93 in https://github.com/SeldonIO/MLServer/pull/1081
Fix settings reference in runtime docs by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1109
Bump Alibi libs requirements by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1121
Add default LD_LIBRARY_PATH env var by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1120
Ignore both .metrics and .envs folders by @adriangonz in https://github.com/SeldonIO/MLServer/pull/1132
@ascillitoe made their first contribution in https://github.com/SeldonIO/MLServer/pull/1108
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.3.1...1.3.2
Move OpenAPI schemas into Python package (#1095)
More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package - or different versions of the same package (e.g. scikit-learn==1.1.0
vs scikit-learn==1.2.0
). In these cases, to load your custom runtime, MLServer will need access to these dependencies.
In MLServer 1.3.0
, it is now possible to load this custom set of dependencies by providing them, through an environment tarball, whose path can be specified within your model-settings.json
file. This custom environment will get provisioned on the fly after loading a model - alongside the default environment and any other custom environments.
Under the hood, each of these environments will run their own separate pool of workers.
The MLServer framework now includes a simple interface that allows you to register and keep track of any custom metrics:
[mlserver.register()](https://mlserver.readthedocs.io/en/latest/reference/api/metrics.html#mlserver.register)
: Register a new metric.
[mlserver.log()](https://mlserver.readthedocs.io/en/latest/reference/api/metrics.html#mlserver.log)
: Log a new set of metric / value pairs.
Custom metrics will generally be registered in the [load()](https://mlserver.readthedocs.io/en/latest/reference/api/model.html#mlserver.MLModel.load)
method and then used in the [predict()](https://mlserver.readthedocs.io/en/latest/reference/api/model.html#mlserver.MLModel.predict)
method of your custom runtime. These metrics can then be polled and queried via Prometheus.
MLServer 1.3.0
now includes an autogenerated Swagger UI which can be used to interact dynamically with the Open Inference Protocol.
The autogenerated Swagger UI can be accessed under the /v2/docs
endpoint.
Alongside the general API documentation, MLServer also exposes now a set of API docs tailored to individual models, showing the specific endpoints available for each one.
The model-specific autogenerated Swagger UI can be accessed under the following endpoints:
/v2/models/{model_name}/docs
/v2/models/{model_name}/versions/{model_version}/docs
MLServer now includes improved Codec support for all the main different types that can be returned by HugginFace models - ensuring that the values returned via the Open Inference Protocol are more semantic and meaningful.
Massive thanks to @pepesi for taking the lead on improving the HuggingFace runtime!
Internally, MLServer leverages a Model Repository implementation which is used to discover and find different models (and their versions) available to load. The latest version of MLServer will now allow you to swap this for your own model repository implementation - letting you integrate against your own model repository workflows.
This is exposed via the model_repository_implementation flag of your settings.json
configuration file.
Thanks to @jgallardorama (aka @jgallardorama-itx ) for his effort contributing this feature!
MLServer 1.3.0
introduces a new set of metrics to increase visibility around two of its internal queues:
Adaptive batching queue: used to accumulate request batches on the fly.
Parallel inference queue: used to send over requests to the inference worker pool.
Many thanks to @alvarorsant for taking the time to implement this highly requested feature!
The latest version of MLServer includes a few optimisations around image size, which help reduce the size of the official set of images by more than ~60% - making them more convenient to use and integrate within your workloads. In the case of the full seldonio/mlserver:1.3.0
image (including all runtimes and dependencies), this means going from 10GB down to ~3GB.
Alongside its built-in inference runtimes, MLServer also exposes a Python framework that you can use to extend MLServer and write your own codecs and inference runtimes. The MLServer official docs now include a reference page documenting the main components of this framework in more detail.
@rio made their first contribution in https://github.com/SeldonIO/MLServer/pull/864
@pepesi made their first contribution in https://github.com/SeldonIO/MLServer/pull/692
@jgallardorama made their first contribution in https://github.com/SeldonIO/MLServer/pull/849
@alvarorsant made their first contribution in https://github.com/SeldonIO/MLServer/pull/860
@gawsoftpl made their first contribution in https://github.com/SeldonIO/MLServer/pull/950
@stephen37 made their first contribution in https://github.com/SeldonIO/MLServer/pull/1033
@sauerburger made their first contribution in https://github.com/SeldonIO/MLServer/pull/1064
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.2.3...1.2.4
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.2.2...1.2.3
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.2.1...1.2.2
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.2.0...1.2.1
MLServer now exposes an alternative “simplified” interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict() method with the mlserver.codecs.decode_args
decorator, and it lets you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.
Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through MLServer’s codecs and content types system.
To make it easier to write your own custom runtimes, MLServer now ships with a mlserver init
command that will generate a templated project. This project will include a skeleton with folders, unit tests, Dockerfiles, etc. for you to fill.
MLServer now lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to do is to move it to your model folder, next to your model-settings.json
configuration file.
For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:
This release of MLServer introduces a new mlserver infer
command, which will let you run inference over a large batch of input data on the client side. Under the hood, this command will stream a large set of inference requests from specified input file, arrange them in microbatches, orchestrate the request / response lifecycle, and will finally write back the obtained responses into output file.
The 1.2.0
release of MLServer, includes a number of fixes around the parallel inference pool focused on improving the architecture to optimise memory usage and reduce latency. These changes include (but are not limited to):
The main MLServer process won’t load an extra replica of the model anymore. Instead, all computing will occur on the parallel inference pool.
The worker pool will now ensure that all requests are executed on each worker’s AsyncIO loop, thus optimising compute time vs IO time.
Several improvements around logging from the inference workers.
MLServer has now dropped support for Python 3.7
. Going forward, only 3.8
, 3.9
and 3.10
will be supported (with 3.8
being used in our official set of images).
The official set of MLServer images has now moved to use UBI 9 as a base image. This ensures support to run MLServer in OpenShift clusters, as well as a well-maintained baseline for our images.
In line with MLServer’s close relationship with the MLflow team, this release of MLServer introduces support for the recently released MLflow 2.0. This introduces changes to the drop-in MLflow “scoring protocol” support, in the MLflow runtime for MLServer, to ensure it’s aligned with MLflow 2.0.
MLServer is also shipped as a dependency of MLflow, therefore you can try it out today by installing MLflow as:
To learn more about how to use MLServer directly from the MLflow CLI, check out the MLflow docs.
@johnpaulett made their first contribution in https://github.com/SeldonIO/MLServer/pull/633
@saeid93 made their first contribution in https://github.com/SeldonIO/MLServer/pull/711
@RafalSkolasinski made their first contribution in https://github.com/SeldonIO/MLServer/pull/720
@dumaas made their first contribution in https://github.com/SeldonIO/MLServer/pull/742
@Salehbigdeli made their first contribution in https://github.com/SeldonIO/MLServer/pull/776
@regen100 made their first contribution in https://github.com/SeldonIO/MLServer/pull/839
Full Changelog: https://github.com/SeldonIO/MLServer/compare/1.1.0...1.2.0
Out of the box, MLServer includes support for streaming data to your models. Streaming support is available for both the REST and gRPC servers.
Streaming support for the REST server is limited only to server streaming. This means that the client sends a single request to the server, and the server responds with a stream of data.
The streaming endpoints are available for both the infer
and generate
methods through the following endpoints:
/v2/models/{model_name}/versions/{model_version}/infer_stream
/v2/models/{model_name}/infer_stream
/v2/models/{model_name}/versions/{model_version}/generate_stream
/v2/models/{model_name}/generate_stream
Note that for REST, the generate
and generate_stream
endpoints are aliases for the infer
and infer_stream
endpoints, respectively. Those names are used to better reflect the nature of the operation for Large Language Models (LLMs).
Streaming support for the gRPC server is available for both client and server streaming. This means that the client sends a stream of data to the server, and the server responds with a stream of data.
The two streams operate independently, so the client and the server can read and write data however they want (e.g., the server could either wait to receive all the client messages before sending a response or it can send a response after each message). Note that bi-directional streaming covers all the possible combinations of client and server streaming: unary-stream, stream-unary, and stream-stream. The unary-unary case can be covered as well by the bi-directional streaming, but mlserver
already has the predict
method dedicated to this use case. The logic for how the requests are received, and processed, and the responses are sent back should be built into the runtime logic.
The stub method for streaming to be used by the client is ModelStreamInfer
.
There are three main limitations of the streaming support in MLServer:
the parallel_workers
setting should be set to 0
to disable distributed workers (to be addressed in future releases)
Open Inference Protocol
Main dataplane for inference, health and metadata
Model Repository Extension
Extension to the protocol to provide a control plane which lets you load / unload models dynamically
MLServer has been built with Multi-Model Serving (MMS) in mind. This means that, within a single instance of MLServer, you can serve multiple models under different paths. This also includes multiple versions of the same model.
This notebook shows an example of how you can leverage MMS with MLServer.
We will first start by training 2 different models:
Name | Framework | Source | Trained Model Path |
---|---|---|---|
mnist-svm
modelmushroom-xgboost
modelThe next step will be serving both our models within the same MLServer instance. For that, we will just need to create a model-settings.json
file local to each of our models and a server-wide settings.json
. That is,
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
models/mnist-svm/model-settings.json
: holds the configuration specific to our mnist-svm
model (e.g. input type, runtime to use, etc.).
models/mushroom-xgboost/model-settings.json
: holds the configuration specific to our mushroom-xgboost
model (e.g. input type, runtime to use, etc.).
settings.json
models/mnist-svm/model-settings.json
models/mushroom-xgboost/model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
By this point, we should have both our models getting served by MLServer. To make sure that everything is working as expected, let's send a request from each test set.
For that, we can use the Python types that the mlserver
package provides out of box, or we can build our request manually.
mnist-svm
modelmushroom-xgboost
modelThis tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we'll cover:
Running the model locally
Turning the ML model into an API
Containerizing the model
Storing the container in a registry
Deploying the model to Kubernetes (with Seldon Core)
Scaling the model
The tutorial comes with an accompanying video which you might find useful as you work through the steps:
The slides used in the video can be found here.
For this tutorial, we're going to use the Cassava dataset available from the Tensorflow Catalog. This dataset includes leaf images from the cassava plant. Each plant can be classified as either "healthly" or as having one of four diseases (Mosaic Disease, Bacterial Blight, Green Mite, Brown Streak Disease).
We won't go through the steps of training the classifier. Instead, we'll be using a pre-trained one available on TensorFlow Hub. You can find the model details here.
The easiest way to run this example is to clone the repository located here:
If you've already cloned the MLServer repository, you can also find it in docs/examples/cassava
.
Once you've done that, you can just run:
And it'll set you up with all the libraries required to run the code.
The starting point for this tutorial is python script app.py
. This is typical of the kind of python code we'd run standalone or in a jupyter notebook. Let's familiarise ourself with the code:
First up, we're importing a couple of functions from our helpers.py
file:
plot
provides the visualisation of the samples, labels and predictions.
preprocess
is used to resize images to 224x224 pixels and normalize the RGB values.
The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.
Try it yourself by running:
The problem with running our code like we did earlier is that it's not accessible to anyone who doesn't have the python script (and all of it's dependencies). A good way to solve this is to turn our model into an API.
Typically people turn to popular python web servers like Flask or FastAPI. This is a good approach and gives us lots of flexibility but it also requires us to do a lot of the work ourselves. We need to impelement routes, set up logging, capture metrics and define an API schema among other things. A simpler way to tackle this problem is to use an inference server. For this tutorial we're going to use the open source MLServer framework.
MLServer supports a bunch of inference runtimes out of the box, but it also supports custom python code which is what we'll use for our Tensorflow model.
In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load()
and predict()
. Let's take a look at the code (found in model/serve-model.py
):
The load()
method is used to define any logic required to set up our model for inference. In our case, we're loading the model weights into self._model
. The predict()
method is where we include all of our prediction logic.
You may notice that we've slightly modified our code from earlier (in app.py
). The biggest change is that it is now wrapped in a single class CassavaModel
.
The only other task we need to do to run our model on MLServer is to specify a model-settings.json
file:
This is a simple configuration file that tells MLServer how to handle our model. In our case, we've provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel
).
We're now ready to serve our model with MLServer. To do that we can simply run:
MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.
Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the test.py
file by running:
Containers are an easy way to package our application together with it's runtime and dependencies. More importantly, containerizing our model allows it to run in a variety of different environments.
Note: you will need Docker installed to run this section of the tutorial. You'll also need a docker hub account or another container registry.
Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build
command.
Before we run this command, we need to provide our dependencies in either a requirements.txt
or a conda.env
file. The requirements file we'll use for this example is stored in model/requirements.txt
:
Notice that we didn't need to include
mlserver
in our requirements? That's because the builder image has mlserver included already.
We're now ready to build our container image using:
Make sure you replace YOUR_CONTAINER_REGISTRY
and IMAGE_NAME
with your dockerhub username and a suitable name e.g. "bobsmith/cassava".
MLServer will now build the model into a container image for us. We can check the output of this by running:
Finally, we want to send this container image to be stored in our container registry. We can do this by running:
Now that we've turned our model into a production-ready API, containerized it and pushed it to a registry, it's time to deploy our model.
We're going to use a popular open source framework called Seldon Core to deploy our model. Seldon Core is great because it combines all of the awesome cloud-native features we get from Kubernetes but it also adds machine-learning specific features.
This tutorial assumes you already have a Seldon Core cluster up and running. If that's not the case, head over the installation instructions and get set up first. You'll also need to install the kubectl
command line interface.
To create our deployment with Seldon Core we need to create a small configuration file that looks like this:
You can find this file named deployment.yaml
in the base folder of this tutorial's repository.
Make sure you replace YOUR_CONTAINER_REGISTRY
and IMAGE_NAME
with your dockerhub username and a suitable name e.g. "bobsmith/cassava".
We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:
To check our deployment is up and running we can run:
We should see STATUS = Running
once our deployment has finalized.
Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it's working.
To do this, we simply run the test.py
file in the following way:
This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.
A note on running this yourself: This example is set up to connect to a kubernetes cluster running locally on your machine. If yours is local too, you'll need to make sure you port forward before sending requests. If your cluster is remote, you'll need to change the inference_url
variable on line 21 of test.py
.
Our model is now running in a production environment and able to handle requests from external sources. This is awesome but what happens as the number of requests being sent to our model starts to increase? Eventually, we'll reach the limit of what a single server can handle. Thankfully, we can get around this problem by scaling our model horizontally.
Kubernetes and Seldon Core make this really easy to do by simply running:
We can replace the --replicas=3
with any number we want to scale to.
To watch the servers scaling out we can run:
In this tutorial we've scaled the model out manually to show how it works. In a real environment we'd want to set up auto-scaling to make sure our prediction API is always online and performing as expected.
MLServer is currently used as the core Python inference server in some of most popular Kubernetes-native serving frameworks, including and . This allows MLServer users to leverage the usability and maturity of these frameworks to take their model deployments to the next level of their MLOps journey, ensuring that they are served in a robust and scalable infrastructure.
In general, it should be possible to deploy models using MLServer into any serving engine compatible with the V2 protocol. Alternatively, it's also possible to manage MLServer deployments manually as regular processes (i.e. in a non-Kubernetes-native way). However, this may be more involved and highly dependant on the deployment infrastructure.
This package provides a MLServer runtime compatible with .
You can install the runtime, alongside mlserver
, as:
The MLflow inference runtime introduces a new dict
content type, which decodes an incoming V2 request as a . This is useful for certain MLflow-serialised models, which will expect that the model inputs are serialised in this format.
Out-of-the-box, MLServer exposes a set of metrics that help you monitor your machine learning workloads in production. These include standard metrics like number of requests and latency.
On top of these, you can also register and track your own as part of your .
By default, MLServer will expose metrics around inference requests (count and error rate) and the status of its internal requests queues. These internal queues are used for and .
Metric Name | Description |
---|
On top of the default set of metrics, MLServer's REST server will also expose a set of metrics specific to REST.
The prefix for the REST-specific metrics will be dependent on the metrics_rest_server_prefix
flag from the .
Metric Name | Description |
---|
On top of the default set of metrics, MLServer's gRPC server will also expose a set of metrics specific to gRPC.
MLServer allows you to register custom metrics within your custom inference runtimes. This can be done through the mlserver.register()
and mlserver.log()
methods.
mlserver.register
: Register a new metric.
mlserver.log
: Log a new set of metric / value pairs. If there's any unregistered metric, it will get registered on-the-fly.
Under the hood, metrics logged through the mlserver.log
method will get exposed to Prometheus as a Histogram.
If these labels are not present on a specific metric, this means that those metrics can't be sliced at the model level.
Below, you can find the list of standardised labels that you will be able to find on model-specific metrics:
There may be cases where the inference runtimes offered out-of-the-box by MLServer may not be enough, or where you may need extra custom functionality which is not included in MLServer (e.g. custom codecs). To cover these cases, MLServer lets you create custom runtimes very easily.
This page covers some of the bigger points that need to be taken into account when extending MLServer. You can also see this which walks through the process of writing a custom runtime.
MLServer is designed as an easy-to-extend framework, encouraging users to write their own custom runtimes easily. The starting point for this is the MLModel <mlserver.MLModel>
abstract class, whose main methods are:
load() <mlserver.MLModel.load>
: Responsible for loading any artifacts related to a model (e.g. model weights, pickle files, etc.).
unload() <mlserver.MLModel.unload>
: Responsible for unloading the model, freeing any resources (e.g. GPU memory, etc.).
predict() <mlserver.MLModel.predict>
: Responsible for using a model to perform inference on an incoming data point.
Therefore, the "one-line version" of how to write a custom runtime is to write a custom class extending from MLModel <mlserver.MLModel>
, and then overriding those methods with your custom logic.
MLServer exposes an alternative "simplified" interface which can be used to write custom runtimes. This interface can be enabled by decorating your predict()
method with the mlserver.codecs.decode_args
decorator. This will let you specify in the method signature both how you want your request payload to be decoded and how to encode the response back.
As an example of the above, let's assume a model which
Takes two lists of strings as inputs:
questions
, containing multiple questions to ask our model.
context
, containing multiple contexts for each of the questions.
Returns a Numpy array with some predictions as the output.
Leveraging MLServer's simplified notation, we can represent the above as the following custom runtime:
Note that, the method signature of our predict
method now specifies:
The input names that we should be looking for in the request payload (i.e. questions
and context
).
The expected content type for each of the request inputs (i.e. List[str]
on both cases).
The expected content type of the response outputs (i.e. np.ndarray
).
The headers
field within the parameters
section of the request / response is managed by MLServer. Therefore, incoming payloads where this field has been explicitly modified will be overriden.
There are occasions where custom logic must be made conditional to extra information sent by the client outside of the payload. To allow for these use cases, MLServer will map all incoming HTTP headers (in the case of REST) or metadata (in the case of gRPC) into the headers
field of the parameters
object within the InferenceRequest
instance.
Similarly, to return any HTTP headers (in the case of REST) or metadata (in the case of gRPC), you can append any values to the headers
field within the parameters
object of the returned InferenceResponse
instance.
MLServer lets you load custom runtimes dynamically into a running instance of MLServer. Once you have your custom runtime ready, all you need to is to move it to your model folder, next to your model-settings.json
configuration file.
For example, if we assume a flat model repository where each folder represents a model, you would end up with a folder structure like the one below:
Note that, from the example above, we are assuming that:
Your custom runtime code lives in the models.py
file.
The implementation
field of your model-settings.json
configuration file contains the import path of your custom runtime (e.g. models.MyCustomRuntime
).
More often that not, your custom runtimes will depend on external 3rd party dependencies which are not included within the main MLServer package. In these cases, to load your custom runtime, MLServer will need access to these dependencies.
Note that, in the folder layout above, we are assuming that:
The environment.tar.gz
tarball contains a pre-packaged version of your custom environment.
The environment_tarball
field of your model-settings.json
configuration file points to your pre-packaged custom environment (i.e. ./environment.tar.gz
).
The mlserver build
command expects that a Docker runtime is available and running in the background.
To leverage these, we can use the mlserver build
command. Assuming that we're currently on the folder containing our custom inference runtime, we should be able to just run:
The output will be a Docker image named my-custom-server
, ready to be used.
Default setting values can still be overriden by external environment variables or model-specific model-settings.json
.
Out-of-the-box, the mlserver build
subcommand leverages a default Dockerfile
which takes into account a number of requirements, like
Supporting arbitrary user IDs.
Available on most models, but not in .
Only available on .
WARNING : The 1.3.0
has been yanked from PyPi due to a packaging issue. This should have been now resolved in >= 1.3.1
.
for REST, the gzip_enabled
setting should be set to false
to disable GZIP compression, as streaming is not compatible with GZIP compression (see issue )
Here's what our setup currently looks like:
Our setup has now evloved and looks like this:
Our setup now looks like this. Where our model has been packaged and sent to a container registry:
Having deployed our model to kubernetes and tested it, our setup now looks like this:
Once the new replicas have finished rolling out, our setup now looks like this:
Metric Name | Description |
---|
Custom metrics will generally be registered in the load() <mlserver.MLModel.load>
method and then used in the predict() <mlserver.MLModel.predict>
method of your .
For metrics specific to a model (e.g. , request counts, etc), MLServer will always label these with the model name and model version. Downstream, this will allow to aggregate and query metrics per model.
Label Name | Description |
---|
MLServer will expose metric values through a metrics endpoint exposed on its own metric server. This endpoint can be polled by or other -compatible backends.
Below you can find the available to control the behaviour of the metrics server:
Label Name | Description | Default |
---|
Based on the information provided in the method signature, MLServer will automatically decode the request payload into the different inputs specified as keyword arguments. Under the hood, this is implemented through .
MLServer's "simplified" interface aims to cover use cases where encoding / decoding can be done through one of the codecs built-in into the MLServer package. However, there are instances where this may not be enough (e.g. variable number of inputs, variable content types, etc.). For these types of cases, please use MLServer's , where you will have full control over the full encoding / decoding process.
It is possible to load this custom set of dependencies by providing them through an , whose path can be specified within your model-settings.json
file.
To load a custom environment, must be enabled.
The main MLServer process communicates with workers in custom environments via using pickled objects. Custom environments therefore must use the same version of MLServer and a compatible version of Python with the same as the main process. Consult the tables below for environment compatibility.
Status | Description |
---|
Worker Python \ Server Python | 3.9 | 3.10 | 3.11 |
---|
If we take the above as a reference, we could extend it to include our custom environment as:
MLServer offers built-in utilities to help you build a custom MLServer image. This image can contain any custom code (including custom inference runtimes), as well as any custom environment, provided either through a or a requirements.txt
file.
The subcommand will search for any Conda environment file (i.e. named either as environment.yaml
or conda.yaml
) and / or any requirements.txt
present in your root folder. These can be used to tell MLServer what Python environment is required in the final Docker image.
The environment built by the mlserver build
will be global to the whole MLServer image (i.e. every loaded model will, by default, use that custom environment). For Multi-Model Serving scenarios, it may be better to use instead - which will allow you to run multiple custom environments at the same time.
The mlserver build
subcommand will treat any or files present on your root folder as the default settings that must be set in your final image. Therefore, these files can be used to configure things like the default inference runtime to be used, or to even include embedded models that will always be present within your custom image.
Building your on the fly.
Configure a set of .
However, there may be occasions where you need to customise your Dockerfile
even further. This may be the case, for example, when you need to provide extra environment variables or when you need to customise your Docker build process (e.g. by using other "Docker-less" tools, like or ).
To account for these cases, MLServer also includes a subcommand which will just generate a Dockerfile
(and optionally a .dockerignore
file) exactly like the one used by the mlserver build
command. This Dockerfile
can then be customised according to your needs.
The base Dockerfile
requires to be enabled. To ensure BuildKit is used, you can use the DOCKER_BUILDKIT=1
environment variable, e.g.
mnist-svm
scikit-learn
./models/mnist-svm/model.joblib
mushroom-xgboost
xgboost
./models/mushroom-xgboost/model.json
| Number of gRPC requests, labelled by gRPC code and method. |
| Number of in-flight gRPC requests. |
| Model Name (e.g. |
| Model Version (e.g. |
🔴 | Unsupported |
🟢 | Supported |
🔵 | Untested |
3.9 | 🟢 | 🟢 | 🔵 |
3.10 | 🟢 | 🟢 | 🔵 |
3.11 | 🔵 | 🔵 | 🔵 |
| Number of successful inference requests. |
| Number of failed inference requests. |
|
|
| Number of REST requests, labelled by endpoint and status code. |
| Latency of REST requests. |
| Number of in-flight REST requests. |
| Path under which the metrics endpoint will be exposed. |
|
| Port used to serve the metrics server. |
|
| Prefix used for metric names specific to MLServer's REST inference interface. |
|
| MLServer's current working directory (i.e. |
Queue size for the queue.
Queue size for the queue.
Directory used to store internal metric files (used to support metrics sharing across ). This is equivalent to Prometheus' env var.