Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Out of the box, mlserver
supports the deployment and serving of xgboost
models. By default, it will assume that these models have been serialised using the bst.save_model()
method.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver
.
The first step will be to train a simple xgboost
model. For that, we will use the mushrooms example from the xgboost
Getting Started guide.
To save our trained model, we will serialise it using bst.save_model()
and the JSON format. This is the approach by the XGBoost project.
Our model will be persisted as a file named mushroom-xgboost.json
.
Now that we have trained and saved our model, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
As we can see above, the model predicted the input as close to 0
, which matches what's on the test set.
Out of the box, mlserver
supports the deployment and serving of lightgbm
models. By default, it will assume that these models have been serialised using the bst.save_model()
method.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver
.
To test the LightGBM Server, first we need to generate a simple LightGBM model using Python.
Our model will be persisted as a file named iris-lightgbm.bst
.
Now that we have trained and saved our model, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
As we can see above, the model predicted the probability for each class, and the probability of class 1
is the biggest, close to 0.99
, which matches what's on the test set.
The mlserver
package comes with inference runtime implementations for scikit-learn
and xgboost
models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.
In this example, we will train a numpyro
model. The numpyro
library streamlines the implementation of probabilistic models, abstracting away advanced inference and training algorithms.
Out of the box, mlserver
doesn't provide an inference runtime for numpyro
. However, through this example we will see how easy is to develop our own.
The first step will be to train our model. This will be a very simple bayesian regression model, based on an example provided in the numpyro
docs.
Since this is a probabilistic model, during training we will compute an approximation to the posterior distribution of our model using MCMC.
Now that we have trained our model, the next step will be to save it so that it can be loaded afterwards at serving-time. Note that, since this is a probabilistic model, we will only need to save the traces that approximate the posterior distribution over latent parameters.
This will get saved in a numpyro-divorce.json
file.
The next step will be to serve our model using mlserver
. For that, we will first implement an extension which serve as the runtime to perform inference using our custom numpyro
model.
Our custom inference wrapper should be responsible of:
Loading the model from the set samples we saved previously.
Running inference using our model structure, and the posterior approximated from the samples.
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Now that we have written and tested our custom model, the next step is to deploy it. With that goal in mind, the rough outline of steps will be to first build a custom image containing our code, and then deploy it.
MLServer will automatically find your requirements.txt file and install necessary python packages
MLServer offers helpers to build a custom Docker image containing your code. In this example, we will use the mlserver build
subcommand to create an image, which we'll be able to deploy later.
Note that this section expects that Docker is available and running in the background, as well as a functional cluster with Seldon Core installed and some familiarity with kubectl
.
To ensure that the image is fully functional, we can spin up a container and then send a test request. To start the container, you can run something along the following lines in a separate terminal:
As we should be able to see, the server running within our Docker image responds as expected.
Now that we've built a custom image and verified that it works as expected, we can move to the next step and deploy it. There is a large number of tools out there to deploy images. However, for our example, we will focus on deploying it to a cluster running Seldon Core.
For that, we will need to create a SeldonDeployment
resource which instructs Seldon Core to deploy a model embedded within our custom image and compliant with the V2 Inference Protocol. This can be achieved by applying (i.e. kubectl apply
) a SeldonDeployment
manifest to the cluster, similar to the one below:
Out of the box, mlserver
supports the deployment and serving of alibi_detect models. Alibi Detect is an open source Python library focused on outlier, adversarial and drift detection. In this example, we will cover how we can create a detector configuration to then serve it using mlserver
.
The first step will be to fetch a reference data and other relevant metadata for an alibi-detect
model.
For that, we will use the alibi library to get the adult dataset with demographic features from a 1996 US census.
This example is based on the Categorical and mixed type data drift detection on income prediction tabular data from the alibi-detect documentation.
Now that we have the reference data and other configuration parameters, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start command. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our alibi-detect model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Out of the box, MLServer supports the deployment and serving of HuggingFace Transformer models with the following features:
Loading of Transformer Model artifacts from the Hugging Face Hub.
Model quantization & optimization using the Hugging Face Optimum library
Request batching for GPU optimization (via adaptive batching and request batching)
In this example, we will showcase some of this features using an example model.
Since we're using a pretrained model, we can skip straight to serving.
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We can also leverage the Optimum library that allows us to access quantized and optimized models.
We can download pretrained optimized models from the hub if available by enabling the optimum_model
flag:
Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
The request can now be sent using the same request structure but using optimized models for better performance.
We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.
Once again, you are able to run the model using the MLServer CLI.
Once again, you are able to run the model using the MLServer CLI.
We can also evaluate GPU acceleration, we can test the speed on CPU vs GPU using the following parameters
We first test the time taken with the device=-1 which configures CPU by default
Once again, you are able to run the model using the MLServer CLI.
We can see that it takes 81 seconds which is 8 times longer than the gpu example below.
IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.
Now we'll run the benchmark with GPU configured, which we can do by setting device=0
We can see that the elapsed time is 8 times less than the CPU version!
We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.
In our case we can enable adaptive batching with the max_batch_size
which in our case we will set it ot 128.
We will also configure max_batch_time
which specifies` the maximum amount of time the MLServer orchestrator will wait before sending for inference.
In order to achieve the throughput required of 50 requests per second, we will use the tool vegeta
which performs load testing.
We can now see that we are able to see that the requests are batched and we receive 100% success eventhough the requests are sent one-by-one.
Out of the box, MLServer supports the deployment and serving of MLflow models with the following features:
Loading of MLflow Model artifacts.
Support of dataframes, dict-of-tensors and tensor inputs.
In this example, we will showcase some of this features using an example model.
The first step will be to train and serialise a MLflow model. For that, we will use the linear regression examle from the MLflow docs.
The training script will also serialise our trained model, leveraging the MLflow Model format. By default, we should be able to find the saved artifact under the mlruns
folder.
Now that we have trained and serialised our model, we are ready to start serving it. For that, the initial step will be to set up a model-settings.json
that instructs MLServer to load our artifact using the MLflow Inference Runtime.
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set. For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Note that, the request specifies the value pd
as its content type, whereas every input specifies the content type np
. These parameters will instruct MLServer to:
Convert every input value to a NumPy array, using the data type and shape information provided.
Group all the different inputs into a Pandas DataFrame, using their names as the column names.
To learn more about how MLServer uses content type parameters, you can check this worked out example.
As we can see in the output above, the predicted quality score for our input wine was 5.57
.
MLflow currently ships with an scoring server with its own protocol. In order to provide a drop-in replacement, the MLflow runtime in MLServer also exposes a custom endpoint which matches the signature of the MLflow's /invocations
endpoint.
As an example, we can try to send the same request that sent previously, but using MLflow's protocol. Note that, in both cases, the request will be handled by the same MLServer instance.
As we can see above, the predicted quality for our input is 5.57
, matching the prediction we obtained above.
MLflow lets users define a model signature, where they can specify what types of inputs does the model accept, and what types of outputs it returns. Similarly, the V2 inference protocol employed by MLServer defines a metadata endpoint which can be used to query what inputs and outputs does the model accept. However, even though they serve similar functions, the data schemas used by each one of them are not compatible between them.
To solve this, if your model defines a MLflow model signature, MLServer will convert on-the-fly this signature to a metadata schema compatible with the V2 Inference Protocol. This will also include specifying any extra content type that is required to correctly decode / encode your data.
As an example, we can first have a look at the model signature saved for our MLflow model. This can be seen directly on the MLModel
file saved by our model.
We can then query the metadata endpoint, to see the model metadata inferred by MLServer from our test model's signature. For this, we will use the /v2/models/wine-classifier/
endpoint.
As we should be able to see, the model metadata now matches the information contained in our model signature, including any extra content types necessary to decode our data correctly.
MLServer supports loading and unloading models dynamically from a models repository. This allows you to enable and disable the models accessible by MLServer on demand. This extension builds on top of the support for Multi-Model Serving, letting you change at runtime which models is MLServer currently serving.
The API to manage the model repository is modelled after Triton's Model Repository extension to the V2 Dataplane and is thus fully compatible with it.
This notebook will walk you through an example using the Model Repository API.
First of all, we will need to train some models. For that, we will re-use the models we trained previously in the Multi-Model Serving example. You can check the details on how they are trained following that notebook.
Next up, we will start our mlserver
inference server. Note that, by default, this will load all our models.
Now that we've got our inference server up and running, and serving 2 different models, we can start using the Model Repository API. To get us started, we will first list all available models in the repository.
As we can, the repository lists 2 models (i.e. mushroom-xgboost
and mnist-svm
). Note that the state for both is set to READY
. This means that both models are loaded, and thus ready for inference.
mushroom-xgboost
modelWe will now try to unload one of the 2 models, mushroom-xgboost
. This will unload the model from the inference server but will keep it available on our model repository.
If we now try to list the models available in our repository, we will see that the mushroom-xgboost
model is flagged as UNAVAILABLE
. This means that it's present in the repository but it's not loaded for inference.
mushroom-xgboost
model backWe will now load our model back into our inference server.
If we now try to list the models again, we will see that our mushroom-xgboost
is back again, ready for inference.
The mlserver
package comes with inference runtime implementations for scikit-learn
and xgboost
models. However, some times we may also need to roll out our own inference server, with custom logic to perform inference. To support this scenario, MLServer makes it really easy to create your own extensions, which can then be containerised and deployed in a production environment.
In this example, we create a simple Hello World JSON
model that parses and modifies a JSON data chunk. This is often useful as a means to quickly bootstrap existing models that utilize JSON based model inputs.
The next step will be to serve our model using mlserver
. For that, we will first implement an extension which serve as the runtime to perform inference using our custom Hello World JSON
model.
This is a trivial model to demonstrate how to conceptually work with JSON inputs / outputs. In this example:
Parse the JSON input from the client
Create a JSON response echoing back the client request as well as a server generated message
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Utilizing string data with the gRPC interface can be a bit tricky. To ensure we are correctly handling inputs and outputs we will be handled correctly.
For simplicity in this case, we leverage the Python types that mlserver
provides out of the box. Alternatively, the gRPC stubs can be generated regenerated from the V2 specification directly for use by non-Python as well as Python clients.
MLServer has been built with in mind. This means that, within a single instance of MLServer, you can serve multiple models under different paths. This also includes multiple versions of the same model.
This notebook shows an example of how you can leverage MMS with MLServer.
We will first start by training 2 different models:
mnist-svm
modelmushroom-xgboost
modelThe next step will be serving both our models within the same MLServer instance. For that, we will just need to create a model-settings.json
file local to each of our models and a server-wide settings.json
. That is,
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
models/mnist-svm/model-settings.json
: holds the configuration specific to our mnist-svm
model (e.g. input type, runtime to use, etc.).
models/mushroom-xgboost/model-settings.json
: holds the configuration specific to our mushroom-xgboost
model (e.g. input type, runtime to use, etc.).
settings.json
models/mnist-svm/model-settings.json
models/mushroom-xgboost/model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
By this point, we should have both our models getting served by MLServer. To make sure that everything is working as expected, let's send a request from each test set.
For that, we can use the Python types that the mlserver
package provides out of box, or we can build our request manually.
mnist-svm
modelmushroom-xgboost
modelOut of the box, MLServer provides support to receive inference requests from Kafka. The Kafka server can run side-by-side with the REST and gRPC ones, and adds a new interface to interact with your model. The inference responses coming back from your model, will also get written back to their own output topic.
In this example, we will showcase the integration with Kafka by serving a model thorugh Kafka.
We are going to start by running a simple local docker deployment of kafka that we can test against. This will be a minimal cluster that will consist of a single zookeeper node and a single broker.
You need to have Java installed in order for it to work correctly.
Now you can just run it with the following command outside the terminal:
Now we can create the input and output topics required
Our model will be persisted as a file named mnist-svm.joblib
Now that we have trained and saved our model, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
Note that, the settings.json
file will contain our Kafka configuration, including the address of the Kafka broker and the input / output topics that will be used for inference.
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
Now that we have verified that our server is accepting REST requests, we will try to send a new inference request through Kafka. For this, we just need to send a request to the mlserver-input
topic (which is the default input topic):
Once the message has gone into the queue, the Kafka server running within MLServer should receive this message and run inference. The prediction output should then get posted into an output queue, which will be named mlserver-output
by default.
As we should now be able to see above, the results of our inference request should now be visible in the output Kafka queue.
It's not unusual that model runtimes require extra dependencies that are not direct dependencies of MLServer. This is the case when we want to use , but also when our model artifacts are the output of older versions of a toolkit (e.g. models trained with an older version of SKLearn).
In these cases, since these dependencies (or dependency versions) are not known in advance by MLServer, they won't be included in the default seldonio/mlserver
Docker image. To cover these cases, the seldonio/mlserver
Docker image allows you to load custom environments before starting the server itself.
This example will walk you through how to create and save an custom environment, so that it can be loaded in MLServer without any extra change to the seldonio/mlserver
Docker image.
For this example, we will create a custom environment to serve a model trained with an older version of Scikit-Learn. The first step will be define this environment, using a .
Note that these environments can also be created on the fly as we go, and then serialised later.
To illustrate the point, we will train a Scikit-Learn model using our older environment.
The first step will be to create and activate an environment which reflects what's outlined in our environment.yml
file.
NOTE: If you are running this from a Jupyter Notebook, you will need to restart your Jupyter instance so that it runs from this environment.
We can now train and save a Scikit-Learn model using the older version of our environment. This model will be serialised as model.joblib
.
This tool, will save a portable version of our environment as a .tar.gz
file, also known as tarball.
Now that we have defined our environment (and we've got a sample artifact trained in that environment), we can move to serving our model.
To do that, we will first need to select the right runtime through a model-settings.json
config file.
We can then spin up our model, using our custom environment, leveraging MLServer's Docker image. Keep in mind that you will need Docker installed in your machine to run this example.
Our Docker command will need to take into account the following points:
Mount the example's folder as a volume so that it can be accessed from within the container.
Let MLServer know that our custom environment's tarball can be found as old-sklearn.tar.gz
.
Expose port 8080
so that we can send requests from the outside.
From the command line, this can be done using Docker's CLI as:
Note that we need to keep the server running in the background while we send requests. Therefore, it's best to run this command on a separate terminal session.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
This tutorial walks through the steps required to take a python ML model from your machine to a production deployment on Kubernetes. More specifically we'll cover:
Running the model locally
Turning the ML model into an API
Containerizing the model
Storing the container in a registry
Deploying the model to Kubernetes (with Seldon Core)
Scaling the model
The tutorial comes with an accompanying video which you might find useful as you work through the steps:
The slides used in the video can be found .
For this tutorial, we're going to use the available from the Tensorflow Catalog. This dataset includes leaf images from the cassava plant. Each plant can be classified as either "healthly" or as having one of four diseases (Mosaic Disease, Bacterial Blight, Green Mite, Brown Streak Disease).
If you've already cloned the MLServer repository, you can also find it in docs/examples/cassava
.
Once you've done that, you can just run:
And it'll set you up with all the libraries required to run the code.
The starting point for this tutorial is python script app.py
. This is typical of the kind of python code we'd run standalone or in a jupyter notebook. Let's familiarise ourself with the code:
First up, we're importing a couple of functions from our helpers.py
file:
plot
provides the visualisation of the samples, labels and predictions.
preprocess
is used to resize images to 224x224 pixels and normalize the RGB values.
The rest of the code is fairly self-explanatory from the comments. We load the model and dataset, select some examples, make predictions and then plot the results.
Try it yourself by running:
The problem with running our code like we did earlier is that it's not accessible to anyone who doesn't have the python script (and all of it's dependencies). A good way to solve this is to turn our model into an API.
In order to get our model ready to run on MLServer we need to wrap it in a single python class with two methods, load()
and predict()
. Let's take a look at the code (found in model/serve-model.py
):
The load()
method is used to define any logic required to set up our model for inference. In our case, we're loading the model weights into self._model
. The predict()
method is where we include all of our prediction logic.
You may notice that we've slightly modified our code from earlier (in app.py
). The biggest change is that it is now wrapped in a single class CassavaModel
.
The only other task we need to do to run our model on MLServer is to specify a model-settings.json
file:
This is a simple configuration file that tells MLServer how to handle our model. In our case, we've provided a name for our model and told MLServer where to look for our model class (serve-model.CassavaModel
).
We're now ready to serve our model with MLServer. To do that we can simply run:
MLServer will now start up, load our cassava model and provide access through both a REST and gRPC API.
Now that our API is up and running. Open a new terminal window and navigate back to the root of this repository. We can then send predictions to our api using the test.py
file by running:
Taking our model and packaging it into a container manually can be a pretty tricky process and requires knowledge of writing Dockerfiles. Thankfully MLServer removes this complexity and provides us with a simple build
command.
Before we run this command, we need to provide our dependencies in either a requirements.txt
or a conda.env
file. The requirements file we'll use for this example is stored in model/requirements.txt
:
Notice that we didn't need to include
mlserver
in our requirements? That's because the builder image has mlserver included already.
We're now ready to build our container image using:
Make sure you replace YOUR_CONTAINER_REGISTRY
and IMAGE_NAME
with your dockerhub username and a suitable name e.g. "bobsmith/cassava".
MLServer will now build the model into a container image for us. We can check the output of this by running:
Finally, we want to send this container image to be stored in our container registry. We can do this by running:
Now that we've turned our model into a production-ready API, containerized it and pushed it to a registry, it's time to deploy our model.
To create our deployment with Seldon Core we need to create a small configuration file that looks like this:
You can find this file named deployment.yaml
in the base folder of this tutorial's repository.
Make sure you replace YOUR_CONTAINER_REGISTRY
and IMAGE_NAME
with your dockerhub username and a suitable name e.g. "bobsmith/cassava".
We can apply this configuration file to our Kubernetes cluster just like we would for any other Kubernetes object using:
To check our deployment is up and running we can run:
We should see STATUS = Running
once our deployment has finalized.
Now that our model is up and running on a Kubernetes cluster (via Seldon Core), we can send some test inference requests to make sure it's working.
To do this, we simply run the test.py
file in the following way:
This script will randomly select some test samples, send them to the cluster, gather the predictions and then plot them for us.
Kubernetes and Seldon Core make this really easy to do by simply running:
We can replace the --replicas=3
with any number we want to scale to.
To watch the servers scaling out we can run:
To see MLServer in action you can check out the examples below. These are end-to-end notebooks, showing how to serve models with MLServer.
If you are interested in how MLServer interacts with particular model frameworks, you can check the following examples. These focus on showcasing the different that ship with MLServer out of the box. Note that, for advanced use cases, you can also write your own custom inference runtime (see the ).
To see some of the advanced features included in MLServer (e.g. multi-model serving), check out the examples below.
Tutorials are designed to be beginner-friendly and walk through accomplishing a series of tasks using MLServer (and other tools).
The first step will be to train a simple scikit-learn
model. For that, we will use the which trains an SVM model.
To save our trained model, we will serialise it using joblib
. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the .
You can find more details of this process in the .
Lastly, we will need to serialise our environment in the format expected by MLServer. To do that, we will use a tool called .
We won't go through the steps of training the classifier. Instead, we'll be using a pre-trained one available on TensorFlow Hub. You can find the .
The easiest way to run this example is to clone the repository located :
Here's what our setup currently looks like:
Typically people turn to popular python web servers like or . This is a good approach and gives us lots of flexibility but it also requires us to do a lot of the work ourselves. We need to impelement routes, set up logging, capture metrics and define an API schema among other things. A simpler way to tackle this problem is to use an inference server. For this tutorial we're going to use the open source framework.
MLServer supports a bunch of out of the box, but it also supports which is what we'll use for our Tensorflow model.
Our setup has now evloved and looks like this:
are an easy way to package our application together with it's runtime and dependencies. More importantly, containerizing our model allows it to run in a variety of different environments.
Note: you will need installed to run this section of the tutorial. You'll also need a account or another container registry.
Our setup now looks like this. Where our model has been packaged and sent to a container registry:
We're going to use a popular open source framework called to deploy our model. Seldon Core is great because it combines all of the awesome cloud-native features we get from but it also adds machine-learning specific features.
This tutorial assumes you already have a Seldon Core cluster up and running. If that's not the case, head over the and get set up first. You'll also need to install the kubectl
command line interface.
A note on running this yourself: This example is set up to connect to a kubernetes cluster running locally on your machine. If yours is local too, you'll need to make sure you before sending requests. If your cluster is remote, you'll need to change the inference_url
variable on line 21 of test.py
.
Having deployed our model to kubernetes and tested it, our setup now looks like this:
Our model is now running in a production environment and able to handle requests from external sources. This is awesome but what happens as the number of requests being sent to our model starts to increase? Eventually, we'll reach the limit of what a single server can handle. Thankfully, we can get around this problem by scaling our model .
Once the new replicas have finished rolling out, our setup now looks like this:
In this tutorial we've scaled the model out manually to show how it works. In a real environment we'd want to set up to make sure our prediction API is always online and performing as expected.
mnist-svm
scikit-learn
./models/mnist-svm/model.joblib
mushroom-xgboost
xgboost
./models/mushroom-xgboost/model.json
Out of the box, mlserver
supports the deployment and serving of scikit-learn
models. By default, it will assume that these models have been serialised using joblib
.
In this example, we will cover how we can train and serialise a simple model, to then serve it using mlserver
.
The first step will be to train a simple scikit-learn
model. For that, we will use the MNIST example from the scikit-learn
documentation which trains an SVM model.
To save our trained model, we will serialise it using joblib
. While this is not a perfect approach, it's currently the recommended method to persist models to disk in the scikit-learn
documentation.
Our model will be persisted as a file named mnist-svm.joblib
Now that we have trained and saved our model, the next step will be to serve it using mlserver
. For that, we will need to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
settings.json
model-settings.json
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
We now have our model being served by mlserver
. To make sure that everything is working as expected, let's send a request from our test set.
For that, we can use the Python types that mlserver
provides out of box, or we can build our request manually.
As we can see above, the model predicted the input as the number 8
, which matches what's on the test set.
MLServer extends the V2 inference protocol by adding support for a content_type
annotation. This annotation can be provided either through the model metadata parameters
, or through the input parameters
. By leveraging the content_type
annotation, we can provide the necessary information to MLServer so that it can decode the input payload from the "wire" V2 protocol to something meaningful to the model / user (e.g. a NumPy array).
This example will walk you through some examples which illustrate how this works, and how it can be extended.
To start with, we will write a dummy runtime which just prints the input, the decoded input and returns it. This will serve as a testbed to showcase how the content_type
support works.
Later on, we will extend this runtime by adding custom codecs that will decode our V2 payload to custom types.
As you can see above, this runtime will decode the incoming payloads by calling the self.decode()
helper method. This method will check what's the right content type for each input in the following order:
Is there any content type defined in the inputs[].parameters.content_type
field within the request payload?
Is there any content type defined in the inputs[].parameters.content_type
field within the model metadata?
Is there any default content type that should be assumed?
In order to enable this runtime, we will also create a model-settings.json
file. This file should be present (or accessible from) in the folder where we run mlserver start .
.
Our initial step will be to decide the content type based on the incoming inputs[].parameters
field. For this, we will start our MLServer in the background (e.g. running mlserver start .
)
As you've probably already noticed, writing request payloads compliant with both the V2 Inference Protocol requires a certain knowledge about both the V2 spec and the structure expected by each content type. To account for this and simplify usage, the MLServer package exposes a set of utilities which will help you interact with your models via the V2 protocol.
These helpers are mainly shaped as "codecs". That is, abstractions which know how to "encode" and "decode" arbitrary Python datatypes to and from the V2 Inference Protocol.
Generally, we recommend using the existing set of codecs to generate your V2 payloads. This will ensure that requests and responses follow the right structure, and should provide a more seamless experience.
Following with our previous example, the same code could be rewritten using codecs as:
Note that the rewritten snippet now makes use of the built-in InferenceRequest
class, which represents a V2 inference request. On top of that, it also uses the NumpyCodec
and StringCodec
implementations, which know how to encode a Numpy array and a list of strings into V2-compatible request inputs.
Our next step will be to define the expected content type through the model metadata. This can be done by extending the model-settings.json
file, and adding a section on inputs.
After adding this metadata, we will re-start MLServer (e.g. mlserver start .
) and we will send a new request without any explicit parameters
.
As you should be able to see in the server logs, MLServer will cross-reference the input names against the model metadata to find the right content type.
There may be cases where a custom inference runtime may need to encode / decode to custom datatypes. As an example, we can think of computer vision models which may only operate with pillow
image objects.
In these scenarios, it's possible to extend the Codec
interface to write our custom encoding logic. A Codec
, is simply an object which defines a decode()
and encode()
methods. To illustrate how this would work, we will extend our custom runtime to add a custom PillowCodec
.
We should now be able to restart our instance of MLServer (i.e. with the mlserver start .
command), to send a few test requests.
As you should be able to see in the MLServer logs, the server is now able to decode the payload into a Pillow image. This example also illustrates how Codec
objects can be compatible with multiple datatype
values (e.g. tensor and BYTES
in this case).
So far, we've seen how you can specify codecs so that they get applied at the input level. However, it is also possible to use request-wide codecs that aggregate multiple inputs to decode the payload. This is usually relevant for cases where the models expect a multi-column input type, like a Pandas DataFrame.
To illustrate this, we will first tweak our EchoRuntime
so that it prints the decoded contents at the request level.
We should now be able to restart our instance of MLServer (i.e. with the mlserver start .
command), to send a few test requests.
The mlserver
package comes with built-in support for streaming data. This allows you to process data in real-time, without having to wait for the entire response to be available. It supports both REST and gRPC APIs.
In this example, we create a simple Identity Text Model
which simply splits the input text into words and returns them one by one. We will use this model to demonstrate how to stream the response from the server to the client. This particular example can provide a good starting point for building more complex streaming models such as the ones based on Large Language Models (LLMs) where streaming is an essential feature to hide the latency of the model.
The next step will be to serve our model using mlserver
. For that, we will first implement an extension that serves as the runtime to perform inference using our custom TextModel
.
This is a trivial model to demonstrate streaming support. The model simply splits the input text into words and returns them one by one. In this example we do the following:
split the text into words using the white space as the delimiter.
wait 0.5 seconds between each word to simulate a slow model.
return each word one by one.
As it can be seen, the predict_stream
method receives as an input an AsyncIterator
of InferenceRequest
and returns an AsyncIterator
of InferenceResponse
. This definition covers all types of possible input-output combinations for streaming: unary-stream, stream-unary, stream-stream. It is up to the client and server to send/receive the appropriate number of requests/responses which should be known apriori.
Note that although unary-unary can be covered by predict_stream
method as well, mlserver
already covers that through the predict
method.
One important limitation to keep in mind is that for the REST API, the client will not be able to send a stream of requests. The client will have to send a single request with the entire input text. The server will then stream the response back to the client. gRPC API, on the other hand, supports all types of streaming listed above.
The next step will be to create 2 configuration files:
settings.json
: holds the configuration of our server (e.g. ports, log level, etc.).
model-settings.json
: holds the configuration of our model (e.g. input type, runtime to use, etc.).
Note the currently there are three main limitations of the streaming support in MLServer:
distributed workers are not supported (i.e., the parallel_workers
setting should be set to 0
)
gzip
middleware is not supported for REST (i.e., gzip_enabled
setting should be set to false
)
Now that we have our config in-place, we can start the server by running mlserver start .
. This needs to either be run from the same directory where our config files are or point to the folder where they are.
Since this command will start the server and block the terminal, waiting for requests, this will need to be run in the background on a separate terminal.
To test our model, we will use the following inference request:
To send a REST streaming request to the server, we will use the following Python code:
To send a gRPC streaming request to the server, we will use the following Python code:
Note that for gRPC, the request is transformed into an async generator which is then passed to the ModelStreamInfer
method. The response is also an async generator which can be iterated over to get the response.