Seldon inference is built from atomic Model components. Models as shown here cover a wide range of artifacts including:
Core machine learning models, e.g. a PyTorch model.
Feature transformations that might be built with custom python code.
Drift detectors.
Outlier detectors.
Explainers
Adversarial detectors.
A typical workflow for a production machine learning setup might be as follows:
You create a Tensorflow model for your core application use case and test this model in isolation to validate.
You create SKLearn feature transformation component before your model to convert the input into the correct form for your model. You also create Drift and Outlier detectors using Seldon's open source Alibi-detect library and test these in isolation.
You join these components together into a Pipeline for the final production setup.
These steps are shown in the diagram below:
This section will provide some examples to allow operations with Seldon to be tested so you can run your own models, experiments, pipelines and explainers.
We use a simple sklearn iris classification model
Load the model
Wait for the model to be ready
Do a REST inference call
Do a gRPC inference call
Unload the model
We run a simple tensorflow model. Note the requirements section specifying tensorflow
.
Load the model.
Wait for the model to be ready.
Get model metadata
Do a REST inference call.
Do a gRPC inference call
Unload the model
We will use two SKlearn Iris classification models to illustrate an experiment.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.
Start the experiment.
Wait for the experiment to be ready.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Run one more request
Use sticky session key passed by last infer request to ensure same route is taken each time. We will test REST and gRPC.
Stop the experiment
Show the requests all go to original model now.
Unload both models.
Examples of various model artifact types from various frameworks running under Seldon Core 2.
SKlearn
Tensorflow
XGBoost
ONNX
Lightgbm
MLFlow
PyTorch
Python requirements in model-zoo-requirements.txt
The training code for this model can be found at scripts/models/iris
in SCv2 repo.
The training code for this model can be found at ./scripts/models/income-xgb
This model is a pretrained model as defined in ./scripts/models/Makefile
target mnist-onnx
The training code for this model can be found at ./scripts/models/income-lgb
The training code for this model can be found at ./scripts/models/wine-mlflow
This example model is downloaded and trained in ./scripts/models/Makefile
target mnist-pytorch
This notebook illustrates a series of Pipelines showing of different ways of combining flows of data and conditional logic. We assume you have Seldon Core 2 running locally.
Other models can be found at https://github.com/SeldonIO/triton-python-examples
Chain the output of one model into the next. Also shows chaning the tensor names via tensorMap
to conform to the expected input tensor names of the second model.
The pipeline below chains the output of tfsimple1
into tfsimple2
. As these models have compatible shape and data type this can be done. However, the output tensor names from tfsimple1
need to be renamed to match the input tensor names for tfsimple2
. We do this with the tensorMap
feature.
The output of the Pipeline is the output from tfsimple2
.
We use the Seldon CLI pipeline inspect
feature to look at the data for all steps of the pipeline for the last data item passed through the pipeline (the default). This can be useful for debugging.
Next, we look get the output as json and use the jq
tool to get just one value.
Chain the output of one model into the next. Shows using the input and outputs and combining.
Join two flows of data from two models as input to a third model. This shows how individual flows of data can be combined.
In the pipeline below for the input to tfsimple3
we join 1 output tensor each from the two previous models tfsimple1
and tfsimple2
. We need to use the tensorMap
feature to rename each output tensor to one of the expected input tensors for the tfsimple3
model.
The outputs are the sequence "2,4,6..." which conforms to the logic of this model (addition and subtraction) when fed the output of the first two models.
Shows conditional data flows - one of two models is run based on output tensors from first.
Here we assume the conditional
model can output two tensors OUTPUT0 and OUTPUT1 but only outputs the former if the CHOICE input tensor is set to 0 otherwise it outputs tensor OUTPUT1. By this means only one of the two downstream models will receive data and run. The output
steps does an any
join from both models and whichever data appears first will be sent as output to pipeline. As in this case only 1 of the two models add10
and mul10
runs we will receive their output.
The mul10
model will run as the CHOICE tensor is set to 0.
The add10
model will run as the CHOICE tensor is not set to zero.
Access to indivudal tensors in pipeline inputs
This pipeline shows how we can access pipeline inputs INPUT0 and INPUT1 from different steps.
Shows how joins can be used for triggers as well.
Here we required tensors names ok1
or ok2
to exist on pipeline inputs to run the mul10
model but require tensor ok3
to exist on pipeline inputs to run the add10
model. The logic on mul10
is handled by a trigger join of any
meaning either of these input data can exist to satisfy the trigger join.
Loading...
We will use two SKlearn Iris classification models to illustrate experiments.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.
Start the experiment.
Wait for the experiment to be ready.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Show sticky session header x-seldon-route
that is returned
Use sticky session key passed by last infer request to ensure same route is taken each time.
Stop the experiment
Unload both models.
Use sticky session key passed by last infer request to ensure same route is taken each time.
We will use two SKlearn Iris classification models to illustrate a model with a mirror.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies in which we mirror traffic to iris also to iris2.
Start the experiment.
Wait for the experiment to be ready.
We get responses from iris but all requests would also have been mirrored to iris2
We can check the local prometheus port from the agent to validate requests went to iris2
Stop the experiment
Unload both models.
Let's check that the mul10 model was called.
Let's do an http call and check agaib the two models
This notebook will show how we can update running experiments.
We will use three SKlearn Iris classification models to illustrate experiment updates.
Load all models.
Let's call all three models individually first.
We will start an experiment to change the iris endpoint to split traffic with the iris2
model.
Now when we call the iris model we should see a roughly 50/50 split between the two models.
Now we update the experiment to change to a split with the iris3
model.
Now we should see a split with the iris3
model.
Now the experiment has been stopped we check everything as before.
Here we test changing the model we want to split traffic on. We will use three SKlearn Iris classification models to illustrate.
Let's call all three models to verify initial conditions.
Now we start an experiment to change calls to the iris
model to split with the iris2
model.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Now let's change the model we want to experiment to modify to the iris3
model. Splitting between that and iris2
.
Let's check the iris model is now as before but the iris3 model has traffic split.
Finally, let's check now the experiment has stopped as is as at the start.
To install tritonclient
Note: binary data support in HTTP is blocked by https://github.com/SeldonIO/seldon-core-v2/issues/475
This example runs you through a series of batch inference requests made to both models and pipelines running on Seldon Core locally.
Deprecated: The MLServer CLI infer
feature is experimental and will be removed in future work.
First, let's jump in to the samples
folder where we'll find some sample models and pipelines we can use:
Let's take a look at a sample model before we deploy it:
Let's now deploy that model using the Seldon CLI:
We see that this pipeline only has one step, which is to call the iris
model we deployed earlier. We can create the pipeline by running:
The tensorflow model takes two arrays as inputs and returns two arrays as outputs. The first output is the addition of the two inputs and the second output is the value of (first input - second input).
Let's deploy the model:
Just as we did for the scikit-learn model, we'll deploy a simple pipeline for our tensorflow model:
Inspect the pipeline manifest:
and deploy it:
Once we've deployed a model or pipeline to Seldon Core, we can list them and check their status by running:
and
Your models and pieplines should be showing a state of ModelAvailable
and PipelineReady
respectively.
Before we run a large batch job of predictions through our models and pipelines, let's quickly check that they work with a single standalone inference request. We can do this using the seldon model infer
command.
The preidiction request body needs to be an Open Inference Protocol compatible payload and also match the expected inputs for the model you've deployed. In this case, the iris model expects data of shape [1, 4]
and of type FP32
.
You'll notice that the prediction results for this request come back on outputs[0].data
.
You'll notice that the inputs for our tensorflow model look different from the ones we sent to the iris model. This time, we're sending two arrays of shape [1,16]
. When sending an inference request, we can optionally chose which outputs we want back by including an {"outputs":...}
object.
In the samples folder there is a batch request input file: batch-inputs/iris-input.txt
. It contains 100 input payloads for our iris model. Let's take a look at the first line in that file:
To run a batch inference job we'll use the MLServer CLI. If you don't already have it installed you can install it using:
The inference job can be executed by running the following command:
The mlserver batch component will take your input file batch-inputs/iris-input.txt
, distribute those payloads across 5 different workers (--workers 5
), collect the responses and write them to a file /tmp/iris-output.txt
. For a full set of options check out the MLServer CLI Reference.
We can check the inference responses by looking at the contents of the output file:
We can run the same batch job for our iris pipeline and store the outputs in a different file:
We can check the inference responses by looking at the contents of the output file:
The samples folder contains an example batch input for the tensorflow model, just as it did for the scikit-learn model. You can find it at batch-inputs/tfsimple-input.txt
. Let's take a look at the first inference request in the file:
As before, we can run the inference batch job using the mlserver infer
command:
We can check the inference responses by looking at the contents of the output file:
You should get the following response:
We can check the inference responses by looking at the contents of the output file:
Now that we've run our batch examples, let's remove the models and pipelines we created:
And finally let's spin down our local instance of Seldon Core:
Loading...
In this example we create a Pipeline to chain two huggingface models to allow speech to sentiment functionalityand add an explainer to understand the result.
This example also illustrates how explainers can target pipelines to allow complex explanations flows.
This example requires ffmpeg package to be installed locally. run make install-requirements
for the Python dependencies.
Create a method to load speech from recorder; transform into mp3 and send at base64 data. On return of the result extract and show the text and sentiment.
We will load two Huggingface models for speech to text and text to sentiment.
To allow Alibi-Explain to more easily explain the sentiment we will need:
input and output transfrorms that take the Dict values input and output by the Huggingface sentiment model and turn them into values that Alibi-Explain can easily understand with the core values we want to explain and the outputs from the sentiment model.
A separate Pipeline to allow us to join the sentiment model with the output transform
These transform models are MLServer custom runtimes as shown below:
We can now create the final pipeline that will take speech and generate sentiment alongwith an explanation of why that sentiment was predicted.
We will wait for the explanation which is run asynchronously to the functional output from the Pipeline above.
Loading...
To run this notebook you need the inference data. This can be acquired in two ways:
Run make train
or,
gsutil cp -R gs://seldon-models/scv2/examples/income/infer-data .
Show predictions from reference set. Should not be drift or outliers.
Show predictions from drift data. Should be drift and probably not outliers.
Show predictions from outlier data. Should be outliers and probably not drift.
Loading...