Batch Inference examples (local)
This example runs you through a series of batch inference requests made to both models and pipelines running on Seldon Core locally.
Deprecated: The MLServer CLI infer feature is experimental and will be removed in future work.
Setup
If you haven't already, you'll need to clone the Seldon Core repository and run it locally before you run through this example.
Deploy Models and Pipelines
First, let's jump in to the samples folder where we'll find some sample models and pipelines
we can use:
cd samples/Deploy the Iris Model
Let's take a look at a sample model before we deploy it:
cat models/sklearn-iris-gs.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: iris
spec:
storageUri: "gs://seldon-models/mlserver/iris"
requirements:
- sklearn
memory: 100KiThe above manifest will deploy a simple sci-kit learn model based on the iris dataset.
Let's now deploy that model using the Seldon CLI:
seldon model load -f models/sklearn-iris-gs.yamlDeploy the Iris Pipeline
Now that we've deployed our iris model, let's create a pipeline around the model.
cat pipelines/iris.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: iris-pipeline
spec:
steps:
- name: iris
output:
steps:
- irisWe see that this pipeline only has one step, which is to call the iris model we deployed
earlier. We can create the pipeline by running:
seldon pipeline load -f pipelines/iris.yamlDeploy the Tensorflow Model
To demonstrate batch inference requests to different types of models, we'll also deploy a simple tensorflow model:
cat models/tfsimple1.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: tfsimple1
spec:
storageUri: "gs://seldon-models/triton/simple"
requirements:
- tensorflow
memory: 100Ki
The tensorflow model takes two arrays as inputs and returns two arrays as outputs. The first output is the addition of the two inputs and the second output is the value of (first input - second input).
Let's deploy the model:
seldon model load -f models/tfsimple1.yamlDeploy the Tensorflow Pipeline
Just as we did for the scikit-learn model, we'll deploy a simple pipeline for our tensorflow model:
Inspect the pipeline manifest:
cat pipelines/tfsimple.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: tfsimple
spec:
steps:
- name: tfsimple1
output:
steps:
- tfsimple1
and deploy it:
seldon pipeline load -f pipelines/tfsimple.yamlCheck Model and Pipeline Status
Once we've deployed a model or pipeline to Seldon Core, we can list them and check their status by running:
seldon model listand
seldon pipeline listYour models and pieplines should be showing a state of ModelAvailable and PipelineReady respectively.
Test Predictions
Before we run a large batch job of predictions through our models and pipelines, let's quickly
check that they work with a single standalone inference request. We can do this using theseldon model infer command.
Scikit-learn Model
seldon model infer iris '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' | jq{
"model_name": "iris_1",
"model_version": "1",
"id": "a67233c2-2f8c-4fbc-a87e-4e4d3d034c9f",
"parameters": {
"content_type": null,
"headers": null
},
"outputs": [
{
"name": "predict",
"shape": [
1
],
"datatype": "INT64",
"parameters": null,
"data": [
2
]
}
]
}
The preidiction request body needs to be an Open Inference Protocol
compatible payload and also match the expected inputs for the model you've deployed. In this case,
the iris model expects data of shape [1, 4] and of type FP32.
You'll notice that the prediction results for this request come back on outputs[0].data.
Scikit-learn Pipeline
seldon pipeline infer iris-pipeline '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}' | jq{
"model_name": "",
"outputs": [
{
"data": [
2
],
"name": "predict",
"shape": [
1
],
"datatype": "INT64"
}
]
}
Tensorflow Model
seldon model infer tfsimple1 '{"outputs":[{"name":"OUTPUT0"}], "inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq{
"model_name": "tfsimple1_1",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT0",
"datatype": "INT32",
"shape": [
1,
16
],
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
]
}
]
}You'll notice that the inputs for our tensorflow model look different from the ones we sent to the
iris model. This time, we're sending two arrays of shape [1,16]. When sending an inference request,
we can optionally chose which outputs we want back by including an {"outputs":...} object.
Tensorflow Pipeline
seldon pipeline infer tfsimple '"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq{
"model_name": "",
"outputs": [
{
"data": [
2,
4,
6,
8,
10,
12,
14,
16,
18,
20,
22,
24,
26,
28,
30,
32
],
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32"
},
{
"data": [
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0,
0
],
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32"
}
]
}
Running the Scikit-Learn Batch Job
In the samples folder there is a batch request input file: batch-inputs/iris-input.txt. It contains
100 input payloads for our iris model. Let's take a look at the first line in that file:
cat batch-inputs/iris-input.txt | head -n 1 | jq{
"inputs": [
{
"name": "predict",
"data": [
0.38606369295833043,
0.006894049558299753,
0.6104082981607108,
0.3958954239450676
],
"datatype": "FP64",
"shape": [
1,
4
]
}
]
}
To run a batch inference job we'll use the MLServer CLI. If you don't already have it installed you can install it using:
pip install mlserverIris Model
The inference job can be executed by running the following command:
mlserver infer -u localhost:9000 -m iris -i batch-inputs/iris-input.txt -o /tmp/iris-output.txt --workers 52023-01-22 18:24:17,272 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2023-01-22 18:24:17,273 [mlserver] INFO - server url: localhost:9000
2023-01-22 18:24:17,273 [mlserver] INFO - model name: iris
2023-01-22 18:24:17,273 [mlserver] INFO - request headers: {}
2023-01-22 18:24:17,273 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
2023-01-22 18:24:17,273 [mlserver] INFO - output file path: /tmp/iris-output.txt
2023-01-22 18:24:17,273 [mlserver] INFO - workers: 5
2023-01-22 18:24:17,273 [mlserver] INFO - retries: 3
2023-01-22 18:24:17,273 [mlserver] INFO - batch interval: 0.0
2023-01-22 18:24:17,274 [mlserver] INFO - batch jitter: 0.0
2023-01-22 18:24:17,274 [mlserver] INFO - connection timeout: 60
2023-01-22 18:24:17,274 [mlserver] INFO - micro-batch size: 1
2023-01-22 18:24:17,420 [mlserver] INFO - Finalizer: processed instances: 100
2023-01-22 18:24:17,421 [mlserver] INFO - Total processed instances: 100
2023-01-22 18:24:17,421 [mlserver] INFO - Time taken: 0.15 seconds
The mlserver batch component will take your input file batch-inputs/iris-input.txt, distribute
those payloads across 5 different workers (--workers 5), collect the responses and write them
to a file /tmp/iris-output.txt. For a full set of options check out theMLServer CLI Reference.
Checking the Output
We can check the inference responses by looking at the contents of the output file:
cat /tmp/iris-output.txt | head -n 1 | jqIris Pipeline
We can run the same batch job for our iris pipeline and store the outputs in a different file:
mlserver infer -u localhost:9000 -m iris-pipeline.pipeline -i batch-inputs/iris-input.txt -o /tmp/iris-pipeline-output.txt --workers 52023-01-22 18:25:18,651 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2023-01-22 18:25:18,653 [mlserver] INFO - server url: localhost:9000
2023-01-22 18:25:18,653 [mlserver] INFO - model name: iris-pipeline.pipeline
2023-01-22 18:25:18,653 [mlserver] INFO - request headers: {}
2023-01-22 18:25:18,653 [mlserver] INFO - input file path: batch-inputs/iris-input.txt
2023-01-22 18:25:18,653 [mlserver] INFO - output file path: /tmp/iris-pipeline-output.txt
2023-01-22 18:25:18,653 [mlserver] INFO - workers: 5
2023-01-22 18:25:18,653 [mlserver] INFO - retries: 3
2023-01-22 18:25:18,653 [mlserver] INFO - batch interval: 0.0
2023-01-22 18:25:18,653 [mlserver] INFO - batch jitter: 0.0
2023-01-22 18:25:18,653 [mlserver] INFO - connection timeout: 60
2023-01-22 18:25:18,653 [mlserver] INFO - micro-batch size: 1
2023-01-22 18:25:18,963 [mlserver] INFO - Finalizer: processed instances: 100
2023-01-22 18:25:18,963 [mlserver] INFO - Total processed instances: 100
2023-01-22 18:25:18,963 [mlserver] INFO - Time taken: 0.31 secondsChecking the Output
We can check the inference responses by looking at the contents of the output file:
cat /tmp/iris-pipeline-output.txt | head -n 1 | jqRunning the Tensorflow Batch Job
The samples folder contains an example batch input for the tensorflow model, just as it did for
the scikit-learn model. You can find it at batch-inputs/tfsimple-input.txt. Let's take a look
at the first inference request in the file:
cat batch-inputs/tfsimple-input.txt | head -n 1 | jq{
"inputs": [
{
"name": "INPUT0",
"data": [
75,
39,
9,
44,
32,
97,
99,
40,
13,
27,
25,
36,
18,
77,
62,
60
],
"datatype": "INT32",
"shape": [
1,
16
]
},
{
"name": "INPUT1",
"data": [
39,
7,
14,
58,
13,
88,
98,
66,
97,
57,
49,
3,
49,
63,
37,
12
],
"datatype": "INT32",
"shape": [
1,
16
]
}
]
}Tensorflow Model
As before, we can run the inference batch job using the mlserver infer command:
mlserver infer -u localhost:9000 -m tfsimple1 -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-output.txt --workers 10
2023-01-23 14:56:10,870 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2023-01-23 14:56:10,872 [mlserver] INFO - server url: localhost:9000
2023-01-23 14:56:10,872 [mlserver] INFO - model name: tfsimple1
2023-01-23 14:56:10,872 [mlserver] INFO - request headers: {}
2023-01-23 14:56:10,872 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
2023-01-23 14:56:10,872 [mlserver] INFO - output file path: /tmp/tfsimple-output.txt
2023-01-23 14:56:10,872 [mlserver] INFO - workers: 10
2023-01-23 14:56:10,872 [mlserver] INFO - retries: 3
2023-01-23 14:56:10,872 [mlserver] INFO - batch interval: 0.0
2023-01-23 14:56:10,872 [mlserver] INFO - batch jitter: 0.0
2023-01-23 14:56:10,872 [mlserver] INFO - connection timeout: 60
2023-01-23 14:56:10,872 [mlserver] INFO - micro-batch size: 1
2023-01-23 14:56:11,077 [mlserver] INFO - Finalizer: processed instances: 100
2023-01-23 14:56:11,077 [mlserver] INFO - Total processed instances: 100
2023-01-23 14:56:11,078 [mlserver] INFO - Time taken: 0.21 secondsChecking the Output
We can check the inference responses by looking at the contents of the output file:
cat /tmp/tfsimple-output.txt | head -n 1 | jqYou should get the following response:
{
"model_name": "tfsimple1_1",
"model_version": "1",
"id": "54e6c237-8356-4c3c-96b5-2dca4596dbe9",
"parameters": {
"batch_index": 0,
"inference_id": "54e6c237-8356-4c3c-96b5-2dca4596dbe9"
},
"outputs": [
{
"name": "OUTPUT0",
"shape": [
1,
16
],
"datatype": "INT32",
"parameters": {},
"data": [
114,
46,
23,
102,
45,
185,
197,
106,
110,
84,
74,
39,
67,
140,
99,
72
]
},
{
"name": "OUTPUT1",
"shape": [
1,
16
],
"datatype": "INT32",
"parameters": {},
"data": [
36,
32,
-5,
-14,
19,
9,
1,
-26,
-84,
-30,
-24,
33,
-31,
14,
25,
48
]
}
]
}Tensorflow Pipeline
mlserver infer -u localhost:9000 -m tfsimple1 -i batch-inputs/tfsimple-input.txt -o /tmp/tfsimple-pipeline-output.txt --workers 102023-01-23 14:56:10,870 [mlserver] INFO - Using asyncio event-loop policy: uvloop
2023-01-23 14:56:10,872 [mlserver] INFO - server url: localhost:9000
2023-01-23 14:56:10,872 [mlserver] INFO - model name: tfsimple1
2023-01-23 14:56:10,872 [mlserver] INFO - request headers: {}
2023-01-23 14:56:10,872 [mlserver] INFO - input file path: batch-inputs/tfsimple-input.txt
2023-01-23 14:56:10,872 [mlserver] INFO - output file path: /tmp/tfsimple-pipeline-output.txt
2023-01-23 14:56:10,872 [mlserver] INFO - workers: 10
2023-01-23 14:56:10,872 [mlserver] INFO - retries: 3
2023-01-23 14:56:10,872 [mlserver] INFO - batch interval: 0.0
2023-01-23 14:56:10,872 [mlserver] INFO - batch jitter: 0.0
2023-01-23 14:56:10,872 [mlserver] INFO - connection timeout: 60
2023-01-23 14:56:10,872 [mlserver] INFO - micro-batch size: 1
2023-01-23 14:56:11,077 [mlserver] INFO - Finalizer: processed instances: 100
2023-01-23 14:56:11,077 [mlserver] INFO - Total processed instances: 100
2023-01-23 14:56:11,078 [mlserver] INFO - Time taken: 0.25 seconds
Checking the Output
We can check the inference responses by looking at the contents of the output file:
cat /tmp/tfsimple-pipeline-output.txt | head -n 1 | jqCleaning Up
Now that we've run our batch examples, let's remove the models and pipelines we created:
seldon model unload irisseldon model unload tfsimple1seldon pipeline unload iris-pipelineseldon pipeline unload tfsimpleAnd finally let's spin down our local instance of Seldon Core:
cd ../ && make undeploy-localLast updated
Was this helpful?

