Now that we have our config in-place, we can start the server by running mlserver start .. This needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
mlserver start .
Since this command will start the server and block the terminal, waiting for requests, this will need to be ran in the background on a separate terminal.
Once again, you are able to run the model using the MLServer CLI. As before this needs to either be ran from the same directory where our config files are or pointing to the folder where they are.
mlserver start .
Send Test Request to Optimum Optimized Model
The request can now be sent using the same request structure but using optimized models for better performance.
{'model_name': 'transformer',
'id': '9c482c8d-b21e-44b1-8a42-7650a9dc01ef',
'parameters': {},
'outputs': [{'name': 'output',
'shape': [1, 1],
'datatype': 'BYTES',
'parameters': {'content_type': 'hg_jsonlist'},
'data': ['{"generated_text": "this is a test of the \\"safe-code-safe-code-safe-code\\" approach. The method only accepts two parameters as parameters: the code. The parameter \'unsafe-code-safe-code-safe-code\' should"}']}]}
Testing Supported Tasks
We can support multiple other transformers other than just text generation, below includes examples for a few other tasks supported.
Once again, you are able to run the model using the MLServer CLI.
mlserver start .
inference_request = {
"inputs": [
{
"name": "text_inputs",
"shape": [1],
"datatype": "BYTES",
"data": ["This is a generation for the work" for i in range(512)],
}
]
}
# Benchmark time
import time
start_time = time.monotonic()
requests.post(
"http://localhost:8080/v2/models/transformer/infer", json=inference_request
)
print(f"Elapsed time: {time.monotonic() - start_time}")
Elapsed time: 66.42268538899953
We can see that it takes 81 seconds which is 8 times longer than the gpu example below.
Testing with GPU
IMPORTANT: Running the code below requries having a machine with GPU configured correctly to work for Tensorflow/Pytorch.
Now we'll run the benchmark with GPU configured, which we can do by setting device=0
inference_request = {
"inputs": [
{
"name": "text_inputs",
"shape": [1],
"datatype": "BYTES",
"data": ["This is a generation for the work" for i in range(512)],
}
]
}
# Benchmark time
import time
start_time = time.monotonic()
requests.post(
"http://localhost:8080/v2/models/transformer/infer", json=inference_request
)
print(f"Elapsed time: {time.monotonic() - start_time}")
Elapsed time: 11.27933280000434
We can see that the elapsed time is 8 times less than the CPU version!
Adaptive Batching with GPU
We can also see how the adaptive batching capabilities can allow for GPU acceleration by grouping multiple incoming requests so they get processed in GPU batch.
In our case we can enable adaptive batching with the max_batch_size which in our case we will set it ot 128.
We will also configure max_batch_time which specifies` the maximum amount of time the MLServer orchestrator will wait before sending for inference.