Multimodel serving
The following demonstrates how to use the multimodel serving (MMS) capabilites of the Local runtime on top of Seldon Core 2. We will showcase how to deploy two models on a single GPU, using the transformers
and the deepspeed
backend. To run this example you will need Core 2 up and running, with a Local Runtime deployed on it, and a GPU with compute capability >= 8.0 for the deepspeed
backeend. Please check our installation tutorial to see how you can do so.
We begin with the transformers
backend for which we deploy a gpt2
model. The associated model-settings.json
file is the following:
!cat models/local-gpt2-transformers/model-settings.json
{
"name": "local-gpt2-transformers",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "transformers",
"config": {
"model_type": "completions",
"model_settings": {
"model": "gpt2",
"device": "cuda",
"gpu_memory_utilization": 0.3
}
}
}
}
}
Note that besides the model name and device to be used, we also specified the gpu_memory_utilization
. This is an indication for the scheduler to use approximately 30% of the available GPU memory when peforming inferece. For the transformers
backend no memory is allocated apriori, but the scheduler will try its best to use that specified amount of GPU memory.
To deploy the model on Core 2, run the following command:
!kubectl apply -f manifests/local-gpt2-transformers.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/local-gpt2-transformers created
model.mlops.seldon.io/local-gpt2-transformers condition met
We can now move on to our second model, which is opt-125m
to be deployed using the deepspeed
backend. The associated model-settings.json
file is the following:
!cat models/local-opt-125m-deepspeed/model-settings.json
{
"name": "local-opt-125m-deepspeed",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "deepspeed",
"config": {
"model_type": "completions",
"model_settings": {
"model": "facebook/opt-125m",
"inference_engine_config": {
"state_manager": {
"memory_config": {
"mode": "allocate",
"size": 50
}
}
}
}
}
}
}
}
Memory speficiation for the deepspeed
backend is a bit more complex. In the settings above we specify the allocation of 50 blocks of KV-cache. Each block may store the associated KV-cache for 32 or 64 tokens, depending on the model. We will make our best effort to standardize the gpu memory size specification across all backends in the near future. For more information about allocating KV-cache memory for the deepspeed
backend, please consult the docs (TODO: add link to the docs).
We can now deploy opt-125m
on Core, by running the following command:
!kubectl apply -f manifests/local-opt-125m-deepspeed.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/local-opt-125m-deepspeed created
Both our models are now deployed and we are ready to send an inference request:
import pprint
import requests
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"]
},
{
"name": "prompt",
"shape": [1],
"datatype": "BYTES",
"data": ["What is the capital of France?"],
},
],
"parameters": {
"kwargs": {
"temperature": 0.0,
}
}
}
We can send now the request to the gpt2
model:
endpoint = f"http://{get_mesh_ip()}/v2/models/local-gpt2-transformers/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'd1781451-f266-49fc-bcb8-507969da8da4',
'model_name': 'local-gpt2-transformers_1',
'outputs': [{'data': ['\n'
'\n'
'The capital of France is Paris.\n'
'\n'
'The capital of France is Paris.\n'
'\n'],
'datatype': 'BYTES',
'name': 'text',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}
And to the opt-125m
model:
endpoint = f"http://{get_mesh_ip()}/v2/models/local-opt-125m-deepspeed/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'aaaac187-80ad-4fe2-aa6e-8ef595fea5b7',
'model_name': 'local-opt-125m-deepspeed_1',
'outputs': [{'data': ['\n'
'France is the capital of the French Republic.\n'
"I'm not sure what you mean.\n"],
'datatype': 'BYTES',
'name': 'text',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}
To unload the the models, run the follwoing commands:
!kubectl delete -f manifests/local-gpt2-transformers.yaml -n seldon
!kubectl delete -f manifests/local-opt-125m-deepspeed.yaml -n seldon
model.mlops.seldon.io "local-gpt2-transformers" deleted
model.mlops.seldon.io "local-opt-125m-deepspeed" deleted
Last updated
Was this helpful?