Multimodel serving
The following demonstrates how to use the multimodel serving (MMS) capabilites of the Local runtime on top of Seldon Core 2. We will showcase how to deploy two models on a single GPU, using the transformers and the deepspeed backend. To run this example you will need Core 2 up and running, with a Local Runtime deployed on it, and a GPU with compute capability >= 8.0 for the deepspeed backeend. Please check our installation tutorial to see how you can do so.
We begin with the transformers backend for which we deploy a gpt2 model. The associated model-settings.json file is the following:
!cat models/local-gpt2-transformers/model-settings.json{
"name": "local-gpt2-transformers",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "transformers",
"config": {
"model_type": "completions",
"model_settings": {
"model": "gpt2",
"device": "cuda",
"gpu_memory_utilization": 0.3
}
}
}
}
}Note that besides the model name and device to be used, we also specified the gpu_memory_utilization. This is an indication for the scheduler to use approximately 30% of the available GPU memory when peforming inferece. For the transformers backend no memory is allocated apriori, but the scheduler will try its best to use that specified amount of GPU memory.
To deploy the model on Core 2, run the following command:
!kubectl apply -f manifests/local-gpt2-transformers.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldonmodel.mlops.seldon.io/local-gpt2-transformers created
model.mlops.seldon.io/local-gpt2-transformers condition metWe can now move on to our second model, which is opt-125m to be deployed using the deepspeed backend. The associated model-settings.json file is the following:
!cat models/local-opt-125m-deepspeed/model-settings.json{
"name": "local-opt-125m-deepspeed",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "deepspeed",
"config": {
"model_type": "completions",
"model_settings": {
"model": "facebook/opt-125m",
"inference_engine_config": {
"state_manager": {
"memory_config": {
"mode": "allocate",
"size": 50
}
}
}
}
}
}
}
}Memory speficiation for the deepspeed backend is a bit more complex. In the settings above we specify the allocation of 50 blocks of KV-cache. Each block may store the associated KV-cache for 32 or 64 tokens, depending on the model. We will make our best effort to standardize the gpu memory size specification across all backends in the near future. For more information about allocating KV-cache memory for the deepspeed backend, please consult the docs (TODO: add link to the docs).
We can now deploy opt-125m on Core, by running the following command:
!kubectl apply -f manifests/local-opt-125m-deepspeed.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldonmodel.mlops.seldon.io/local-opt-125m-deepspeed createdBoth our models are now deployed and we are ready to send an inference request:
import pprint
import requests
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"]
},
{
"name": "prompt",
"shape": [1],
"datatype": "BYTES",
"data": ["What is the capital of France?"],
},
],
"parameters": {
"kwargs": {
"temperature": 0.0,
}
}
}We can send now the request to the gpt2 model:
endpoint = f"http://{get_mesh_ip()}/v2/models/local-gpt2-transformers/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4){'id': 'd1781451-f266-49fc-bcb8-507969da8da4',
'model_name': 'local-gpt2-transformers_1',
'outputs': [{'data': ['\n'
'\n'
'The capital of France is Paris.\n'
'\n'
'The capital of France is Paris.\n'
'\n'],
'datatype': 'BYTES',
'name': 'text',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}And to the opt-125m model:
endpoint = f"http://{get_mesh_ip()}/v2/models/local-opt-125m-deepspeed/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4){'id': 'aaaac187-80ad-4fe2-aa6e-8ef595fea5b7',
'model_name': 'local-opt-125m-deepspeed_1',
'outputs': [{'data': ['\n'
'France is the capital of the French Republic.\n'
"I'm not sure what you mean.\n"],
'datatype': 'BYTES',
'name': 'text',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}To unload the the models, run the follwoing commands:
!kubectl delete -f manifests/local-gpt2-transformers.yaml -n seldon
!kubectl delete -f manifests/local-opt-125m-deepspeed.yaml -n seldonmodel.mlops.seldon.io "local-gpt2-transformers" deleted
model.mlops.seldon.io "local-opt-125m-deepspeed" deletedLast updated
Was this helpful?