Multimodel serving

The following demonstrates how to use the multimodel serving (MMS) capabilites of the Local runtime on top of Seldon Core 2. We will showcase how to deploy two models on a single GPU, using the transformers and the deepspeed backend. To run this example you will need Core 2 up and running, with a Local Runtime deployed on it, and a GPU with compute capability >= 8.0 for the deepspeed backeend. Please check our installation tutorial to see how you can do so.

Multimodel serving is only supported for a single GPU. Any attempt to use MMS with multiple GPUs (i.e. for tensor parallelism) might result in unexpected behaviour, depending on the backend you are using.Thus, you can replace the server requirements to only request one GPU (i.e., nvidia.com/gpu: 1).

We begin with the transformers backend for which we deploy a gpt2 model. The associated model-settings.json file is the following:

!cat models/local-gpt2-transformers/model-settings.json
{
    "name": "local-gpt2-transformers",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "completions",
                "model_settings": {
                    "model": "gpt2",
                    "device": "cuda",
                    "gpu_memory_utilization": 0.3
                }
            }
        }
    }
}

Note that besides the model name and device to be used, we also specified the gpu_memory_utilization. This is an indication for the scheduler to use approximately 30% of the available GPU memory when peforming inferece. For the transformers backend no memory is allocated apriori, but the scheduler will try its best to use that specified amount of GPU memory.

To deploy the model on Core 2, run the following command:

!kubectl apply -f manifests/local-gpt2-transformers.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/local-gpt2-transformers created
model.mlops.seldon.io/local-gpt2-transformers condition met

We can now move on to our second model, which is opt-125m to be deployed using the deepspeed backend. The associated model-settings.json file is the following:

!cat models/local-opt-125m-deepspeed/model-settings.json
{
    "name": "local-opt-125m-deepspeed",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "deepspeed",
            "config": {
                "model_type": "completions",
                "model_settings": {
                    "model": "facebook/opt-125m",
                    "inference_engine_config": {
                        "state_manager": {
                            "memory_config": {
                                "mode": "allocate",
                                "size": 50
                            }
                        }
                    }
                }
            }
        }
    }
}

Memory speficiation for the deepspeed backend is a bit more complex. In the settings above we specify the allocation of 50 blocks of KV-cache. Each block may store the associated KV-cache for 32 or 64 tokens, depending on the model. We will make our best effort to standardize the gpu memory size specification across all backends in the near future. For more information about allocating KV-cache memory for the deepspeed backend, please consult the docs (TODO: add link to the docs).

We can now deploy opt-125m on Core, by running the following command:

!kubectl apply -f manifests/local-opt-125m-deepspeed.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/local-opt-125m-deepspeed created

Both our models are now deployed and we are ready to send an inference request:

import pprint
import requests
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')


inference_request = {
    "inputs": [
            {
                "name": "role", 
                "shape": [1], 
                "datatype": "BYTES", 
                "data": ["user"]
            },
            {
                "name": "prompt",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["What is the capital of France?"],
            },
        ],
    "parameters": {
        "kwargs": {
            "temperature": 0.0,
        }
    }
}

We can send now the request to the gpt2 model:

endpoint = f"http://{get_mesh_ip()}/v2/models/local-gpt2-transformers/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'd1781451-f266-49fc-bcb8-507969da8da4',
 'model_name': 'local-gpt2-transformers_1',
 'outputs': [{'data': ['\n'
                       '\n'
                       'The capital of France is Paris.\n'
                       '\n'
                       'The capital of France is Paris.\n'
                       '\n'],
              'datatype': 'BYTES',
              'name': 'text',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

And to the opt-125m model:

endpoint = f"http://{get_mesh_ip()}/v2/models/local-opt-125m-deepspeed/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
{'id': 'aaaac187-80ad-4fe2-aa6e-8ef595fea5b7',
 'model_name': 'local-opt-125m-deepspeed_1',
 'outputs': [{'data': ['\n'
                       'France is the capital of the French Republic.\n'
                       "I'm not sure what you mean.\n"],
              'datatype': 'BYTES',
              'name': 'text',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

To unload the the models, run the follwoing commands:

!kubectl delete -f manifests/local-gpt2-transformers.yaml -n seldon 
!kubectl delete -f manifests/local-opt-125m-deepspeed.yaml -n seldon
model.mlops.seldon.io "local-gpt2-transformers" deleted
model.mlops.seldon.io "local-opt-125m-deepspeed" deleted

Last updated

Was this helpful?