# Multimodel serving

The following demonstrates how to use the multimodel serving (MMS) capabilites of the Local runtime on top of Seldon Core 2. We will showcase how to deploy two models on a single GPU, using the `transformers` and the `deepspeed` backend. To run this example you will need Core 2 up and running, with a Local Runtime deployed on it, and a GPU with compute capability >= 8.0 for the `deepspeed` backeend. Please check our [installation tutorial](/llm-module/introduction/installation.md) to see how you can do so.

{% hint style="info" %}
Multimodel serving is only supported for a single GPU. Any attempt to use MMS with multiple GPUs (i.e. for tensor parallelism) might result in unexpected behaviour, depending on the backend you are using.Thus, you can replace the server requirements to only request one GPU (i.e., `nvidia.com/gpu: 1`).
{% endhint %}

We begin with the `transformers` backend for which we deploy a `gpt2` model. The associated `model-settings.json` file is the following:

```python
!cat models/local-gpt2-transformers/model-settings.json
```

```
{
    "name": "local-gpt2-transformers",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "completions",
                "model_settings": {
                    "model": "gpt2",
                    "device": "cuda",
                    "gpu_memory_utilization": 0.3
                }
            }
        }
    }
}
```

Note that besides the model name and device to be used, we also specified the `gpu_memory_utilization`. This is an indication for the scheduler to use approximately 30% of the available GPU memory when peforming inferece. For the `transformers` backend no memory is allocated apriori, but the scheduler will try its best to use that specified amount of GPU memory.

To deploy the model on Core 2, run the following command:

```python
!kubectl apply -f manifests/local-gpt2-transformers.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/local-gpt2-transformers created
model.mlops.seldon.io/local-gpt2-transformers condition met
```

We can now move on to our second model, which is `opt-125m` to be deployed using the `deepspeed` backend. The associated `model-settings.json` file is the following:

```python
!cat models/local-opt-125m-deepspeed/model-settings.json
```

```
{
    "name": "local-opt-125m-deepspeed",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "deepspeed",
            "config": {
                "model_type": "completions",
                "model_settings": {
                    "model": "facebook/opt-125m",
                    "inference_engine_config": {
                        "state_manager": {
                            "memory_config": {
                                "mode": "allocate",
                                "size": 50
                            }
                        }
                    }
                }
            }
        }
    }
}
```

Memory speficiation for the `deepspeed` backend is a bit more complex. In the settings above we specify the allocation of 50 blocks of KV-cache. Each block may store the associated KV-cache for 32 or 64 tokens, depending on the model. We will make our best effort to standardize the gpu memory size specification across all backends in the near future. For more information about allocating KV-cache memory for the `deepspeed` backend, please consult the docs (TODO: add link to the docs).

We can now deploy `opt-125m` on Core, by running the following command:

```python
!kubectl apply -f manifests/local-opt-125m-deepspeed.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/local-opt-125m-deepspeed created
```

Both our models are now deployed and we are ready to send an inference request:

```python
import pprint
import requests
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')


inference_request = {
    "inputs": [
            {
                "name": "role", 
                "shape": [1], 
                "datatype": "BYTES", 
                "data": ["user"]
            },
            {
                "name": "prompt",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["What is the capital of France?"],
            },
        ],
    "parameters": {
        "kwargs": {
            "temperature": 0.0,
        }
    }
}
```

We can send now the request to the `gpt2` model:

```python
endpoint = f"http://{get_mesh_ip()}/v2/models/local-gpt2-transformers/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': 'd1781451-f266-49fc-bcb8-507969da8da4',
 'model_name': 'local-gpt2-transformers_1',
 'outputs': [{'data': ['\n'
                       '\n'
                       'The capital of France is Paris.\n'
                       '\n'
                       'The capital of France is Paris.\n'
                       '\n'],
              'datatype': 'BYTES',
              'name': 'text',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

And to the `opt-125m` model:

```python
endpoint = f"http://{get_mesh_ip()}/v2/models/local-opt-125m-deepspeed/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': 'aaaac187-80ad-4fe2-aa6e-8ef595fea5b7',
 'model_name': 'local-opt-125m-deepspeed_1',
 'outputs': [{'data': ['\n'
                       'France is the capital of the French Republic.\n'
                       "I'm not sure what you mean.\n"],
              'datatype': 'BYTES',
              'name': 'text',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

To unload the the models, run the follwoing commands:

```python
!kubectl delete -f manifests/local-gpt2-transformers.yaml -n seldon 
!kubectl delete -f manifests/local-opt-125m-deepspeed.yaml -n seldon
```

```
model.mlops.seldon.io "local-gpt2-transformers" deleted
model.mlops.seldon.io "local-opt-125m-deepspeed" deleted
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/components/models/local/mms.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
