# Chat Bot

In this example, we're going to build a chatbot using the LLM Module runtimes. You'll see how each of these components can be leveraged within a Seldon Core 2 pipeline to build out an LLM application. Without having to write any backend code we can write a production-ready chat application in almost no time. This tutorial relies on [Kubernetes](https://kubernetes.io/) and [Seldon Core 2](https://docs.seldon.io/projects/seldon-core/en/v2/contents/getting-started/) and an understanding of these tools may be useful. To follow along you'll also need to have the LLM Module set up on your cluster. This includes images of both the OpenAI Runtime and Conversational Memory Runtimes deployed on your cluster. For this tutorial I've assumed they're deployed under a `seldon-mesh` namespace but this doesn't have to be the case.

The app we're going to build is a simple question-answer chatbot with a persistent memory store. To build it we're going to use the OpenAI runtime via the [API runtime](https://github.com/SeldonIO/llm-runtimes/blob/master/runtimes/api/openai/README.md) and the [Conversational Memory Runtime](https://github.com/SeldonIO/llm-runtimes/blob/master/runtimes/conversational_memory/README.md). The following illustrates an interaction between the chat bot and a user.

![chatbot example](https://1351131837-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FcsWxN0Xouw6OXkKNUoAe%2Fuploads%2Fgit-blob-2c11d547e0d9b6822c06df53eeba4f38baf4d361%2Fimage.png?alt=media)

## Model deployments

To build the above we have to first do two things:

1. Create the `model-settings.json` files for the memory and LLM models.
2. Create the model and pipeline specifications for Seldon Core 2

### Model-settings configs

Let's start with the large language model. We will use the `gpt-3.5-turbo` model with the `chat.completions` endpoint. For more explanation of the available options, see [this](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/api/openai_runtimes/README.md) example.

```python
!cat models/chatgpt/model-settings.json
```

```
{
    "name": "chatgpt",
    "implementation": "mlserver_llm_api.LLMRuntime",
    "parameters": {
        "extra": {
            "provider_id": "openai",
            "config": {
                "model_id": "gpt-3.5-turbo",
                "model_type": "chat.completions"
            }
        }
    }
}
```

For the memory runtime, we'll use the `filesys` database backend, a `window_size` of 100, and we'll let the `tensor_names` of the tensors we want to store be `"role"`, `"content"`, and `"type"` . Note we choose to store these tensor names as they match both the input and output `tensor_names` of the large language model. As you'll see later, this means we can just use the output of the LLM as input to our memory and vice versa.

```python
!cat models/memory/model-settings.json
```

```
{
    "name": "memory",
    "implementation": "mlserver_memory.ConversationalMemory",
    "parameters": {
        "extra": {
            "database": "filesys",
            "config": {
                "window_size": 100,
                "tensor_names": ["content", "role", "type"]
            }
        }
    }
}
```

### Seldon CRDs

Now that we've defined the model configs, we need to create the custom resource definitions (CRDs) that tell Seldon to deploy those models. Before we can do this step we need to first upload the model settings to remote storage that Kubernetes will use to fetch on deployment. When developing locally this step seems redundant but in production the cluster will fetch the model settings from the remote storage and not your local environment. We recommend using MinIO for the remote storage but for this demo I've placed the `model-settings.json` files in a [Google bucket](https://console.cloud.google.com/storage/browser/seldon-models/llm-runtimes/examples/pipelines/chat_bot/models).

The following model yaml tells Seldon Core 2 and Kubernetes how many models to deploy, their names, where their specifications are and any requirements they might have. In particular, the memory/OpenAI requirements tell Kubernetes to run the memory/OpenAI MLServer runtimes respectively.

```python
!cat manifests/models/models-api.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: combine-question
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: combine-answer
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: chatgpt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/chatgpt"
  requirements:
  - openai
```

## Deploying the models

To deploy the models, we first need to create the secret containg the API key and setup servers for our memory and API deployments. Check our [installation tutorial](https://docs.seldon.ai/llm-module/introduction/installation) to see how you can create the secret and deploy the API and Memory Runtimes servers.

Once you have all the above up an running, we are ready to deploy our models. The manifest file which describes our models is the following:

```python
!cat manifests/models/models-api.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: combine-question
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: combine-answer
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: chatgpt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/chatgpt"
  requirements:
  - openai
```

To deploy the models, run the following:

```python
!kubectl apply -f manifests/models/models-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=300s model --all -n seldon
```

```
model.mlops.seldon.io/combine-question unchanged
model.mlops.seldon.io/combine-answer unchanged
model.mlops.seldon.io/chatgpt created
model.mlops.seldon.io/chatgpt condition met
model.mlops.seldon.io/combine-answer condition met
model.mlops.seldon.io/combine-question condition met
```

We can now query the LLM by sending a request. Note the following interaction is stateless in the sense that follow-up questions will confuse the model because it won't have the context of having just answered about the Burj Khalifa.

```python
import pprint
import subprocess
import requests

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')


inference_request = {
    "inputs": [
        {
            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Hi, what's the tallest building in the world?"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "role",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["user"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "type",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["text"],
            "parameters": {"content_type": "str"},
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/chatgpt/infer"

response_with_params = requests.post(
    endpoint,
    json=inference_request,
)

pprint.pprint(response_with_params.json()['outputs'][:2])
```

```
[{'data': ['assistant'],
  'datatype': 'BYTES',
  'name': 'role',
  'parameters': {'content_type': 'str'},
  'shape': [1, 1]},
 {'data': ['As of now, the tallest building in the world is the Burj Khalifa '
           'in Dubai, United Arab Emirates. It stands at a height of 828 '
           'meters (2,717 feet).'],
  'datatype': 'BYTES',
  'name': 'content',
  'parameters': {'content_type': 'str'},
  'shape': [1, 1]}]
```

## Deploying the Seldon Core 2 Pipeline

To maintain the context of the conversation we will integrate the memory runtime. To do this we need to define a [pipeline](https://docs.seldon.io/projects/seldon-core/en/v2/contents/pipelines/index.html) that tells Seldon Core 2 how to wire together the memory and LLM runtimes. The following diagram illustrates the layout of the pipeline.

{% @mermaid/diagram content="flowchart LR
input(\[input])
output(\[output])
filesys\[(FILE SYSTEM)]
combine-question
combine-answer
OAI\["OpenAI"]

```
input --> combine-question --> OAI --> output
filesys <--> combine-question
combine-answer --> filesys
combine-answer --> output
OAI --> combine-answer" %}
```

The pipeline consists of three MLServer runtimes: `OpenAI` is the OpenAI model served via the [API Runtime](https://github.com/SeldonIO/llm-runtimes/blob/master/runtimes/api/openai/README.md) while `combine-question` and `combine-answer` are both [conversational memory runtimes](https://github.com/SeldonIO/llm-runtimes/blob/master/runtimes/conversational_memory/README.md).

1. Imagine there has been some conversation already. In this case, the conversational memory is a non-empty list of messages.
2. The user first sends a question to the `combine-question` memory model.
3. This model appends the question to the conversational history and returns the conversational history with the question appended to it.
4. The conversational history tensor is then passed to the LLM (OpenAI)
5. The output (answer) of the llm is passed to the `combine-answer` memory model and does the same thing as the combine-question model but with the answer instead of the question.
6. The answer returned to the user along with the history. The history can be ignored, since it only serves as a confirmation that the new message has be commited to the database.

Note that the input to the pipeline also includes a `memory_id`. This is a session id that should be managed by the user and is responsible for identifying the conversation history in the data store.

```python
!cat manifests/pipelines/pipeline-api.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: chat-app
spec:
  steps:
    - name: combine-question
      inputs:
        - chat-app.inputs.memory_id
        - chat-app.inputs.role
        - chat-app.inputs.content
        - chat-app.inputs.type
    - name: chatgpt
      inputs:
        - combine-question.outputs.history
    - name: combine-answer
      inputs:
      - chat-app.inputs.memory_id
      - chatgpt.outputs.role
      - chatgpt.outputs.content
      - chatgpt.outputs.type
  output:
    steps:
    - chatgpt
    - combine-answer
```

We can now deploy the pipeline using the following terminal command.

```python
!kubectl apply -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
```

```
pipeline.mlops.seldon.io/chat-app created
pipeline.mlops.seldon.io/chat-app condition met
```

## Using the chatbot

The chatbot is now deployed and can be used. The following code creates a while loop which requests the user to input a question. This question is sent to the chatbot and the response is returned. This repeats until the user types exit.

```python
import subprocess
import requests
import uuid

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')



class APIController():
    def __init__(self, target='chat-app'):
        self.session = str(uuid.uuid4())
        self.target = target
        self.endpoint = f"http://{get_mesh_ip()}/v2/pipelines/{target}/infer"

    def _build_request(self, question):
        inference_request = {
            "inputs": [
                {
                    "name": "role", 
                    "shape": [1, 1],
                    "datatype": "BYTES", 
                    "data": ['user'],
                    "parameters": {"content_type": "str"},
                },
                {
                    "name": "content", 
                    "shape": [1, 1],
                    "datatype": "BYTES", 
                    "data": [question],
                    "parameters": {"content_type": "str"},
                },
                {
                    "name": "type", 
                    "shape": [1, 1],
                    "datatype": "BYTES", 
                    "data": ['text'],
                    "parameters": {"content_type": "str"},
                },
                {
                    "name": "memory_id", 
                    "shape": [1, 1],
                    "datatype": "BYTES", 
                    "data": [self.session],
                    "parameters": {"content_type": "str"},
                }
            ]
        }
        headers = {
            "Content-Type": "application/json",
            "seldon-model": f"{self.target}.pipeline"
        }
        return inference_request, headers

    def send(self, text):
        inference_request, headers = self._build_request(text)
        return requests.post(self.endpoint, json=inference_request, headers=headers)
```

```python
!pip install ipywidgets -q
```

```python
import ipywidgets as widgets
from IPython.display import display

def display_ui(target: str = 'chat-app'):
    
    # Create text input and button widgets
    text_input = widgets.Text(description="Enter text:")
    button = widgets.Button(description="Submit")
    output = widgets.Output()

    # Create an instance of the API controller
    api = APIController(target)
    print('session:', api.session)

    # Define what happens on button click
    def on_button_click(b):
        text = text_input.value
        if text == 'exit':
            with output:
                print("Exiting...")
            button.disabled = True
        else:
            with output:
                print("User:", text)
                response = api.send(text)
                
                print("Assistant:", response.json()['outputs'][1]['data'][0])
        
        text_input.value = ""  # Clear input after each submission

    # Bind the click event
    button.on_click(on_button_click)

    # Display widgets and output
    display(text_input, button, output)
```

```python
display_ui()
```

```
session: ebe4b1f9-12a7-4c63-bd85-e1df933978b5



Text(value='', description='Enter text:')



Button(description='Submit', style=ButtonStyle())



Output()
```

Using the `memory_id` that we printed out at the top of the conversation we can query the memory runtime for a history of that conversation.

```python

inference_request = {
    "inputs": [
        {
            "name": "memory_id",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["ebe4b1f9-12a7-4c63-bd85-e1df933978b5"],
            "parameters": {"content_type": "str"},
        },
    ],
    "parameters": {
        "memory_parameters": {"window_size": 10}
    },
}

endpoint = f"http://{get_mesh_ip()}/v2/models/combine-question/infer"
response_with_params = requests.post(
    endpoint,
    json=inference_request,
)

pprint.pprint(response_with_params.json()['outputs'][0]['data'])
```

```
['{"role": "user", "content": [{"type": "text", "value": "Whats the tallest '
 'building in the world?"}]}',
 '{"role": "assistant", "content": [{"type": "text", "value": "As of October '
 '2021, the tallest building in the world is the Burj Khalifa in Dubai, United '
 'Arab Emirates. It stands at a height of 828 meters (2,717 feet)."}]}',
 '{"role": "user", "content": [{"type": "text", "value": "And what about the '
 'second?"}]}',
 '{"role": "assistant", "content": [{"type": "text", "value": "The second '
 'tallest building in the world is the Shanghai Tower in Shanghai, China. It '
 'has a height of 632 meters (2,073 feet)."}]}']
```

## Deploying a Local Model with GPU support

As well as the API runtime, we also provide the [Local Runtime](https://github.com/SeldonIO/llm-runtimes/blob/master/runtimes/local/README.md) which allows developers to deploy and serve their large language models on GPUs. To get this working developers will need to provision a node with GPU support and the server should request GPU resources.

For this demo, we're going to deploy the [`mistralai/Mistral-7B-Instruct-v0.2`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2) model because it is small but performant. It's also a gated model on Huggingface which means you'll need to create an account and request access. You'll need to create an access token for your account and then make a Kubernetes secret. Check our [installation tutorial](https://docs.seldon.ai/llm-module/introduction/installation) to see how you can create the secrete and deploy the Local Runtime server.

When running this example we used a single GPU node with four NVIDIA GeForce RTX 2080 Ti GPUs, but alternative infrastructure setups are also possible.

### Model Settings

Take a look at the following `model-settings.json` file:

```python
!cat models/mistral/model-settings.json
```

```
{
    "name": "mistral",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "mistralai/Mistral-7B-Instruct-v0.2",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "max_model_len": 4096,
                    "default_generate_kwargs": {
                        "max_tokens": 1024
                    }
                }
            }
        }
    }
}
```

For more details on the settings here please view the [local runtime](https://github.com/SeldonIO/llm-runtimes/blob/master/runtimes/local/README.md) documentation page. However, a couple of key settings are worth pointing out:

1. The `dtype` is set to `float16` because of the type of GPU we're using. This won't be nessesery for other GPUs for instance A100s.
2. We specify the backend as `vllm` but this can also be `transformers` although it will be much less performant. `deepspeed` is not available here because the NVIDIA GeForce RTX 2080 Ti GPUs don't support it. If you wish to use `deepspeed` consider using a L4 or A100 GPU.
3. By default the device set to `cuda`. This can also be set to `cpu` if you wish, however this setting is only valid for the `transformer` backend.
4. `tensor_parallel_size` is set to 4 because we're splitting the model over four GPUs.
5. `max_model_len` represents the maximum number of tokens that the LLM engine can handle.
6. `max_tokens` under `default_generate_kwargs` represents the maximum number of token that can be generated.

## Deploying the Mistral 7B Model

We deploy the model using the following `yaml`:

```python
!cat manifests/models/models-local.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: mistral
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/mistral"
  requirements:
  - llm-local
```

Let's deploy!

```python
!kubectl apply -f manifests/models/models-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/mistral created
model.mlops.seldon.io/chatgpt condition met
model.mlops.seldon.io/combine-answer condition met
model.mlops.seldon.io/combine-question condition met
model.mlops.seldon.io/mistral condition met
```

Finally we can directly swap in and out the OpenAI and Mistral models within the pipeline. To do so we just replace the `chatgpt` model in the pipeline yaml definition with `mistral`. See the following:

```python
!cat manifests/pipelines/pipeline-local.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: chat-app-local
spec:
  steps:
    - name: combine-question
      inputs:
        - chat-app-local.inputs.memory_id
        - chat-app-local.inputs.role
        - chat-app-local.inputs.content
        - chat-app-local.inputs.type
    - name: mistral
      inputs:
        - combine-question.outputs.history
    - name: combine-answer
      inputs:
      - chat-app-local.inputs.memory_id
      - mistral.outputs.role
      - mistral.outputs.content
      - mistral.outputs.type
  output:
    steps:
    - mistral
    - combine-answer
```

And deploy!

```python
!kubectl apply -f manifests/pipelines/pipeline-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
```

```
pipeline.mlops.seldon.io/chat-app-local created
pipeline.mlops.seldon.io/chat-app condition met
pipeline.mlops.seldon.io/chat-app-local condition met
```

Let's test it:

```python
display_ui('chat-app-local')
```

```
session: 2e4f8b5f-b7a4-488a-814a-84ad20cdd71b



Text(value='', description='Enter text:')



Button(description='Submit', style=ButtonStyle())



Output()
```

## Undeploying

If you want to undeploy the chatbot you can easily do so using:

```python
!kubectl delete -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl delete -f manifests/pipelines/pipeline-local.yaml -n seldon
```

```
pipeline.mlops.seldon.io "chat-app" deleted
pipeline.mlops.seldon.io "chat-app-local" deleted
```

```python
!kubectl delete -f manifests/models/models-api.yaml -n seldon
!kubectl delete -f manifests/models/models-local.yaml -n seldon
```

```
model.mlops.seldon.io "combine-question" deleted
model.mlops.seldon.io "combine-answer" deleted
model.mlops.seldon.io "chatgpt" deleted
model.mlops.seldon.io "mistral" deleted
```
