Chat Bot

In this example, we're going to build a chatbot using the LLM Module runtimes. You'll see how each of these components can be leveraged within a Seldon Core 2 pipeline to build out an LLM application. Without having to write any backend code we can write a production-ready chat application in almost no time. This tutorial relies on Kubernetes and Seldon Core 2 and an understanding of these tools may be useful. To follow along you'll also need to have the LLM Module set up on your cluster. This includes images of both the OpenAI Runtime and Conversational Memory Runtimes deployed on your cluster. For this tutorial I've assumed they're deployed under a seldon-mesh namespace but this doesn't have to be the case.

The app we're going to build is a simple question-answer chatbot with a persistent memory store. To build it we're going to use the OpenAI runtime via the API runtime and the Conversational Memory Runtime. The following illustrates an interaction between the chat bot and a user.

chatbot example

Model deployments

To build the above we have to first do two things:

  1. Create the model-settings.json files for the memory and LLM models.

  2. Create the model and pipeline specifications for Seldon Core 2

Model-settings configs

Let's start with the large language model. We will use the gpt-3.5-turbo model with the chat.completions endpoint. For more explanation of the available options, see this example.

!cat models/chatgpt/model-settings.json
{
    "name": "chatgpt",
    "implementation": "mlserver_llm_api.LLMRuntime",
    "parameters": {
        "extra": {
            "provider_id": "openai",
            "config": {
                "model_id": "gpt-3.5-turbo",
                "model_type": "chat.completions"
            }
        }
    }
}

For the memory runtime, we'll use the filesys database backend, a window_size of 100, and we'll let the tensor_names of the tensors we want to store be "role", "content", and "type" . Note we choose to store these tensor names as they match both the input and output tensor_names of the large language model. As you'll see later, this means we can just use the output of the LLM as input to our memory and vice versa.

!cat models/memory/model-settings.json
{
    "name": "memory",
    "implementation": "mlserver_memory.ConversationalMemory",
    "parameters": {
        "extra": {
            "database": "filesys",
            "config": {
                "window_size": 100,
                "tensor_names": ["content", "role", "type"]
            }
        }
    }
}

Seldon CRDs

Now that we've defined the model configs, we need to create the custom resource definitions (CRDs) that tell Seldon to deploy those models. Before we can do this step we need to first upload the model settings to remote storage that Kubernetes will use to fetch on deployment. When developing locally this step seems redundant but in production the cluster will fetch the model settings from the remote storage and not your local environment. We recommend using MinIO for the remote storage but for this demo I've placed the model-settings.json files in a Google bucket.

The following model yaml tells Seldon Core 2 and Kubernetes how many models to deploy, their names, where their specifications are and any requirements they might have. In particular, the memory/OpenAI requirements tell Kubernetes to run the memory/OpenAI MLServer runtimes respectively.

!cat manifests/models/models-api.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: combine-question
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: combine-answer
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: chatgpt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/chatgpt"
  requirements:
  - openai

Deploying the models

To deploy the models, we first need to create the secret containg the API key and setup servers for our memory and API deployments. Check our installation tutorial to see how you can create the secret and deploy the API and Memory Runtimes servers.

Once you have all the above up an running, we are ready to deploy our models. The manifest file which describes our models is the following:

!cat manifests/models/models-api.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: combine-question
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: combine-answer
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
  requirements:
  - memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: chatgpt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/chatgpt"
  requirements:
  - openai

To deploy the models, run the following:

!kubectl apply -f manifests/models/models-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=300s model --all -n seldon
model.mlops.seldon.io/combine-question unchanged
model.mlops.seldon.io/combine-answer unchanged
model.mlops.seldon.io/chatgpt created
model.mlops.seldon.io/chatgpt condition met
model.mlops.seldon.io/combine-answer condition met
model.mlops.seldon.io/combine-question condition met

We can now query the LLM by sending a request. Note the following interaction is stateless in the sense that follow-up questions will confuse the model because it won't have the context of having just answered about the Burj Khalifa.

import pprint
import subprocess
import requests

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')


inference_request = {
    "inputs": [
        {
            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Hi, what's the tallest building in the world?"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "role",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["user"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "type",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["text"],
            "parameters": {"content_type": "str"},
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/chatgpt/infer"

response_with_params = requests.post(
    endpoint,
    json=inference_request,
)

pprint.pprint(response_with_params.json()['outputs'][:2])
[{'data': ['assistant'],
  'datatype': 'BYTES',
  'name': 'role',
  'parameters': {'content_type': 'str'},
  'shape': [1, 1]},
 {'data': ['As of now, the tallest building in the world is the Burj Khalifa '
           'in Dubai, United Arab Emirates. It stands at a height of 828 '
           'meters (2,717 feet).'],
  'datatype': 'BYTES',
  'name': 'content',
  'parameters': {'content_type': 'str'},
  'shape': [1, 1]}]

Deploying the Seldon Core 2 Pipeline

To maintain the context of the conversation we will integrate the memory runtime. To do this we need to define a pipeline that tells Seldon Core 2 how to wire together the memory and LLM runtimes. The following diagram illustrates the layout of the pipeline.

The pipeline consists of three MLServer runtimes: OpenAI is the OpenAI model served via the API Runtime while combine-question and combine-answer are both conversational memory runtimes.

  1. Imagine there has been some conversation already. In this case, the conversational memory is a non-empty list of messages.

  2. The user first sends a question to the combine-question memory model.

  3. This model appends the question to the conversational history and returns the conversational history with the question appended to it.

  4. The conversational history tensor is then passed to the LLM (OpenAI)

  5. The output (answer) of the llm is passed to the combine-answer memory model and does the same thing as the combine-question model but with the answer instead of the question.

  6. The answer returned to the user along with the history. The history can be ignored, since it only serves as a confirmation that the new message has be commited to the database.

Note that the input to the pipeline also includes a memory_id. This is a session id that should be managed by the user and is responsible for identifying the conversation history in the data store.

!cat manifests/pipelines/pipeline-api.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: chat-app
spec:
  steps:
    - name: combine-question
      inputs:
        - chat-app.inputs.memory_id
        - chat-app.inputs.role
        - chat-app.inputs.content
        - chat-app.inputs.type
    - name: chatgpt
      inputs:
        - combine-question.outputs.history
    - name: combine-answer
      inputs:
      - chat-app.inputs.memory_id
      - chatgpt.outputs.role
      - chatgpt.outputs.content
      - chatgpt.outputs.type
  output:
    steps:
    - chatgpt
    - combine-answer

We can now deploy the pipeline using the following terminal command.

!kubectl apply -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
pipeline.mlops.seldon.io/chat-app created
pipeline.mlops.seldon.io/chat-app condition met

Using the chatbot

The chatbot is now deployed and can be used. The following code creates a while loop which requests the user to input a question. This question is sent to the chatbot and the response is returned. This repeats until the user types exit.

import subprocess
import requests
import uuid

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')



class APIController():
    def __init__(self, target='chat-app'):
        self.session = str(uuid.uuid4())
        self.target = target
        self.endpoint = f"http://{get_mesh_ip()}/v2/pipelines/{target}/infer"

    def _build_request(self, question):
        inference_request = {
            "inputs": [
                {
                    "name": "role", 
                    "shape": [1, 1],
                    "datatype": "BYTES", 
                    "data": ['user'],
                    "parameters": {"content_type": "str"},
                },
                {
                    "name": "content", 
                    "shape": [1, 1],
                    "datatype": "BYTES", 
                    "data": [question],
                    "parameters": {"content_type": "str"},
                },
                {
                    "name": "type", 
                    "shape": [1, 1],
                    "datatype": "BYTES", 
                    "data": ['text'],
                    "parameters": {"content_type": "str"},
                },
                {
                    "name": "memory_id", 
                    "shape": [1, 1],
                    "datatype": "BYTES", 
                    "data": [self.session],
                    "parameters": {"content_type": "str"},
                }
            ]
        }
        headers = {
            "Content-Type": "application/json",
            "seldon-model": f"{self.target}.pipeline"
        }
        return inference_request, headers

    def send(self, text):
        inference_request, headers = self._build_request(text)
        return requests.post(self.endpoint, json=inference_request, headers=headers)
!pip install ipywidgets -q
import ipywidgets as widgets
from IPython.display import display

def display_ui(target: str = 'chat-app'):
    
    # Create text input and button widgets
    text_input = widgets.Text(description="Enter text:")
    button = widgets.Button(description="Submit")
    output = widgets.Output()

    # Create an instance of the API controller
    api = APIController(target)
    print('session:', api.session)

    # Define what happens on button click
    def on_button_click(b):
        text = text_input.value
        if text == 'exit':
            with output:
                print("Exiting...")
            button.disabled = True
        else:
            with output:
                print("User:", text)
                response = api.send(text)
                
                print("Assistant:", response.json()['outputs'][1]['data'][0])
        
        text_input.value = ""  # Clear input after each submission

    # Bind the click event
    button.on_click(on_button_click)

    # Display widgets and output
    display(text_input, button, output)
display_ui()
session: ebe4b1f9-12a7-4c63-bd85-e1df933978b5



Text(value='', description='Enter text:')



Button(description='Submit', style=ButtonStyle())



Output()

Using the memory_id that we printed out at the top of the conversation we can query the memory runtime for a history of that conversation.


inference_request = {
    "inputs": [
        {
            "name": "memory_id",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["ebe4b1f9-12a7-4c63-bd85-e1df933978b5"],
            "parameters": {"content_type": "str"},
        },
    ],
    "parameters": {
        "memory_parameters": {"window_size": 10}
    },
}

endpoint = f"http://{get_mesh_ip()}/v2/models/combine-question/infer"
response_with_params = requests.post(
    endpoint,
    json=inference_request,
)

pprint.pprint(response_with_params.json()['outputs'][0]['data'])
['{"role": "user", "content": [{"type": "text", "value": "Whats the tallest '
 'building in the world?"}]}',
 '{"role": "assistant", "content": [{"type": "text", "value": "As of October '
 '2021, the tallest building in the world is the Burj Khalifa in Dubai, United '
 'Arab Emirates. It stands at a height of 828 meters (2,717 feet)."}]}',
 '{"role": "user", "content": [{"type": "text", "value": "And what about the '
 'second?"}]}',
 '{"role": "assistant", "content": [{"type": "text", "value": "The second '
 'tallest building in the world is the Shanghai Tower in Shanghai, China. It '
 'has a height of 632 meters (2,073 feet)."}]}']

Deploying a Local Model with GPU support

As well as the API runtime, we also provide the Local Runtime which allows developers to deploy and serve their large language models on GPUs. To get this working developers will need to provision a node with GPU support and the server should request GPU resources.

For this demo, we're going to deploy the mistralai/Mistral-7B-Instruct-v0.2 model because it is small but performant. It's also a gated model on Huggingface which means you'll need to create an account and request access. You'll need to create an access token for your account and then make a Kubernetes secret. Check our installation tutorial to see how you can create the secrete and deploy the Local Runtime server.

When running this example we used a single GPU node with four NVIDIA GeForce RTX 2080 Ti GPUs, but alternative infrastructure setups are also possible.

Model Settings

Take a look at the following model-settings.json file:

!cat models/mistral/model-settings.json
{
    "name": "mistral",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "mistralai/Mistral-7B-Instruct-v0.2",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "max_model_len": 4096,
                    "default_generate_kwargs": {
                        "max_tokens": 1024
                    }
                }
            }
        }
    }
}

For more details on the settings here please view the local runtime documentation page. However, a couple of key settings are worth pointing out:

  1. The dtype is set to float16 because of the type of GPU we're using. This won't be nessesery for other GPUs for instance A100s.

  2. We specify the backend as vllm but this can also be transformers although it will be much less performant. deepspeed is not available here because the NVIDIA GeForce RTX 2080 Ti GPUs don't support it. If you wish to use deepspeed consider using a L4 or A100 GPU.

  3. By default the device set to cuda. This can also be set to cpu if you wish, however this setting is only valid for the transformer backend.

  4. tensor_parallel_size is set to 4 because we're splitting the model over four GPUs.

  5. max_model_len represents the maximum number of tokens that the LLM engine can handle.

  6. max_tokens under default_generate_kwargs represents the maximum number of token that can be generated.

Deploying the Mistral 7B Model

We deploy the model using the following yaml:

!cat manifests/models/models-local.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: mistral
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/mistral"
  requirements:
  - llm-local

Let's deploy!

!kubectl apply -f manifests/models/models-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/mistral created
model.mlops.seldon.io/chatgpt condition met
model.mlops.seldon.io/combine-answer condition met
model.mlops.seldon.io/combine-question condition met
model.mlops.seldon.io/mistral condition met

Finally we can directly swap in and out the OpenAI and Mistral models within the pipeline. To do so we just replace the chatgpt model in the pipeline yaml definition with mistral. See the following:

!cat manifests/pipelines/pipeline-local.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: chat-app-local
spec:
  steps:
    - name: combine-question
      inputs:
        - chat-app-local.inputs.memory_id
        - chat-app-local.inputs.role
        - chat-app-local.inputs.content
        - chat-app-local.inputs.type
    - name: mistral
      inputs:
        - combine-question.outputs.history
    - name: combine-answer
      inputs:
      - chat-app-local.inputs.memory_id
      - mistral.outputs.role
      - mistral.outputs.content
      - mistral.outputs.type
  output:
    steps:
    - mistral
    - combine-answer

And deploy!

!kubectl apply -f manifests/pipelines/pipeline-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
pipeline.mlops.seldon.io/chat-app-local created
pipeline.mlops.seldon.io/chat-app condition met
pipeline.mlops.seldon.io/chat-app-local condition met

Let's test it:

display_ui('chat-app-local')
session: 2e4f8b5f-b7a4-488a-814a-84ad20cdd71b



Text(value='', description='Enter text:')



Button(description='Submit', style=ButtonStyle())



Output()

Undeploying

If you want to undeploy the chatbot you can easily do so using:

!kubectl delete -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl delete -f manifests/pipelines/pipeline-local.yaml -n seldon
pipeline.mlops.seldon.io "chat-app" deleted
pipeline.mlops.seldon.io "chat-app-local" deleted
!kubectl delete -f manifests/models/models-api.yaml -n seldon
!kubectl delete -f manifests/models/models-local.yaml -n seldon
model.mlops.seldon.io "combine-question" deleted
model.mlops.seldon.io "combine-answer" deleted
model.mlops.seldon.io "chatgpt" deleted
model.mlops.seldon.io "mistral" deleted

Last updated

Was this helpful?