Chat Bot
In this example, we're going to build a chatbot using the LLM Module runtimes. You'll see how each of these components can be leveraged within a Seldon Core 2 pipeline to build out an LLM application. Without having to write any backend code we can write a production-ready chat application in almost no time. This tutorial relies on Kubernetes and Seldon Core 2 and an understanding of these tools may be useful. To follow along you'll also need to have the LLM Module set up on your cluster. This includes images of both the OpenAI Runtime and Conversational Memory Runtimes deployed on your cluster. For this tutorial I've assumed they're deployed under a seldon-mesh
namespace but this doesn't have to be the case.
The app we're going to build is a simple question-answer chatbot with a persistent memory store. To build it we're going to use the OpenAI runtime via the API runtime and the Conversational Memory Runtime. The following illustrates an interaction between the chat bot and a user.

Model deployments
To build the above we have to first do two things:
Create the
model-settings.json
files for the memory and LLM models.Create the model and pipeline specifications for Seldon Core 2
Model-settings configs
Let's start with the large language model. We will use the gpt-3.5-turbo
model with the chat.completions
endpoint. For more explanation of the available options, see this example.
!cat models/chatgpt/model-settings.json
{
"name": "chatgpt",
"implementation": "mlserver_llm_api.LLMRuntime",
"parameters": {
"extra": {
"provider_id": "openai",
"config": {
"model_id": "gpt-3.5-turbo",
"model_type": "chat.completions"
}
}
}
}
For the memory runtime, we'll use the filesys
database backend, a window_size
of 100, and we'll let the tensor_names
of the tensors we want to store be "role"
, "content"
, and "type"
. Note we choose to store these tensor names as they match both the input and output tensor_names
of the large language model. As you'll see later, this means we can just use the output of the LLM as input to our memory and vice versa.
!cat models/memory/model-settings.json
{
"name": "memory",
"implementation": "mlserver_memory.ConversationalMemory",
"parameters": {
"extra": {
"database": "filesys",
"config": {
"window_size": 100,
"tensor_names": ["content", "role", "type"]
}
}
}
}
Seldon CRDs
Now that we've defined the model configs, we need to create the custom resource definitions (CRDs) that tell Seldon to deploy those models. Before we can do this step we need to first upload the model settings to remote storage that Kubernetes will use to fetch on deployment. When developing locally this step seems redundant but in production the cluster will fetch the model settings from the remote storage and not your local environment. We recommend using MinIO for the remote storage but for this demo I've placed the model-settings.json
files in a Google bucket.
The following model yaml tells Seldon Core 2 and Kubernetes how many models to deploy, their names, where their specifications are and any requirements they might have. In particular, the memory/OpenAI requirements tell Kubernetes to run the memory/OpenAI MLServer runtimes respectively.
!cat manifests/models/models-api.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: combine-question
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
requirements:
- memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: combine-answer
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
requirements:
- memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: chatgpt
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/chatgpt"
requirements:
- openai
Deploying the models
To deploy the models, we first need to create the secret containg the API key and setup servers for our memory and API deployments. Check our installation tutorial to see how you can create the secret and deploy the API and Memory Runtimes servers.
Once you have all the above up an running, we are ready to deploy our models. The manifest file which describes our models is the following:
!cat manifests/models/models-api.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: combine-question
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
requirements:
- memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: combine-answer
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/memory"
requirements:
- memory
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: chatgpt
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/chatgpt"
requirements:
- openai
To deploy the models, run the following:
!kubectl apply -f manifests/models/models-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=300s model --all -n seldon
model.mlops.seldon.io/combine-question unchanged
model.mlops.seldon.io/combine-answer unchanged
model.mlops.seldon.io/chatgpt created
model.mlops.seldon.io/chatgpt condition met
model.mlops.seldon.io/combine-answer condition met
model.mlops.seldon.io/combine-question condition met
We can now query the LLM by sending a request. Note the following interaction is stateless in the sense that follow-up questions will confuse the model because it won't have the context of having just answered about the Burj Khalifa.
import pprint
import subprocess
import requests
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
inference_request = {
"inputs": [
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"data": ["Hi, what's the tallest building in the world?"],
"parameters": {"content_type": "str"},
},
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"],
"parameters": {"content_type": "str"},
},
{
"name": "type",
"shape": [1],
"datatype": "BYTES",
"data": ["text"],
"parameters": {"content_type": "str"},
}
],
}
endpoint = f"http://{get_mesh_ip()}/v2/models/chatgpt/infer"
response_with_params = requests.post(
endpoint,
json=inference_request,
)
pprint.pprint(response_with_params.json()['outputs'][:2])
[{'data': ['assistant'],
'datatype': 'BYTES',
'name': 'role',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['As of now, the tallest building in the world is the Burj Khalifa '
'in Dubai, United Arab Emirates. It stands at a height of 828 '
'meters (2,717 feet).'],
'datatype': 'BYTES',
'name': 'content',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}]
Deploying the Seldon Core 2 Pipeline
To maintain the context of the conversation we will integrate the memory runtime. To do this we need to define a pipeline that tells Seldon Core 2 how to wire together the memory and LLM runtimes. The following diagram illustrates the layout of the pipeline.
The pipeline consists of three MLServer runtimes: OpenAI
is the OpenAI model served via the API Runtime while combine-question
and combine-answer
are both conversational memory runtimes.
Imagine there has been some conversation already. In this case, the conversational memory is a non-empty list of messages.
The user first sends a question to the
combine-question
memory model.This model appends the question to the conversational history and returns the conversational history with the question appended to it.
The conversational history tensor is then passed to the LLM (OpenAI)
The output (answer) of the llm is passed to the
combine-answer
memory model and does the same thing as the combine-question model but with the answer instead of the question.The answer returned to the user along with the history. The history can be ignored, since it only serves as a confirmation that the new message has be commited to the database.
Note that the input to the pipeline also includes a memory_id
. This is a session id that should be managed by the user and is responsible for identifying the conversation history in the data store.
!cat manifests/pipelines/pipeline-api.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: chat-app
spec:
steps:
- name: combine-question
inputs:
- chat-app.inputs.memory_id
- chat-app.inputs.role
- chat-app.inputs.content
- chat-app.inputs.type
- name: chatgpt
inputs:
- combine-question.outputs.history
- name: combine-answer
inputs:
- chat-app.inputs.memory_id
- chatgpt.outputs.role
- chatgpt.outputs.content
- chatgpt.outputs.type
output:
steps:
- chatgpt
- combine-answer
We can now deploy the pipeline using the following terminal command.
!kubectl apply -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
pipeline.mlops.seldon.io/chat-app created
pipeline.mlops.seldon.io/chat-app condition met
Using the chatbot
The chatbot is now deployed and can be used. The following code creates a while loop which requests the user to input a question. This question is sent to the chatbot and the response is returned. This repeats until the user types exit.
import subprocess
import requests
import uuid
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
class APIController():
def __init__(self, target='chat-app'):
self.session = str(uuid.uuid4())
self.target = target
self.endpoint = f"http://{get_mesh_ip()}/v2/pipelines/{target}/infer"
def _build_request(self, question):
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1, 1],
"datatype": "BYTES",
"data": ['user'],
"parameters": {"content_type": "str"},
},
{
"name": "content",
"shape": [1, 1],
"datatype": "BYTES",
"data": [question],
"parameters": {"content_type": "str"},
},
{
"name": "type",
"shape": [1, 1],
"datatype": "BYTES",
"data": ['text'],
"parameters": {"content_type": "str"},
},
{
"name": "memory_id",
"shape": [1, 1],
"datatype": "BYTES",
"data": [self.session],
"parameters": {"content_type": "str"},
}
]
}
headers = {
"Content-Type": "application/json",
"seldon-model": f"{self.target}.pipeline"
}
return inference_request, headers
def send(self, text):
inference_request, headers = self._build_request(text)
return requests.post(self.endpoint, json=inference_request, headers=headers)
!pip install ipywidgets -q
import ipywidgets as widgets
from IPython.display import display
def display_ui(target: str = 'chat-app'):
# Create text input and button widgets
text_input = widgets.Text(description="Enter text:")
button = widgets.Button(description="Submit")
output = widgets.Output()
# Create an instance of the API controller
api = APIController(target)
print('session:', api.session)
# Define what happens on button click
def on_button_click(b):
text = text_input.value
if text == 'exit':
with output:
print("Exiting...")
button.disabled = True
else:
with output:
print("User:", text)
response = api.send(text)
print("Assistant:", response.json()['outputs'][1]['data'][0])
text_input.value = "" # Clear input after each submission
# Bind the click event
button.on_click(on_button_click)
# Display widgets and output
display(text_input, button, output)
display_ui()
session: ebe4b1f9-12a7-4c63-bd85-e1df933978b5
Text(value='', description='Enter text:')
Button(description='Submit', style=ButtonStyle())
Output()
Using the memory_id
that we printed out at the top of the conversation we can query the memory runtime for a history of that conversation.
inference_request = {
"inputs": [
{
"name": "memory_id",
"shape": [1],
"datatype": "BYTES",
"data": ["ebe4b1f9-12a7-4c63-bd85-e1df933978b5"],
"parameters": {"content_type": "str"},
},
],
"parameters": {
"memory_parameters": {"window_size": 10}
},
}
endpoint = f"http://{get_mesh_ip()}/v2/models/combine-question/infer"
response_with_params = requests.post(
endpoint,
json=inference_request,
)
pprint.pprint(response_with_params.json()['outputs'][0]['data'])
['{"role": "user", "content": [{"type": "text", "value": "Whats the tallest '
'building in the world?"}]}',
'{"role": "assistant", "content": [{"type": "text", "value": "As of October '
'2021, the tallest building in the world is the Burj Khalifa in Dubai, United '
'Arab Emirates. It stands at a height of 828 meters (2,717 feet)."}]}',
'{"role": "user", "content": [{"type": "text", "value": "And what about the '
'second?"}]}',
'{"role": "assistant", "content": [{"type": "text", "value": "The second '
'tallest building in the world is the Shanghai Tower in Shanghai, China. It '
'has a height of 632 meters (2,073 feet)."}]}']
Deploying a Local Model with GPU support
As well as the API runtime, we also provide the Local Runtime which allows developers to deploy and serve their large language models on GPUs. To get this working developers will need to provision a node with GPU support and the server should request GPU resources.
For this demo, we're going to deploy the mistralai/Mistral-7B-Instruct-v0.2
model because it is small but performant. It's also a gated model on Huggingface which means you'll need to create an account and request access. You'll need to create an access token for your account and then make a Kubernetes secret. Check our installation tutorial to see how you can create the secrete and deploy the Local Runtime server.
When running this example we used a single GPU node with four NVIDIA GeForce RTX 2080 Ti GPUs, but alternative infrastructure setups are also possible.
Model Settings
Take a look at the following model-settings.json
file:
!cat models/mistral/model-settings.json
{
"name": "mistral",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"tensor_parallel_size": 4,
"dtype": "float16",
"max_model_len": 4096,
"default_generate_kwargs": {
"max_tokens": 1024
}
}
}
}
}
}
For more details on the settings here please view the local runtime documentation page. However, a couple of key settings are worth pointing out:
The
dtype
is set tofloat16
because of the type of GPU we're using. This won't be nessesery for other GPUs for instance A100s.We specify the backend as
vllm
but this can also betransformers
although it will be much less performant.deepspeed
is not available here because the NVIDIA GeForce RTX 2080 Ti GPUs don't support it. If you wish to usedeepspeed
consider using a L4 or A100 GPU.By default the device set to
cuda
. This can also be set tocpu
if you wish, however this setting is only valid for thetransformer
backend.tensor_parallel_size
is set to 4 because we're splitting the model over four GPUs.max_model_len
represents the maximum number of tokens that the LLM engine can handle.max_tokens
underdefault_generate_kwargs
represents the maximum number of token that can be generated.
Deploying the Mistral 7B Model
We deploy the model using the following yaml
:
!cat manifests/models/models-local.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mistral
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/mistral"
requirements:
- llm-local
Let's deploy!
!kubectl apply -f manifests/models/models-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/mistral created
model.mlops.seldon.io/chatgpt condition met
model.mlops.seldon.io/combine-answer condition met
model.mlops.seldon.io/combine-question condition met
model.mlops.seldon.io/mistral condition met
Finally we can directly swap in and out the OpenAI and Mistral models within the pipeline. To do so we just replace the chatgpt
model in the pipeline yaml definition with mistral
. See the following:
!cat manifests/pipelines/pipeline-local.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: chat-app-local
spec:
steps:
- name: combine-question
inputs:
- chat-app-local.inputs.memory_id
- chat-app-local.inputs.role
- chat-app-local.inputs.content
- chat-app-local.inputs.type
- name: mistral
inputs:
- combine-question.outputs.history
- name: combine-answer
inputs:
- chat-app-local.inputs.memory_id
- mistral.outputs.role
- mistral.outputs.content
- mistral.outputs.type
output:
steps:
- mistral
- combine-answer
And deploy!
!kubectl apply -f manifests/pipelines/pipeline-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
pipeline.mlops.seldon.io/chat-app-local created
pipeline.mlops.seldon.io/chat-app condition met
pipeline.mlops.seldon.io/chat-app-local condition met
Let's test it:
display_ui('chat-app-local')
session: 2e4f8b5f-b7a4-488a-814a-84ad20cdd71b
Text(value='', description='Enter text:')
Button(description='Submit', style=ButtonStyle())
Output()
Undeploying
If you want to undeploy the chatbot you can easily do so using:
!kubectl delete -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl delete -f manifests/pipelines/pipeline-local.yaml -n seldon
pipeline.mlops.seldon.io "chat-app" deleted
pipeline.mlops.seldon.io "chat-app-local" deleted
!kubectl delete -f manifests/models/models-api.yaml -n seldon
!kubectl delete -f manifests/models/models-local.yaml -n seldon
model.mlops.seldon.io "combine-question" deleted
model.mlops.seldon.io "combine-answer" deleted
model.mlops.seldon.io "chatgpt" deleted
model.mlops.seldon.io "mistral" deleted
Last updated
Was this helpful?