Example
The following demonstrates how to locally run a Conversational Memory Runtime MLServer instance. It also illustrates the different ways it can be used. (Note this example only showcases the Conversational Memory as a stand-alone component, for a more integrated example check out chatbot)
To get up and running we need to pull the runtime docker image. To pull the docker image, you must be authenticated. Check our installation tutorial to see how you can authenticate with Docker CLI.
docker pull \
europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-memory:0.7.0
Before we can start the runtime we need to create the model-settings.json
which will tell it what model to run:
!cat models/filesys-memory/model-settings.json
{
"name": "filesys-memory",
"implementation": "mlserver_memory.ConversationalMemory",
"parameters": {
"extra": {
"database": "filesys",
"config": {
"window_size": 50,
"tensor_names": ["role", "content", "type"]
}
}
}
}
See the ConversationalMemory specification for an explanation of the "config"
parameters as well as the file system backend specification. Currently we support the following backends: filesys
and sql
. filesys
uses the local file system of the runtime and is for development purposes only. Because it's so rudimentary it doesn't require any extra set up and we'll use it for the remainder of this example. The sql
backend can be used with a selection of sql databases, namely: sqllite
, postgresql
, mysql
, mssql
and oracle
. These all require an amount of configuration. To see how to use the memory runtime with a postgresql
or mysql
database see the this example.
File System runtime
Now that the model-settings.json
file is ready, we can start the runtime using:
docker run -it --rm -p 8080:8080 \
-v ${PWD}/filesys-memory:/filesys-memory \
europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-memory:0.7.0 \
mlserver start /filesys-memory
There are two intended workflows for using the memory runtime. In the first the developer wishes to update the memory with new messages from the conversation and in the second the developer wants to get messages from the memory. When sending a request to the runtime the developer must always include a memory_id
tensor that the runtime uses to identify the relevant conversation. Regardless of whether or not the history is updated, the runtime will return window_size
number of messages from the memory.
As a more concrete example, imagine we want to get the conversational history for a specific conversation. A developer can do this by just sending the memory_id
tensor. Doing so the runtime will return the last window_size
number of messages it's stored.
Suppose we want to update the memory with a new message, and return the last window_size - 1
messages with the new message appended to the list. To do so we'd send a memory_id
tensor as well as three other tensors: role
, content
, and type
. The runtime would then update the memory with the new content
message, read the last window_size
messages (including the new one) and return these. This is done to make the typical flow of a conversation easy to implement with the runtime. Hopefully the next examples will make this even clearer:
Unimodal inputs
To start with let's send a request with a memory_id
and some content to start a conversational store. We've chosen the memory_id=6e1f1413
at random.
import requests
from typing import List
def send_inference_request(
role: List[str],
content: List[str],
type: List[str],
memory_id: str,
endpoint: str = "http://localhost:8080/v2/models/filesys-memory/infer",
):
inference_request = {
"inputs": [
{
"name": "memory_id",
"shape": [1],
"datatype": "BYTES",
"data": [memory_id],
"parameters": {"content_type": "str"},
},
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": role,
"parameters": {"content_type": "str"},
},
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"data": content,
"parameters": {"content_type": "str"},
},
{
"name": "type",
"shape": [1],
"datatype": "BYTES",
"data": type,
"parameters": {"content_type": "str"},
}
]
}
return requests.post(endpoint,json=inference_request)
from uuid import uuid4
memory_id = str(uuid4())
# simulate a conversation
send_inference_request(role=["user"], content=["Hey how are you?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["assistant"], content=["I'm good, how are you?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["user"], content=["I'm good too. Can I ask you a question?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["assistant"], content=["Sure, what's your question?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["user"], content=["What is the capital of France?"], type=["text"], memory_id=memory_id);
We should now be able to retrieve the conversation above, by sending a get
request (i.e., a request containing only the memory_id
tensor).
import pprint
import requests
def get_inference_request(
memory_id: str,
endpoint: str = "http://localhost:8080/v2/models/filesys-memory/infer",
):
inference_request = {
"inputs": [
{
"name": "memory_id",
"shape": [1],
"datatype": "BYTES",
"data": [memory_id],
"parameters": {"content_type": "str"},
}
]
}
return requests.post(endpoint, json=inference_request)
response = get_inference_request(memory_id=memory_id)
pprint.pprint(response.json())
{'id': '07d4c4c9-e337-49b8-a446-bbbe972d7b9c',
'model_name': 'filesys-memory',
'outputs': [{'data': ['846424c3-d6ab-4e80-9422-644d97f4a934'],
'datatype': 'BYTES',
'name': 'memory_id',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['{"role": "user", "content": [{"type": "text", "value": '
'"Hey how are you?"}]}',
'{"role": "assistant", "content": [{"type": "text", '
'"value": "I\'m good, how are you?"}]}',
'{"role": "user", "content": [{"type": "text", "value": '
'"I\'m good too. Can I ask you a question?"}]}',
'{"role": "assistant", "content": [{"type": "text", '
'"value": "Sure, what\'s your question?"}]}',
'{"role": "user", "content": [{"type": "text", "value": '
'"What is the capital of France?"}]}'],
'datatype': 'BYTES',
'name': 'history',
'parameters': {'content_type': 'str'},
'shape': [5, 1]}],
'parameters': {}}
As expected it returns all the previous message we just sent. Remember that you can control the number of messages to be returned by changing the window_size
parameter. If we were to set window_size=2
, then only the last two messages would have been returned.
Multimodal inputs
Multimodal input messages can be sent to the memory runtime. As an example:
memory_id = str(uuid4())
request_response = send_inference_request(
role=["user"],
content=["This is the first input", "This is the second input"],
type=["text", "text"],
memory_id=memory_id
)
pprint.pprint(request_response.json())
{'id': '618090fa-e326-4516-addf-b31057664088',
'model_name': 'filesys-memory',
'outputs': [{'data': ['1323ae57-d516-408a-92f2-f09f42b694b3'],
'datatype': 'BYTES',
'name': 'memory_id',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['{"role": "user", "content": [{"type": "text", "value": '
'"This is the first input"}, {"type": "text", "value": '
'"This is the second input"}]}'],
'datatype': 'BYTES',
'name': 'history',
'parameters': {'content_type': 'str'},
'shape': [1, 1]}],
'parameters': {}}
The above will write both inputs under the same entry. This is useful when deploying multimodal models which accept multiple inputs at the same time (e.g., (text, text), (text, image), (text, audio)).
The above showcases the memory runtime as a stand-alone server. In production the runtime is intended to be used with a sql database instead of the file system. To see an example of configuring the memory runtime to work with postgresql
or mysql
databases within kubernetes see this example. This chatbot example showcases how to use the memory runtime as a component within a Seldon Core V2 pipeline along with other components such as large language models.
Deploying on Seldon Core 2
We are now going to show how to deploy the conversational memory runtime with an SQL backend. This tutorial assumes that you have:
A Kubernetes cluster set up with Core 2 installed
A PostgreSQL database deployment and service under the namespace
seldon
.The memory runtime server up and running.
While the runtime image can be used as a standalone server in most cases you'll want to deploy it as part of a Kubernetes cluster. This section assumes the user has a Kubernetes cluster running with Seldon Core 2 installed in the seldon
namespace. In order start serving memory models, you need to deploy the Memory Runtime. Check our installation tutorial to see how you can deploy the Memory Runtime server.
Once the Memory Runtime server is up and running, we are now ready to deploy memory models.
PostgreSQL
We begin with the following PostgreSQL deployment yaml:
!cat manifests/psql-deployment.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
labels:
app: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:latest
env:
- name: POSTGRES_USER
value: admin
- name: POSTGRES_PASSWORD
value: admin
- name: POSTGRES_DB
value: db
ports:
- containerPort: 5432
readinessProbe:
exec:
command: ["pg_isready", "-U", "admin", "-d", "db"]
initialDelaySeconds: 5
periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
name: postgres
spec:
type: LoadBalancer
selector:
app: postgres
ports:
- protocol: TCP
port: 5432
targetPort: 5432
The above gives us a PostgreSQL database called db
with a user called admin with password admin. You should not use these credentials in production. You should also name the database something more informative than db
. Note that you'll need to update the URL
string in the model-settings.json
to reflect these changes.
The following is an example of the model-settings.json
file we'll use:
!cat models/psql-memory/model-settings.json
{
"name": "psql-memory",
"implementation": "mlserver_memory.ConversationalMemory",
"parameters": {
"extra": {
"database": "sql",
"config": {
"url": "postgresql+pg8000://admin:admin@postgres:5432/db",
"connection_options": {},
"window_size": 50,
"tensor_names": ["role", "content", "type"]
}
}
}
}
Like all model deployments, the above configuration file needs to be uploaded to a Google bucket or MinIO storage. In our case, we're using a Google bucket. Once this is done we create the following deployment yaml.
!cat manifests/psql-memory.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: psql-memory
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/memory/example/models/psql-memory"
requirements:
- memory
We are now ready to apply the deployment and the model to our k8s cluster. We begin by applying the database deployment:
!kubectl apply -f manifests/psql-deployment.yaml -n seldon
!kubectl rollout status deployment/postgres -n seldon --timeout=600s
deployment.apps/postgres created
service/postgres created
Waiting for deployment "postgres" rollout to finish: 0 of 1 updated replicas are available...
deployment "postgres" successfully rolled out
Once the psql
server is deployed we can apply our model:
!kubectl apply -f manifests/psql-memory.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/psql-memory created
model.mlops.seldon.io/psql-memory condition met
We can test it as before by sending multiple request to our model. The only difference is that the endpoint has changed.
import subprocess
from uuid import uuid4
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
endpoint = f"http://{get_mesh_ip()}/v2/models/psql-memory/infer"
memory_id = str(uuid4())
send_inference_request(role=["system"], content=["You are a helpful assistant"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["user"], content=["Hi! How are you doing today?"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["assistant"], content=["I'm doing great, how about you?"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["user"], content=["I'm doing well too. I have a question for you."], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["assistant"], content=["Sure, what's your question?"], type=["text"], memory_id=memory_id, endpoint=endpoint);
As before, we can retrieve the entire history by sending a get
request.
response = get_inference_request(memory_id=memory_id, endpoint=endpoint)
pprint.pprint(response.json())
{'id': '80893a7d-443d-406c-ade9-2a953156626b',
'model_name': 'psql-memory_1',
'model_version': '1',
'outputs': [{'data': ['7e69002d-de89-4fac-928a-002889a09b65'],
'datatype': 'BYTES',
'name': 'memory_id',
'parameters': {'content_type': 'str'},
'shape': [1, 1]},
{'data': ['{"role": "system", "content": [{"type": "text", '
'"value": "You are a helpful assistant"}]}',
'{"role": "assistant", "content": [{"type": "text", '
'"value": "I\'m doing great, how about you?"}]}',
'{"role": "user", "content": [{"type": "text", "value": '
'"I\'m doing well too. I have a question for you."}]}',
'{"role": "assistant", "content": [{"type": "text", '
'"value": "Sure, what\'s your question?"}]}'],
'datatype': 'BYTES',
'name': 'history',
'parameters': {'content_type': 'str'},
'shape': [4, 1]}],
'parameters': {}}
To clean the cluster (psql model and deployment), run the following commands:
!kubectl delete -f manifests/psql-memory.yaml -n seldon
!kubectl delete -f manifests/psql-deployment.yaml -n seldon
model.mlops.seldon.io "psql-memory" deleted
deployment.apps "postgres" deleted
service "postgres" deleted
Last updated
Was this helpful?