Example

The following demonstrates how to locally run a Conversational Memory Runtime MLServer instance. It also illustrates the different ways it can be used. (Note this example only showcases the Conversational Memory as a stand-alone component, for a more integrated example check out chatbot)

To get up and running we need to pull the runtime docker image. To pull the docker image, you must be authenticated. Check our installation tutorial to see how you can authenticate with Docker CLI.

docker pull \
    europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-memory:0.7.0

Before we can start the runtime we need to create the model-settings.json which will tell it what model to run:

!cat models/filesys-memory/model-settings.json
{
  "name": "filesys-memory",
  "implementation": "mlserver_memory.ConversationalMemory",
  "parameters": {
    "extra": {
      "database": "filesys",
      "config": {
        "window_size": 50,
        "tensor_names": ["role", "content", "type"]
      }
    }
  }
}

See the ConversationalMemory specification for an explanation of the "config" parameters as well as the file system backend specification. Currently we support the following backends: filesys and sql. filesys uses the local file system of the runtime and is for development purposes only. Because it's so rudimentary it doesn't require any extra set up and we'll use it for the remainder of this example. The sql backend can be used with a selection of sql databases, namely: sqllite, postgresql, mysql, mssql and oracle. These all require an amount of configuration. To see how to use the memory runtime with a postgresql or mysql database see the this example.

File System runtime

Now that the model-settings.json file is ready, we can start the runtime using:

docker run -it --rm -p 8080:8080 \
  -v ${PWD}/filesys-memory:/filesys-memory \
  europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-memory:0.7.0 \
  mlserver start /filesys-memory

There are two intended workflows for using the memory runtime. In the first the developer wishes to update the memory with new messages from the conversation and in the second the developer wants to get messages from the memory. When sending a request to the runtime the developer must always include a memory_id tensor that the runtime uses to identify the relevant conversation. Regardless of whether or not the history is updated, the runtime will return window_size number of messages from the memory.

As a more concrete example, imagine we want to get the conversational history for a specific conversation. A developer can do this by just sending the memory_id tensor. Doing so the runtime will return the last window_size number of messages it's stored.

Suppose we want to update the memory with a new message, and return the last window_size - 1 messages with the new message appended to the list. To do so we'd send a memory_id tensor as well as three other tensors: role, content, and type. The runtime would then update the memory with the new content message, read the last window_size messages (including the new one) and return these. This is done to make the typical flow of a conversation easy to implement with the runtime. Hopefully the next examples will make this even clearer:

The window_size parameter is intended to control the size of requests being sent through the pipelines in which the memory runtime is deployed. It's not intended to be used to control the number of tokens sent to the LLM. Token bounds can be configured in the Local Runtime.

Unimodal inputs

To start with let's send a request with a memory_id and some content to start a conversational store. We've chosen the memory_id=6e1f1413 at random.

import requests
from typing import List


def send_inference_request(
    role: List[str],
    content: List[str], 
    type: List[str],
    memory_id: str,
    endpoint: str = "http://localhost:8080/v2/models/filesys-memory/infer",
):
    inference_request = {
        "inputs": [
            {
                "name": "memory_id",
                "shape": [1],
                "datatype": "BYTES",
                "data": [memory_id],
                "parameters": {"content_type": "str"},
            },
            {
                "name": "role",
                "shape": [1],
                "datatype": "BYTES",
                "data": role,
                "parameters": {"content_type": "str"},
            },
            {
                "name": "content",
                "shape": [1],
                "datatype": "BYTES",
                "data": content,
                "parameters": {"content_type": "str"},
            },
            {
                "name": "type",
                "shape": [1],
                "datatype": "BYTES",
                "data": type,
                "parameters": {"content_type": "str"},
            }
        ]
    }
    return requests.post(endpoint,json=inference_request)
from uuid import uuid4
memory_id = str(uuid4())

# simulate a conversation
send_inference_request(role=["user"], content=["Hey how are you?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["assistant"], content=["I'm good, how are you?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["user"], content=["I'm good too. Can I ask you a question?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["assistant"], content=["Sure, what's your question?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["user"], content=["What is the capital of France?"], type=["text"], memory_id=memory_id);

We should now be able to retrieve the conversation above, by sending a get request (i.e., a request containing only the memory_id tensor).

import pprint
import requests

def get_inference_request(
    memory_id: str,
    endpoint: str = "http://localhost:8080/v2/models/filesys-memory/infer",
):
    inference_request = {
        "inputs": [
            {
                "name": "memory_id",
                "shape": [1],
                "datatype": "BYTES",
                "data": [memory_id],
                "parameters": {"content_type": "str"},
            }
        ]
    }
    return requests.post(endpoint, json=inference_request)
response = get_inference_request(memory_id=memory_id)
pprint.pprint(response.json())
{'id': '07d4c4c9-e337-49b8-a446-bbbe972d7b9c',
 'model_name': 'filesys-memory',
 'outputs': [{'data': ['846424c3-d6ab-4e80-9422-644d97f4a934'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{"role": "user", "content": [{"type": "text", "value": '
                       '"Hey how are you?"}]}',
                       '{"role": "assistant", "content": [{"type": "text", '
                       '"value": "I\'m good, how are you?"}]}',
                       '{"role": "user", "content": [{"type": "text", "value": '
                       '"I\'m good too. Can I ask you a question?"}]}',
                       '{"role": "assistant", "content": [{"type": "text", '
                       '"value": "Sure, what\'s your question?"}]}',
                       '{"role": "user", "content": [{"type": "text", "value": '
                       '"What is the capital of France?"}]}'],
              'datatype': 'BYTES',
              'name': 'history',
              'parameters': {'content_type': 'str'},
              'shape': [5, 1]}],
 'parameters': {}}

As expected it returns all the previous message we just sent. Remember that you can control the number of messages to be returned by changing the window_size parameter. If we were to set window_size=2, then only the last two messages would have been returned.

Multimodal inputs

Multimodal input messages can be sent to the memory runtime. As an example:

memory_id = str(uuid4())
request_response = send_inference_request(
    role=["user"],
    content=["This is the first input", "This is the second input"],
    type=["text", "text"],
    memory_id=memory_id
)

pprint.pprint(request_response.json())
{'id': '618090fa-e326-4516-addf-b31057664088',
 'model_name': 'filesys-memory',
 'outputs': [{'data': ['1323ae57-d516-408a-92f2-f09f42b694b3'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{"role": "user", "content": [{"type": "text", "value": '
                       '"This is the first input"}, {"type": "text", "value": '
                       '"This is the second input"}]}'],
              'datatype': 'BYTES',
              'name': 'history',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

The above will write both inputs under the same entry. This is useful when deploying multimodal models which accept multiple inputs at the same time (e.g., (text, text), (text, image), (text, audio)).

The above showcases the memory runtime as a stand-alone server. In production the runtime is intended to be used with a sql database instead of the file system. To see an example of configuring the memory runtime to work with postgresql or mysql databases within kubernetes see this example. This chatbot example showcases how to use the memory runtime as a component within a Seldon Core V2 pipeline along with other components such as large language models.

Deploying on Seldon Core 2

We are now going to show how to deploy the conversational memory runtime with an SQL backend. This tutorial assumes that you have:

  1. A Kubernetes cluster set up with Core 2 installed

  2. A PostgreSQL database deployment and service under the namespace seldon.

  3. The memory runtime server up and running.

While the runtime image can be used as a standalone server in most cases you'll want to deploy it as part of a Kubernetes cluster. This section assumes the user has a Kubernetes cluster running with Seldon Core 2 installed in the seldon namespace. In order start serving memory models, you need to deploy the Memory Runtime. Check our installation tutorial to see how you can deploy the Memory Runtime server.

Once the Memory Runtime server is up and running, we are now ready to deploy memory models.

PostgreSQL

We begin with the following PostgreSQL deployment yaml:

!cat manifests/psql-deployment.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:latest
        env:
        - name: POSTGRES_USER
          value: admin
        - name: POSTGRES_PASSWORD
          value: admin
        - name: POSTGRES_DB
          value: db
        ports:
        - containerPort: 5432
        readinessProbe:
          exec:
            command: ["pg_isready", "-U", "admin", "-d", "db"]
          initialDelaySeconds: 5
          periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  type: LoadBalancer
  selector:
    app: postgres
  ports:
  - protocol: TCP
    port: 5432
    targetPort: 5432

The above gives us a PostgreSQL database called db with a user called admin with password admin. You should not use these credentials in production. You should also name the database something more informative than db. Note that you'll need to update the URL string in the model-settings.json to reflect these changes.

The following is an example of the model-settings.json file we'll use:

!cat models/psql-memory/model-settings.json
{
    "name": "psql-memory",
    "implementation": "mlserver_memory.ConversationalMemory",
    "parameters": {
        "extra": {
            "database": "sql",
            "config": {
                "url": "postgresql+pg8000://admin:admin@postgres:5432/db",
                "connection_options": {},
                "window_size": 50,
                "tensor_names": ["role", "content", "type"]
            }
        }
    }
}

The database URL indicates we're using the pg80000 as the PostgreSQL driver which comes packaged with the memory runtime. You can also use other drivers but you'll have to package them into the model-settings.json. The MySQL example will showcase how this is done.

Here we're setting the database URL directly in the model-settings.json file but we can also set this as an environment variable called MLSERVER_MODEL_MEMORY_URL via a Kubernetes secret. Doing so may be preferable to avoid storing credentials in the model-settings.json file.

Like all model deployments, the above configuration file needs to be uploaded to a Google bucket or MinIO storage. In our case, we're using a Google bucket. Once this is done we create the following deployment yaml.

!cat manifests/psql-memory.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: psql-memory
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/memory/example/models/psql-memory"
  requirements:
  - memory

We are now ready to apply the deployment and the model to our k8s cluster. We begin by applying the database deployment:

!kubectl apply -f manifests/psql-deployment.yaml -n seldon
!kubectl rollout status deployment/postgres -n seldon --timeout=600s
deployment.apps/postgres created
service/postgres created
Waiting for deployment "postgres" rollout to finish: 0 of 1 updated replicas are available...
deployment "postgres" successfully rolled out

Once the psql server is deployed we can apply our model:

!kubectl apply -f manifests/psql-memory.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/psql-memory created
model.mlops.seldon.io/psql-memory condition met

We can test it as before by sending multiple request to our model. The only difference is that the endpoint has changed.

import subprocess
from uuid import uuid4

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

endpoint = f"http://{get_mesh_ip()}/v2/models/psql-memory/infer"
memory_id = str(uuid4())


send_inference_request(role=["system"], content=["You are a helpful assistant"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["user"], content=["Hi! How are you doing today?"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["assistant"], content=["I'm doing great, how about you?"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["user"], content=["I'm doing well too. I have a question for you."], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["assistant"], content=["Sure, what's your question?"], type=["text"], memory_id=memory_id, endpoint=endpoint);

As before, we can retrieve the entire history by sending a get request.

response = get_inference_request(memory_id=memory_id, endpoint=endpoint)
pprint.pprint(response.json())
{'id': '80893a7d-443d-406c-ade9-2a953156626b',
 'model_name': 'psql-memory_1',
 'model_version': '1',
 'outputs': [{'data': ['7e69002d-de89-4fac-928a-002889a09b65'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{"role": "system", "content": [{"type": "text", '
                       '"value": "You are a helpful assistant"}]}',
                       '{"role": "assistant", "content": [{"type": "text", '
                       '"value": "I\'m doing great, how about you?"}]}',
                       '{"role": "user", "content": [{"type": "text", "value": '
                       '"I\'m doing well too. I have a question for you."}]}',
                       '{"role": "assistant", "content": [{"type": "text", '
                       '"value": "Sure, what\'s your question?"}]}'],
              'datatype': 'BYTES',
              'name': 'history',
              'parameters': {'content_type': 'str'},
              'shape': [4, 1]}],
 'parameters': {}}

To clean the cluster (psql model and deployment), run the following commands:

!kubectl delete -f manifests/psql-memory.yaml -n seldon
!kubectl delete -f manifests/psql-deployment.yaml -n seldon
model.mlops.seldon.io "psql-memory" deleted
deployment.apps "postgres" deleted
service "postgres" deleted

Last updated

Was this helpful?