# Example

The following demonstrates how to locally run a Conversational Memory Runtime MLServer instance. It also illustrates the different ways it can be used. (Note this example only showcases the Conversational Memory as a stand-alone component, for a more integrated example check out [chatbot](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/chat_bot/README.md))

To get up and running we need to pull the runtime docker image. To pull the docker image, you must be authenticated. Check our [installation tutorial](https://docs.seldon.ai/llm-module/introduction/installation) to see how you can authenticate with Docker CLI.

```sh
docker pull \
    europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-memory:0.7.0
```

Before we can start the runtime we need to create the `model-settings.json` which will tell it what model to run:

```python
!cat models/filesys-memory/model-settings.json
```

```
{
  "name": "filesys-memory",
  "implementation": "mlserver_memory.ConversationalMemory",
  "parameters": {
    "extra": {
      "database": "filesys",
      "config": {
        "window_size": 50,
        "tensor_names": ["role", "content", "type"]
      }
    }
  }
}
```

See the ConversationalMemory specification for an explanation of the `"config"` parameters as well as the file system backend specification. Currently we support the following backends: `filesys` and `sql`. `filesys` uses the local file system of the runtime and is for development purposes only. Because it's so rudimentary it doesn't require any extra set up and we'll use it for the remainder of this example. The `sql` backend can be used with a selection of sql databases, namely: `sqllite`, `postgresql`, `mysql`, `mssql` and `oracle`. These all require an amount of configuration. To see how to use the memory runtime with a `postgresql` or `mysql` database see the [this](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/conversational_memory/sql_memory/README.md) example.

## File System runtime

Now that the `model-settings.json` file is ready, we can start the runtime using:

```sh
docker run -it --rm -p 8080:8080 \
  -v ${PWD}/filesys-memory:/filesys-memory \
  europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-memory:0.7.0 \
  mlserver start /filesys-memory
```

There are two intended workflows for using the memory runtime. In the first the developer wishes to update the memory with new messages from the conversation and in the second the developer wants to get messages from the memory. When sending a request to the runtime the developer must always include a `memory_id` tensor that the runtime uses to identify the relevant conversation. Regardless of whether or not the history is updated, the runtime will return `window_size` number of messages from the memory.

As a more concrete example, imagine we want to get the conversational history for a specific conversation. A developer can do this by just sending the `memory_id` tensor. Doing so the runtime will return the last `window_size` number of messages it's stored.

Suppose we want to update the memory with a new message, and return the last `window_size - 1` messages with the new message appended to the list. To do so we'd send a `memory_id` tensor as well as three other tensors: `role`, `content`, and `type`. The runtime would then update the memory with the new `content` message, read the last `window_size` messages (including the new one) and return these. This is done to make the typical flow of a conversation easy to implement with the runtime. Hopefully the next examples will make this even clearer:

{% hint style="info" %}
The `window_size` parameter is intended to control the size of requests being sent through the pipelines in which the memory runtime is deployed. It's not intended to be used to control the number of tokens sent to the LLM. Token bounds can be configured in the [Local Runtime](https://github.com/SeldonIO/llm-runtimes/blob/master/runtimes/local.md).
{% endhint %}

### Unimodal inputs

To start with let's send a request with a `memory_id` and some content to start a conversational store. We've chosen the `memory_id=6e1f1413` at random.

```python
import requests
from typing import List


def send_inference_request(
    role: List[str],
    content: List[str], 
    type: List[str],
    memory_id: str,
    endpoint: str = "http://localhost:8080/v2/models/filesys-memory/infer",
):
    inference_request = {
        "inputs": [
            {
                "name": "memory_id",
                "shape": [1],
                "datatype": "BYTES",
                "data": [memory_id],
                "parameters": {"content_type": "str"},
            },
            {
                "name": "role",
                "shape": [1],
                "datatype": "BYTES",
                "data": role,
                "parameters": {"content_type": "str"},
            },
            {
                "name": "content",
                "shape": [1],
                "datatype": "BYTES",
                "data": content,
                "parameters": {"content_type": "str"},
            },
            {
                "name": "type",
                "shape": [1],
                "datatype": "BYTES",
                "data": type,
                "parameters": {"content_type": "str"},
            }
        ]
    }
    return requests.post(endpoint,json=inference_request)

```

```python
from uuid import uuid4
memory_id = str(uuid4())

# simulate a conversation
send_inference_request(role=["user"], content=["Hey how are you?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["assistant"], content=["I'm good, how are you?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["user"], content=["I'm good too. Can I ask you a question?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["assistant"], content=["Sure, what's your question?"], type=["text"], memory_id=memory_id);
send_inference_request(role=["user"], content=["What is the capital of France?"], type=["text"], memory_id=memory_id);
```

We should now be able to retrieve the conversation above, by sending a `get` request (i.e., a request containing only the `memory_id` tensor).

```python
import pprint
import requests

def get_inference_request(
    memory_id: str,
    endpoint: str = "http://localhost:8080/v2/models/filesys-memory/infer",
):
    inference_request = {
        "inputs": [
            {
                "name": "memory_id",
                "shape": [1],
                "datatype": "BYTES",
                "data": [memory_id],
                "parameters": {"content_type": "str"},
            }
        ]
    }
    return requests.post(endpoint, json=inference_request)
```

```python
response = get_inference_request(memory_id=memory_id)
pprint.pprint(response.json())
```

```
{'id': '07d4c4c9-e337-49b8-a446-bbbe972d7b9c',
 'model_name': 'filesys-memory',
 'outputs': [{'data': ['846424c3-d6ab-4e80-9422-644d97f4a934'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{"role": "user", "content": [{"type": "text", "value": '
                       '"Hey how are you?"}]}',
                       '{"role": "assistant", "content": [{"type": "text", '
                       '"value": "I\'m good, how are you?"}]}',
                       '{"role": "user", "content": [{"type": "text", "value": '
                       '"I\'m good too. Can I ask you a question?"}]}',
                       '{"role": "assistant", "content": [{"type": "text", '
                       '"value": "Sure, what\'s your question?"}]}',
                       '{"role": "user", "content": [{"type": "text", "value": '
                       '"What is the capital of France?"}]}'],
              'datatype': 'BYTES',
              'name': 'history',
              'parameters': {'content_type': 'str'},
              'shape': [5, 1]}],
 'parameters': {}}
```

As expected it returns all the previous message we just sent. Remember that you can control the number of messages to be returned by changing the `window_size` parameter. If we were to set `window_size=2`, then only the last two messages would have been returned.

### Multimodal inputs

Multimodal input messages can be sent to the memory runtime. As an example:

```python
memory_id = str(uuid4())
request_response = send_inference_request(
    role=["user"],
    content=["This is the first input", "This is the second input"],
    type=["text", "text"],
    memory_id=memory_id
)

pprint.pprint(request_response.json())
```

```
{'id': '618090fa-e326-4516-addf-b31057664088',
 'model_name': 'filesys-memory',
 'outputs': [{'data': ['1323ae57-d516-408a-92f2-f09f42b694b3'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{"role": "user", "content": [{"type": "text", "value": '
                       '"This is the first input"}, {"type": "text", "value": '
                       '"This is the second input"}]}'],
              'datatype': 'BYTES',
              'name': 'history',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

The above will write both inputs under the same entry. This is useful when deploying multimodal models which accept multiple inputs at the same time (e.g., (text, text), (text, image), (text, audio)).

The above showcases the memory runtime as a stand-alone server. In production the runtime is intended to be used with a sql database instead of the file system. To see an example of configuring the memory runtime to work with `postgresql` or `mysql` databases within kubernetes see this example. This [chatbot](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/chat_bot/README.md) example showcases how to use the memory runtime as a component within a Seldon Core V2 pipeline along with other components such as large language models.

## Deploying on Seldon Core 2

We are now going to show how to deploy the conversational memory runtime with an SQL backend. This tutorial assumes that you have:

1. A Kubernetes cluster set up with [Core 2](https://docs.seldon.io/projects/seldon-core/en/v2/contents/getting-started/) installed
2. A PostgreSQL database deployment and service under the namespace `seldon`.
3. The memory runtime server up and running.

While the runtime image can be used as a standalone server in most cases you'll want to deploy it as part of a Kubernetes cluster. This section assumes the user has a Kubernetes cluster running with Seldon Core 2 installed in the `seldon` namespace. In order start serving memory models, you need to deploy the Memory Runtime. Check our [installation tutorial](https://docs.seldon.ai/llm-module/introduction/installation) to see how you can deploy the Memory Runtime server.

Once the Memory Runtime server is up and running, we are now ready to deploy memory models.

### PostgreSQL

We begin with the following PostgreSQL deployment yaml:

```python
!cat manifests/psql-deployment.yaml
```

```
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:latest
        env:
        - name: POSTGRES_USER
          value: admin
        - name: POSTGRES_PASSWORD
          value: admin
        - name: POSTGRES_DB
          value: db
        ports:
        - containerPort: 5432
        readinessProbe:
          exec:
            command: ["pg_isready", "-U", "admin", "-d", "db"]
          initialDelaySeconds: 5
          periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
spec:
  type: ClusterIP
  selector:
    app: postgres
  ports:
  - protocol: TCP
    port: 5432
    targetPort: 5432
```

The above gives us a PostgreSQL database called `db` with a user called admin with password admin. You should not use these credentials in production. You should also name the database something more informative than `db`. Note that you'll need to update the `URL` string in the `model-settings.json` to reflect these changes.

The following is an example of the `model-settings.json` file we'll use:

```python
!cat models/psql-memory/model-settings.json
```

```
{
    "name": "psql-memory",
    "implementation": "mlserver_memory.ConversationalMemory",
    "parameters": {
        "extra": {
            "database": "sql",
            "config": {
                "url": "postgresql+pg8000://admin:admin@postgres:5432/db",
                "connection_options": {},
                "window_size": 50,
                "tensor_names": ["role", "content", "type"]
            }
        }
    }
}
```

{% hint style="info" %}
The database URL indicates we're using the `pg80000` as the PostgreSQL driver which comes packaged with the memory runtime. You can also use other drivers but you'll have to package them into the `model-settings.json`. The MySQL example will showcase how this is done.
{% endhint %}

{% hint style="info" %}
Here we're setting the database URL directly in the `model-settings.json` file but we can also set this as an environment variable called `MLSERVER_MODEL_MEMORY_URL` via a Kubernetes secret. Doing so may be preferable to avoid storing credentials in the `model-settings.json` file.
{% endhint %}

Like all model deployments, the above configuration file needs to be uploaded to a Google bucket or MinIO storage. In our case, we're using a Google bucket. Once this is done we create the following deployment yaml.

```python
!cat manifests/psql-memory.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: psql-memory
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/memory/example/models/psql-memory"
  requirements:
  - memory
```

We are now ready to apply the deployment and the model to our k8s cluster. We begin by applying the database deployment:

```python
!kubectl apply -f manifests/psql-deployment.yaml -n seldon
!kubectl rollout status deployment/postgres -n seldon --timeout=600s
```

```
deployment.apps/postgres created
service/postgres created
Waiting for deployment "postgres" rollout to finish: 0 of 1 updated replicas are available...
deployment "postgres" successfully rolled out
```

Once the `psql` server is deployed we can apply our model:

```python
!kubectl apply -f manifests/psql-memory.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/psql-memory created
model.mlops.seldon.io/psql-memory condition met
```

We can test it as before by sending multiple request to our model. The only difference is that the endpoint has changed.

```python
import subprocess
from uuid import uuid4

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

endpoint = f"http://{get_mesh_ip()}/v2/models/psql-memory/infer"
memory_id = str(uuid4())


send_inference_request(role=["system"], content=["You are a helpful assistant"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["user"], content=["Hi! How are you doing today?"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["assistant"], content=["I'm doing great, how about you?"], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["user"], content=["I'm doing well too. I have a question for you."], type=["text"], memory_id=memory_id, endpoint=endpoint);
send_inference_request(role=["assistant"], content=["Sure, what's your question?"], type=["text"], memory_id=memory_id, endpoint=endpoint);
```

As before, we can retrieve the entire history by sending a `get` request.

```python
response = get_inference_request(memory_id=memory_id, endpoint=endpoint)
pprint.pprint(response.json())
```

```
{'id': '80893a7d-443d-406c-ade9-2a953156626b',
 'model_name': 'psql-memory_1',
 'model_version': '1',
 'outputs': [{'data': ['7e69002d-de89-4fac-928a-002889a09b65'],
              'datatype': 'BYTES',
              'name': 'memory_id',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['{"role": "system", "content": [{"type": "text", '
                       '"value": "You are a helpful assistant"}]}',
                       '{"role": "assistant", "content": [{"type": "text", '
                       '"value": "I\'m doing great, how about you?"}]}',
                       '{"role": "user", "content": [{"type": "text", "value": '
                       '"I\'m doing well too. I have a question for you."}]}',
                       '{"role": "assistant", "content": [{"type": "text", '
                       '"value": "Sure, what\'s your question?"}]}'],
              'datatype': 'BYTES',
              'name': 'history',
              'parameters': {'content_type': 'str'},
              'shape': [4, 1]}],
 'parameters': {}}
```

To clean the cluster (psql model and deployment), run the following commands:

```python
!kubectl delete -f manifests/psql-memory.yaml -n seldon
!kubectl delete -f manifests/psql-deployment.yaml -n seldon
```

```
model.mlops.seldon.io "psql-memory" deleted
deployment.apps "postgres" deleted
service "postgres" deleted
```
