# Memory

The Conversational Memory runtime is built on top of MLServer and it is intended to be used within a [Seldon Core 2 pipeline](https://docs.seldon.io/projects/seldon-core/en/v2/contents/pipelines/index.html) to persist content tensors corresponding to conversations between users and Large Language models (LLMs). To understand this runtime better, it is important to first highlight how LLMs work. Typically, LLMs accept input in the form of text and return a new string as a response to such input. The LLM will generate new responses based on the input it receives and, if we don't include old questions as context alongside the new question we just asked, it will not recall topics or responses previously raised and it will provide a new response each time. Therefore, in order to have a conversation with an LLM, we must provide a conversation history from a persistent storage solution.

The conversational memory component allows developers to store the interactions of a user with an LLM in a persistent storage, while indicating the number of messages in a conversation you'd like to append to new requests. To illustrate how this runtime works, consider the following diagram:

{% @mermaid/diagram content="flowchart LR
input(\[input])
output(\[output])
filesys\[(FILE SYSTEM)]
memory\_1
memory\_2
OAI\["OpenAI"]

```
input --> memory_1 --> OAI --> output
filesys <--> memory_1
memory_2 --> filesys
OAI --> memory_2

%% Styling for OpenAI node
style memory_1 fill:#407,stroke:#333,stroke-width:2px,color:#fff
style memory_2 fill:#407,stroke:#333,stroke-width:2px,color:#fff
```

" %}

This diagram illustrates an LLM chat app built with a [Seldon Core 2 pipeline](https://docs.seldon.io/projects/seldon-core/en/v2/contents/pipelines/index.html) taken from the [chatbot demo](/llm-module/use-cases/chat-bot.md). It consists of three MLServer runtimes: an `OpenAI` endpoint is set up via the [API runtime](/llm-module/components/models/api.md) while `memory_1` and `memory_2` are both conversational memory runtimes. When the user sends a message to the pipeline, the first step, `memory_1`, is where the message is appended to any conversational history available in either a memory store or elsewhere. The message is also written to a persistent store for later use in the same conversation or, potentially, a different one. All of this is then sent to the `OpenAI` model which generates a response that is passed to and persisted by the `memory_2` runtime. Now, subsequent requests sent to the pipeline will have a new message and response as input to provide context from different conversations. Excluding the memory runtimes in the above application would mean that the user could query the LLM but this will have no context of what has been said in the past, meaning, you would have to start each prompt from scratch and pass all the context necessary for the LLM to understand each new prompt.

The Conversational Memory component is designed to slot into any Core 2 pipeline with minimal set-up and configuration, and it can also be run as a stand alone runtime for convenience. As an example, have a look at the following configuration inside a `model-settings.json` file.

```json
{
    "name": "conversational_memory",
    "implementation": "mlserver_memory.ConversationalMemory",
    "parameters": {
        "extra": {
            "database": "filesys",
            "config": {
                "window_size": 25,
                "tensor_names": ["role", "content", "type"]
            }
        }
    }
}
```

The two key features of the above are the `database` and `config`. The `database` contains a specification for whatever choice of database you'd like to use with your application (currently only `filesys`, but more will be added soon!), and `config` contains parameters relating to the specific behaviour of the conversational memory runtime itself.

In the above `model-settings.json`, we've chosen the file system backend. This backend uses Python's inbuilt `shelve` library and the local file system to store conversations within a running MLServer instance (please note that this should only be used for development and testing purposes and this store will be ephemeral). We then selected a `window_size` of `25` which means that the runtime will prepend the last 25 messages sent in the conversation to any new prompt that we send to the LLM. Finally, we selected `message` as the `tensor_names` to tell the runtime the names of the tensors you, the developer, will want to store when messages are sent to the runtime. For more details on why this parameter is required, and for a further introduction to the memory runtime, see the [chat app example](/llm-module/use-cases/chat-bot.md).

{% hint style="info" %}
Most LLM conversations include a system prompt, which tells the LLM to behave in a certain way. For example, you may want to have the following system prompt: `"You are a cat, your name is Neko"` so that your LLM agent includes phrases like `"*Blinks slowly, tail twitching* Meow? *Nudges a paw at your leg, curious*` in their response. When a system prompt is included in the conversation history, the Conversational Memory runtime will look for the last system prompt. The last system prompt will be considered the reference for the beginning of the conversation. This means that all the messages before the system prompt will be ignored. In addition, it is always guaranteed that the system prompt will be included in the returned conversation history. This means the first message in the conversation history will be the system prompt, followed by the last `window_size - 1` message. In this way, we ensure that the model will preserve the context and continue to act as we want (e.g., responding like a cat).
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/components/memory.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
