Memory

The Conversational Memory runtime is built on top of MLServer and it is intended to be used within a Seldon Core 2 pipeline to persist content tensors corresponding to conversations between users and Large Language models (LLMs). To understand this runtime better, it is important to first highlight how LLMs work. Typically, LLMs accept input in the form of text and return a new string as a response to such input. The LLM will generate new responses based on the input it receives and, if we don't include old questions as context alongside the new question we just asked, it will not recall topics or responses previously raised and it will provide a new response each time. Therefore, in order to have a conversation with an LLM, we must provide a conversation history from a persistent storage solution.

The conversational memory component allows developers to store the interactions of a user with an LLM in a persistent storage, while indicating the number of messages in a conversation you'd like to append to new requests. To illustrate how this runtime works, consider the following diagram:

This diagram illustrates an LLM chat app built with a Seldon Core 2 pipeline taken from the chatbot demo. It consists of three MLServer runtimes: an OpenAI endpoint is set up via the API runtime while memory_1 and memory_2 are both conversational memory runtimes. When the user sends a message to the pipeline, the first step, memory_1, is where the message is appended to any conversational history available in either a memory store or elsewhere. The message is also written to a persistent store for later use in the same conversation or, potentially, a different one. All of this is then sent to the OpenAI model which generates a response that is passed to and persisted by the memory_2 runtime. Now, subsequent requests sent to the pipeline will have a new message and response as input to provide context from different conversations. Excluding the memory runtimes in the above application would mean that the user could query the LLM but this will have no context of what has been said in the past, meaning, you would have to start each prompt from scratch and pass all the context necessary for the LLM to understand each new prompt.

The Conversational Memory component is designed to slot into any Core 2 pipeline with minimal set-up and configuration, and it can also be run as a stand alone runtime for convenience. As an example, have a look at the following configuration inside a model-settings.json file.

{
    "name": "conversational_memory",
    "implementation": "mlserver_memory.ConversationalMemory",
    "parameters": {
        "extra": {
            "database": "filesys",
            "config": {
                "window_size": 25,
                "tensor_names": ["role", "content", "type"]
            }
        }
    }
}

The two key features of the above are the database and config. The database contains a specification for whatever choice of database you'd like to use with your application (currently only filesys, but more will be added soon!), and config contains parameters relating to the specific behaviour of the conversational memory runtime itself.

In the above model-settings.json, we've chosen the file system backend. This backend uses Python's inbuilt shelve library and the local file system to store conversations within a running MLServer instance (please note that this should only be used for development and testing purposes and this store will be ephemeral). We then selected a window_size of 25 which means that the runtime will prepend the last 25 messages sent in the conversation to any new prompt that we send to the LLM. Finally, we selected message as the tensor_names to tell the runtime the names of the tensors you, the developer, will want to store when messages are sent to the runtime. For more details on why this parameter is required, and for a further introduction to the memory runtime, see the chat app example.

Most LLM conversations include a system prompt, which tells the LLM to behave in a certain way. For example, you may want to have the following system prompt: "You are a cat, your name is Neko" so that your LLM agent includes phrases like "*Blinks slowly, tail twitching* Meow? *Nudges a paw at your leg, curious* in their response. When a system prompt is included in the conversation history, the Conversational Memory runtime will look for the last system prompt. The last system prompt will be considered the reference for the beginning of the conversation. This means that all the messages before the system prompt will be ignored. In addition, it is always guaranteed that the system prompt will be included in the returned conversation history. This means the first message in the conversation history will be the system prompt, followed by the last window_size - 1 message. In this way, we ensure that the model will preserve the context and continue to act as we want (e.g., responding like a cat).

Last updated

Was this helpful?