Chat

The following demonstrates how to locally run a Local MLServer runtime instance to run inference with LLMs deployed on your own infrastructure. It also illustrates the different ways it can be used.

This example only showcases the API Runtime as a stand-alone component, for a more integrated example check out the chatbot demo.

To get up and running we need to pull the runtime Docker image. To pull the docker image, you must be authenticated. Check our installation tutorial to see how you can authenticate with Docker CLI.

docker pull \
    europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-local:0.7.0

Before we can start the runtime we need to create the model-settings.json which will define our model of choice:

!cat models/local-chat-completions/model-settings.json
{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}

The runtime config is specified in the parameters JSON field:

  1. backend - currently, the backends supported are "transformers", "vllm", and "deepspeed".

  2. model - here we've chosen the microsoft/Phi-3.5-mini-instruct model to be pulled from HuggingFace.

  3. model_type: Here we define the endpoint to use, in this case the chat.completions API.

  4. tensor_parallel_size defines the number of GPUs we might want to split the model across (if relevant).

  5. dtype defines the dataype used for model weights. In this case we've used float16 due to hardward limitation

  6. max_tokens limits the number of maximum generation tokens to 256

  7. max_model_len sets the maximum sequence length to 1024.

If the model to be deployed is stored in a local bucket, the model-settings.json should identify the relative path from where the model-settings.json file is stored. As the model artefact should be stored alondside this file, . can indicate the model path, with the full uri to be defined in the Model Custom Resource if deploying with Core 2.

These are configurations that can be modified to suit a particular hardware setup.

Starting the Runtime

Finally, to start the server run:

Sending Requests

To send our first request to the chat.completions endpoint that we are now serving via mlserver, we use the following:

Note that we've sent three tensors: a role, a content, and a type tensor. The role tensors tell the model who is speaking. In this case, it includes a system role and a user role. The system role is used to dictate the context of the interaction and the user role indicates that the matching content is sent by a user. In the above the system content is: "You are a helpful AI assistant." and the last user content is "What about solving an 2x + 3 = 7 equation?". The type tensor indicates that the content we sent is a text.

The endpoint responds with its own role, content and type tensors. Its role is given as "assistant" and the content it returns the solution of the equation "... So, the solution to the equation 2x + 3 = 7 is x = 2.".

Requests with Parameters

As can add parameters to a request to specify attributes that would change our desired inference behaviour, such as number of generations and temperature. For a list of all available parameters see the Local documentation (TODO: add link). The following sets the temperature, the maximum number of tokens to generate and the number of generations to return:

Since the temperature is set to 0, the response will always be determinstic. You can test that yourself, by sending the same request multiple time. Note that if you are sending a single message, you must not encode the content and type as json.

Adding prompts

Prompting is a pivotal part of using large language models. It allows developers to embed messages sent by a user within other textual contexts that provide more information about the task you want the model to perform. For instance, we use the following prompt to give the model more context about the kind of question that the model is expected to answer:

In the above the content sent by the user can be inserted in the {question} variable. A developer should specify what content to insert thereby giving that tensor the name "question". To start with we need to create a new model-settings.json file. This will be the same as the previous one but in addition it specifies the prompt tempalte to be used through prompt_utils settings.

Unload the previous model and load this new on. We can test this using the following (Note that if we send a single tensor named "question", this is the content that will be inserted.):

We can see that the model is able to solve correctly our problem.

Deploying on Seldon Core 2

We will now demostrate how do deploy the chat completions model on Seldon Core 2.

While the runtime image can be used as a standalone server in most cases you'll want to deploy it as part of a Kubernetes cluster. This section assumes the user has a Kubernetes cluster running with Seldon Core 2 installed in the seldon namespace. In to start serving LLMs we might have to first create a secret for the Huggingface token and deploy the Local Runtime server. Please check our installation tutorial to see how you can do so.

To deploy the chat completions model, we will need to create the associated manifest file.

To load the model in Seldon Core 2, run:

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct ip:

As before, we can now send a request to the model:

You now have a deployed model in Seldon Core 2, ready and available for requests! To unload the model, you can run the following command.

Last updated

Was this helpful?