Chat

The following demonstrates how to locally run a Local MLServer runtime instance to run inference with LLMs deployed on your own infrastructure. It also illustrates the different ways it can be used.

This example only showcases the API Runtime as a stand-alone component, for a more integrated example check out the chatbot demo.

To get up and running we need to pull the runtime Docker image. To pull the docker image, you must be authenticated. Check our installation tutorial to see how you can authenticate with Docker CLI.

docker pull \
    europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-local:0.7.0

Before we can start the runtime we need to create the model-settings.json which will define our model of choice:

!cat models/local-chat-completions/model-settings.json

{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}

The runtime config is specified in the parameters JSON field:

the choice of backend - currently, the backends supported are "transformers", "vllm", and "deepspeed".
we've chosen the microsoft/Phi-3.5-mini-instruct model and the chat.completions API.
we split the model across 4 GPUs (specified with the tensor_parallel_size field)
modified the weight representation to float16 (as the dtype) due to hardward limitation
limit the number of maximum generation tokens to 256 (under max_tokens) and the maximum sequence length to 1024 (max_model_len).

These are configurations that can be modified to suit a particular hardware setup.

Starting the Runtime

Finally, to start the server run:

docker run -it --rm -p 8080:8080 \
  -v ${PWD}/models:/models \
  -e HF_TOKEN=<your_hf_token> \
  --shm-size=1g \
  europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-local:0.7.0 \
  mlserver start /models/local-chat-completions

Sending Requests

To send our first request to the chat.completions endpoint that we are now serving via mlserver, we use the following:

import json
import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [4], 
            "datatype": "BYTES", 
            "data": ["system", "user", "assistant", "user"]
        },
        {
            "name": "content",
            "shape": [4],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful AI assistant."]),
                json.dumps(["Can you provide ways to eat combinations of bananas and dragonfruits?"]),
                json.dumps(["Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."]),
                json.dumps(["What about solving an 2x + 3 = 7 equation?"]) 
            ]
        },
        {
            "name": "type",
            "shape": [4],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]),
                json.dumps(["text"]),
                json.dumps(["text"]),
                json.dumps(["text"])
            ]
        }
    ]
}

endpoint = "http://localhost:8080/v2/models/local-chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)

{'id': '8dba16dd-4b8d-4876-94d2-e542debe21ab',
 'model_name': 'local-chat-completions',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': [" To solve the equation 2x + 3 = 7, you'll want to "
                       "isolate the variable x. Here's how you can do it step "
                       'by step:\n'
                       '\n'
                       '1. Subtract 3 from both sides of the equation:\n'
                       '\n'
                       '   2x + 3 - 3 = 7 - 3\n'
                       '\n'
                       '   This simplifies to:\n'
                       '\n'
                       '   2x = 4\n'
                       '\n'
                       '2. Now, divide both sides of the equation by 2:\n'
                       '\n'
                       '   2x / 2 = 4 / 2\n'
                       '\n'
                       '   This gives you:\n'
                       '\n'
                       '   x = 2\n'
                       '\n'
                       'So, the solution to the equation 2x + 3 = 7 is x = 2. '
                       '<|end|>'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

Note that we've sent three tensors: a role, a content, and a type tensor. The role tensors tell the model who is speaking. In this case, it includes a system role and a user role. The system role is used to dictate the context of the interaction and the user role indicates that the matching content is sent by a user. In the above the system content is: "You are a helpful AI assistant." and the last user content is "What about solving an 2x + 3 = 7 equation?". The type tensor indicates that the content we sent is a text.

The endpoint responds with its own role, content and type tensors. Its role is given as "assistant" and the content it returns the solution of the equation "... So, the solution to the equation 2x + 3 = 7 is x = 2.".

Requests with Parameters

As can add parameters to a request to specify attributes that would change our desired inference behaviour, such as number of generations and temperature. For a list of all available parameters see the Local documentation (TODO: add link). The following sets the temperature, the maximum number of tokens to generate and the number of generations to return:

import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [1], 
            "datatype": "BYTES", 
            "data": ["user"]
        },
        {
            "name": "content",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["Does money buy happiness?"],
        },
        {
            "name": "type",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["text"]
        }
    ],
    "parameters": {
        "kwargs": {
            "temperature": 0.0,
            "max_tokens": 20,
        }
    },
}

endpoint = "http://localhost:8080/v2/models/local-chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)

{'id': '58c21f8d-f217-4fb1-97f5-83b6865d9d1f',
 'model_name': 'local-chat-completions',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': [' The relationship between money and happiness is a '
                       'complex and nuanced topic that has been studied across '
                       'various discipl'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

Since the temperature is set to 0, the response will always be determinstic. You can test that yourself, by sending the same request multiple time. Note that if you are sending a single message, you must not encode the content and type as json.

Adding prompts

Prompting is a pivotal part of using large language models. It allows developers to embed messages sent by a user within other textual contexts that provide more information about the task you want the model to perform. For instance, we use the following prompt to give the model more context about the kind of question that the model is expected to answer:

!cat models/local-prompt-chat-completions/cot.jinja

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Q: {{ question[0] }}
A:

In the above the content sent by the user can be inserted in the {question} variable. A developer should specify what content to insert thereby giving that tensor the name "question". To start with we need to create a new model-settings.json file. This will be the same as the previous one but in addition it specifies the prompt tempalte to be used through prompt_utils settings.

!cat models/local-prompt-chat-completions/model-settings.json

{
    "name": "local-prompt-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            },
            "prompt_utils": {
                "prompt_options": {
                    "uri": "cot.jinja"
                }
            }
        }
    }
}

Unload the previous model and load this new on. We can test this using the following (Note that if we send a single tensor named "question", this is the content that will be inserted.):

import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "role",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["user"],
        },
        {
            "name": "question",
            "shape": [1],
            "datatype": "BYTES",
            "data": [
                "The cafeteria had 23 apples originally. If they used 20 "
                "to make lunch and bought 6 more, how many apples do they have?"
            ],
        },
    ]
}

endpoint = "http://localhost:8080/v2/models/local-prompt-chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)

{'id': 'a72b641c-0831-4bc4-9881-ce8d67b60817',
 'model_name': 'local-prompt-chat-completions',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': [' The cafeteria used 20 apples, so they had 23 - 20 = 3 '
                       'apples left. Then they bought 6 more apples. Now, they '
                       'have 3 + 6 = 9 apples. The answer is 9. <|end|>'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

We can see that the model is able to solve correctly our problem.

Deploying on Seldon Core 2

We will now demostrate how do deploy the chat completions model on Seldon Core 2.

While the runtime image can be used as a standalone server in most cases you'll want to deploy it as part of a Kubernetes cluster. This section assumes the user has a Kubernetes cluster running with Seldon Core 2 installed in the seldon namespace. In to start serving LLMs we might have to first create a secret for the Huggingface token and deploy the Local Runtime server. Please check our installation tutorial to see how you can do so.

To deploy the chat completions model, we will need to create the associated manifest file.

!cat manifests/local-chat-completions.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/local/chat/models/local-chat-completions"
  requirements:
  - llm-local

To load the model in Seldon Core 2, run:

!kubectl apply -f manifests/local-chat-completions.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/local-chat-completions created
model.mlops.seldon.io/local-chat-completions condition met

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct ip:

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

As before, we can now send a request to the model:

import json
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["Write a python function to compute the factorial of a number."]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/local-chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)

{'id': '05bd3a73-c490-47c8-aef4-5fbd03811b8c',
 'model_name': 'local-chat-completions_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': [' Certainly! Below is a Python function to compute the '
                       'factorial of a non-negative integer. The factorial of '
                       'a number `n` is denoted as `n!` and is the product of '
                       'all positive integers less than or equal to `n`.\n'
                       '\n'
                       '```python\n'
                       'def factorial(n):\n'
                       '    """\n'
                       '    Compute the factorial of a non-negative integer '
                       'n.\n'
                       '\n'
                       '    Parameters:\n'
                       '    n (int): A non-negative integer whose factorial is '
                       'to be computed.\n'
                       '\n'
                       '    Returns:\n'
                       '    int: The factorial of n.\n'
                       '\n'
                       '    Raises:\n'
                       '    ValueError: If n is negative.\n'
                       '    """\n'
                       '    if n < 0:\n'
                       '        raise ValueError("Factorial is not defined for '
                       'negative numbers")\n'
                       '    \n'
                       '    result = 1\n'
                       '    for i in range(1, n + 1):\n'
                       '        result *= i\n'
                       '    return result\n'
                       '\n'
                       '# Example usage\n'
                       'try:\n'
                       '    number = 5\n'
                       '    print(f"The factorial of {number} is '
                       '{factorial(number)}")\n'
                       'except ValueError as e:\n'
                       '    print(e)\n'
                       '```\n'
                       '\n'
                       'This function checks if the input'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

You now have a deployed model in Seldon Core 2, ready and available for requests! To unload the model, you can run the following command.

!kubectl delete -f manifests/local-chat-completions.yaml -n seldon

model.mlops.seldon.io "local-chat-completions" deleted

PreviousLocal NextMultimodel serving

Last updated 22 days ago

Was this helpful?