Prompting

The following demonstrates how run a Local MLServer runtime instance to run inference with LLMs deployed on your own infrastructure and configure the chat template that you want to use.

In this tutorial we assume that the user has a Kubernetes cluster running with Seldon Core 2 installed in the seldon namespace. In order to start serving LLMs we might have to first create a secret for the Huggingface token and deploy the Local Runtime server. Please check our installation tutorial to see how to do so.

Default chat template

By default, when you load a model using the Local runtime, the server is going to apply the default template located under tokenizer.json from the model page (see here). An example of a model-setting.json using the default chat template is the following:

!cat models/local-default-template/model-settings.json

{
    "name": "local-default-template",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}

To load the model, we need to apply the following manifest file:

!cat manifests/local-default-template.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-default-template
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/prompting/models/local-default-template"
  requirements:
  - llm-local

To apply the manifest file onto your cluster, run the following command:

!kubectl apply -f manifests/local-default-template.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/local-default-template created
model.mlops.seldon.io/local-default-template condition met

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct ip:

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

We can now send a request to the model:

import json
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["What is the tallest building in the world?"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/local-default-template/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)

{'id': 'f0390442-2d6b-4a0c-911b-1ea426d6dda7',
 'model_name': 'local-default-template_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': [' As of my knowledge cutoff in 2023, the tallest '
                       'building in the world is the Burj Khalifa, located in '
                       'Dubai, United Arab Emirates. Completed in 2010, it '
                       'stands at a staggering height of 828 meters (2,717 '
                       'feet). The Burj Khalifa’s design includes a mix of '
                       'architectural concepts and features a Y-shaped floor '
                       'plan to minimize wind forces on the exterior. It '
                       'serves not only as a symbol of urban development but '
                       'also houses residential and hotel spaces, an '
                       'observation deck, and various offices and facilities. '
                       'Please note, new structures may have been constructed '
                       'since my knowledge cutoff date that could have '
                       'surpassed its title. <|end|>'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

To unload the model, run the following command:

!kubectl delete -f manifests/local-default-template.yaml -n seldon

model.mlops.seldon.io "local-default-template" deleted

Predefined template

Some models come with multiple predefined templates. This is the case for CohereForAI/c4ai-command-r-v01-4bit which has two other templates which can be used: tool_use and rag (see tokenizer_config.json here). To configure the model to use either of those, you can specify the their label in the chat_template parameter. A model-settings.json example to use the rag templete is the following:

!cat models/local-predefined-template/model-settings.json

{
    "name": "local-predefined-template",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "chat_template": "rag",
                "model_settings": {
                    "model": "CohereForAI/c4ai-command-r-v01-4bit",
                    "dtype": "float16",
                    "quantization": "bitsandbytes", 
                    "load_format": "bitsandbytes"
                }
            }
        }
    }
}

Deploying the model is the same as above, just ensure that you are sending the documents tensor as well - we will exemplify how in the following example.

Custom template

In this example we will show how you can use a custom jinja template for a rag application with the phi-3.5 model.

The model-settings.json file looks as follows:

!cat models/local-custom-rag-template/model-settings.json

{
    "name": "local-custom-rag-template",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "chat_template": "custom_rag.jinja",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}

Alongside the model-settings.json, we also create the jinja template (in the same directory):

{% raw %}
{%- for message in messages %}
    {%- if message['role'] == 'system' and message['content'] %}
        {{- '<|system|>\n' + message['content'] + '<|end|>\n' }}
    {%- elif message['role'] == 'user' %}
        {{- '<|user|>\n' }}
        {%- if loop.index == messages | length %}
            {%- for document in documents %}
                {{- '\nDocument: ' }}{{ loop.index0 }}{{ ' ' }} 
                {%- for key, value in document.items() %}
                    {{- key }}: {{value}}
                {%- endfor %}
            {%- endfor %}
        {%- endif %}
        {{- message['content'] + '<|end|>\n' }}
    {%- elif message['role'] == 'assistant' %}
        {{- '<|assistant|>\n' + message['content'] + '<|end|>\n' }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|assistant|>\n' }}
{%- else %}
    {{- eos_token }}
{%- endif %}
{% endraw %}

We load the model as before:

!kubectl apply -f manifests/local-custom-rag-template.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/local-custom-rag-template created
model.mlops.seldon.io/local-custom-rag-template condition met

Once the model is loaded, we can send inference requests. What differes now is that we specify an additional tensor called documents which contains the context based on which the answer will be constructed. Note that the runtime is looking specifically for the tensor documents, thus the naming cannot be changed.

import json
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["What is the tallest building in the world?"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        },
        {
            "name": "documents",
            "shape": [3],
            "datatype": "BYTES",
            "data": [
                json.dumps({"text": "Building A is the tallest building in the world."}),
                json.dumps({"text": "Building B is the second tallest building in the world."}),
                json.dumps({"text": "Building C is the third tallest building in the world."}),
            ],
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/local-custom-rag-template/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)

{'id': '634764fd-d01f-4c15-a22a-ffc0661a8673',
 'model_name': 'local-custom-rag-template_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': [' The tallest building in the world, according to the '
                       'documents provided, is Building A. It is stated '
                       'directly in Document 0: "Building A is the tallest '
                       'building in the world." This confirms that Building A '
                       'surpasses both Buildings B and C in height, as per '
                       'their rankings in the subsequent documents. Building B '
                       'is the second tallest, and Building C is the third '
                       'tallest. Hence, Building A consistently holds the '
                       'position as the tallest in these documents. <|end|>'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

Note that in this case, the answer is constructed based on the context we provided, returning that building A is the tallest in the world.

To unload the model, run the following command:

!kubectl delete -f manifests/local-custom-rag-template.yaml -n seldon

model.mlops.seldon.io "local-custom-rag-template" deleted

PreviousQuantization NextStreaming

Last updated 22 days ago

Was this helpful?