Prompting

The following demonstrates how run a Local MLServer runtime instance to run inference with LLMs deployed on your own infrastructure and configure the chat template that you want to use.

In this tutorial we assume that the user has a Kubernetes cluster running with Seldon Core 2 installed in the seldon namespace. In order to start serving LLMs we might have to first create a secret for the Huggingface token and deploy the Local Runtime server. Please check our installation tutorial to see how to do so.

Default chat template

By default, when you load a model using the Local runtime, the server is going to apply the default template located under tokenizer.json from the model page (see here). An example of a model-setting.json using the default chat template is the following:

!cat models/local-default-template/model-settings.json
{
    "name": "local-default-template",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}

To load the model, we need to apply the following manifest file:

To apply the manifest file onto your cluster, run the following command:

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct ip:

We can now send a request to the model:

To unload the model, run the following command:

Predefined template

Some models come with multiple predefined templates. This is the case for CohereForAI/c4ai-command-r-v01-4bit which has two other templates which can be used: tool_use and rag (see tokenizer_config.json here). To configure the model to use either of those, you can specify the their label in the chat_template parameter. A model-settings.json example to use the rag templete is the following:

Deploying the model is the same as above, just ensure that you are sending the documents tensor as well - we will exemplify how in the following example.

Custom template

In this example we will show how you can use a custom jinja template for a rag application with the phi-3.5 model.

The model-settings.json file looks as follows:

Alongside the model-settings.json, we also create the jinja template (in the same directory):

We load the model as before:

Once the model is loaded, we can send inference requests. What differes now is that we specify an additional tensor called documents which contains the context based on which the answer will be constructed. Note that the runtime is looking specifically for the tensor documents, thus the naming cannot be changed.

Note that in this case, the answer is constructed based on the context we provided, returning that building A is the tallest in the world.

To unload the model, run the following command:

Last updated

Was this helpful?