# Prompting

The following demonstrates how run a Local MLServer runtime instance to run inference with LLMs deployed on your own infrastructure and configure the chat template that you want to use.

In this tutorial we assume that the user has a Kubernetes cluster running with Seldon Core 2 installed in the `seldon` namespace. In order to start serving LLMs we might have to first create a secret for the Huggingface token and deploy the Local Runtime server. Please check our [installation tutorial](/llm-module/introduction/installation.md) to see how to do so.

## Default chat template

By default, when you load a model using the Local runtime, the server is going to apply the default template located under `tokenizer.json` from the model page (see [here](https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit/tree/main)). An example of a `model-setting.json` using the default chat template is the following:

```python
!cat models/local-default-template/model-settings.json
```

```
{
    "name": "local-default-template",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}
```

To load the model, we need to apply the following manifest file:

```python
!cat manifests/local-default-template.yaml
```

```
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-default-template
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/models/prompting/models/local-default-template"
  requirements:
  - llm-local
```

To apply the manifest file onto your cluster, run the following command:

```python
!kubectl apply -f manifests/local-default-template.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/local-default-template created
model.mlops.seldon.io/local-default-template condition met
```

Before sending the actual request, we need to get the mesh ip. The following util function will help you retrieve the correct ip:

```python
import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')
```

We can now send a request to the model:

```python
import json
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["What is the tallest building in the world?"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/local-default-template/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': 'f0390442-2d6b-4a0c-911b-1ea426d6dda7',
 'model_name': 'local-default-template_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': [' As of my knowledge cutoff in 2023, the tallest '
                       'building in the world is the Burj Khalifa, located in '
                       'Dubai, United Arab Emirates. Completed in 2010, it '
                       'stands at a staggering height of 828 meters (2,717 '
                       'feet). The Burj Khalifa’s design includes a mix of '
                       'architectural concepts and features a Y-shaped floor '
                       'plan to minimize wind forces on the exterior. It '
                       'serves not only as a symbol of urban development but '
                       'also houses residential and hotel spaces, an '
                       'observation deck, and various offices and facilities. '
                       'Please note, new structures may have been constructed '
                       'since my knowledge cutoff date that could have '
                       'surpassed its title. <|end|>'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

To unload the model, run the following command:

```python
!kubectl delete -f manifests/local-default-template.yaml -n seldon
```

```
model.mlops.seldon.io "local-default-template" deleted
```

## Predefined template

Some models come with multiple predefined templates. This is the case for `CohereForAI/c4ai-command-r-v01-4bit` which has two other templates which can be used: `tool_use` and `rag` (see `tokenizer_config.json` [here](https://huggingface.co/CohereForAI/c4ai-command-r-v01-4bit/blob/main/tokenizer_config.json)). To configure the model to use either of those, you can specify the their label in the `chat_template` parameter. A `model-settings.json` example to use the `rag` templete is the following:

```python
!cat models/local-predefined-template/model-settings.json
```

```
{
    "name": "local-predefined-template",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "chat_template": "rag",
                "model_settings": {
                    "model": "CohereForAI/c4ai-command-r-v01-4bit",
                    "dtype": "float16",
                    "quantization": "bitsandbytes", 
                    "load_format": "bitsandbytes"
                }
            }
        }
    }
}
```

Deploying the model is the same as above, just ensure that you are sending the `documents` tensor as well - we will exemplify how in the following example.

## Custom template

In this example we will show how you can use a custom `jinja` template for a `rag` application with the `phi-3.5` model.

The `model-settings.json` file looks as follows:

```python
!cat models/local-custom-rag-template/model-settings.json
```

```
{
    "name": "local-custom-rag-template",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "chat_template": "custom_rag.jinja",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            }
        }
    }
}
```

Alongside the `model-settings.json`, we also create the `jinja` template (in the same directory):

```
{% raw %}
{%- for message in messages %}
    {%- if message['role'] == 'system' and message['content'] %}
        {{- '<|system|>\n' + message['content'] + '<|end|>\n' }}
    {%- elif message['role'] == 'user' %}
        {{- '<|user|>\n' }}
        {%- if loop.index == messages | length %}
            {%- for document in documents %}
                {{- '\nDocument: ' }}{{ loop.index0 }}{{ ' ' }} 
                {%- for key, value in document.items() %}
                    {{- key }}: {{value}}
                {%- endfor %}
            {%- endfor %}
        {%- endif %}
        {{- message['content'] + '<|end|>\n' }}
    {%- elif message['role'] == 'assistant' %}
        {{- '<|assistant|>\n' + message['content'] + '<|end|>\n' }}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|assistant|>\n' }}
{%- else %}
    {{- eos_token }}
{%- endif %}
{% endraw %}
```

We load the model as before:

```python
!kubectl apply -f manifests/local-custom-rag-template.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
```

```
model.mlops.seldon.io/local-custom-rag-template created
model.mlops.seldon.io/local-custom-rag-template condition met
```

Once the model is loaded, we can send inference requests. What differes now is that we specify an additional tensor called `documents` which contains the context based on which the answer will be constructed. Note that the runtime is looking specifically for the tensor `documents`, thus the naming cannot be changed.

```python
import json
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["What is the tallest building in the world?"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        },
        {
            "name": "documents",
            "shape": [3],
            "datatype": "BYTES",
            "data": [
                json.dumps({"text": "Building A is the tallest building in the world."}),
                json.dumps({"text": "Building B is the second tallest building in the world."}),
                json.dumps({"text": "Building C is the third tallest building in the world."}),
            ],
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/local-custom-rag-template/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)
```

```
{'id': '634764fd-d01f-4c15-a22a-ffc0661a8673',
 'model_name': 'local-custom-rag-template_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': [' The tallest building in the world, according to the '
                       'documents provided, is Building A. It is stated '
                       'directly in Document 0: "Building A is the tallest '
                       'building in the world." This confirms that Building A '
                       'surpasses both Buildings B and C in height, as per '
                       'their rankings in the subsequent documents. Building B '
                       'is the second tallest, and Building C is the third '
                       'tallest. Hence, Building A consistently holds the '
                       'position as the tallest in these documents. <|end|>'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}
```

Note that in this case, the answer is constructed based on the context we provided, returning that building A is the tallest in the world.

To unload the model, run the following command:

```python
!kubectl delete -f manifests/local-custom-rag-template.yaml -n seldon
```

```
model.mlops.seldon.io "local-custom-rag-template" deleted
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/components/models/prompting.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
