Example

In the LocalRuntime example, we have seen how to deploy a model on our infrastructure. We've also seen that we can create some specific prompts for that particular model to work with. By defining a prompt, we managed to simplify the user's experience, which does not have to compile a prompt locally and send it within each request. Also, defining a prompt allows the creation of more complex pipelines that can only look for specific tensors and use their content to compile the prompt. The only caveat of that example is that the prompt is bound to the model and compiled locally. In other words, it means that we can only have a prompt per LLM deployment. Such an approach is the optimal one when we have a single use case for that LLM. But, in many situations, we would probably like to reuse the same model for multiple tasks. Redeploying another model is not a solution since those LLMs tend to be large and demand a lot of hardware. This is where the PromptRuntime comes into the picture.

Prompt Runtime allows you to deploy multiple prompt modules with only a single job: compile a prompt given the input tensors and send the compiled prompt to the LLM for completion. In this case, the LLM will be a general-purpose completion model that generates a completion given a prompt.

Before starting this tutorial, please ensure you have the LocalRuntime and PromptRuntime up and running. Please check our installation guidelines on how to do so.

We also require a helper function to get the mesh IP. This function will help us find the correct URL to send the request to.

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

We can now deploy the LLM. The associated model-settings.json file is the following:

!cat models/tiny-llama/model-settings.json

{
    "name": "tiny-llama",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "dtype": "float16"
                }
            },
            "prompt_utils": {
                "model_type": "compiled"
            }
        }
    }
}

There are a couple of things to note in the settings file:

"model_type": "chat.completions" under "config" - means that the model uses internally the default chat completion prompt (i.e., the one present in the config.json file from the model hub)
"model_type": "compiled" under "prompt_utils" - means that the model will expect a compiled prompt in the request.

The associated manifest file for this model is:

!cat manifests/tiny-llama.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tiny-llama
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/prompting/models/tiny-llama"
  requirements:
  - llm-local

We can now deploy the model by running the following commands:

!kubectl apply -f manifests/tiny-llama.yaml -n seldon
!kubectl wait --for=condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/tiny-llama created
model.mlops.seldon.io/tiny-llama condition met

We can deploy some prompt models that will compile their prompts and forward them to the LLM for completion.

We begin with a classical chat-completions prompt. The associated model-settings.json file is the following:

!cat models/chat-completions/model-settings.json

{
    "name": "chat-completions",
    "implementation": "mlserver_prompt_utils.runtime.PromptRuntime",
    "parameters": {
        "extra": {
            "prompt_utils": {
                "model_type": "chat.completions"
            }
        }
    }
}

The corresponding manifest file is the following:

!cat manifests/chat-completions.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: chat-completions
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/prompting/models/chat-completions"
  llm:
    modelRef: tiny-llama
  requirements:
  - prompt

As you can see in the manifest file, we have specified the modelRef field, which points to the LLM model. In this way, the prompt model will internally call the tiny-llama LLM we deployed earlier.

We can deploy the prompt model by running the following command:

!kubectl apply -f manifests/chat-completions.yaml -n seldon
!kubectl wait --for=condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/chat-completions created
model.mlops.seldon.io/chat-completions condition met
model.mlops.seldon.io/tiny-llama condition met

We define a second prompt model which will use a custom jinja prompt template to compile the prompt. The model-settings.json file looks like this:

!cat models/chat-completions-prompt/model-settings.json

{
    "name": "prompt-chat-completions",
    "implementation": "mlserver_prompt_utils.runtime.PromptRuntime",
    "parameters": {
        "extra": {
            "prompt_utils": {
                "model_type": "chat.completions",
                "prompt_options": {
                    "uri": "cot.jinja"
                }
            }
        }
    }
}

Note that we still have the reference to the model, and we've included the path to the local Jinja template as well. The Jinja template used for this example is the following:

!cat models/chat-completions-prompt/cot.jinja

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.

Q: {{ question[0] }}
A:

As you may have guessed, our inference request will contain a question tensor which will replace the placeholder above and result in a compiled prompt. The compiled prompt will then be sent to the LLM for completion to answer the given question.

The corresponding manifest file is the following:

!cat manifests/chat-completions-prompt.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: chat-completions-prompt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/prompting/models/chat-completions-prompt"
  llm:
    modelRef: tiny-llama
  requirements:
  - prompt

We can now deploy the model by running the following commands:

!kubectl apply -f manifests/chat-completions-prompt.yaml -n seldon
!kubectl wait --for=condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/chat-completions-prompt created
model.mlops.seldon.io/chat-completions condition met
model.mlops.seldon.io/chat-completions-prompt condition met
model.mlops.seldon.io/tiny-llama condition met

With both of our models deployed, we can send requests for completion.

We begin by sending a request for the regular chat completion model by running the following piece of code:

import json
import pprint
import requests


inference_request = {
    "inputs": [
        {
            "name": "role", 
            "shape": [2], 
            "datatype": "BYTES", 
            "data": ["system", "user"]
        },
        {
            "name": "content",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["You are a helpful assistant"]), 
                json.dumps(["What is the capital of Romania?"]),
            ],
        },
        {
            "name": "type",
            "shape": [2],
            "datatype": "BYTES",
            "data": [
                json.dumps(["text"]), 
                json.dumps(["text"]),
            ]
        }
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/models/chat-completions/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)

{'id': '8d44cdea-8b9d-46d0-bf1b-5e2174eb21f3',
 'model_name': 'tiny-llama_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['The capital of Romania is Bucharest.'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

As you can see, we successfully queried the first model. Now we can do the same for the second model, by including the question tensor in our request:

import pprint
import requests

inference_request = {
    "inputs": [
        {
            "name": "role",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["user"],
        },
        {
            "name": "question",
            "shape": [1],
            "datatype": "BYTES",
            "data": [
                "The cafeteria had 23 apples. If they used 20. How many apples are left?"
            ],
        },
    ],
    "parameters": {
        "kwargs": {
            "max_tokens": 1024,
            "temperature": 0.0,
        }
    }
}

endpoint = f"http://{get_mesh_ip()}/v2/models/chat-completions-prompt/infer"
response = requests.post(endpoint, json=inference_request)
pprint.pprint(response.json(), depth=4)

{'id': '5cd5de7a-f3ae-4bdd-9ef6-2acadd10f818',
 'model_name': 'tiny-llama_1',
 'outputs': [{'data': ['assistant'],
              'datatype': 'BYTES',
              'name': 'role',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['The given text states that the cafeteria had 23 '
                       'apples. If they used 20 apples, the answer is 23 - 20 '
                       '= 3 apples left.'],
              'datatype': 'BYTES',
              'name': 'content',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]},
             {'data': ['text'],
              'datatype': 'BYTES',
              'name': 'type',
              'parameters': {'content_type': 'str'},
              'shape': [1, 1]}],
 'parameters': {}}

As before, the model can successfully respond to our question.

To unload the models, run the following commands:

!kubectl delete -f manifests/chat-completions-prompt.yaml -n seldon
!kubectl delete -f manifests/chat-completions.yaml -n seldon
!kubectl delete -f manifests/tiny-llama.yaml -n seldon

model.mlops.seldon.io "chat-completions-prompt" deleted
model.mlops.seldon.io "chat-completions" deleted
model.mlops.seldon.io "tiny-llama" deleted

Congrats! You've successfully managed to serve two use cases using the same LLM.

PreviousPrompting NextRetrieval Augmented Generation

Last updated 22 days ago

Was this helpful?