Example

In the LocalRuntime example, we have seen how to deploy a model on our infrastructure. We've also seen that we can create some specific prompts for that particular model to work with. By defining a prompt, we managed to simplify the user's experience, which does not have to compile a prompt locally and send it within each request. Also, defining a prompt allows the creation of more complex pipelines that can only look for specific tensors and use their content to compile the prompt. The only caveat of that example is that the prompt is bound to the model and compiled locally. In other words, it means that we can only have a prompt per LLM deployment. Such an approach is the optimal one when we have a single use case for that LLM. But, in many situations, we would probably like to reuse the same model for multiple tasks. Redeploying another model is not a solution since those LLMs tend to be large and demand a lot of hardware. This is where the PromptRuntime comes into the picture.

Prompt Runtime allows you to deploy multiple prompt modules with only a single job: compile a prompt given the input tensors and send the compiled prompt to the LLM for completion. In this case, the LLM will be a general-purpose completion model that generates a completion given a prompt.

Before starting this tutorial, please ensure you have the LocalRuntime and PromptRuntime up and running. Please check our installation guidelines on how to do so.

We also require a helper function to get the mesh IP. This function will help us find the correct URL to send the request to.

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

We can now deploy the LLM. The associated model-settings.json file is the following:

!cat models/tiny-llama/model-settings.json
{
    "name": "tiny-llama",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
                    "dtype": "float16"
                }
            },
            "prompt_utils": {
                "model_type": "compiled"
            }
        }
    }
}

There are a couple of things to note in the settings file:

  • "model_type": "chat.completions" under "config" - means that the model uses internally the default chat completion prompt (i.e., the one present in the config.json file from the model hub)

  • "model_type": "compiled" under "prompt_utils" - means that the model will expect a compiled prompt in the request.

The associated manifest file for this model is:

We can now deploy the model by running the following commands:

We can deploy some prompt models that will compile their prompts and forward them to the LLM for completion.

We begin with a classical chat-completions prompt. The associated model-settings.json file is the following:

The corresponding manifest file is the following:

As you can see in the manifest file, we have specified the modelRef field, which points to the LLM model. In this way, the prompt model will internally call the tiny-llama LLM we deployed earlier.

We can deploy the prompt model by running the following command:

We define a second prompt model which will use a custom jinja prompt template to compile the prompt. The model-settings.json file looks like this:

Note that we still have the reference to the model, and we've included the path to the local Jinja template as well. The Jinja template used for this example is the following:

As you may have guessed, our inference request will contain a question tensor which will replace the placeholder above and result in a compiled prompt. The compiled prompt will then be sent to the LLM for completion to answer the given question.

The corresponding manifest file is the following:

We can now deploy the model by running the following commands:

With both of our models deployed, we can send requests for completion.

We begin by sending a request for the regular chat completion model by running the following piece of code:

As you can see, we successfully queried the first model. Now we can do the same for the second model, by including the question tensor in our request:

As before, the model can successfully respond to our question.

To unload the models, run the following commands:

Congrats! You've successfully managed to serve two use cases using the same LLM.

Last updated

Was this helpful?