Guardrails

In this tutorial, we will show how to implement a local LLM guardrails pipeline. We will deploy the mistralai/Mistral-7B-Instruct-v0.2 model locally using the Local Runtime and deploy two additional prompt models. The first prompt model has the role of accepting or rejecting a request based on whether the content of the query aligns with the filtering criteria. The second model will be a standard chat template prompt, which will generate the answer if the query passes our filtering checks. In addition, we will deploy a custom MLServer model, which will generate a warning message in case the query does not pass our checks.

Thus, for this tutorial, you need to have the Local Runtime, the Prompt Runtime, and an instance of MLServer up and running. Please consult our installation page for more details on how to set up those servers.

We begin by defining the custom MLServer model. The implementation of this model is the following:

from mlserver.model import MLModel
from mlserver.codecs import StringCodec
from mlserver.types import InferenceRequest, InferenceResponse


class MessageModel(MLModel):
    async def load(self) -> bool:
        self.ready = True
        return self.ready

    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        return InferenceResponse(
            model_name=self.settings.name,
            model_version=self.settings.version,
            outputs=[
                StringCodec.encode_output(
                    name="role",
                    payload=["assistant"],
                ),
                StringCodec.encode_output(
                    name="content",
                    payload=["I cannot talk about this."],
                ),
                StringCodec.encode_output(
                    name="type",
                    payload=["text"],
                ),
            ]
        )

As you can see, the model ignores the input and simply returns a warning message saying I cannot talk about this. We will call this the message model. The model-settings.json file for the message model is:

We can now define the model-settings.json files for the other models. We begin with the mistral model:

Probably the most important setting in the file above is the "model_type": "compiled" under "prompt_utils". This means that the model will expect an already compiled inference request from a Prompt Runtime model. The purpose of this configuration is to be able to reuse the same model deployed locally with multiple prompts.

The model-settings.json file for the generate-prompt model, which is a standard chat template prompt model, is the following:

The last model we have to define is the guard-prompt. The model-settings.json file for this model is:

We emphasize here two settings: "uri" under "prompt_options" and "keys" under "extract_tensors".

The "uri" points to a Jinja prompt template the model will use. Here is the content of the Jinja template:

Note that we are specifically instructing the model to only accept topics related to cats and dogs and return a JSON-formated output that contains the decision whether or not the topic of the question is accepted or not.

The "keys" point to the key in the output for which we want to extract the value. Internally, the Prompt Runtime will take that value and use it to name an empty tensor which will be returned in the inference response. That empty tensor will be used as a routing tensor which will guide the data to the appropriate branch.

Now that we have all the model-settings.json files defined, we can write the manifest file to deploy the models above:

To deploy the models above, run the following command:

Having the models deployed, we can now define the conditional pipeline. The manifest file for the pipeline is the following:

Note that we route the dataflow to either the generate-prompt branch or the message branch based on the output of the guard-prompt model.

To deploy the pipeline above, run the following command:

Before querying the pipeline, we first need to define a helper function that returns the mesh IP used in the construction of the inference endpoint.

We also define a helper function to send a request to avoid duplicated code:

We can now start sending some requests. Let's first send a request asking about a car, which is not within the allowed discussion topics.

Note that in this case, the pipeline rejects our question.

Let's now ask about a topic which is allowed.

Note that in this case, the pipeline returns an answer to our question.

To delete the models and the pipeline, run the following commands:

Congrats! You have successfully deployed a guardrails pipeline!

This tutorial was inspired by the How to implement LLM guardrails from the OpenAI Cookbook (see link here for more details.)

Last updated

Was this helpful?