Guardrails

In this tutorial, we will show how to implement a local LLM guardrails pipeline. We will deploy the mistralai/Mistral-7B-Instruct-v0.2 model locally using the Local Runtime and deploy two additional prompt models. The first prompt model has the role of accepting or rejecting a request based on whether the content of the query aligns with the filtering criteria. The second model will be a standard chat template prompt, which will generate the answer if the query passes our filtering checks. In addition, we will deploy a custom MLServer model, which will generate a warning message in case the query does not pass our checks.

Thus, for this tutorial, you need to have the Local Runtime, the Prompt Runtime, and an instance of MLServer up and running. Please consult our installation page for more details on how to set up those servers.

We begin by defining the custom MLServer model. The implementation of this model is the following:

from mlserver.model import MLModel
from mlserver.codecs import StringCodec
from mlserver.types import InferenceRequest, InferenceResponse


class MessageModel(MLModel):
    async def load(self) -> bool:
        self.ready = True
        return self.ready

    async def predict(self, payload: InferenceRequest) -> InferenceResponse:
        return InferenceResponse(
            model_name=self.settings.name,
            model_version=self.settings.version,
            outputs=[
                StringCodec.encode_output(
                    name="role",
                    payload=["assistant"],
                ),
                StringCodec.encode_output(
                    name="content",
                    payload=["I cannot talk about this."],
                ),
                StringCodec.encode_output(
                    name="type",
                    payload=["text"],
                ),
            ]
        )

As you can see, the model ignores the input and simply returns a warning message saying I cannot talk about this. We will call this the message model. The model-settings.json file for the message model is:

!cat models/message/model-settings.json

{
    "name": "message",
    "implementation": "model.MessageModel",
    "parameters": {
        "version": "v0.1.0"
    }
}

We can now define the model-settings.json files for the other models. We begin with the mistral model:

!cat models/mistral/model-settings.json

{
    "name": "mistral",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "mistralai/Mistral-7B-Instruct-v0.2",
                    "dtype": "float16",
                    "max_model_len": 1024,
                    "default_generate_kwargs": {
                        "max_tokens": 256
                    }
                }
            },
            "prompt_utils": {
                "model_type": "compiled"
            }
        }
    }
}

Probably the most important setting in the file above is the "model_type": "compiled" under "prompt_utils". This means that the model will expect an already compiled inference request from a Prompt Runtime model. The purpose of this configuration is to be able to reuse the same model deployed locally with multiple prompts.

The model-settings.json file for the generate-prompt model, which is a standard chat template prompt model, is the following:

!cat models/generate-prompt/model-settings.json

{
    "name": "generate-prompt",
    "implementation": "mlserver_prompt_utils.runtime.PromptRuntime",
    "parameters": {
        "extra": {
            "prompt_utils": {
                "model_type": "chat.completions"
            }
        }
    }
}

The last model we have to define is the guard-prompt. The model-settings.json file for this model is:

!cat models/guard-prompt/model-settings.json

{
    "name": "guard-prompt",
    "implementation": "mlserver_prompt_utils.runtime.PromptRuntime",
    "parameters": {
        "extra": {
            "prompt_utils": {
                "model_type": "chat.completions",
                "prompt_options": {
                    "uri": "template.jinja"
                },
                "extract_tensors": {
                    "keys": ["answer"]
                }
            }
        }
    }
}

We emphasize here two settings: "uri" under "prompt_options" and "keys" under "extract_tensors".

The "uri" points to a Jinja prompt template the model will use. Here is the content of the Jinja template:

!cat models/guard-prompt/template.jinja

Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. Provide a JSON-formatted response with either "yes" or "no" as the answer whether the  topic is allowed or not. The response should strictly follow this format: `{"answer": "yes" | "no"}`.  Make sure you are only returning the json and not any additional text.

Question: {{ question[0]}}

Note that we are specifically instructing the model to only accept topics related to cats and dogs and return a JSON-formated output that contains the decision whether or not the topic of the question is accepted or not.

The "keys" point to the key in the output for which we want to extract the value. Internally, the Prompt Runtime will take that value and use it to name an empty tensor which will be returned in the inference response. That empty tensor will be used as a routing tensor which will guide the data to the appropriate branch.

Now that we have all the model-settings.json files defined, we can write the manifest file to deploy the models above:

!cat manifests/models.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: message
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/guardrails/models/message"
  requirements:
  - mlserver
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: mistral
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/guardrails/models/mistral"
  requirements:
  - llm-local
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: generate-prompt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/guardrails/models/generate-prompt"
  llm:
    modelRef: mistral
  requirements:
  - prompt
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: guard-prompt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/guardrails/models/guard-prompt"
  llm:
    modelRef: mistral
  requirements:
  - prompt

To deploy the models above, run the following command:

!kubectl apply -f manifests/models.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/message created
model.mlops.seldon.io/mistral created
model.mlops.seldon.io/generate-prompt created
model.mlops.seldon.io/guard-prompt created
model.mlops.seldon.io/generate-prompt condition met
model.mlops.seldon.io/guard-prompt condition met
model.mlops.seldon.io/message condition met
model.mlops.seldon.io/mistral condition met

Having the models deployed, we can now define the conditional pipeline. The manifest file for the pipeline is the following:

!cat manifests/pipeline.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: guardrails-pipeline
spec:
  steps:
  - name: guard-prompt
    inputs:
    - guardrails-pipeline.inputs.role
    - guardrails-pipeline.inputs.content
    tensorMap:
      guardrails-pipeline.inputs.content: question
  - name: generate-prompt
    inputs:
    - guardrails-pipeline.inputs.role
    - guardrails-pipeline.inputs.content
    - guardrails-pipeline.inputs.type
    triggers:
    - guard-prompt.outputs.yes
  - name: message
    inputs:
    - guardrails-pipeline.inputs
    triggers:
    - guard-prompt.outputs.no
  output:
    steps:
    - message
    - generate-prompt
    stepsJoin: any

Note that we route the dataflow to either the generate-prompt branch or the message branch based on the output of the guard-prompt model.

To deploy the pipeline above, run the following command:

!kubectl apply -f manifests/pipeline.yaml -n seldon
!kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon

pipeline.mlops.seldon.io/guardrails-pipeline created
pipeline.mlops.seldon.io/guardrails-pipeline condition met

Before querying the pipeline, we first need to define a helper function that returns the mesh IP used in the construction of the inference endpoint.

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

We also define a helper function to send a request to avoid duplicated code:

import requests

def send_request(question: str):
    inference_request = {
        "inputs": [
            {
                "name": "role", 
                "shape": [1], 
                "datatype": "BYTES", 
                "data": ["user"]
            },
            {
                "name": "content",
                "shape": [1],
                "datatype": "BYTES",
                "data": [question],
            },
            {
                "name": "type",
                "shape": [1],
                "datatype": "BYTES",
                "data": ["text"]
            }
        ],
    }

    endpoint = f"http://{get_mesh_ip()}/v2/pipelines/guardrails-pipeline/infer"
    return requests.post(endpoint, json=inference_request)

We can now start sending some requests. Let's first send a request asking about a car, which is not within the allowed discussion topics.

response = send_request("What is the most popular car brand in Europe?")
print(response.json()['outputs'][1]['data'][0])

I cannot talk about this.

Note that in this case, the pipeline rejects our question.

Let's now ask about a topic which is allowed.

response = send_request("What is the most popular cat breed in the UK?")
print(response.json()['outputs'][1]['data'][0])

 The most popular cat breed in the UK, according to the Governing Council of the Cat Fancy (GCCF), which is the main regulatory body for cat shows in the UK, is the British Shorthair. This breed is native to the UK, known for its dense, plush coat and round, well-rounded head. Other popular breeds in the UK include the Maine Coon, Bengal, and Ragdoll, according to the same source. However, it's important to note that the Cat Fancy only registers pedigree cats. According to the Pet Food Manufacturers' Association, the most commonly owned domestic cat in the UK is the Domestic Shorthair, which is not a specific breed but a term used for mixed breed cats.

Note that in this case, the pipeline returns an answer to our question.

To delete the models and the pipeline, run the following commands:

!kubectl delete -f manifests/pipeline.yaml -n seldon
!kubectl delete -f manifests/models.yaml -n seldon

pipeline.mlops.seldon.io "guardrails-pipeline" deleted
model.mlops.seldon.io "message" deleted
model.mlops.seldon.io "mistral" deleted
model.mlops.seldon.io "generate-prompt" deleted
model.mlops.seldon.io "guard-prompt" deleted

Congrats! You have successfully deployed a guardrails pipeline!

This tutorial was inspired by the How to implement LLM guardrails from the OpenAI Cookbook (see link here for more details.)

PreviousOpenAI Function Calling NextAgents

Last updated 22 days ago

Was this helpful?