Guardrails
In this tutorial, we will show how to implement a local LLM guardrails pipeline. We will deploy the mistralai/Mistral-7B-Instruct-v0.2
model locally using the Local Runtime and deploy two additional prompt models. The first prompt model has the role of accepting or rejecting a request based on whether the content of the query aligns with the filtering criteria. The second model will be a standard chat template prompt, which will generate the answer if the query passes our filtering checks. In addition, we will deploy a custom MLServer model, which will generate a warning message in case the query does not pass our checks.
Thus, for this tutorial, you need to have the Local Runtime, the Prompt Runtime, and an instance of MLServer up and running. Please consult our installation page for more details on how to set up those servers.
We begin by defining the custom MLServer model. The implementation of this model is the following:
from mlserver.model import MLModel
from mlserver.codecs import StringCodec
from mlserver.types import InferenceRequest, InferenceResponse
class MessageModel(MLModel):
async def load(self) -> bool:
self.ready = True
return self.ready
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
return InferenceResponse(
model_name=self.settings.name,
model_version=self.settings.version,
outputs=[
StringCodec.encode_output(
name="role",
payload=["assistant"],
),
StringCodec.encode_output(
name="content",
payload=["I cannot talk about this."],
),
StringCodec.encode_output(
name="type",
payload=["text"],
),
]
)
As you can see, the model ignores the input and simply returns a warning message saying I cannot talk about this
. We will call this the message
model. The model-settings.json
file for the message
model is:
!cat models/message/model-settings.json
{
"name": "message",
"implementation": "model.MessageModel",
"parameters": {
"version": "v0.1.0"
}
}
We can now define the model-settings.json
files for the other models. We begin with the mistral
model:
!cat models/mistral/model-settings.json
{
"name": "mistral",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"dtype": "float16",
"max_model_len": 1024,
"default_generate_kwargs": {
"max_tokens": 256
}
}
},
"prompt_utils": {
"model_type": "compiled"
}
}
}
}
Probably the most important setting in the file above is the "model_type": "compiled"
under "prompt_utils"
. This means that the model will expect an already compiled inference request from a Prompt Runtime model. The purpose of this configuration is to be able to reuse the same model deployed locally with multiple prompts.
The model-settings.json
file for the generate-prompt
model, which is a standard chat template prompt model, is the following:
!cat models/generate-prompt/model-settings.json
{
"name": "generate-prompt",
"implementation": "mlserver_prompt_utils.runtime.PromptRuntime",
"parameters": {
"extra": {
"prompt_utils": {
"model_type": "chat.completions"
}
}
}
}
The last model we have to define is the guard-prompt
. The model-settings.json
file for this model is:
!cat models/guard-prompt/model-settings.json
{
"name": "guard-prompt",
"implementation": "mlserver_prompt_utils.runtime.PromptRuntime",
"parameters": {
"extra": {
"prompt_utils": {
"model_type": "chat.completions",
"prompt_options": {
"uri": "template.jinja"
},
"extract_tensors": {
"keys": ["answer"]
}
}
}
}
}
We emphasize here two settings: "uri"
under "prompt_options"
and "keys"
under "extract_tensors"
.
The "uri"
points to a Jinja prompt template the model will use. Here is the content of the Jinja template:
!cat models/guard-prompt/template.jinja
Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. Provide a JSON-formatted response with either "yes" or "no" as the answer whether the topic is allowed or not. The response should strictly follow this format: `{"answer": "yes" | "no"}`. Make sure you are only returning the json and not any additional text.
Question: {{ question[0]}}
Note that we are specifically instructing the model to only accept topics related to cats and dogs and return a JSON-formated output that contains the decision whether or not the topic of the question is accepted or not.
The "keys"
point to the key in the output for which we want to extract the value. Internally, the Prompt Runtime will take that value and use it to name an empty tensor which will be returned in the inference response. That empty tensor will be used as a routing tensor which will guide the data to the appropriate branch.
Now that we have all the model-settings.json
files defined, we can write the manifest file to deploy the models above:
!cat manifests/models.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: message
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/guardrails/models/message"
requirements:
- mlserver
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: mistral
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/guardrails/models/mistral"
requirements:
- llm-local
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: generate-prompt
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/guardrails/models/generate-prompt"
llm:
modelRef: mistral
requirements:
- prompt
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: guard-prompt
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/guardrails/models/guard-prompt"
llm:
modelRef: mistral
requirements:
- prompt
To deploy the models above, run the following command:
!kubectl apply -f manifests/models.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/message created
model.mlops.seldon.io/mistral created
model.mlops.seldon.io/generate-prompt created
model.mlops.seldon.io/guard-prompt created
model.mlops.seldon.io/generate-prompt condition met
model.mlops.seldon.io/guard-prompt condition met
model.mlops.seldon.io/message condition met
model.mlops.seldon.io/mistral condition met
Having the models deployed, we can now define the conditional pipeline. The manifest file for the pipeline is the following:
!cat manifests/pipeline.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: guardrails-pipeline
spec:
steps:
- name: guard-prompt
inputs:
- guardrails-pipeline.inputs.role
- guardrails-pipeline.inputs.content
tensorMap:
guardrails-pipeline.inputs.content: question
- name: generate-prompt
inputs:
- guardrails-pipeline.inputs.role
- guardrails-pipeline.inputs.content
- guardrails-pipeline.inputs.type
triggers:
- guard-prompt.outputs.yes
- name: message
inputs:
- guardrails-pipeline.inputs
triggers:
- guard-prompt.outputs.no
output:
steps:
- message
- generate-prompt
stepsJoin: any
Note that we route the dataflow to either the generate-prompt
branch or the message
branch based on the output of the guard-prompt
model.
To deploy the pipeline above, run the following command:
!kubectl apply -f manifests/pipeline.yaml -n seldon
!kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon
pipeline.mlops.seldon.io/guardrails-pipeline created
pipeline.mlops.seldon.io/guardrails-pipeline condition met
Before querying the pipeline, we first need to define a helper function that returns the mesh IP used in the construction of the inference endpoint.
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
We also define a helper function to send a request to avoid duplicated code:
import requests
def send_request(question: str):
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"]
},
{
"name": "content",
"shape": [1],
"datatype": "BYTES",
"data": [question],
},
{
"name": "type",
"shape": [1],
"datatype": "BYTES",
"data": ["text"]
}
],
}
endpoint = f"http://{get_mesh_ip()}/v2/pipelines/guardrails-pipeline/infer"
return requests.post(endpoint, json=inference_request)
We can now start sending some requests. Let's first send a request asking about a car, which is not within the allowed discussion topics.
response = send_request("What is the most popular car brand in Europe?")
print(response.json()['outputs'][1]['data'][0])
I cannot talk about this.
Note that in this case, the pipeline rejects our question.
Let's now ask about a topic which is allowed.
response = send_request("What is the most popular cat breed in the UK?")
print(response.json()['outputs'][1]['data'][0])
The most popular cat breed in the UK, according to the Governing Council of the Cat Fancy (GCCF), which is the main regulatory body for cat shows in the UK, is the British Shorthair. This breed is native to the UK, known for its dense, plush coat and round, well-rounded head. Other popular breeds in the UK include the Maine Coon, Bengal, and Ragdoll, according to the same source. However, it's important to note that the Cat Fancy only registers pedigree cats. According to the Pet Food Manufacturers' Association, the most commonly owned domestic cat in the UK is the Domestic Shorthair, which is not a specific breed but a term used for mixed breed cats.
Note that in this case, the pipeline returns an answer to our question.
To delete the models and the pipeline, run the following commands:
!kubectl delete -f manifests/pipeline.yaml -n seldon
!kubectl delete -f manifests/models.yaml -n seldon
pipeline.mlops.seldon.io "guardrails-pipeline" deleted
model.mlops.seldon.io "message" deleted
model.mlops.seldon.io "mistral" deleted
model.mlops.seldon.io "generate-prompt" deleted
model.mlops.seldon.io "guard-prompt" deleted
Congrats! You have successfully deployed a guardrails pipeline!
This tutorial was inspired by the How to implement LLM guardrails
from the OpenAI Cookbook
(see link here for more details.)
Last updated
Was this helpful?