Retrieval Augmented Generation
In this example, we demonstrate how to build a RAG application using the LLM module on top of Core 2. This will allow us to supplement our LLM with relevant local context to enhance response quality. For this tutorial, you will need to have the Local, API, and Vector-DB runtime up and running. Check our installation tutorial to see how to do so.
We need to deploy a pgvector
server on our k8s cluster. The deployment manifest for the pgvector
server is the following:
!cat manifests/pgvector-deployment.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
labels:
app: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: ankane/pgvector
env:
- name: POSTGRES_USER
value: admin
- name: POSTGRES_PASSWORD
value: admin
- name: POSTGRES_DB
value: db
- name: POSTGRES_HOST_AUTH_METHOD
value: trust
ports:
- containerPort: 5432
readinessProbe:
exec:
command: ["pg_isready", "-U", "admin", "-d", "db"]
initialDelaySeconds: 5
periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
name: pgvector-db
labels:
app: pgvector-db
spec:
type: ClusterIP # Specifies that this is an internal service
ports:
- port: 5432
targetPort: 5432
selector:
app: postgres
We can now deploy the pgvector
server by running the following command:
!kubectl apply -f manifests/pgvector-deployment.yaml -n seldon
!kubectl rollout status deployment/postgres -n seldon --timeout=600s
deployment.apps/postgres created
service/pgvector-db created
Waiting for deployment "postgres" rollout to finish: 0 of 1 updated replicas are available...
deployment "postgres" successfully rolled out
At this point, all our servers should be up and running. Before loading the models, we first need to populate the vector database. In this example, we will use the Seldon Core 2 documentation. We already scraped the documentation pages, split the content into documents, and embedded those documents using the OpenAI API with the text-embedding-3-small
model. The JSON
file containing the embedded documents can be downloaded using the following script:
!pip install google-cloud-storage pg8000 tqdm -q
import os
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from a public bucket without credentials."""
# Ensure the destination directory exists
if not os.path.exists(os.path.dirname(destination_file_name)):
os.makedirs(os.path.dirname(destination_file_name))
# Initialize a storage client without credentials
storage_client = storage.Client.create_anonymous_client()
# Retrieve the bucket and blob (file)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
# Download the blob to a local file
blob.download_to_filename(destination_file_name)
print(f"Downloaded {source_blob_name} to {destination_file_name}")
# Example usage
bucket_name = "seldon-models"
source_blob_name = "llm-runtimes-settings/rag/assets/vectorized-docs.json"
destination_file_name = "assets/vectorized-docs.json"
download_blob(bucket_name, source_blob_name, destination_file_name)
Downloaded llm-runtimes-settings/rag/assets/vectorized-docs.json to assets/vectorized-docs.json
To populate the database with the downloaded docs, we need to forward the port from localhost:5432
to the container port 5432
.
To do so, open a new terminal and run the following command:
kubectl port-forward postgres-c9784f947-4msmr 5432:5432 -n seldon
or feel free to do port-forwarding directly from a tool like k9s.
To populate the database, run the following script:
import json
import pg8000
from tqdm import tqdm
client = pg8000.connect(
host='localhost',
port=5432,
database="db",
user="admin",
password="admin",
)
client.run("CREATE EXTENSION IF NOT EXISTS vector")
client.run(
"CREATE TABLE IF NOT EXISTS embedding_table (id SERIAL PRIMARY KEY, "
"embedding vector, text text)"
)
# Check if there are any records in the table
record_count = client.run("SELECT COUNT(*) FROM embedding_table")[0][0]
# Only insert data if the table is empty
if record_count == 0:
# read the data from the json file of embeddings
path = os.path.join('assets', 'vectorized-docs.json')
with open(path, 'r') as f:
data = json.loads(f.read())
for doc in tqdm(data):
client.run(
"INSERT INTO embedding_table (embedding, text) VALUES (:embedding, :text)",
embedding=str(doc['embedding']),
text=doc['text'],
)
client.commit()
client.close()
100%|██████████| 1123/1123 [00:48<00:00, 23.36it/s]
Once the database is populated, we can load the models. We begin with the OpenAI LLM. The model-settings.json
file for the OpenAI model is the following:
!cat models/openai-llm/model-settings.json
{
"name": "openai-llm",
"implementation": "mlserver_llm_api.LLMRuntime",
"parameters": {
"extra": {
"provider_id": "openai",
"config": {
"model_id": "gpt-4o",
"model_type": "chat.completions"
},
"prompt_utils": {
"prompt_options": {
"uri": "rag.jinja"
}
}
}
}
}
Note that we are deploying a gpt-4o
model which uses a custom jinja template rag.jinja
. The content of the template is the following:
!cat models/openai-llm/rag.jinja
{{- '\nDocument: ' }}{{ loop.index0 }}
{{- '\n' + '-' * 10 }}
{{- document["text"] }}
{{- '\n' + query[0] }}
The template above iterates over all the documents provided and adds their content at the top of the prompt. After the contents of all documents are included, we add the query at the end. In this way, we give the LLM additional context to respond to our query.
We can now move to the embedding component needed to embed the query into a vector and use that vector to search over our vector database deployed earlier. For this part of the tutorial, we will use an OpenAI embedding model. The associated model-settings.json
file is the following:
!cat models/openai-embeddings/model-settings.json
{
"name": "openai-embeddings",
"implementation": "mlserver_llm_api.LLMRuntime",
"parameters": {
"extra": {
"provider_id": "openai",
"config": {
"model_id": "text-embedding-3-small",
"model_type": "embeddings"
}
}
}
}
The last part is the PGVector client from our VectorDB runtime. This is the actual component that will query the vector-db to find the related documents in our database. The model-settings.json
file for the vector-db client is the following:
!cat models/pgvector-client/model-settings.json
{
"name": "pgvector-client",
"implementation": "mlserver_vector_db.VectorDBRuntime",
"parameters": {
"extra": {
"provider_id": "pgvector",
"config": {
"host": "pgvector-db",
"port": 5432,
"database": "db",
"user": "admin",
"password": "admin",
"table": "embedding_table",
"embedding_column": "embedding",
"search_parameters": {
"columns": ["text"],
"limit": 5
}
}
}
}
}
Note that we are searching in the embedding_table
, over the embedding
column, and return the top 5 matches from the text
column.
The manifest file for the models is the following:
!cat manifests/models.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: openai-llm
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/openai-llm"
requirements:
- openai
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: openai-embeddings
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/openai-embeddings"
requirements:
- openai
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: pgvector-client
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/pgvector-client"
requirements:
- vector-db
As you can see, we are deploying three models: an OpenAI LLM, an OpenAI embedding model (this has to be the same one we used to embed the documents above), and the pgvector
client. To load the models, run the following commands:
!kubectl apply -f manifests/models.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/openai-llm created
model.mlops.seldon.io/openai-embeddings created
model.mlops.seldon.io/pgvector-client created
model.mlops.seldon.io/openai-embeddings condition met
model.mlops.seldon.io/openai-llm condition met
model.mlops.seldon.io/pgvector-client condition met
The final step before starting to send requests is to chain all those models together through a Core 2 pipeline. The pipeline definition is the following:
!cat manifests/pipelines/pipeline-api.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: rag-app-api
spec:
steps:
- name: openai-embeddings
inputs:
- rag-app-api.inputs.query
tensorMap:
rag-app-api.inputs.query: input
- name: pgvector-client
inputs:
- openai-embeddings.outputs.embedding
- name: openai-llm
inputs:
- rag-app-api.inputs.role
- rag-app-api.inputs.query
- pgvector-client.outputs.documents
output:
steps:
- openai-llm
We can break the logic of the pipeline above in three main steps:
The query reaches the OpenAI embedding model, which computes the vector embedding of our query
The emebedding is sent to the pgvector client, which retrieves the most relevant documents
The query and the retrieved documents are sent to the LLM, which creates the prompt by adding the context and the query, and sending them to the LLM to generate the completion
To deploy our pipeline on Core 2, run the following command:
!kubectl apply -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
pipeline.mlops.seldon.io/rag-app-api created
pipeline.mlops.seldon.io/rag-app-api condition met
Before sending the requests, we will need the following helper function to construct the endpoint we want to hit:
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')
We are now ready to send a request to our pipeline. We will use the following query:
import requests
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"],
"parameters": {"content_type": "str"},
},
{
"name": "query",
"shape": [1],
"datatype": "BYTES",
"data": [
"Write the yaml manifests to deploy a pipeline consisting of two huggingface models "
"chaining a speach to text model and then a text to sentiment model. Finally, "
"add an explainer on top of the sentiment model."
],
"parameters": {"content_type": "str"},
},
],
}
endpoint = f"http://{get_mesh_ip()}/v2/pipelines/rag-app-api/infer"
response = requests.post(endpoint, json=inference_request)
print(response.json()["outputs"][1]["data"][0])
To deploy a pipeline consisting of two Huggingface models—one for speech-to-text and another for text-to-sentiment—with an explainer on top of the sentiment model, you'll need to write YAML manifests for each model and the pipeline. Below are example YAML configurations for the models and the pipeline:
### 1. Huggingface Whisper Model (Speech-to-Text)
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: whisper
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/whisper"
requirements:
- huggingface
```
### 2. Huggingface Sentiment Model (Text-to-Sentiment)
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/sentiment"
requirements:
- huggingface
```
### 3. Sentiment Explainer
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-explainer
spec:
storageUri: "gs://seldon-models/scv2/examples/huggingface/speech-sentiment/explainer"
explainer:
type: anchor_text
pipelineRef: sentiment-explain
```
### 4. Speech to Sentiment Pipeline
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: speech-to-sentiment
spec:
steps:
- name: whisper
- name: sentiment
inputs:
- whisper
tensorMap:
whisper.outputs.output: args
- name: sentiment-input-transform
```
### Commands to Deploy Models and Pipeline
```bash
# Load the models
seldon model load -f whisper.yaml
seldon model load -f sentiment.yaml
seldon model load -f sentiment-explainer.yaml
# Check model status
seldon model status whisper -w ModelAvailable | jq -M .
seldon model status sentiment -w ModelAvailable | jq -M .
seldon model status sentiment-explainer -w ModelAvailable | jq -M .
# Deploy the pipeline
seldon pipeline load -f speech-to-sentiment.yaml
# Check pipeline status
seldon pipeline status speech-to-sentiment -w PipelineAvailable | jq -M .
```
### Explanation
- **Whisper Model**: This is responsible for transcribing speech to text.
- **Sentiment Model**: It processes the text from the Whisper model and outputs sentiment.
- **Sentiment Explainer**: This provides explanations for the sentiment output via Alibi-Explain.
- **Pipeline**: Chains the Whisper and Sentiment models together and integrates the explainer.
Make sure your environment has all necessary permissions to access the specified storage URIs and that required dependencies are installed. Adjust paths and URIs as needed for your specific setup.
We can see that the LLM is doing a pretty decent job defining all the manifest files and answering our query.
You can also load another model using the Local runtime and create a RAG pipeline using that model. Here is an example of the model-settings.json
file for the phi-3.5
model:
!cat models/local-llm/model-settings.json
{
"name": "local-llm",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "microsoft/Phi-3.5-mini-instruct",
"tensor_parallel_size": 4,
"dtype": "float16",
"gpu_memory_utilization": 0.8,
"max_model_len": 4096,
"default_generate_kwargs": {
"max_tokens": 1024
}
}
},
"prompt_utils": {
"prompt_options": {
"uri": "rag.jinja"
}
}
}
}
}
The associated CRD is the following:
!cat manifests/local-llm.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: local-llm
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/local-llm"
requirements:
- llm-local
To load that model on the Local runtime server, run the following command:
!kubectl apply -f manifests/local-llm.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon
model.mlops.seldon.io/local-llm created
model.mlops.seldon.io/local-llm condition met
model.mlops.seldon.io/openai-embeddings condition met
model.mlops.seldon.io/openai-llm condition met
model.mlops.seldon.io/pgvector-client condition met
We can now define a second pipeline which uses our local model:
!cat manifests/pipelines/pipeline-local.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: rag-app-local
spec:
steps:
- name: openai-embeddings
inputs:
- rag-app-local.inputs.query
tensorMap:
rag-app-local.inputs.query: input
- name: pgvector-client
inputs:
- openai-embeddings.outputs.embedding
- name: local-llm
inputs:
- rag-app-local.inputs.role
- rag-app-local.inputs.query
- pgvector-client.outputs.documents
output:
steps:
- local-llm
To deploy the pipeline, run the following command:
!kubectl apply -f manifests/pipelines/pipeline-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon
pipeline.mlops.seldon.io/rag-app-local created
pipeline.mlops.seldon.io/rag-app-api condition met
pipeline.mlops.seldon.io/rag-app-local condition met
Once the pipeline is ready, we can send the same request as above:
endpoint = f"http://{get_mesh_ip()}/v2/pipelines/rag-app-local/infer"
response = requests.post(endpoint, json=inference_request)
print(response.json()["outputs"][1]["data"][0])
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: speech-to-sentiment
spec:
steps:
- name: whisper
model:
name: whisper
uri: http://mlserver:8080
inputTransforms:
- mlmodel: ./sentiment-input-transform/model.py
- name: sentiment
model:
name: sentiment
uri: http://mlserver:8080
inputTransforms:
- mlmodel: ./sentiment-input-transform/model.py
outputTransforms:
- mlmodel: ./sentiment-output-transform/model.py
- name: sentiment-explain
model:
name: sentiment-explainer
uri: http://mlserver:8080
inputTransforms:
- mlmodel: ./sentiment-output-transform/model.py
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: whisper
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/whisper"
requirements:
- huggingface
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/sentiment"
requirements:
- huggingface
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-explainer
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/explainer"
requirements:
- requires: "sentiment"
explainer:
type: anchor_text
```
To deploy and run this pipeline in a Kubernetes cluster, you need to follow these steps:
1. Deploy the MLServer serving the transformer models using the MLModel manifests provided above.
2. Deploy the models for Hugging Face whisper and sentiment using the respective YAML manifests.
3. Deploy the pipeline with the sentiment explain model linked to the sentiment model using the pipeline YAML manifest.
You can use tools like `kubectl` to apply these manifests to your cluster.
Please note that you'll need to ensure that the input and output data types are compatible with the transformer models and the Alibi-Explain explainer. The custom MLServer models for input and output transformations should handle this conversion.
Finally, keep an eye on the resource usage and cluster load, ensuring that your infrastructure can handle the concurrent requests made by the pipeline. <|end|>
We can see that the phi-3.5
model does a decent job as well.
To delete the pipelines, run the following commands:
!kubectl delete -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl delete -f manifests/pipelines/pipeline-local.yaml -n seldon
pipeline.mlops.seldon.io "rag-app-api" deleted
pipeline.mlops.seldon.io "rag-app-local" deleted
To delete the models, run the following commands:
!kubectl delete -f manifests/models.yaml -n seldon
!kubectl delete -f manifests/local-llm.yaml -n seldon
model.mlops.seldon.io "openai-llm" deleted
model.mlops.seldon.io "openai-embeddings" deleted
model.mlops.seldon.io "pgvector-client" deleted
model.mlops.seldon.io "local-llm" deleted
To delete the pgvector deployment, run the following command:
!kubectl delete -f manifests/pgvector-deployment.yaml -n seldon
deployment.apps "postgres" deleted
service "pgvector-db" deleted
Last updated
Was this helpful?