Retrieval Augmented Generation
In this example, we demonstrate how to build a RAG application using the LLM module on top of Core 2. This will allow us to supplement our LLM with relevant local context to enhance response quality. For this tutorial, you will need to have the Local, API, and Vector-DB runtime up and running. Check our installation tutorial to see how to do so.
We need to deploy a pgvector server on our k8s cluster. The deployment manifest for the pgvector server is the following:
!cat manifests/pgvector-deployment.yaml---
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
labels:
app: postgres
spec:
replicas: 1
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: ankane/pgvector
env:
- name: POSTGRES_USER
value: admin
- name: POSTGRES_PASSWORD
value: admin
- name: POSTGRES_DB
value: db
- name: POSTGRES_HOST_AUTH_METHOD
value: trust
ports:
- containerPort: 5432
readinessProbe:
exec:
command: ["pg_isready", "-U", "admin", "-d", "db"]
initialDelaySeconds: 5
periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
name: pgvector-db
labels:
app: pgvector-db
spec:
type: ClusterIP # Specifies that this is an internal service
ports:
- port: 5432
targetPort: 5432
selector:
app: postgresWe can now deploy the pgvector server by running the following command:
!kubectl apply -f manifests/pgvector-deployment.yaml -n seldon
!kubectl rollout status deployment/postgres -n seldon --timeout=600sdeployment.apps/postgres created
service/pgvector-db created
Waiting for deployment "postgres" rollout to finish: 0 of 1 updated replicas are available...
deployment "postgres" successfully rolled outAt this point, all our servers should be up and running. Before loading the models, we first need to populate the vector database. In this example, we will use the Seldon Core 2 documentation. We already scraped the documentation pages, split the content into documents, and embedded those documents using the OpenAI API with the text-embedding-3-small model. The JSON file containing the embedded documents can be downloaded using the following script:
!pip install google-cloud-storage pg8000 tqdm -qimport os
from google.cloud import storage
def download_blob(bucket_name, source_blob_name, destination_file_name):
"""Downloads a blob from a public bucket without credentials."""
# Ensure the destination directory exists
if not os.path.exists(os.path.dirname(destination_file_name)):
os.makedirs(os.path.dirname(destination_file_name))
# Initialize a storage client without credentials
storage_client = storage.Client.create_anonymous_client()
# Retrieve the bucket and blob (file)
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(source_blob_name)
# Download the blob to a local file
blob.download_to_filename(destination_file_name)
print(f"Downloaded {source_blob_name} to {destination_file_name}")
# Example usage
bucket_name = "seldon-models"
source_blob_name = "llm-runtimes-settings/rag/assets/vectorized-docs.json"
destination_file_name = "assets/vectorized-docs.json"
download_blob(bucket_name, source_blob_name, destination_file_name)
Downloaded llm-runtimes-settings/rag/assets/vectorized-docs.json to assets/vectorized-docs.jsonTo populate the database with the downloaded docs, we need to forward the port from localhost:5432 to the container port 5432.
To do so, open a new terminal and run the following command:
kubectl port-forward postgres-c9784f947-4msmr 5432:5432 -n seldonor feel free to do port-forwarding directly from a tool like k9s.
To populate the database, run the following script:
import json
import pg8000
from tqdm import tqdm
client = pg8000.connect(
host='localhost',
port=5432,
database="db",
user="admin",
password="admin",
)
client.run("CREATE EXTENSION IF NOT EXISTS vector")
client.run(
"CREATE TABLE IF NOT EXISTS embedding_table (id SERIAL PRIMARY KEY, "
"embedding vector, text text)"
)
# Check if there are any records in the table
record_count = client.run("SELECT COUNT(*) FROM embedding_table")[0][0]
# Only insert data if the table is empty
if record_count == 0:
# read the data from the json file of embeddings
path = os.path.join('assets', 'vectorized-docs.json')
with open(path, 'r') as f:
data = json.loads(f.read())
for doc in tqdm(data):
client.run(
"INSERT INTO embedding_table (embedding, text) VALUES (:embedding, :text)",
embedding=str(doc['embedding']),
text=doc['text'],
)
client.commit()
client.close()
100%|██████████| 1123/1123 [00:48<00:00, 23.36it/s]Once the database is populated, we can load the models. We begin with the OpenAI LLM. The model-settings.json file for the OpenAI model is the following:
!cat models/openai-llm/model-settings.json{
"name": "openai-llm",
"implementation": "mlserver_llm_api.LLMRuntime",
"parameters": {
"extra": {
"provider_id": "openai",
"config": {
"model_id": "gpt-4o",
"model_type": "chat.completions"
},
"prompt_utils": {
"prompt_options": {
"uri": "rag.jinja"
}
}
}
}
}Note that we are deploying a gpt-4o model which uses a custom jinja template rag.jinja. The content of the template is the following:
!cat models/openai-llm/rag.jinja{%- if documents %}
{%- for document in documents %}
{{- '\nDocument: ' }}{{ loop.index0 }}
{{- '\n' + '-' * 10 }}
{{- document["text"] }}
{%- endfor %}
{%- endif %}
{{- '\n' + query[0] }}The template above iterates over all the documents provided and adds their content at the top of the prompt. After the contents of all documents are included, we add the query at the end. In this way, we give the LLM additional context to respond to our query.We can now move to the embedding component needed to embed the query into a vector and use that vector to search over our vector database deployed earlier. For this part of the tutorial, we will use an OpenAI embedding model. The associated model-settings.json file is the following:
!cat models/openai-embeddings/model-settings.json{
"name": "openai-embeddings",
"implementation": "mlserver_llm_api.LLMRuntime",
"parameters": {
"extra": {
"provider_id": "openai",
"config": {
"model_id": "text-embedding-3-small",
"model_type": "embeddings"
}
}
}
}The last part is the PGVector client from our VectorDB runtime. This is the actual component that will query the vector-db to find the related documents in our database. The model-settings.json file for the vector-db client is the following:
!cat models/pgvector-client/model-settings.json{
"name": "pgvector-client",
"implementation": "mlserver_vector_db.VectorDBRuntime",
"parameters": {
"extra": {
"provider_id": "pgvector",
"config": {
"host": "pgvector-db",
"port": 5432,
"database": "db",
"user": "admin",
"password": "admin",
"table": "embedding_table",
"embedding_column": "embedding",
"search_parameters": {
"columns": ["text"],
"limit": 5
}
}
}
}
}Note that we are searching in the embedding_table, over the embedding column, and return the top 5 matches from the text column.
The manifest file for the models is the following:
!cat manifests/models.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: openai-llm
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/openai-llm"
requirements:
- openai
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: openai-embeddings
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/openai-embeddings"
requirements:
- openai
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: pgvector-client
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/pgvector-client"
requirements:
- vector-dbAs you can see, we are deploying three models: an OpenAI LLM, an OpenAI embedding model (this has to be the same one we used to embed the documents above), and the pgvector client. To load the models, run the following commands:
!kubectl apply -f manifests/models.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldonmodel.mlops.seldon.io/openai-llm created
model.mlops.seldon.io/openai-embeddings created
model.mlops.seldon.io/pgvector-client created
model.mlops.seldon.io/openai-embeddings condition met
model.mlops.seldon.io/openai-llm condition met
model.mlops.seldon.io/pgvector-client condition metThe final step before starting to send requests is to chain all those models together through a Core 2 pipeline. The pipeline definition is the following:
!cat manifests/pipelines/pipeline-api.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: rag-app-api
spec:
steps:
- name: openai-embeddings
inputs:
- rag-app-api.inputs.query
tensorMap:
rag-app-api.inputs.query: input
- name: pgvector-client
inputs:
- openai-embeddings.outputs.embedding
- name: openai-llm
inputs:
- rag-app-api.inputs.role
- rag-app-api.inputs.query
- pgvector-client.outputs.documents
output:
steps:
- openai-llmWe can break the logic of the pipeline above in three main steps:
The query reaches the OpenAI embedding model, which computes the vector embedding of our query
The emebedding is sent to the pgvector client, which retrieves the most relevant documents
The query and the retrieved documents are sent to the LLM, which creates the prompt by adding the context and the query, and sending them to the LLM to generate the completion
To deploy our pipeline on Core 2, run the following command:
!kubectl apply -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldonpipeline.mlops.seldon.io/rag-app-api created
pipeline.mlops.seldon.io/rag-app-api condition metBefore sending the requests, we will need the following helper function to construct the endpoint we want to hit:
import subprocess
def get_mesh_ip():
cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
return subprocess.check_output(cmd, shell=True).decode('utf-8')We are now ready to send a request to our pipeline. We will use the following query:
import requests
inference_request = {
"inputs": [
{
"name": "role",
"shape": [1],
"datatype": "BYTES",
"data": ["user"],
"parameters": {"content_type": "str"},
},
{
"name": "query",
"shape": [1],
"datatype": "BYTES",
"data": [
"Write the yaml manifests to deploy a pipeline consisting of two huggingface models "
"chaining a speach to text model and then a text to sentiment model. Finally, "
"add an explainer on top of the sentiment model."
],
"parameters": {"content_type": "str"},
},
],
}
endpoint = f"http://{get_mesh_ip()}/v2/pipelines/rag-app-api/infer"
response = requests.post(endpoint, json=inference_request)
print(response.json()["outputs"][1]["data"][0])To deploy a pipeline consisting of two Huggingface models—one for speech-to-text and another for text-to-sentiment—with an explainer on top of the sentiment model, you'll need to write YAML manifests for each model and the pipeline. Below are example YAML configurations for the models and the pipeline:
### 1. Huggingface Whisper Model (Speech-to-Text)
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: whisper
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/whisper"
requirements:
- huggingface
```
### 2. Huggingface Sentiment Model (Text-to-Sentiment)
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/sentiment"
requirements:
- huggingface
```
### 3. Sentiment Explainer
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-explainer
spec:
storageUri: "gs://seldon-models/scv2/examples/huggingface/speech-sentiment/explainer"
explainer:
type: anchor_text
pipelineRef: sentiment-explain
```
### 4. Speech to Sentiment Pipeline
```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: speech-to-sentiment
spec:
steps:
- name: whisper
- name: sentiment
inputs:
- whisper
tensorMap:
whisper.outputs.output: args
- name: sentiment-input-transform
```
### Commands to Deploy Models and Pipeline
```bash
# Load the models
seldon model load -f whisper.yaml
seldon model load -f sentiment.yaml
seldon model load -f sentiment-explainer.yaml
# Check model status
seldon model status whisper -w ModelAvailable | jq -M .
seldon model status sentiment -w ModelAvailable | jq -M .
seldon model status sentiment-explainer -w ModelAvailable | jq -M .
# Deploy the pipeline
seldon pipeline load -f speech-to-sentiment.yaml
# Check pipeline status
seldon pipeline status speech-to-sentiment -w PipelineAvailable | jq -M .
```
### Explanation
- **Whisper Model**: This is responsible for transcribing speech to text.
- **Sentiment Model**: It processes the text from the Whisper model and outputs sentiment.
- **Sentiment Explainer**: This provides explanations for the sentiment output via Alibi-Explain.
- **Pipeline**: Chains the Whisper and Sentiment models together and integrates the explainer.
Make sure your environment has all necessary permissions to access the specified storage URIs and that required dependencies are installed. Adjust paths and URIs as needed for your specific setup.We can see that the LLM is doing a pretty decent job defining all the manifest files and answering our query.
You can also load another model using the Local runtime and create a RAG pipeline using that model. Here is an example of the model-settings.json file for the phi-3.5 model:
!cat models/local-llm/model-settings.json{
"name": "local-llm",
"implementation": "mlserver_llm_local.runtime.Local",
"parameters": {
"extra": {
"backend": "vllm",
"config": {
"model_type": "chat.completions",
"model_settings": {
"model": "microsoft/Phi-3.5-mini-instruct",
"tensor_parallel_size": 4,
"dtype": "float16",
"gpu_memory_utilization": 0.8,
"max_model_len": 4096,
"default_generate_kwargs": {
"max_tokens": 1024
}
}
},
"prompt_utils": {
"prompt_options": {
"uri": "rag.jinja"
}
}
}
}
}The associated CRD is the following:
!cat manifests/local-llm.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: local-llm
spec:
storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/local-llm"
requirements:
- llm-localTo load that model on the Local runtime server, run the following command:
!kubectl apply -f manifests/local-llm.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldonmodel.mlops.seldon.io/local-llm created
model.mlops.seldon.io/local-llm condition met
model.mlops.seldon.io/openai-embeddings condition met
model.mlops.seldon.io/openai-llm condition met
model.mlops.seldon.io/pgvector-client condition metWe can now define a second pipeline which uses our local model:
!cat manifests/pipelines/pipeline-local.yamlapiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: rag-app-local
spec:
steps:
- name: openai-embeddings
inputs:
- rag-app-local.inputs.query
tensorMap:
rag-app-local.inputs.query: input
- name: pgvector-client
inputs:
- openai-embeddings.outputs.embedding
- name: local-llm
inputs:
- rag-app-local.inputs.role
- rag-app-local.inputs.query
- pgvector-client.outputs.documents
output:
steps:
- local-llmTo deploy the pipeline, run the following command:
!kubectl apply -f manifests/pipelines/pipeline-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldonpipeline.mlops.seldon.io/rag-app-local created
pipeline.mlops.seldon.io/rag-app-api condition met
pipeline.mlops.seldon.io/rag-app-local condition metOnce the pipeline is ready, we can send the same request as above:
endpoint = f"http://{get_mesh_ip()}/v2/pipelines/rag-app-local/infer"
response = requests.post(endpoint, json=inference_request)
print(response.json()["outputs"][1]["data"][0]) ```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: speech-to-sentiment
spec:
steps:
- name: whisper
model:
name: whisper
uri: http://mlserver:8080
inputTransforms:
- mlmodel: ./sentiment-input-transform/model.py
- name: sentiment
model:
name: sentiment
uri: http://mlserver:8080
inputTransforms:
- mlmodel: ./sentiment-input-transform/model.py
outputTransforms:
- mlmodel: ./sentiment-output-transform/model.py
- name: sentiment-explain
model:
name: sentiment-explainer
uri: http://mlserver:8080
inputTransforms:
- mlmodel: ./sentiment-output-transform/model.py
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: whisper
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/whisper"
requirements:
- huggingface
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/sentiment"
requirements:
- huggingface
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-explainer
spec:
storageUri: "gs://seldon-models/mlserver/huggingface/explainer"
requirements:
- requires: "sentiment"
explainer:
type: anchor_text
```
To deploy and run this pipeline in a Kubernetes cluster, you need to follow these steps:
1. Deploy the MLServer serving the transformer models using the MLModel manifests provided above.
2. Deploy the models for Hugging Face whisper and sentiment using the respective YAML manifests.
3. Deploy the pipeline with the sentiment explain model linked to the sentiment model using the pipeline YAML manifest.
You can use tools like `kubectl` to apply these manifests to your cluster.
Please note that you'll need to ensure that the input and output data types are compatible with the transformer models and the Alibi-Explain explainer. The custom MLServer models for input and output transformations should handle this conversion.
Finally, keep an eye on the resource usage and cluster load, ensuring that your infrastructure can handle the concurrent requests made by the pipeline. <|end|>We can see that the phi-3.5 model does a decent job as well.
To delete the pipelines, run the following commands:
!kubectl delete -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl delete -f manifests/pipelines/pipeline-local.yaml -n seldonpipeline.mlops.seldon.io "rag-app-api" deleted
pipeline.mlops.seldon.io "rag-app-local" deletedTo delete the models, run the following commands:
!kubectl delete -f manifests/models.yaml -n seldon
!kubectl delete -f manifests/local-llm.yaml -n seldonmodel.mlops.seldon.io "openai-llm" deleted
model.mlops.seldon.io "openai-embeddings" deleted
model.mlops.seldon.io "pgvector-client" deleted
model.mlops.seldon.io "local-llm" deletedTo delete the pgvector deployment, run the following command:
!kubectl delete -f manifests/pgvector-deployment.yaml -n seldondeployment.apps "postgres" deleted
service "pgvector-db" deletedLast updated
Was this helpful?