Retrieval Augmented Generation

In this example, we demonstrate how to build a RAG application using the LLM module on top of Core 2. This will allow us to supplement our LLM with relevant local context to enhance response quality. For this tutorial, you will need to have the Local, API, and Vector-DB runtime up and running. Check our installation tutorial to see how to do so.

We need to deploy a pgvector server on our k8s cluster. The deployment manifest for the pgvector server is the following:

!cat manifests/pgvector-deployment.yaml

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: ankane/pgvector
        env:
        - name: POSTGRES_USER
          value: admin
        - name: POSTGRES_PASSWORD
          value: admin
        - name: POSTGRES_DB
          value: db
        - name: POSTGRES_HOST_AUTH_METHOD
          value: trust
        ports:
        - containerPort: 5432
        readinessProbe:
          exec:
            command: ["pg_isready", "-U", "admin", "-d", "db"]
          initialDelaySeconds: 5
          periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
  name: pgvector-db
  labels:
    app: pgvector-db
spec:
  type: ClusterIP  # Specifies that this is an internal service
  ports:
    - port: 5432
      targetPort: 5432
  selector:
    app: postgres

We can now deploy the pgvector server by running the following command:

!kubectl apply -f manifests/pgvector-deployment.yaml -n seldon
!kubectl rollout status deployment/postgres -n seldon --timeout=600s

deployment.apps/postgres created
service/pgvector-db created
Waiting for deployment "postgres" rollout to finish: 0 of 1 updated replicas are available...
deployment "postgres" successfully rolled out

At this point, all our servers should be up and running. Before loading the models, we first need to populate the vector database. In this example, we will use the Seldon Core 2 documentation. We already scraped the documentation pages, split the content into documents, and embedded those documents using the OpenAI API with the text-embedding-3-small model. The JSON file containing the embedded documents can be downloaded using the following script:

!pip install google-cloud-storage pg8000 tqdm -q

import os
from google.cloud import storage

def download_blob(bucket_name, source_blob_name, destination_file_name):
    """Downloads a blob from a public bucket without credentials."""
    # Ensure the destination directory exists
    if not os.path.exists(os.path.dirname(destination_file_name)):
        os.makedirs(os.path.dirname(destination_file_name))
    
    # Initialize a storage client without credentials
    storage_client = storage.Client.create_anonymous_client()

    # Retrieve the bucket and blob (file)
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(source_blob_name)

    # Download the blob to a local file
    blob.download_to_filename(destination_file_name)
    print(f"Downloaded {source_blob_name} to {destination_file_name}")

# Example usage
bucket_name = "seldon-models"
source_blob_name = "llm-runtimes-settings/rag/assets/vectorized-docs.json"
destination_file_name = "assets/vectorized-docs.json"

download_blob(bucket_name, source_blob_name, destination_file_name)

Downloaded llm-runtimes-settings/rag/assets/vectorized-docs.json to assets/vectorized-docs.json

To populate the database with the downloaded docs, we need to forward the port from localhost:5432 to the container port 5432.

To do so, open a new terminal and run the following command:

kubectl port-forward postgres-c9784f947-4msmr 5432:5432 -n seldon

or feel free to do port-forwarding directly from a tool like k9s.

Note that the termination of your pod is going to be different than c9784f947-4msmr. Please modify the command above such that it matches your local configuration.

To populate the database, run the following script:

import json
import pg8000
from tqdm import tqdm


client = pg8000.connect(
    host='localhost',
    port=5432,
    database="db",
    user="admin",
    password="admin",
)

client.run("CREATE EXTENSION IF NOT EXISTS vector")
client.run(
    "CREATE TABLE IF NOT EXISTS embedding_table (id SERIAL PRIMARY KEY, "
    "embedding vector, text text)"
)

# Check if there are any records in the table
record_count = client.run("SELECT COUNT(*) FROM embedding_table")[0][0]

# Only insert data if the table is empty
if record_count == 0:
    # read the data from the json file of embeddings
    path = os.path.join('assets', 'vectorized-docs.json')
    with open(path, 'r') as f:
        data = json.loads(f.read())
    
    for doc in tqdm(data):
        client.run(
            "INSERT INTO embedding_table (embedding, text) VALUES (:embedding, :text)",
            embedding=str(doc['embedding']),
            text=doc['text'],
        )

client.commit()
client.close()

100%|██████████| 1123/1123 [00:48<00:00, 23.36it/s]

Once the database is populated, we can load the models. We begin with the OpenAI LLM. The model-settings.json file for the OpenAI model is the following:

!cat models/openai-llm/model-settings.json

{
    "name": "openai-llm",
    "implementation": "mlserver_llm_api.LLMRuntime",
    "parameters": {
        "extra": {
            "provider_id": "openai",
            "config": {
                "model_id": "gpt-4o",
                "model_type": "chat.completions"
            },
            "prompt_utils": {
                "prompt_options": {
                    "uri": "rag.jinja"
                }
            }
        }
    }
}

Note that we are deploying a gpt-4o model which uses a custom jinja template rag.jinja. The content of the template is the following:

!cat models/openai-llm/rag.jinja

{%- if documents %}
    {%- for document in documents %}
        {{- '\nDocument: ' }}{{ loop.index0 }}
        {{- '\n' + '-' * 10 }}
        {{- document["text"] }}
    {%- endfor %}
{%- endif %}
{{- '\n' + query[0] }}The template above iterates over all the documents provided and adds their content at the top of the prompt. After the contents of all documents are included, we add the query at the end. In this way, we give the LLM additional context to respond to our query.

We can now move to the embedding component needed to embed the query into a vector and use that vector to search over our vector database deployed earlier. For this part of the tutorial, we will use an OpenAI embedding model. The associated model-settings.json file is the following:

!cat models/openai-embeddings/model-settings.json

{
    "name": "openai-embeddings",
    "implementation": "mlserver_llm_api.LLMRuntime",
    "parameters": {
      "extra": {
        "provider_id": "openai",
        "config": {
          "model_id": "text-embedding-3-small",
          "model_type": "embeddings"
        }
      }
    }
  }

The last part is the PGVector client from our VectorDB runtime. This is the actual component that will query the vector-db to find the related documents in our database. The model-settings.json file for the vector-db client is the following:

!cat models/pgvector-client/model-settings.json

{
    "name": "pgvector-client",
    "implementation": "mlserver_vector_db.VectorDBRuntime",
    "parameters": {
        "extra": {
            "provider_id": "pgvector",
            "config": {
                "host": "pgvector-db",
                "port": 5432,
                "database": "db",
                "user": "admin",
                "password": "admin",
                "table": "embedding_table",
                "embedding_column": "embedding",
                "search_parameters": {
                    "columns": ["text"],
                    "limit": 5
                }
            }
        }
    }
}

Note that we are searching in the embedding_table, over the embedding column, and return the top 5 matches from the text column.

The manifest file for the models is the following:

!cat manifests/models.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: openai-llm
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/openai-llm"
  requirements:
  - openai
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: openai-embeddings
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/openai-embeddings"
  requirements:
  - openai
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: pgvector-client
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/pgvector-client"
  requirements:
  - vector-db

As you can see, we are deploying three models: an OpenAI LLM, an OpenAI embedding model (this has to be the same one we used to embed the documents above), and the pgvector client. To load the models, run the following commands:

!kubectl apply -f manifests/models.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/openai-llm created
model.mlops.seldon.io/openai-embeddings created
model.mlops.seldon.io/pgvector-client created
model.mlops.seldon.io/openai-embeddings condition met
model.mlops.seldon.io/openai-llm condition met
model.mlops.seldon.io/pgvector-client condition met

The final step before starting to send requests is to chain all those models together through a Core 2 pipeline. The pipeline definition is the following:

!cat manifests/pipelines/pipeline-api.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: rag-app-api
spec:
  steps:
    - name: openai-embeddings
      inputs:
        - rag-app-api.inputs.query
      tensorMap:
        rag-app-api.inputs.query: input
    - name: pgvector-client
      inputs:
        - openai-embeddings.outputs.embedding
    - name: openai-llm
      inputs:
        - rag-app-api.inputs.role
        - rag-app-api.inputs.query
        - pgvector-client.outputs.documents
  output:
    steps:
    - openai-llm

We can break the logic of the pipeline above in three main steps:

The query reaches the OpenAI embedding model, which computes the vector embedding of our query
The emebedding is sent to the pgvector client, which retrieves the most relevant documents
The query and the retrieved documents are sent to the LLM, which creates the prompt by adding the context and the query, and sending them to the LLM to generate the completion

To deploy our pipeline on Core 2, run the following command:

!kubectl apply -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon

pipeline.mlops.seldon.io/rag-app-api created
pipeline.mlops.seldon.io/rag-app-api condition met

Before sending the requests, we will need the following helper function to construct the endpoint we want to hit:

import subprocess

def get_mesh_ip():
    cmd = f"kubectl get svc seldon-mesh -n seldon -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

We are now ready to send a request to our pipeline. We will use the following query:

import requests



inference_request = {
    "inputs": [
        {
            "name": "role",
            "shape": [1],
            "datatype": "BYTES",
            "data": ["user"],
            "parameters": {"content_type": "str"},
        },
        {
            "name": "query",
            "shape": [1],
            "datatype": "BYTES",
            "data": [
                "Write the yaml manifests to deploy a pipeline consisting of two huggingface models "
                "chaining a speach to text model and then a text to sentiment model. Finally, "
                "add an explainer on top of the sentiment model."
            ],
            "parameters": {"content_type": "str"},
        },
    ],
}

endpoint = f"http://{get_mesh_ip()}/v2/pipelines/rag-app-api/infer"
response = requests.post(endpoint, json=inference_request)
print(response.json()["outputs"][1]["data"][0])

To deploy a pipeline consisting of two Huggingface models—one for speech-to-text and another for text-to-sentiment—with an explainer on top of the sentiment model, you'll need to write YAML manifests for each model and the pipeline. Below are example YAML configurations for the models and the pipeline:

### 1. Huggingface Whisper Model (Speech-to-Text)

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: whisper
spec:
  storageUri: "gs://seldon-models/mlserver/huggingface/whisper"
  requirements:
  - huggingface
```

### 2. Huggingface Sentiment Model (Text-to-Sentiment)

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: sentiment
spec:
  storageUri: "gs://seldon-models/mlserver/huggingface/sentiment"
  requirements:
  - huggingface
```

### 3. Sentiment Explainer

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: sentiment-explainer
spec:
  storageUri: "gs://seldon-models/scv2/examples/huggingface/speech-sentiment/explainer"
  explainer:
    type: anchor_text
    pipelineRef: sentiment-explain
```

### 4. Speech to Sentiment Pipeline

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: speech-to-sentiment
spec:
  steps:
    - name: whisper
    - name: sentiment
      inputs:
      - whisper
      tensorMap:
        whisper.outputs.output: args
    - name: sentiment-input-transform
```

### Commands to Deploy Models and Pipeline

```bash
# Load the models
seldon model load -f whisper.yaml
seldon model load -f sentiment.yaml
seldon model load -f sentiment-explainer.yaml

# Check model status
seldon model status whisper -w ModelAvailable | jq -M .
seldon model status sentiment -w ModelAvailable | jq -M .
seldon model status sentiment-explainer -w ModelAvailable | jq -M .

# Deploy the pipeline
seldon pipeline load -f speech-to-sentiment.yaml

# Check pipeline status
seldon pipeline status speech-to-sentiment -w PipelineAvailable | jq -M .
```

### Explanation

- **Whisper Model**: This is responsible for transcribing speech to text.
- **Sentiment Model**: It processes the text from the Whisper model and outputs sentiment.
- **Sentiment Explainer**: This provides explanations for the sentiment output via Alibi-Explain.
- **Pipeline**: Chains the Whisper and Sentiment models together and integrates the explainer.

Make sure your environment has all necessary permissions to access the specified storage URIs and that required dependencies are installed. Adjust paths and URIs as needed for your specific setup.

We can see that the LLM is doing a pretty decent job defining all the manifest files and answering our query.

You can also load another model using the Local runtime and create a RAG pipeline using that model. Here is an example of the model-settings.json file for the phi-3.5 model:

!cat models/local-llm/model-settings.json

{
    "name": "local-llm",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "vllm",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "microsoft/Phi-3.5-mini-instruct",
                    "tensor_parallel_size": 4,
                    "dtype": "float16",
                    "gpu_memory_utilization": 0.8,
                    "max_model_len": 4096,
                    "default_generate_kwargs": {
                        "max_tokens": 1024
                    }
                }
            },
            "prompt_utils": {
                "prompt_options": {
                    "uri": "rag.jinja"
                }
            }
        }
    }
}

The associated CRD is the following:

!cat manifests/local-llm.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: local-llm
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/rag/models/local-llm"
  requirements:
  - llm-local

To load that model on the Local runtime server, run the following command:

!kubectl apply -f manifests/local-llm.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s model --all -n seldon

model.mlops.seldon.io/local-llm created
model.mlops.seldon.io/local-llm condition met
model.mlops.seldon.io/openai-embeddings condition met
model.mlops.seldon.io/openai-llm condition met
model.mlops.seldon.io/pgvector-client condition met

We can now define a second pipeline which uses our local model:

!cat manifests/pipelines/pipeline-local.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: rag-app-local
spec:
  steps:
    - name: openai-embeddings
      inputs:
        - rag-app-local.inputs.query
      tensorMap:
        rag-app-local.inputs.query: input
    - name: pgvector-client
      inputs:
        - openai-embeddings.outputs.embedding
    - name: local-llm
      inputs:
        - rag-app-local.inputs.role
        - rag-app-local.inputs.query
        - pgvector-client.outputs.documents
  output:
    steps:
    - local-llm

To deploy the pipeline, run the following command:

!kubectl apply -f manifests/pipelines/pipeline-local.yaml -n seldon
!kubectl wait --for condition=ready --timeout=600s pipeline --all -n seldon

pipeline.mlops.seldon.io/rag-app-local created
pipeline.mlops.seldon.io/rag-app-api condition met
pipeline.mlops.seldon.io/rag-app-local condition met

Once the pipeline is ready, we can send the same request as above:

endpoint = f"http://{get_mesh_ip()}/v2/pipelines/rag-app-local/infer"
response = requests.post(endpoint, json=inference_request)
print(response.json()["outputs"][1]["data"][0])

 ```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: speech-to-sentiment
spec:
  steps:
    - name: whisper
      model:
        name: whisper
        uri: http://mlserver:8080
      inputTransforms:
      - mlmodel: ./sentiment-input-transform/model.py
    - name: sentiment
      model:
        name: sentiment
        uri: http://mlserver:8080
      inputTransforms:
      - mlmodel: ./sentiment-input-transform/model.py
      outputTransforms:
      - mlmodel: ./sentiment-output-transform/model.py
    - name: sentiment-explain
      model:
        name: sentiment-explainer
        uri: http://mlserver:8080
      inputTransforms:
      - mlmodel: ./sentiment-output-transform/model.py

---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: whisper
spec:
  storageUri: "gs://seldon-models/mlserver/huggingface/whisper"
  requirements:
  - huggingface

---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: sentiment
spec:
  storageUri: "gs://seldon-models/mlserver/huggingface/sentiment"
  requirements:
  - huggingface

---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: sentiment-explainer
spec:
  storageUri: "gs://seldon-models/mlserver/huggingface/explainer"
  requirements:
  - requires: "sentiment"
  explainer:
    type: anchor_text
```

To deploy and run this pipeline in a Kubernetes cluster, you need to follow these steps:

1. Deploy the MLServer serving the transformer models using the MLModel manifests provided above.
2. Deploy the models for Hugging Face whisper and sentiment using the respective YAML manifests.
3. Deploy the pipeline with the sentiment explain model linked to the sentiment model using the pipeline YAML manifest.

You can use tools like `kubectl` to apply these manifests to your cluster.

Please note that you'll need to ensure that the input and output data types are compatible with the transformer models and the Alibi-Explain explainer. The custom MLServer models for input and output transformations should handle this conversion.

Finally, keep an eye on the resource usage and cluster load, ensuring that your infrastructure can handle the concurrent requests made by the pipeline. <|end|>

We can see that the phi-3.5 model does a decent job as well.

To delete the pipelines, run the following commands:

!kubectl delete -f manifests/pipelines/pipeline-api.yaml -n seldon
!kubectl delete -f manifests/pipelines/pipeline-local.yaml -n seldon

pipeline.mlops.seldon.io "rag-app-api" deleted
pipeline.mlops.seldon.io "rag-app-local" deleted

To delete the models, run the following commands:

!kubectl delete -f manifests/models.yaml -n seldon
!kubectl delete -f manifests/local-llm.yaml -n seldon

model.mlops.seldon.io "openai-llm" deleted
model.mlops.seldon.io "openai-embeddings" deleted
model.mlops.seldon.io "pgvector-client" deleted
model.mlops.seldon.io "local-llm" deleted

To delete the pgvector deployment, run the following command:

!kubectl delete -f manifests/pgvector-deployment.yaml -n seldon

deployment.apps "postgres" deleted
service "pgvector-db" deleted

PreviousExample NextChat Bot

Last updated 22 days ago

Was this helpful?