Retrieval Augmented Generation

In this example, we demonstrate how to build a RAG application using the LLM module on top of Core 2. This will allow us to supplement our LLM with relevant local context to enhance response quality. For this tutorial, you will need to have the Local, API, and Vector-DB runtime up and running. Check our installation tutorial to see how to do so.

We need to deploy a pgvector server on our k8s cluster. The deployment manifest for the pgvector server is the following:

!cat manifests/pgvector-deployment.yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: ankane/pgvector
        env:
        - name: POSTGRES_USER
          value: admin
        - name: POSTGRES_PASSWORD
          value: admin
        - name: POSTGRES_DB
          value: db
        - name: POSTGRES_HOST_AUTH_METHOD
          value: trust
        ports:
        - containerPort: 5432
        readinessProbe:
          exec:
            command: ["pg_isready", "-U", "admin", "-d", "db"]
          initialDelaySeconds: 5
          periodSeconds: 1
---
apiVersion: v1
kind: Service
metadata:
  name: pgvector-db
  labels:
    app: pgvector-db
spec:
  type: ClusterIP  # Specifies that this is an internal service
  ports:
    - port: 5432
      targetPort: 5432
  selector:
    app: postgres

We can now deploy the pgvector server by running the following command:

At this point, all our servers should be up and running. Before loading the models, we first need to populate the vector database. In this example, we will use the Seldon Core 2 documentation. We already scraped the documentation pages, split the content into documents, and embedded those documents using the OpenAI API with the text-embedding-3-small model. The JSON file containing the embedded documents can be downloaded using the following script:

To populate the database with the downloaded docs, we need to forward the port from localhost:5432 to the container port 5432.

To do so, open a new terminal and run the following command:

or feel free to do port-forwarding directly from a tool like k9s.

Note that the termination of your pod is going to be different than c9784f947-4msmr. Please modify the command above such that it matches your local configuration.

To populate the database, run the following script:

Once the database is populated, we can load the models. We begin with the OpenAI LLM. The model-settings.json file for the OpenAI model is the following:

Note that we are deploying a gpt-4o model which uses a custom jinja template rag.jinja. The content of the template is the following:

We can now move to the embedding component needed to embed the query into a vector and use that vector to search over our vector database deployed earlier. For this part of the tutorial, we will use an OpenAI embedding model. The associated model-settings.json file is the following:

The last part is the PGVector client from our VectorDB runtime. This is the actual component that will query the vector-db to find the related documents in our database. The model-settings.json file for the vector-db client is the following:

Note that we are searching in the embedding_table, over the embedding column, and return the top 5 matches from the text column.

The manifest file for the models is the following:

As you can see, we are deploying three models: an OpenAI LLM, an OpenAI embedding model (this has to be the same one we used to embed the documents above), and the pgvector client. To load the models, run the following commands:

The final step before starting to send requests is to chain all those models together through a Core 2 pipeline. The pipeline definition is the following:

We can break the logic of the pipeline above in three main steps:

  1. The query reaches the OpenAI embedding model, which computes the vector embedding of our query

  2. The emebedding is sent to the pgvector client, which retrieves the most relevant documents

  3. The query and the retrieved documents are sent to the LLM, which creates the prompt by adding the context and the query, and sending them to the LLM to generate the completion

To deploy our pipeline on Core 2, run the following command:

Before sending the requests, we will need the following helper function to construct the endpoint we want to hit:

We are now ready to send a request to our pipeline. We will use the following query:

We can see that the LLM is doing a pretty decent job defining all the manifest files and answering our query.

You can also load another model using the Local runtime and create a RAG pipeline using that model. Here is an example of the model-settings.json file for the phi-3.5 model:

The associated CRD is the following:

To load that model on the Local runtime server, run the following command:

We can now define a second pipeline which uses our local model:

To deploy the pipeline, run the following command:

Once the pipeline is ready, we can send the same request as above:

We can see that the phi-3.5 model does a decent job as well.

To delete the pipelines, run the following commands:

To delete the models, run the following commands:

To delete the pgvector deployment, run the following command:

Last updated

Was this helpful?