Installation

You can use your standard ways of accessing private artifact registries. We'll cover some of them. Remember to replace the credentials.json with the name of the access credentials file that we've sent to you.

We recommend pushing the images to your own private artifact registry.

Images

These are the runtimes' available images.

Runtime
Name:Tag

API

{{ registry-url }}/mlserver-llm-api:{{ current-version }}

Local

{{ registry-url }}/mlserver-llm-local:{{ current-version }}

Conversational Memory

{{ registry-url }}/mlserver-llm-memory:{{ current-version }}

VectorDB

{{ registry-url }}/mlserver-vector-db:{{ current-version }}

Local Embeddings

{{ registry-url }}/mlserver-local-embeddings:{{ current-version }}

Prompt

{{ registry-url }}/mlserver-prompt-utils:{{ current-version }}

Authenticating against the Seldon Artifact Registry

Note: change credentials.json to the filename you have saved the credentials as.

Docker

To authenticate with the Docker CLI, you can run the following command:

You'll now be able to pull the private image(s). For example, we'll pull the image for the Local Runtime using the following command:

Kubernetes

In order to pull images directly into a Kubernetes cluster, you'll need to create a Kubernetes secret. For example:

To test, we can apply the following manifest as a validation step, and ensure that it deploys successfully.

Deploy the LLM Module runtime Servers

Environment Variables Setup

If you are going to be using the API server you need to set you OpenAI, Azure OpenAI, or Gemini API key as an environment variable. We will do this as a Kubernetes secret.

Example:

Then apply the secret to the namespace you will be deploying the models.

Similarly for Gemini:

Then apply the secret to the namespace you will be deploying the models.

If you are going to be deploying models directly from Hugging Face(HF) you will need to set your HF token into the name space you will be deploying models,

Create a Kubernetes secret containing your HF token in the namespace you are deploying your models.

Example:

Then apply the secret to the namespace you will be deploying the models:

Make sure your API key/token is base64 encoded. To get the base64 encoding of your API key/token run the following command:

API Runtime

We will deploy the API Runtime server into the namespace where you will be running models.

Create a manifest file called server-api.yaml and add the following configuration:

Then apply the server-api.yaml file to the namespace you will be deploying the models.

You should now see the pod running the API Runtime.

Conversational Memory Runtime

We will deploy the Conversational Memory Runtime server into the namespace where you will be running models.

Create a manifest file called server-memory.yaml and add the following configuration:

Then apply the server-memory.yaml file to the namespace you will be deploying the models

We should now see the pod running the Conversational Memory Runtime.

Local Runtime with GPU

We will deploy the Local Runtime server with GPU support into the namespace where you will be running models.

Add the following configuration to your server-local.yaml file. If needed, please change the resource requirements based on your use case.

Then apply the server-local.yaml file to the namespace you will be deploying the models

We should now see the pod running the Local Runtime.

VectorDB Runtime

We will deploy the VectorDB Runtime into the namespace where you will be running models

Add the following configuration to your server-vector-db.yaml file.

Then apply the server-vector-db.yaml file to the namespace you will be deploying the models

We should now see the pod running for the VectorDB Runtime.

LocalEmbeddings Runtime

We will deploy the LocalEmbeddings Runtime into the namespace where you will be running models.

Add the following configuration to your sever-local-embeddings.yaml file.

The CPU, GPU, and memory requirements are just for reference and should be updated according to your setup.

Then apply the server-local-embeddings.yaml file to the namespace you will be deploying the models

We should now see the pod running the LocalEbeddings Runtime.

Prompt Runtime

We will deploy the Prompt Runtime into the namespace where you will be running models.

Add the following configuration to your sever-prompt.yaml file.

Then apply the server-prompt.yaml file to the namespace you will be deploying the models

We should now see the pod running the Prompt Runtime.

Key Values from the server file

name: The name you want to call the server

serverConfig: This refers to the ServerConfig within your Seldon installation. ServerConfig are used as templates for the base layer of the LLM modules MLServer Pod

imagePullSecrets: Setting the proper secret variable to authenticate against the private docker registry to pull the LLM module images

Cleaning up

To remove the servers, run the following commands:

To remove the secrets, run the following commands:

At this point, the LLM module should be removed from you cluster.

Troubleshooting

If there is an issue with the access key secret run the command below:

Trouble setting the OpenAI secret

Another method you can use is directly through kubectl.

Set your environment variables:

Use kubectl to apply the secret:

The same workflow applies to Gemini and HF secrets.

Next Steps

Now that you're able to pull the images, you can try some of the examples.

Last updated

Was this helpful?