Installation
This guide assumes we've provided LLM Module images via access credentials to our artefact registry. If not, please reach out to us for a demo.
You can use your standard ways of accessing private artifact registries. We'll cover some of them. Remember to replace the credentials.json with the name of the access credentials file that we've sent to you.
Images
These are the runtimes' available images.
API
{{ registry-url }}/mlserver-llm-api:{{ current-version }}
Local
{{ registry-url }}/mlserver-llm-local:{{ current-version }}
Conversational Memory
{{ registry-url }}/mlserver-llm-memory:{{ current-version }}
VectorDB
{{ registry-url }}/mlserver-vector-db:{{ current-version }}
Local Embeddings
{{ registry-url }}/mlserver-local-embeddings:{{ current-version }}
Prompt
{{ registry-url }}/mlserver-prompt-utils:{{ current-version }}
Authenticating against the Seldon Artifact Registry
Note: change credentials.json to the filename you have saved the credentials as.
Docker
To authenticate with the Docker CLI, you can run the following command:
You'll now be able to pull the private image(s). For example, we'll pull the image for the Local Runtime using the following command:
Kubernetes
In order to pull images directly into a Kubernetes cluster, you'll need to create a Kubernetes secret. For example:
To test, we can apply the following manifest as a validation step, and ensure that it deploys successfully.
Deploy the LLM Module runtime Servers
Environment Variables Setup
If you are going to be using the API server you need to set you OpenAI, Azure OpenAI, or Gemini API key as an environment variable. We will do this as a Kubernetes secret.
Example:
Then apply the secret to the namespace you will be deploying the models.
Similarly for Gemini:
Then apply the secret to the namespace you will be deploying the models.
If you are going to be deploying models directly from Hugging Face(HF) you will need to set your HF token into the name space you will be deploying models,
Create a Kubernetes secret containing your HF token in the namespace you are deploying your models.
Example:
Then apply the secret to the namespace you will be deploying the models:
API Runtime
We will deploy the API Runtime server into the namespace where you will be running models.
Create a manifest file called server-api.yaml and add the following configuration:
Then apply the server-api.yaml file to the namespace you will be deploying the models.
You should now see the pod running the API Runtime.
Conversational Memory Runtime
We will deploy the Conversational Memory Runtime server into the namespace where you will be running models.
Create a manifest file called server-memory.yaml and add the following configuration:
Then apply the server-memory.yaml file to the namespace you will be deploying the models
We should now see the pod running the Conversational Memory Runtime.
Local Runtime with GPU
We will deploy the Local Runtime server with GPU support into the namespace where you will be running models.
Add the following configuration to your server-local.yaml file. If needed, please change the resource requirements based on your use case.
Then apply the server-local.yaml file to the namespace you will be deploying the models
We should now see the pod running the Local Runtime.
VectorDB Runtime
We will deploy the VectorDB Runtime into the namespace where you will be running models
Add the following configuration to your server-vector-db.yaml file.
Then apply the server-vector-db.yaml file to the namespace you will be deploying the models
We should now see the pod running for the VectorDB Runtime.
LocalEmbeddings Runtime
We will deploy the LocalEmbeddings Runtime into the namespace where you will be running models.
Add the following configuration to your sever-local-embeddings.yaml file.
The CPU, GPU, and memory requirements are just for reference and should be updated according to your setup.
Then apply the server-local-embeddings.yaml file to the namespace you will be deploying the models
We should now see the pod running the LocalEbeddings Runtime.
Prompt Runtime
We will deploy the Prompt Runtime into the namespace where you will be running models.
Add the following configuration to your sever-prompt.yaml file.
Then apply the server-prompt.yaml file to the namespace you will be deploying the models
We should now see the pod running the Prompt Runtime.
Key Values from the server file
name: The name you want to call the server
serverConfig: This refers to the ServerConfig within your Seldon installation. ServerConfig are used as templates for the base layer of the LLM modules MLServer Pod
imagePullSecrets: Setting the proper secret variable to authenticate against the private docker registry to pull the LLM module images
Cleaning up
To remove the servers, run the following commands:
To remove the secrets, run the following commands:
At this point, the LLM module should be removed from you cluster.
Troubleshooting
If there is an issue with the access key secret run the command below:
Trouble setting the OpenAI secret
Another method you can use is directly through kubectl.
Set your environment variables:
Use kubectl to apply the secret:
The same workflow applies to Gemini and HF secrets.
Next Steps
Now that you're able to pull the images, you can try some of the examples.
Last updated
Was this helpful?