Installation

This guide assumes we've provided LLM Module images via access credentials to our artefact registry. If not, please reach out to us for a demo.

You can use your standard ways of accessing private artifact registries. We'll cover some of them. Remember to replace the credentials.json with the name of the access credentials file that we've sent to you.

We recommend pushing the images to your own private artifact registry.

Images

These are the runtimes' available images.

Runtime

Name:Tag

API

{{ registry-url }}/mlserver-llm-api:{{ current-version }}

Local

{{ registry-url }}/mlserver-llm-local:{{ current-version }}

Conversational Memory

{{ registry-url }}/mlserver-llm-memory:{{ current-version }}

VectorDB

{{ registry-url }}/mlserver-vector-db:{{ current-version }}

Local Embeddings

{{ registry-url }}/mlserver-local-embeddings:{{ current-version }}

Prompt

{{ registry-url }}/mlserver-prompt-utils:{{ current-version }}

Authenticating against the Seldon Artifact Registry

Note: change credentials.json to the filename you have saved the credentials as.

Docker

To authenticate with the Docker CLI, you can run the following command:

REGISTRY=europe-west2-docker.pkg.dev
cat credentials.json | docker login -u _json_key --password-stdin ${REGISTRY}

You'll now be able to pull the private image(s). For example, we'll pull the image for the Local Runtime using the following command:

docker pull europe-west2-docker.pkg.dev/seldon-registry/llm/whoami:latest

Kubernetes

In order to pull images directly into a Kubernetes cluster, you'll need to create a Kubernetes secret. For example:

NAMESPACE=seldon
REGISTRY=europe-west2-docker.pkg.dev
CREDENTIALS=$(cat credentials.json)
SECRET=seldon-registry
kubectl create secret docker-registry ${SECRET} \
	--docker-server="${REGISTRY}" \
	--docker-username="_json_key" \
	--docker-password="${CREDENTIALS}" \
	--dry-run=client -o yaml | kubectl apply -n ${NAMESPACE} -f -

To test, we can apply the following manifest as a validation step, and ensure that it deploys successfully.

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: whoami
  labels:
    app: whoami
spec:
  replicas: 1
  selector:
    matchLabels:
      app: whoami
  template:
    metadata:
      labels:
        app: whoami
    spec:
      containers:
      - name: whoami
        image: europe-west2-docker.pkg.dev/seldon-registry/llm/whoami:latest
      imagePullSecrets:
      - name: seldon-registry
---
apiVersion: v1
kind: Service
metadata:
  name: whoami
spec:
  type: LoadBalancer
  selector:
    app: whoami
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 80

Deploy the LLM Module runtime Servers

Environment Variables Setup

If you are going to be using the API server you need to set you OpenAI, Azure OpenAI, or Gemini API key as an environment variable. We will do this as a Kubernetes secret.

Example:

apiVersion: v1
kind: Secret
metadata:
  name: openai-api-key
type: Opaque
data:
  MLSERVER_MODEL_OPENAI_API_KEY: [base64 OpenAI or Azure OpenAI service API key]

Then apply the secret to the namespace you will be deploying the models.

kubectl apply -f openai-secret.yaml -n seldon

Similarly for Gemini:

apiVersion: v1
kind: Secret
metadata:
  name: gemini-api-key
type: Opaque
data:
  MLSERVER_MODEL_GEMINI_API_KEY: [base64 Gemini service API key]

Then apply the secret to the namespace you will be deploying the models.

kubectl apply -f gemini-secret.yaml -n seldon

If you are going to be deploying models directly from Hugging Face(HF) you will need to set your HF token into the name space you will be deploying models,

Create a Kubernetes secret containing your HF token in the namespace you are deploying your models.

Example:

apiVersion: v1
kind: Secret
metadata:
  name: huggingface-token
type: Opaque
data:
  HF_TOKEN: [base64 Hugging Face API Key]

Then apply the secret to the namespace you will be deploying the models:

kubectl apply -f hf-secret.yaml -n seldon

Make sure your API key/token is base64 encoded. To get the base64 encoding of your API key/token run the following command:

echo -n '[key/token-to-encode]' | base64

API Runtime

We will deploy the API Runtime server into the namespace where you will be running models.

Create a manifest file called server-api.yaml and add the following configuration:

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-llm-api
spec:
  serverConfig: mlserver
  capabilities:
  - openai
  - gemini
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-api:0.7.0
      imagePullPolicy: Always
      env:
      - name: MLSERVER_MODEL_OPENAI_API_KEY
        valueFrom:
          secretKeyRef:
            name: openai-api-key
            key: MLSERVER_MODEL_OPENAI_API_KEY
      - name: MLSERVER_MODEL_GEMINI_API_KEY
        valueFrom:
          secretKeyRef:
            name: gemini-api-key
            key: MLSERVER_MODEL_GEMINI_API_KEY

Then apply the server-api.yaml file to the namespace you will be deploying the models.

kubectl apply -f server-api.yaml -n seldon

You should now see the pod running the API Runtime.

Conversational Memory Runtime

We will deploy the Conversational Memory Runtime server into the namespace where you will be running models.

Create a manifest file called server-memory.yaml and add the following configuration:

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-llm-memory
spec:
  serverConfig: mlserver
  capabilities:
  - memory
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-memory:0.7.0
      imagePullPolicy: Always

Then apply the server-memory.yaml file to the namespace you will be deploying the models

kubectl apply -f server-memory.yaml -n seldon

We should now see the pod running the Conversational Memory Runtime.

Local Runtime with GPU

We will deploy the Local Runtime server with GPU support into the namespace where you will be running models.

Add the following configuration to your server-local.yaml file. If needed, please change the resource requirements based on your use case.

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-llm-local
spec:
  serverConfig: mlserver
  capabilities:
  - llm-local
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-local:0.7.0
      imagePullPolicy: Always
      env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: huggingface-token
            key: HF_TOKEN
      resources:
        requests:
          nvidia.com/gpu: 4
          memory: 4Gi
        limits:
          memory: 32Gi
          nvidia.com/gpu: 4
      volumeMounts:
      - mountPath: /dev/shm
        name: dshm
        readOnly: false
    volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 1Gi
      name: dshm

Then apply the server-local.yaml file to the namespace you will be deploying the models

kubectl apply -f server-local.yaml -n seldon

We should now see the pod running the Local Runtime.

VectorDB Runtime

We will deploy the VectorDB Runtime into the namespace where you will be running models

Add the following configuration to your server-vector-db.yaml file.

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-vector-db
spec:
  serverConfig: mlserver
  capabilities:
  - vector-db
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-vector-db:0.7.0
      imagePullPolicy: Always

Then apply the server-vector-db.yaml file to the namespace you will be deploying the models

kubectl apply -f server-vector-db.yaml -n seldon

We should now see the pod running for the VectorDB Runtime.

LocalEmbeddings Runtime

We will deploy the LocalEmbeddings Runtime into the namespace where you will be running models.

Add the following configuration to your sever-local-embeddings.yaml file.

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-local-embeddings
spec:
  serverConfig: mlserver
  capabilities:
  - local-embeddings
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-local-embeddings:0.7.0
      imagePullPolicy: Always
      resources:
        requests:
          nvidia.com/gpu: 1
          memory: 4Gi
          cpu: "1"
        limits:
          memory: 32Gi
          nvidia.com/gpu: 1
          cpu: "8"
      volumeMounts:
      - mountPath: /dev/shm
        name: dshm
        readOnly: false
    volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 1Gi
      name: dshm

The CPU, GPU, and memory requirements are just for reference and should be updated according to your setup.

Then apply the server-local-embeddings.yaml file to the namespace you will be deploying the models

kubectl apply -f server-local-embeddings.yaml -n seldon

We should now see the pod running the LocalEbeddings Runtime.

Prompt Runtime

We will deploy the Prompt Runtime into the namespace where you will be running models.

Add the following configuration to your sever-prompt.yaml file.

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-prompt-utils
spec:
  serverConfig: mlserver
  capabilities:
  - prompt
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-prompt-utils:0.7.0
      imagePullPolicy: Always

Then apply the server-prompt.yaml file to the namespace you will be deploying the models

kubectl apply -f server-prompt.yaml -n seldon

We should now see the pod running the Prompt Runtime.

Key Values from the server file

name: The name you want to call the server

serverConfig: This refers to the ServerConfig within your Seldon installation. ServerConfig are used as templates for the base layer of the LLM modules MLServer Pod

imagePullSecrets: Setting the proper secret variable to authenticate against the private docker registry to pull the LLM module images

Cleaning up

To remove the servers, run the following commands:

kubectl delete -f server-api.yaml
kubectl delete -f server-memory.yaml
kubectl delete -f server-local.yaml
kubectl delete -f server-vector-db.yaml

To remove the secrets, run the following commands:

kubectl delete -f openai-secret.yaml -n seldon
kubectl delete -f gemini-secret.yaml -n seldon
kubectl delete -f hf-secret.yaml -n seldon

At this point, the LLM module should be removed from you cluster.

Troubleshooting

If there is an issue with the access key secret run the command below:

kubectl patch sa seldon-server -n seldon \
		-p '"imagePullSecrets": [{"name": "seldon-registry" }]'

Trouble setting the OpenAI secret

Another method you can use is directly through kubectl.

Set your environment variables:

export MLSERVER_MODEL_OPENAI_API_KEY=[OpenAI or Azure OpenAI service API key]
export NAMESPACE=[Namespace where you are deploying models]

Use kubectl to apply the secret:

kubectl delete secret openai-api-key -n $(SELDON_NAMESPACE) || echo "openai-api-key secret does not exist - will create"
kubectl create secret generic openai-api-key --from-literal=key=$(MLSERVER_MODEL_OPENAI_API_KEY) -n $(SELDON_NAMESPACE)

The same workflow applies to Gemini and HF secrets.

Next Steps

Now that you're able to pull the images, you can try some of the examples.

PreviousGetting started NextQuickstart

Last updated 22 days ago

Was this helpful?