# Installation

{% hint style="warning" %}
This guide assumes we've provided LLM Module images via access credentials to our artefact registry. If not, please reach out to us for a [demo](https://www.seldon.io/demo/).
{% endhint %}

You can use your standard ways of accessing private artifact registries. We'll cover some of them. Remember to replace the `credentials.json` with the name of the access credentials file that we've sent to you.

{% hint style="info" %}
We recommend pushing the images to your own private artifact registry.
{% endhint %}

## Images

These are the runtimes' available images.

| Runtime               | Name:Tag                                                             |
| --------------------- | -------------------------------------------------------------------- |
| API                   | `{{ registry-url }}/mlserver-llm-api:{{ current-version }}`          |
| Local                 | `{{ registry-url }}/mlserver-llm-local:{{ current-version }}`        |
| Conversational Memory | `{{ registry-url }}/mlserver-llm-memory:{{ current-version }}`       |
| VectorDB              | `{{ registry-url }}/mlserver-vector-db:{{ current-version }}`        |
| Local Embeddings      | `{{ registry-url }}/mlserver-local-embeddings:{{ current-version }}` |
| Prompt                | `{{ registry-url }}/mlserver-prompt-utils:{{ current-version }}`     |

## Authenticating against the Seldon Artifact Registry

Note: change `credentials.json` to the filename you have saved the credentials as.

## Docker

To authenticate with the Docker CLI, you can run the following command:

```shell
REGISTRY=europe-west2-docker.pkg.dev
cat credentials.json | docker login -u _json_key --password-stdin ${REGISTRY}
```

You'll now be able to pull the private image(s). For example, we'll pull the image for the `Local Runtime` using the following command:

```shell
docker pull europe-west2-docker.pkg.dev/seldon-registry/llm/whoami:latest
```

## Kubernetes

In order to pull images directly into a Kubernetes cluster, you'll need to create a Kubernetes secret. For example:

```shell
NAMESPACE=seldon
REGISTRY=europe-west2-docker.pkg.dev
CREDENTIALS=$(cat credentials.json)
SECRET=seldon-registry
kubectl create secret docker-registry ${SECRET} \
	--docker-server="${REGISTRY}" \
	--docker-username="_json_key" \
	--docker-password="${CREDENTIALS}" \
	--dry-run=client -o yaml | kubectl apply -n ${NAMESPACE} -f -
```

To test, we can apply the following manifest as a validation step, and ensure that it deploys successfully.

```yaml
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: whoami
  labels:
    app: whoami
spec:
  replicas: 1
  selector:
    matchLabels:
      app: whoami
  template:
    metadata:
      labels:
        app: whoami
    spec:
      containers:
      - name: whoami
        image: europe-west2-docker.pkg.dev/seldon-registry/llm/whoami:latest
      imagePullSecrets:
      - name: seldon-registry
---
apiVersion: v1
kind: Service
metadata:
  name: whoami
spec:
  type: LoadBalancer
  selector:
    app: whoami
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 80
```

## Deploy the LLM Module runtime Servers

### Environment Variables Setup

If you are going to be using the API server you need to set you OpenAI, Azure OpenAI, or Gemini API key as an environment variable. We will do this as a Kubernetes secret.

**Example:**

```yaml
apiVersion: v1
kind: Secret
metadata:
  name: openai-api-key
type: Opaque
data:
  MLSERVER_MODEL_OPENAI_API_KEY: [base64 OpenAI or Azure OpenAI service API key]
```

Then apply the secret to the namespace you will be deploying the models.

```shell
kubectl apply -f openai-secret.yaml -n seldon
```

Similarly for Gemini:

```yaml
apiVersion: v1
kind: Secret
metadata:
  name: gemini-api-key
type: Opaque
data:
  MLSERVER_MODEL_GEMINI_API_KEY: [base64 Gemini service API key]
```

Then apply the secret to the namespace you will be deploying the models.

```shell
kubectl apply -f gemini-secret.yaml -n seldon
```

If you are going to be deploying models directly from Hugging Face(HF) you will need to set your HF token into the name space you will be deploying models,

Create a Kubernetes secret containing your HF token in the namespace you are deploying your models.

**Example:**

```yaml
apiVersion: v1
kind: Secret
metadata:
  name: huggingface-token
type: Opaque
data:
  HF_TOKEN: [base64 Hugging Face API Key]
```

Then apply the secret to the namespace you will be deploying the models:

```shell
kubectl apply -f hf-secret.yaml -n seldon
```

{% hint style="info" %}
Make sure your API key/token is `base64` encoded. To get the `base64` encoding of your API key/token run the following command:

```shell
echo -n '[key/token-to-encode]' | base64
```

{% endhint %}

### API Runtime

We will deploy the `API Runtime` server into the namespace where you will be running models.

Create a manifest file called `server-api.yaml` and add the following configuration:

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-llm-api
spec:
  serverConfig: mlserver
  capabilities:
  - openai
  - gemini
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-api:0.7.0
      imagePullPolicy: Always
      env:
      - name: MLSERVER_MODEL_OPENAI_API_KEY
        valueFrom:
          secretKeyRef:
            name: openai-api-key
            key: MLSERVER_MODEL_OPENAI_API_KEY
      - name: MLSERVER_MODEL_GEMINI_API_KEY
        valueFrom:
          secretKeyRef:
            name: gemini-api-key
            key: MLSERVER_MODEL_GEMINI_API_KEY
```

Then apply the `server-api.yaml` file to the namespace you will be deploying the models.

```shell
kubectl apply -f server-api.yaml -n seldon
```

You should now see the pod running the `API Runtime`.

### Conversational Memory Runtime

We will deploy the `Conversational Memory Runtime` server into the namespace where you will be running models.

Create a manifest file called `server-memory.yaml` and add the following configuration:

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-llm-memory
spec:
  serverConfig: mlserver
  capabilities:
  - memory
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-memory:0.7.0
      imagePullPolicy: Always
```

Then apply the `server-memory.yaml` file to the namespace you will be deploying the models

```shell
kubectl apply -f server-memory.yaml -n seldon
```

We should now see the pod running the `Conversational Memory Runtime`.

### Local Runtime with GPU

We will deploy the `Local Runtime` server with GPU support into the namespace where you will be running models.

Add the following configuration to your `server-local.yaml` file. If needed, please change the resource requirements based on your use case.

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-llm-local
spec:
  serverConfig: mlserver
  capabilities:
  - llm-local
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-llm-local:0.7.0
      imagePullPolicy: Always
      env:
      - name: HF_TOKEN
        valueFrom:
          secretKeyRef:
            name: huggingface-token
            key: HF_TOKEN
      resources:
        requests:
          nvidia.com/gpu: 4
          memory: 4Gi
        limits:
          memory: 32Gi
          nvidia.com/gpu: 4
      volumeMounts:
      - mountPath: /dev/shm
        name: dshm
        readOnly: false
    volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 1Gi
      name: dshm
```

Then apply the `server-local.yaml` file to the namespace you will be deploying the models

```shell
kubectl apply -f server-local.yaml -n seldon
```

We should now see the pod running the `Local Runtime`.

### VectorDB Runtime

We will deploy the `VectorDB Runtime` into the namespace where you will be running models

Add the following configuration to your `server-vector-db.yaml` file.

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-vector-db
spec:
  serverConfig: mlserver
  capabilities:
  - vector-db
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-vector-db:0.7.0
      imagePullPolicy: Always
```

Then apply the `server-vector-db.yaml` file to the namespace you will be deploying the models

```shell
kubectl apply -f server-vector-db.yaml -n seldon
```

We should now see the pod running for the `VectorDB Runtime`.

### LocalEmbeddings Runtime

We will deploy the `LocalEmbeddings Runtime` into the namespace where you will be running models.

Add the following configuration to your `sever-local-embeddings.yaml` file.

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-local-embeddings
spec:
  serverConfig: mlserver
  capabilities:
  - local-embeddings
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-local-embeddings:0.7.0
      imagePullPolicy: Always
      resources:
        requests:
          nvidia.com/gpu: 1
          memory: 4Gi
          cpu: "1"
        limits:
          memory: 32Gi
          nvidia.com/gpu: 1
          cpu: "8"
      volumeMounts:
      - mountPath: /dev/shm
        name: dshm
        readOnly: false
    volumes:
    - emptyDir:
        medium: Memory
        sizeLimit: 1Gi
      name: dshm
```

The CPU, GPU, and memory requirements are just for reference and should be updated according to your setup.

Then apply the `server-local-embeddings.yaml` file to the namespace you will be deploying the models

```shell
kubectl apply -f server-local-embeddings.yaml -n seldon
```

We should now see the pod running the `LocalEbeddings Runtime`.

### Prompt Runtime

We will deploy the `Prompt Runtime` into the namespace where you will be running models.

Add the following configuration to your `sever-prompt.yaml` file.

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-prompt-utils
spec:
  serverConfig: mlserver
  capabilities:
  - prompt
  podSpec:
    imagePullSecrets:
    - name: seldon-registry
    containers:
    - name: mlserver
      image: europe-west2-docker.pkg.dev/seldon-registry/llm/mlserver-prompt-utils:0.7.0
      imagePullPolicy: Always
```

Then apply the `server-prompt.yaml` file to the namespace you will be deploying the models

```shell
kubectl apply -f server-prompt.yaml -n seldon
```

We should now see the pod running the `Prompt Runtime`.

### Key Values from the server file

**name**: The name you want to call the server

**serverConfig**: This refers to the `ServerConfig` within your `Seldon` installation. `ServerConfig` are used as templates for the base layer of the LLM modules MLServer Pod

**imagePullSecrets**: Setting the proper secret variable to authenticate against the private docker registry to pull the LLM module images

## Cleaning up

To remove the servers, run the following commands:

```shell
kubectl delete -f server-api.yaml
kubectl delete -f server-memory.yaml
kubectl delete -f server-local.yaml
kubectl delete -f server-vector-db.yaml
```

To remove the secrets, run the following commands:

```shell
kubectl delete -f openai-secret.yaml -n seldon
kubectl delete -f gemini-secret.yaml -n seldon
kubectl delete -f hf-secret.yaml -n seldon
```

At this point, the LLM module should be removed from you cluster.

## Troubleshooting

**If there is an issue with the access key secret run the command below:**

```shell
kubectl patch sa seldon-server -n seldon \
		-p '"imagePullSecrets": [{"name": "seldon-registry" }]'
```

**Trouble setting the OpenAI secret**

Another method you can use is directly through `kubectl`.

Set your environment variables:

```shell
export MLSERVER_MODEL_OPENAI_API_KEY=[OpenAI or Azure OpenAI service API key]
export NAMESPACE=[Namespace where you are deploying models]
```

Use `kubectl` to apply the secret:

```shell
kubectl delete secret openai-api-key -n $(SELDON_NAMESPACE) || echo "openai-api-key secret does not exist - will create"
kubectl create secret generic openai-api-key --from-literal=key=$(MLSERVER_MODEL_OPENAI_API_KEY) -n $(SELDON_NAMESPACE)
```

The same workflow applies to Gemini and HF secrets.

## Next Steps

Now that you're able to pull the images, you can try some of the [examples](https://github.com/SeldonIO/llm-runtimes/blob/master/docs-gb/examples/README.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/introduction/installation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
