# Routing (with LiteLLM)

This guide describes how to deploy a LiteLLM proxy server to route requests between multiple LLM runtimes managed by Seldon Core 2. It assumes that the LLM Module is installed and that both the API and local runtimes are available in your cluster. For installation instructions, see our [installation tutorial](/llm-module/introduction/installation.md).

## Overview

In this tutorial, you will:

1. Deploy two LLMs: an OpenAI-compatible chat model on the API runtime and a small Hugging Face model (SmolLM) on the local runtime.
2. Configure and test LiteLLM locally against the cluster endpoints.
3. Deploy the LiteLLM proxy in-cluster for production use.

After running through this tutorial, you will have an updated system architecture as depicted below, where LiteLLM is used as the entry point to route to a specific *model* and Envoy is then used by Core 2 to route to the right *instance* of that model, given multiple replicas working in parallel.

![litellm-architecture.png](/files/TmxHYA0BJcBaezaC4RIA)

## Prerequisites

* Seldon Core 2 installed and configured.
* Seldon's LLM module deployed (including both API and local runtimes).
* Access to a cloud storage location for model settings (e.g., GCS bucket).
* \[Optional] `kubectl` configured to access the target cluster namespace (this example uses namespace `seldon-mesh`).

## Deploying the models

### API model

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: chatgpt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/chatgpt"
  requirements:
  - openai
```

This references a `storageUri` containing `model-settings.json`. Adjust the storage URI to your bucket.

### Local model

Upload a `model-settings.json` file such as:

```json
{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "HuggingFaceTB/SmolLM2-135M-Instruct",
                    "device": "cpu",
                    "max_tokens": -1
                }
            }
        }
    }
}
```

Then create a Model resource (example):

```yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: smollm
spec:
  storageUri: "gs://.../path/to/local-chat-completions"
  requirements:
  - local
```

Apply both model manifests:

```bash
kubectl create -f deployment/models/yaml/chatgpt.yaml -n seldon-mesh
kubectl create -f deployment/models/yaml/smollm.yaml -n seldon-mesh
```

Note: The SmolLM example is CPU-based for convenience. For GPU-based models, see [this doc](/llm-module/components/models/local/mms.md).

## Configure LiteLLM

Obtain the external service IP for the Seldon mesh:

```bash
kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath="{.status.loadBalancer.ingress[0].ip}"
```

Place that IP into the LiteLLM config file. Example config (local testing):

```yaml
model_list:
  - model_name: chatgpt
    litellm_params:
      model: openai/chatgpt
      api_base: http://{external_ip}/v2/models/chatgpt/infer
      api_key: "..."
  - model_name: chatgpt
    litellm_params:
      model: openai/smollm
      api_base: http://{external_ip}/v2/models/smollm/infer
      api_key: "..."
```

{% hint style="info" %}
API keys are typically handled by Kubernetes Secrets accessed by the llm-runtimes; local config does not need to include them.
{% endhint %}

Run the LiteLLM proxy locally to verify routing:

```bash
litellm --config litellm-config.yaml
```

Requests to the local proxy will be routed to the configured models according to the proxy configuration.

### Example request in Python

```py
import requests

endpoint = "http://0.0.0.0:4000/chat/completions"

response = requests.post(
    endpoint,
    headers={"Content-Type": "application/json"},
    json={
        "model": "chatgpt",
        "messages": [
            {"role": "user", "content": "You are a helpful assistant."},
            {"role": "assistant", "content": "Hello! How can I help you?"},
            {"role": "user", "content": "What is the capital of Scotland?"}
        ]
    }
)

print(response.json())
```

```
{'id': 'chatcmpl-ClFr4TxGcd9jpQd3GTKYfvt2VG5Id', 'created': 1765378398, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': 'The capital of Scotland is Edinburgh.', 'role': 'assistant'}}], 'usage': {'completion_tokens': 7, 'prompt_tokens': 36, 'total_tokens': 43, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'service_tier': 'default'}
```

## Deploying to Production

To deploy the LiteLLM proxy server, perform three steps:

1. Create a ConfigMap to store the LiteLLM configuration
2. Create a Deployment for the LiteLLM proxy, and
3. Create a Service to expose an external IP for queries.

The ConfigMap, Deployment, and Service can be placed together in a single file (for example, `litellm.yaml`). For more details see the [deployment](https://docs.litellm.ai/docs/proxy/deploy) and [production](https://docs.litellm.ai/docs/proxy/prod) documentation for LiteLLM.

### Configuration

Use the following ConfigMap for the LiteLLM deployment. In this example we implement Tag filtering to permit routing to specific LLMs based on assigned tags. For implementing other routing logic such as fallbacks, load-balancing within a model group, or routing based on availability, see [LiteLLM's routing docs](https://docs.litellm.ai/docs/routing-load-balancing).

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: seldon-mesh
  name: litellm-config-file
data:
  config.yaml:
    model_list:
    - model_name: chatgpt
      litellm_params:
        model: openai/chatgpt
        api_base: http://seldon-mesh.seldon-mesh.svc.cluster.local/v2/models/chatgpt/infer
        api_key: "..."
        tags: ['api', 'default']
    - model_name: chatgpt
      litellm_params:
        model: openai/smollm
        api_base: http://seldon-mesh.seldon-mesh.svc.cluster.local/v2/models/smollm/infer
        api_key: "..."
        tags: ['local']
    router_settings:
      enable_tag_filtering: true
```

{% hint style="info" %}
Use the internal service DNS (seldon-mesh.seldon-mesh.svc.cluster.local) for in-cluster routing.
{% endhint %}

### Deployment

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: seldon-mesh
  name: litellm-deployment
  labels:
    app: litellm
spec:
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
      - name: litellm
        image: ghcr.io/berriai/litellm:main-stable
        args:
          - "--config"
          - "/app/proxy_server_config.yaml"
        ports:
        - containerPort: 4000
        volumeMounts:
        - name: config-volume
          mountPath: /app/proxy_server_config.yaml
          subPath: config.yaml
      volumes:
        - name: config-volume
          configMap:
            name: litellm-config-file
```

### Service

```yaml
apiVersion: v1
kind: Service
metadata:
  namespace: seldon-mesh
  name: litellm
  labels:
    app: litellm
spec:
  selector:
    app: litellm
  ports:
    - name: http
      port: 4000
      targetPort: 4000
  type: LoadBalancer
```

Apply resources:

```bash
kubectl apply --namespace=seldon-mesh -f litellm.yaml
```

## Testing the deployed proxy

Get the external IP for the LiteLLM service:

```bash
kubectl get svc litellm -n seldon-mesh -o jsonpath="{.status.loadBalancer.ingress[0].ip}"
```

Example programmatic test:

```py
import subprocess
import requests


def get_litellm_ip():
    cmd = f"kubectl get svc litellm -n seldon-mesh -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

endpoint = f"http://{get_litellm_ip()}:4000/chat/completions"

response = requests.post(
    endpoint,
    headers={"Content-Type": "application/json"},
    json={
        "model": "chatgpt",
        "messages": [
            {"role": "user", "content": "You are a helpful assistant."},
            {"role": "assistant", "content": "Hello! How can I help you?"},
            {"role": "user", "content": "What is the capital of Scotland?"}
        ],
        "tags": ["local"]
    }
)

print(response.json())
```

**Final Notes**

* Because these runtimes are OpenAI-compatible, the OpenAI client may be used against the deployed endpoints.
* Protect API keys using Kubernetes Secrets and avoid embedding them in ConfigMaps.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/llm-module/use-cases/litellm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
