Routing (with LiteLLM)

This guide describes how to deploy a LiteLLM proxy server to route requests between multiple LLM runtimes managed by Seldon Core 2. It assumes that the LLM Module is installed and that both the API and local runtimes are available in your cluster. For installation instructions, see our installation tutorial.

Overview

In this tutorial, you will:

Deploy two LLMs: an OpenAI-compatible chat model on the API runtime and a small Hugging Face model (SmolLM) on the local runtime.
Configure and test LiteLLM locally against the cluster endpoints.
Deploy the LiteLLM proxy in-cluster for production use.

After running through this tutorial, you will have an updated system architecture as depicted below, where LiteLLM is used as the entry point to route to a specific model and Envoy is then used by Core 2 to route to the right instance of that model, given multiple replicas working in parallel.

Prerequisites

Seldon Core 2 installed and configured.
Seldon's LLM module deployed (including both API and local runtimes).
Access to a cloud storage location for model settings (e.g., GCS bucket).
[Optional] kubectl configured to access the target cluster namespace (this example uses namespace seldon-mesh).

Deploying the models

API model

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: chatgpt
spec:
  storageUri: "gs://seldon-models/llm-runtimes-settings/chat-bot/models/chatgpt"
  requirements:
  - openai

This references a storageUri containing model-settings.json. Adjust the storage URI to your bucket.

Local model

Upload a model-settings.json file such as:

{
    "name": "local-chat-completions",
    "implementation": "mlserver_llm_local.runtime.Local",
    "parameters": {
        "extra": {
            "backend": "transformers",
            "config": {
                "model_type": "chat.completions",
                "model_settings": {
                    "model": "HuggingFaceTB/SmolLM2-135M-Instruct",
                    "device": "cpu",
                    "max_tokens": -1
                }
            }
        }
    }
}

Then create a Model resource (example):

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: smollm
spec:
  storageUri: "gs://.../path/to/local-chat-completions"
  requirements:
  - local

Apply both model manifests:

kubectl create -f deployment/models/yaml/chatgpt.yaml -n seldon-mesh
kubectl create -f deployment/models/yaml/smollm.yaml -n seldon-mesh

Note: The SmolLM example is CPU-based for convenience. For GPU-based models, see this doc.

Configure LiteLLM

Obtain the external service IP for the Seldon mesh:

kubectl get svc seldon-mesh -n seldon-mesh -o jsonpath="{.status.loadBalancer.ingress[0].ip}"

Place that IP into the LiteLLM config file. Example config (local testing):

model_list:
  - model_name: chatgpt
    litellm_params:
      model: openai/chatgpt
      api_base: http://{external_ip}/v2/models/chatgpt/infer
      api_key: "..."
  - model_name: chatgpt
    litellm_params:
      model: openai/smollm
      api_base: http://{external_ip}/v2/models/smollm/infer
      api_key: "..."

API keys are typically handled by Kubernetes Secrets accessed by the llm-runtimes; local config does not need to include them.

Run the LiteLLM proxy locally to verify routing:

litellm --config litellm-config.yaml

Requests to the local proxy will be routed to the configured models according to the proxy configuration.

Example request in Python

import requests

endpoint = "http://0.0.0.0:4000/chat/completions"

response = requests.post(
    endpoint,
    headers={"Content-Type": "application/json"},
    json={
        "model": "chatgpt",
        "messages": [
            {"role": "user", "content": "You are a helpful assistant."},
            {"role": "assistant", "content": "Hello! How can I help you?"},
            {"role": "user", "content": "What is the capital of Scotland?"}
        ]
    }
)

print(response.json())

{'id': 'chatcmpl-ClFr4TxGcd9jpQd3GTKYfvt2VG5Id', 'created': 1765378398, 'model': 'gpt-3.5-turbo-0125', 'object': 'chat.completion', 'choices': [{'finish_reason': 'stop', 'index': 0, 'message': {'content': 'The capital of Scotland is Edinburgh.', 'role': 'assistant'}}], 'usage': {'completion_tokens': 7, 'prompt_tokens': 36, 'total_tokens': 43, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'service_tier': 'default'}

Deploying to Production

To deploy the LiteLLM proxy server, perform three steps:

Create a ConfigMap to store the LiteLLM configuration
Create a Deployment for the LiteLLM proxy, and
Create a Service to expose an external IP for queries.

The ConfigMap, Deployment, and Service can be placed together in a single file (for example, litellm.yaml). For more details see the deployment and production documentation for LiteLLM.

Configuration

Use the following ConfigMap for the LiteLLM deployment. In this example we implement Tag filtering to permit routing to specific LLMs based on assigned tags. For implementing other routing logic such as fallbacks, load-balancing within a model group, or routing based on availability, see LiteLLM's routing docs.

apiVersion: v1
kind: ConfigMap
metadata:
  namespace: seldon-mesh
  name: litellm-config-file
data:
  config.yaml:
    model_list:
    - model_name: chatgpt
      litellm_params:
        model: openai/chatgpt
        api_base: http://seldon-mesh.seldon-mesh.svc.cluster.local/v2/models/chatgpt/infer
        api_key: "..."
        tags: ['api', 'default']
    - model_name: chatgpt
      litellm_params:
        model: openai/smollm
        api_base: http://seldon-mesh.seldon-mesh.svc.cluster.local/v2/models/smollm/infer
        api_key: "..."
        tags: ['local']
    router_settings:
      enable_tag_filtering: true

Use the internal service DNS (seldon-mesh.seldon-mesh.svc.cluster.local) for in-cluster routing.

Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: seldon-mesh
  name: litellm-deployment
  labels:
    app: litellm
spec:
  selector:
    matchLabels:
      app: litellm
  template:
    metadata:
      labels:
        app: litellm
    spec:
      containers:
      - name: litellm
        image: ghcr.io/berriai/litellm:main-stable
        args:
          - "--config"
          - "/app/proxy_server_config.yaml"
        ports:
        - containerPort: 4000
        volumeMounts:
        - name: config-volume
          mountPath: /app/proxy_server_config.yaml
          subPath: config.yaml
      volumes:
        - name: config-volume
          configMap:
            name: litellm-config-file

Service

apiVersion: v1
kind: Service
metadata:
  namespace: seldon-mesh
  name: litellm
  labels:
    app: litellm
spec:
  selector:
    app: litellm
  ports:
    - name: http
      port: 4000
      targetPort: 4000
  type: LoadBalancer

Apply resources:

kubectl apply --namespace=seldon-mesh -f litellm.yaml

Testing the deployed proxy

Get the external IP for the LiteLLM service:

kubectl get svc litellm -n seldon-mesh -o jsonpath="{.status.loadBalancer.ingress[0].ip}"

Example programmatic test:

import subprocess
import requests


def get_litellm_ip():
    cmd = f"kubectl get svc litellm -n seldon-mesh -o jsonpath='{{.status.loadBalancer.ingress[0].ip}}'"
    return subprocess.check_output(cmd, shell=True).decode('utf-8')

endpoint = f"http://{get_litellm_ip()}:4000/chat/completions"

response = requests.post(
    endpoint,
    headers={"Content-Type": "application/json"},
    json={
        "model": "chatgpt",
        "messages": [
            {"role": "user", "content": "You are a helpful assistant."},
            {"role": "assistant", "content": "Hello! How can I help you?"},
            {"role": "user", "content": "What is the capital of Scotland?"}
        ],
        "tags": ["local"]
    }
)

print(response.json())

Final Notes

Because these runtimes are OpenAI-compatible, the OpenAI client may be used against the deployed endpoints.
Protect API keys using Kubernetes Secrets and avoid embedding them in ConfigMaps.

PreviousMonitoring NextReference

Last updated 1 month ago

Was this helpful?

hashtagOverview

hashtagPrerequisites

hashtagDeploying the models

hashtagAPI model

hashtagLocal model

hashtagConfigure LiteLLM

hashtagExample request in Python

hashtagDeploying to Production

hashtagConfiguration

hashtagDeployment

hashtagService

hashtagTesting the deployed proxy