githubEdit

Routing (with LiteLLM)

This guide describes how to deploy a LiteLLM proxy server to route requests between multiple LLM runtimes managed by Seldon Core 2. It assumes that the LLM Module is installed and that both the API and local runtimes are available in your cluster. For installation instructions, see our installation tutorial.

Overview

In this tutorial, you will:

  1. Deploy two LLMs: an OpenAI-compatible chat model on the API runtime and a small Hugging Face model (SmolLM) on the local runtime.

  2. Configure and test LiteLLM locally against the cluster endpoints.

  3. Deploy the LiteLLM proxy in-cluster for production use.

After running through this tutorial, you will have an updated system architecture as depicted below, where LiteLLM is used as the entry point to route to a specific model and Envoy is then used by Core 2 to route to the right instance of that model, given multiple replicas working in parallel.

litellm-architecture.png

Prerequisites

  • Seldon Core 2 installed and configured.

  • Seldon's LLM module deployed (including both API and local runtimes).

  • Access to a cloud storage location for model settings (e.g., GCS bucket).

  • [Optional] kubectl configured to access the target cluster namespace (this example uses namespace seldon-mesh).

Deploying the models

API model

This references a storageUri containing model-settings.json. Adjust the storage URI to your bucket.

Local model

Upload a model-settings.json file such as:

Then create a Model resource (example):

Apply both model manifests:

Note: The SmolLM example is CPU-based for convenience. For GPU-based models, see this doc.

Configure LiteLLM

Obtain the external service IP for the Seldon mesh:

Place that IP into the LiteLLM config file. Example config (local testing):

circle-info

API keys are typically handled by Kubernetes Secrets accessed by the llm-runtimes; local config does not need to include them.

Run the LiteLLM proxy locally to verify routing:

Requests to the local proxy will be routed to the configured models according to the proxy configuration.

Example request in Python

Deploying to Production

To deploy the LiteLLM proxy server, perform three steps:

  1. Create a ConfigMap to store the LiteLLM configuration

  2. Create a Deployment for the LiteLLM proxy, and

  3. Create a Service to expose an external IP for queries.

The ConfigMap, Deployment, and Service can be placed together in a single file (for example, litellm.yaml). For more details see the deploymentarrow-up-right and productionarrow-up-right documentation for LiteLLM.

Configuration

Use the following ConfigMap for the LiteLLM deployment. In this example we implement Tag filtering to permit routing to specific LLMs based on assigned tags. For implementing other routing logic such as fallbacks, load-balancing within a model group, or routing based on availability, see LiteLLM's routing docsarrow-up-right.

circle-info

Use the internal service DNS (seldon-mesh.seldon-mesh.svc.cluster.local) for in-cluster routing.

Deployment

Service

Apply resources:

Testing the deployed proxy

Get the external IP for the LiteLLM service:

Example programmatic test:

Final Notes

  • Because these runtimes are OpenAI-compatible, the OpenAI client may be used against the deployed endpoints.

  • Protect API keys using Kubernetes Secrets and avoid embedding them in ConfigMaps.

Last updated

Was this helpful?