Routing (with LiteLLM)
This guide describes how to deploy a LiteLLM proxy server to route requests between multiple LLM runtimes managed by Seldon Core 2. It assumes that the LLM Module is installed and that both the API and local runtimes are available in your cluster. For installation instructions, see our installation tutorial.
Overview
In this tutorial, you will:
Deploy two LLMs: an OpenAI-compatible chat model on the API runtime and a small Hugging Face model (SmolLM) on the local runtime.
Configure and test LiteLLM locally against the cluster endpoints.
Deploy the LiteLLM proxy in-cluster for production use.
After running through this tutorial, you will have an updated system architecture as depicted below, where LiteLLM is used as the entry point to route to a specific model and Envoy is then used by Core 2 to route to the right instance of that model, given multiple replicas working in parallel.

Prerequisites
Seldon Core 2 installed and configured.
Seldon's LLM module deployed (including both API and local runtimes).
Access to a cloud storage location for model settings (e.g., GCS bucket).
[Optional]
kubectlconfigured to access the target cluster namespace (this example uses namespaceseldon-mesh).
Deploying the models
API model
This references a storageUri containing model-settings.json. Adjust the storage URI to your bucket.
Local model
Upload a model-settings.json file such as:
Then create a Model resource (example):
Apply both model manifests:
Note: The SmolLM example is CPU-based for convenience. For GPU-based models, see this doc.
Configure LiteLLM
Obtain the external service IP for the Seldon mesh:
Place that IP into the LiteLLM config file. Example config (local testing):
API keys are typically handled by Kubernetes Secrets accessed by the llm-runtimes; local config does not need to include them.
Run the LiteLLM proxy locally to verify routing:
Requests to the local proxy will be routed to the configured models according to the proxy configuration.
Example request in Python
Deploying to Production
To deploy the LiteLLM proxy server, perform three steps:
Create a ConfigMap to store the LiteLLM configuration
Create a Deployment for the LiteLLM proxy, and
Create a Service to expose an external IP for queries.
The ConfigMap, Deployment, and Service can be placed together in a single file (for example, litellm.yaml). For more details see the deployment and production documentation for LiteLLM.
Configuration
Use the following ConfigMap for the LiteLLM deployment. In this example we implement Tag filtering to permit routing to specific LLMs based on assigned tags. For implementing other routing logic such as fallbacks, load-balancing within a model group, or routing based on availability, see LiteLLM's routing docs.
Use the internal service DNS (seldon-mesh.seldon-mesh.svc.cluster.local) for in-cluster routing.
Deployment
Service
Apply resources:
Testing the deployed proxy
Get the external IP for the LiteLLM service:
Example programmatic test:
Final Notes
Because these runtimes are OpenAI-compatible, the OpenAI client may be used against the deployed endpoints.
Protect API keys using Kubernetes Secrets and avoid embedding them in ConfigMaps.
Last updated
Was this helpful?