This example illustrates how to use taints, tolerations with nodeAffinity or nodeSelector to assign GPU nodes to specific models.
Note: Configuration options depend on your cluster setup and the desired outcome. The Seldon CRDs for Seldon Core 2 Pods offer complete customization of Pod specifications, allowing you to apply additional Kubernetes customizations as needed.
To serve a model on a dedicated GPU node, you should follow these steps:
Note: To dedicate a set of nodes to run only a specific group of inference servers, you must first provision an additional set of nodes within the Kubernetes cluster for the remaining Seldon Core 2 components. For more information about adding labels and taint to the GPU nodes in your Kubernetes cluster refer to the respective cloud provider documentation.
You can add the taint when you are creating the node or after the node has been provisioned. You can apply the same taint to multiple nodes, not just a single node. A common approach is to define the taint at the node pool level.
When you apply a NoSchedule taint to a node after it is created it may result in existing Pods that do not have a matching toleration to remain on the node without being evicted. To ensure that such Pods are removed, you can use the NoExecute taint effect instead.
In this example, the node includes several labels that are used later for node affinity settings. You may choose to specify some labels, while others are usually added by the cloud provider or a GPU operator installed in the cluster. \
apiVersion:v1kind:Nodemetadata:name:example-node# Replace with the actual node namelabels:pool:infer-srv# Custom labelnvidia.com/gpu.product:A100-SXM4-40GB-MIG-1g.5gb-SHARED# Sample label from GPU discoverycloud.google.com/gke-accelerator:nvidia-a100-80gb# GKE without NVIDIA GPU operatorcloud.google.com/gke-accelerator-count:"2"# Accelerator countspec:taints: - effect:NoSchedulekey:seldon-gpu-srvvalue:"true"
Configure inference servers
To ensure a specific inference server Pod runs only on the nodes you've configured, you can use nodeSelector or nodeAffinity together with a toleration by modifying one of the following:
Seldon Server custom resource: Apply changes to each individual inference server.
ServerConfig custom resource: Apply settings across multiple inference servers at once.
Configuring Seldon Server custom resource
While nodeSelector requires an exact match of node labels for server Pods to select a node, nodeAffinity offers more fine-grained control. It enables a conditional approach by using logical operators in the node selection process. For more information, see Affinity and anti-affinity.
In this example, a nodeSelector and a toleration is set for the Seldon Server custom resource.
apiVersion:mlops.seldon.io/v1alpha1kind:Servermetadata:name:mlserver-llm-local-gpu# <server name>namespace:seldon-mesh# <seldon runtime namespace>spec:replicas:1serverConfig:mlserver# <reference Serverconfig CR>extraCapabilities: - model-on-gpu# Custom capability for matching Model to this serverpodSpec:nodeSelector:# Schedule pods only on nodes with these labelspool:infer-srvcloud.google.com/gke-accelerator:nvidia-a100-80gb# Example requesting specific GPU on GKE# cloud.google.com/gke-accelerator-count: 2 # Optional GPU counttolerations:# Allow scheduling on nodes with the matching taint - effect:NoSchedulekey:seldon-gpu-srvoperator:Equalvalue:"true"containers:# Override settings from Serverconfig if needed - name:mlserverresources:requests:nvidia.com/gpu:1# Request a GPU for the mlserver containercpu:40memory:360Giephemeral-storage:290Gilimits:nvidia.com/gpu:2# Limit to 2 GPUscpu:40memory:360Gi
In this example, a nodeAffinity and a toleration is set for the Seldon Server custom resource.
apiVersion:mlops.seldon.io/v1alpha1kind:Servermetadata:name:mlserver-llm-local-gpu# <server name>namespace:seldon-mesh# <seldon runtime namespace>spec:podSpec:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms: - matchExpressions: - key:"pool"operator:Invalues: - infer-srv - key:"cloud.google.com/gke-accelerator"operator:Invalues: - nvidia-a100-80gb tolerations: # Allow mlserver-llm-local-gpu pods to be scheduled on nodes with the matching taint
- effect:NoSchedulekey:seldon-gpu-srvoperator:Equalvalue:"true"containers:# If needed, override settings from ServerConfig for this specific Server - name:mlserverresources:requests:nvidia.com/gpu:1# Request a GPU for the mlserver containercpu:40memory:360Giephemeral-storage:290Gilimits:nvidia.com/gpu:2# Limit to 2 GPUscpu:40memory:360Gi
You can configure more advanced Pod selection using nodeAffinity, as in this example:
apiVersion:mlops.seldon.io/v1alpha1kind:Servermetadata:name:mlserver-llm-local-gpu# <server name>namespace:seldon-mesh# <seldon runtime namespace>spec:podSpec:affinity:nodeAffinity:requiredDuringSchedulingIgnoredDuringExecution:nodeSelectorTerms: - matchExpressions: - key:"cloud.google.com/gke-accelerator-count"operator:Gt# (greater than)values: ["1"] - key:"gpu.gpu-vendor.example/installed-memory"operator:Gtvalues: ["75000"] - key:"feature.node.kubernetes.io/pci-10.present"# NFD Feature labeloperator:Invalues: ["true"] # (optional) only schedule on nodes with PCI device 10 tolerations: # Allow mlserver-llm-local-gpu pods to be scheduled on nodes with the matching taint
- effect:NoSchedulekey:seldon-gpu-srvoperator:Equalvalue:"true"containers:# If needed, override settings from ServerConfig for this specific Server - name:mlserverenv:...# Add your environment variables hereimage:...# Specify your container image hereresources:requests:nvidia.com/gpu:1# Request a GPU for the mlserver containercpu:40memory:360Giephemeral-storage:290Gilimits:nvidia.com/gpu:2# Limit to 2 GPUscpu:40memory:360Gi...# Other configurations can go here
Configuring ServerConfig custom resource
This configuration automatically affects all servers using that ServerConfig, unless you specify server-specific overrides, which takes precedence.
apiVersion:mlops.seldon.io/v1alpha1kind:ServerConfigmetadata:name:mlserver-llm# <ServerConfig name>namespace:seldon-mesh# <seldon runtime namespace>spec:podSpec:nodeSelector:# Schedule pods only on nodes with these labelspool:infer-srvcloud.google.com/gke-accelerator:nvidia-a100-80gb# Example requesting specific GPU on GKE# cloud.google.com/gke-accelerator-count: 2 # Optional GPU counttolerations:# Allow scheduling on nodes with the matching taint - effect:NoSchedulekey:seldon-gpu-srvoperator:Equalvalue:"true"containers:# Define the container specifications - name:mlserverenv:# Environment variables (fill in as needed)...image:...# Specify the container imageresources:requests:nvidia.com/gpu:1# Request a GPU for the mlserver containercpu:40memory:360Giephemeral-storage:290Gilimits:nvidia.com/gpu:2# Limit to 2 GPUscpu:40memory:360Gi...# Additional container configurations
Configuring models
When you have a set of inference servers running exclusively on GPU nodes, you can assign a model to one of those servers in two ways:
Custom model requirements (recommended)
Explicit server pinning
Here's the distinction between the two methods of assigning models to servers.
When you specify a requirement matching a server capability in the model custom resource it loads the model on any inference server with a capability matching the requirements.
apiVersion:mlops.seldon.io/v1alpha1kind:Modelmetadata:name:llama3# <model name>namespace:seldon-mesh# <seldon runtime namespace>spec:requirements: - model-on-gpu# requirement matching a Server capability
Ensure that the additional capability that matches the requirement label is added to the Server custom resource.
apiVersion:mlops.seldon.io/v1alpha1kind:Servermetadata:name:mlserver-llm-local-gpu# <server name>namespace:seldon-mesh# <seldon runtime namespace>spec:serverConfig:mlserver# <reference ServerConfig CR>extraCapabilities: - model-on-gpu# custom capability that can be used for matching Model to this server# Other fields would go here
Instead of adding a capability using extraCapabilities on a Server custom resource, you may also add to the list of capabilities in the associated ServerConfig custom resource. This applies to all servers referencing the configuration.
apiVersion:mlops.seldon.io/v1alpha1kind:ServerConfigmetadata:name:mlserver-llm# <ServerConfig name>namespace:seldon-mesh# <seldon runtime namespace>spec:podSpec:containers: - name:agent# note the setting is applied to the agent containerenv: - name:SELDON_SERVER_CAPABILITIESvalue:mlserver,alibi-detect,...,xgboost,model-on-gpu# add capability to the listimage:...# Other configurations go here
With these specifications, the model is loaded on replicas of inference servers created by the referenced Server custom resource.
apiVersion:mlops.seldon.io/v1alpha1kind:Modelmetadata:name:llama3# <model name>namespace:seldon-mesh# <seldon runtime namespace>spec:server:mlserver-llm-local-gpu# <reference Server CR>requirements: - model-on-gpu# requirement matching a Server capability
Method
Behavior
Custom model requirements
If the assigned server cannot load the model due to insufficient resources, another similarly-capable server can be selected to load the model.
Explicit pinning
If the specified server lacks sufficient memory or resources, the model load fails without trying another server.