This example illustrates how to use taints, tolerations with nodeAffinity or nodeSelector to assign GPU nodes to specific models.
Note: Configuration options depend on your cluster setup and the desired outcome. The Seldon CRDs for Seldon Core 2 Pods offer complete customization of Pod specifications, allowing you to apply additional Kubernetes customizations as needed.
To serve a model on a dedicated GPU node, you should follow these steps:
Note: To dedicate a set of nodes to run only a specific group of inference servers, you must first provision an additional set of nodes within the Kubernetes cluster for the remaining Seldon Core 2 components. For more information about adding labels and taint to the GPU nodes in your Kubernetes cluster refer to the respective cloud provider documentation.
You can add the taint when you are creating the node or after the node has been provisioned. You can apply the same taint to multiple nodes, not just a single node. A common approach is to define the taint at the node pool level.
When you apply a NoSchedule
taint to a node after it is created it may result in existing Pods that do not have a matching toleration to remain on the node without being evicted. To ensure that such Pods are removed, you can use the NoExecute
taint effect instead.
In this example, the node includes several labels that are used later for node affinity settings. You may choose to specify some labels, while others are usually added by the cloud provider or a GPU operator installed in the cluster. \
To ensure a specific inference server Pod runs only on the nodes you've configured, you can use nodeSelector
or nodeAffinity
together with a toleration
by modifying one of the following:
Seldon Server custom resource: Apply changes to each individual inference server.
ServerConfig custom resource: Apply settings across multiple inference servers at once.
Configuring Seldon Server custom resource
While nodeSelector
requires an exact match of node labels for server Pods to select a node, nodeAffinity
offers more fine-grained control. It enables a conditional approach by using logical operators in the node selection process. For more information, see Affinity and anti-affinity.
In this example, a nodeSelector
and a toleration
is set for the Seldon Server custom resource.
In this example, a nodeAffinity
and a toleration
is set for the Seldon Server custom resource.
You can configure more advanced Pod selection using nodeAffinity
, as in this example:
Configuring ServerConfig custom resource
This configuration automatically affects all servers using that ServerConfig
, unless you specify server-specific overrides, which takes precedence.
When you have a set of inference servers running exclusively on GPU nodes, you can assign a model to one of those servers in two ways:
Custom model requirements (recommended)
Explicit server pinning
Here's the distinction between the two methods of assigning models to servers.
When you specify a requirement matching a server capability in the model custom resource it loads the model on any inference server with a capability matching the requirements.
Ensure that the additional capability that matches the requirement label is added to the Server custom resource.
Instead of adding a capability using extraCapabilities
on a Server custom resource, you may also add to the list of capabilities in the associated ServerConfig custom resource. This applies to all servers referencing the configuration.
With these specifications, the model is loaded on replicas of inference servers created by the referenced Server custom resource.
Method | Behavior |
---|---|
Custom model requirements
If the assigned server cannot load the model due to insufficient resources, another similarly-capable server can be selected to load the model.
Explicit pinning
If the specified server lacks sufficient memory or resources, the model load fails without trying another server.
Learn more about using taints and tolerations with node affinity or node selector to allocate resources in a Kubernetes cluster.
When deploying machine learning models in Kubernetes, you may need to control which infrastructure resources these models use. This is especially important in environments where certain workloads, such as resource-intensive models, should be isolated from others or where specific hardware such as GPUs, needs to be dedicated to particular tasks. Without fine-grained control over workload placement, models might end up running on suboptimal nodes, leading to inefficiencies or resource contention.
For example, you may want to:
Isolate inference workloads from control plane components or other services to prevent resource contention.
Ensure that GPU nodes are reserved exclusively for models that require hardware acceleration.
Keep business-critical models on dedicated nodes to ensure performance and reliability.
Run external dependencies like Kafka on separate nodes to avoid interference with inference workloads.
To solve these problems, Kubernetes provides mechanisms such as taints, tolerations, and nodeAffinity
or nodeSelector
to control resource allocation and workload scheduling.
Taints are applied to nodes and tolerations to Pods to control which Pods can be scheduled on specific nodes within the Kubernetes cluster. Pods without a matching toleration for a node’s taint are not scheduled on that node. For instance, if a node has GPUs or other specialized hardware, you can prevent Pods that don’t need these resources from running on that node to avoid unnecessary resource usage.
Note: Taints and tolerations alone do not ensure that a Pod runs on a tainted node. Even if a Pod has the correct toleration, Kubernetes may still schedule it on other nodes without taints. To ensure a Pod runs on a specific node, you need to also use node affinity and node selector rules.
When used together, taints and tolerations with nodeAffinity
or nodeSelector
can effectively allocate certain Pods to specific nodes, while preventing other Pods from being scheduled on those nodes.
In a Kubernetes cluster running Seldon Core 2, this involves two key configurations:
Configuring servers to run on specific nodes using mechanisms like taints, tolerations, and nodeAffinity
or nodeSelector
.
Configuring models so that they are scheduled and loaded on the appropriate servers.
This ensures that models are deployed on the optimal infrastructure and servers that meet their requirements.