Scheduling

Core 2 architecture is built around decoupling Model and Server CRs to allow for multi-model deployment, enabling multiple models to be loaded and served on one server replica or a single Pod. Multi-model serving allows for more efficient use of resources, see Multi Model Serving for more information.

This architecture requires that Core 2 handles scheduling of models onto server pods natively. In particular Core 2 implements different sorters and filters which are used to find the best Server that is able to host a given Model. This process we describe in the following section.

Scheduling Process

Overview

The scheduling process in Core 2 identifies a suitable candidate server for a given model through a series of steps. These steps involve sorting and filtering servers primarily based on the following criteria:

Server has matching Capabilities with Model spec.requirements.
Server has enough replicas to load the desired spec.replicas of the Model.
Each replica of Server has enough available memory to load one replica of the model defined in spec.memory.
Server that already hosts the Model is preferred to reduce flip-flops between different candidate servers.

After a suitable candidate server is identified for a given model, Core 2 attempts to load the model onto it. If no matching server is found, the model is marked as ScheduleFailed.

This process is designed to be extensible, allowing for the addition of new filters in future versions to enhance scheduling decisions.

Note: A specific Model can only be assigned to at most one Server and therefore this Server requires enough replicas to host all replicas of the Model.

Partial Scheduling

Core 2 (from 2.9) is able to do partial scheduling of Models. Partial scheduling is defined as the loading of enough replicas of the model above spec.minReplicas and upto the number of available Server replicas. This allows the user a little bit more flexibility in serving traffic while optimising infrastructure provisioning.

To enable partial scheduling, spec.minReplicas needs to be defined as it provides Core 2 the minimum replicas of the model that is required for serving.

Partial scheduling does not have an explicit state; instead, the model is marked as ModelAvailable and is ready to serve traffic. The status of the Model CR can be inspected, where DESIRED REPLICAS and AVAILABLE REPLICAS provide insight into the number of replicas currently loaded in Core 2. Based on this information, the following logic are applied:

Fully Scheduled: READY is True and DESIRED REPLICAS is equal to AVAILABLE REPLICAS (STATUS is ModelAvailable)
Partially Scheduled: READY is TRUE and DESIRED REPLICAS is greater than AVAILABLE REPLICAS (STATUS is ModelAvailable)
Not Scheduled: Ready is False (Status is ScheduleFailed)

PreviousSecuring Endpoints NextMetrics

Last updated 15 days ago

Was this helpful?