Autoscaling Servers

Learn how to leverage Core 2's native autoscaling functionality for Servers

Core 2 runs with long-lived server replicas, each able to host multiple models (through Multi-Model Serving, or MMS). The server replicas can be autoscaled natively by Core 2 in response to dynamic changes in the requested number of model replicas, allowing users to seamlessly optimize the infrastructure cost associated with their deployments.

This document outlines the autoscaling policies and mechanisms that are available for autoscaling server replicas. These policies are designed to ensure that the server replicas are scaled up or down in response to changes in the number of replicas requested for each model. In other words if a given model is scaled up, the system will scale up the server replicas in order to host the new model replicas. Similarly, if a given model is scaled down, the system may scale down the number of replicas of the server hosting the model, depending on other models that are loaded on the same server replica.

Note: Autoscaling of servers is required in the case of Multi-Model Serving as the models are dynamically loaded and unloaded onto these server replicas. In this case Core 2 would autoscale server replicas according to changes to the model replicas that are required. This is in contrast to single-model autoscaling approach explained here where Server and Model replicas are independently scaled using HPA that targets the same metric.

To enable autoscaling of server replicas, the following requirements need to be met:

Setting minReplicas and maxReplicas in the Server CR. This will define the minimum and maximum number of server replicas that can be created.
Setting the autoscaling.autoscalingServerEnabled value to true (default) during installation of the Core 2 seldon-core-v2-setup helm chart. If not installing via helm, setting the ENABLE_SERVER_AUTOSCALING environment variable to true in the seldon-scheduler podSpec (via either SeldonConfig or a SeldonRuntime podSpec override) will have the same effect. This will enable the autoscaling of server replicas.

An example of a Server CR with autoscaling enabled is shown below:

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver
  namespace: seldon
spec:
  replicas: 2
  minReplicas: 1
  maxReplicas: 4
  serverConfig: mlserver

Note: Not setting minReplicas and/or maxReplicas will also effectively disable autoscaling of server replicas. In this case, the user will need to manually scale the server replicas by setting the replicas field in the Server CR. This allows external autoscaling mechanisms to be used e.g. HPA.

Server Scaling Logic

Scale Up

When we want to scale up the number of replicas for a model, the associated servers might not have enough capacity (replicas) available. In this case we need to scale up the server replicas to match the number required by our models. There is currently only one policy for scaling up server replicas, and that is via Model Replica Count. This policy scales up the server replicas to match the number of model replicas that are required. In other words, if a model is scaled up, the system will scale up the server replicas to host these models. This policy ensures that the server replicas are scaled up in response to changes in the number of model replicas that are required. During the scale up process, the system will create new server replicas to host the new model replicas. The new server replicas will be created with the same configuration as the existing server replicas. This includes the server configuration, resources, etc. The new server replicas will be added to the existing server replicas and will be used to host the new model replicas.

There is a period of time where the new server replicas are being created and the new model replicas are being loaded onto these server replicas. During this period, the system will ensure that the existing server replicas are still serving load so that there is no downtime during the scale up process. This is achieved by using partial scheduling of the new model replicas onto the new server replicas. This ensures that the new server replicas are gradually loaded with the new model replicas and that the existing server replicas are still serving load. Check the Partial Scheduling document for more details.

Scale Down

Once we have scaled down the number of replicas for a model, some of the corresponding server replicas might be left unused (depending on whether those replicas are hosting other models or not). If that is the case, the extra server pods might incur unnecessary infrastructure cost (especially if they have expensive resources such as GPUs attached). Scaling down servers in sync with models is not straightforward in the case of Multi-Model Serving. Scaling down one model does not necessarily mean that we also need to scale down the corresponding Server replica as this replica might be still serving load for other models. Therefore we define heuristics that can be used to scale down servers if we think that they are not properly used, described in the policies below.

Note: Scaling down the number of replicas for an inference server does not necessarily mean that the system is going to remove a specific replica that we want. As currently we have Servers deployed as StatefulSets, scaling down the number of replicas will mean that we are removing the pod with the largest index.

Upon scaling down Servers, the system will rebalance. Models from a draining server replica will be rescheduled after some wait time. This draining process is done without incurring downtime as models are being rescheduled onto other server replicas before the draining server replica is removed.

Policies

There are two possible policies we use to define the scale down of Servers:

Empty Server Replica: In the simplest case we can remove a server replica if it does not host any models. This guarantees that there is no load on a particular server replica before removing it. This policy works best in the case of single model serving where the server replicas are only hosting a single model. In this case, if the model is scaled down, the server replica will be empty and can be removed.

However in the case of MMS, only reducing the number of server replicas when one of the replicas no longer hosts any models can lead to a suboptimal packing of models onto server replicas. This is because the system will not automatically pack models onto the smaller set of replicas. This can lead to more server replicas being used than necessary. This can be mitigated by the lightly loaded server replicas policy.

Lightly Loaded Server Replicas (Experimental):

Warning: This policy is experimental and is not enabled by default. It can be enabled by setting autoscaling.serverPackingEnabled to true and autoscaling.serverPackingPercentage to a value between 0 and 100. This policy is still under development and might in some cases increase latencies, so it's worth testing ahead of time to observe behavior for a given setup.

Using the above policy which MMS enabled, different model replicas will be hosted on potentially different server replicas and as we scale these models up and down the system can end up in a situation where the models are not consolidated to an optimized number of servers. For illustration, take the case of 3 Models: $A$ , $B$ and $C$ . We have 1 server $S$ with 2 replicas: $S_1$ and $S_2$ that can host these 3 models. Assuming that initially we have $A$ and $B$ with 1 replica and $C$ with 2 replicas therefore the assignment is:

Initial assignment:

$S_1$ : $A_1$ , $C_1$
$S_2$ : $B_1$ , $C_2$

Now if the user unloads Model $C$ the assignment is:

$S_1$ : $A_1$
$S_2$ : $B_1$

There is an argument that this is might not be optimized and in MMS the assignment could be:

$S_1$ : $A_1$ , $B_1$
$S_2$ : removed

As the system evolves this imbalance can get larger and could cause the serving infrastructure to be less optimized. The behavior above is actually not limited to autoscaling, however autoscaling will aggravate the issue causing more imbalance over time. This imbalance can be mitigated by making the following observation: If the max number of replicas of any given model (assigned to a server from a logical point of view) is less than the number of replicas for this server, then we can pack the models hosted onto a smaller set of replicas. Note in Core 2 a server replica can host only 1 replica of a given model.

In other words, consider the following example - for models $A$ and $B$ having 2 replicas each and we have 3 server $S$ replicas, the following assignment is not potentially optimized:

$S_1$ : $A_1$ , $B_1$
$S_2$ : $A_2$
$S_3$ : $B_2$

In this case we could trigger removal of $S_3$ for the server which could pack the models more appropriately

$S_1$ : $A_1$ , $B_1$
$S_2$ : $A_2$ , $B_2$
$S_3$ : removed

While this heuristic is going to pack models onto a set of fewer replicas, which allows us to scale models down, there is still the risk that the packing could increase latencies, trigger a later scale up. Core 2 ensures consistent behavior without reverting between states. The user can also reduce the number of packing events by setting autoscaling.serverPackingPercentage to a lower value.

Currently Core 2 triggers the packing logic only when there is model replica being removed, either from a model scale down or a model being deleted. In the future, the logic might be triggered more frequently to improve model packing onto a smaller set of replicas.

PreviousAutoscaling Models NextUsing HPA for Autoscaling

Last updated 6 months ago

Was this helpful?