Server Autoscaling

Overview

Core 2 runs with long lived server replicas, each able to host multiple models (through multi-model serving, or MMS). The server replicas can be autoscaled natively by Core 2 in response to dynamic changes in the requested number of model replicas, allowing users to seamlessly optimize the infrastructure cost associated with their deployments.

This document outlines the autoscaling policies and mechanisms that are available for autoscaling server replicas. These policies are designed to ensure that the server replicas are increased (scaled up) or decreased (scaled down) in response to changes in the number replicas requested for each model. In other words if a given model is scaled up, the system will scale up the server replicas in order to host the new model replicas. Similarly, if a given model is scaled down, the system may scale down the number of replicas of the server hosting the model, depending on other models that are loaded on the same server replica.

Note: Native autoscaling of servers is required in the case of MMS as the models are dynamically loaded and unloaded onto these server replicas. In this case Core 2 would autoscale server replicas according to changes to the model replicas that are required. This is in contrast to single-model autoscaling approach explained where server and model replicas are independently scaled using HPA (but relying on the same metric). Server autoscaling can also be used in the case of single-model serving, simplifying the autoscaling process as users would only needs to manage the scaling logic for model replicas, only requiring one HPA manifest.

Requirements

To enable autoscaling of server replicas, the following requirements need to be met:

Setting minReplicas and maxReplicas in the Server CR. This will define the minimum and maximum number of server replicas that can be created.
Setting the autoscaling.autoscalingServerEnabled value to true (default) during installation of the Core 2 seldon-core-v2-setup helm chart. If not installing via helm, setting the ENABLE_SERVER_AUTOSCALING environment variable to true in the seldon-scheduler podSpec (via either SeldonConfig or a SeldonRuntime podSpec override) will have the same effect. This will enable the autoscaling of server replicas.

An example of a Server CR with autoscaling enabled is shown below:

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver
  namespace: seldon
spec:
  replicas: 2
  minReplicas: 1
  maxReplicas: 4
  serverConfig: mlserver

Note: Not setting minReplicas and/or maxReplicas will also effectively disable autoscaling of server replicas. In this case, the user will need to manually scale the server replicas by setting the replicas field in the Server CR. This allows external autoscaling mechanisms to be used e.g. HPA. In future versions of Core 2, we might relax this requirement and allow native autoscaling of server replicas without setting both minReplicas and maxReplicas.

Server Scale Up

Overview

When we want to scale up the number of replicas for a model, the associated servers might not have enough capacity (replicas) available. In this case we need to scale up the server replicas to match the number required by our models.

Policies

There is currently only one policy for scaling up server replicas:

Model Replica Count:

This policy scales up the server replicas to match the number of model replicas that are required. In other words, if a model is scaled up, the system will scale up the server replicas to host these models. This policy is simple to implement and ensure that the server replicas are scaled up in response to changes in the number of model replicas that are required.

During the scale up process, the system will create new server replicas to host the new model replicas. The new server replicas will be created with the same configuration as the existing server replicas. This includes the server configuration, resources, etc. The new server replicas will be added to the existing server replicas and will be used to host the new model replicas.

Server Scale Down

Overview

Once we have scaled down the number of replicas for a model, some of the corresponding server replicas might be left unused (depending on whether those replicas also hosted other models or not). In this case, the extra server pods are wasting resources and causing additional infrastructure cost (especially if they have expensive resources such as GPUs attached)

Scaling down servers in sync with models is not straight forward in the case of multi-model serving. Scaling down one model does not necessarily mean that we also need to scale down the corresponding server replica as this server replica might be still serving load for other models.

Therefore we need to define some heuristics that can be used to scale down servers if we think that they are not properly used.

Note: Scaling down the number of replicas for an inference server does not necessarily mean that the system is going to remove a specific replica that we want.

As currently we have the model server deployed as StatefulSets, scaling down the number of replicas will mean that we are removing a pod with the largest index.

The system will rebalance afterwards, where the models from this draining server replica will be rescheduled but this requires some wait time to achieve. This draining process is done without incurring downtime as models are being rescheduled onto other server replicas before the draining server replica is removed.

Policies

Empty Server Replica:

In the simplest case we can remove a server replica if it does not host any models. This guarantees that there is no load on a particular server replica before removing it.

This policy works best in the case of single model serving where the server replicas are only hosting a single model. In this case, if the model is scaled down, the server replica will be empty and can be removed.

However in the case of MMS, only reducing the number of server replicas when one of the replicas no longer hosts any models can lead to a suboptimal packing of models onto server replicas. This is because the system will not automatically pack models onto the smaller set of replicas. This can lead to more server replicas being used than necessary. This can be mitigated by the lightly loaded server replicas policy.

Lightly Loaded Server Replicas (Experimental):

Warning: This policy is experimental and is not enabled by default. It can be enabled by setting autoscaling.serverPackingEnabled to true and autoscaling.serverPackingPercentage to a value between 0 and 100. This policy is still under development and might in some cases increase latencies, so it's worth testing ahead of time to observer behavior for a given setup.

Initial assignment:

There is an argument that this is might not be optimized and in MMS the assignment could be:

As the system evolves this imbalance can get larger and could cause the serving infrastructure to be less optimized.

The behavior above is actually not limited to autoscaling, however autoscaling will aggravate the issue causing more imbalance over time.

This imbalance can be mitigated by making by the following observation: If the max number of replicas of any given model (assigned to a server from a logical point of view) is less than the number of replicas for this server, then we can pack the models hosted onto a smaller set of replicas. Note in Core 2 a server replica can host only 1 replica of a given model.

While this heuristic is going to pack models onto a set of fewer replicas, which allows us to scale models down, there is still the risk that the packing could increase latencies, trigger a later scale up. Core 2 tries to make sure that do not flip-flopping between these states. The user can also reduce the number of packing events by setting autoscaling.serverPackingPercentage to a lower value.

Currently Core 2 triggers the packing logic only when there is model replica being removed, either from a model scale down or a model being deleted. In the future we might trigger this logic more frequently to ensure that the models are packed onto a fewer set of replicas.

PreviousModel Autoscaling NextHPA Autoscaling in single-model serving

Last updated 24 days ago

Was this helpful?