Autoscaling Models

Learn how to leverage Core 2's native autoscaling functionality for Models

In order to set up autoscaling, users should first identify which metric they would want to scale their models on. Seldon Core provides an approach to autoscale models based on Inference Lag, or supports more custom scaling logic by leveraging HPA, (or Horizontal Pod Autoscaler), whereby you can use custom metrics to automatically scale Kubernetes resources. This page will go through the first approach. Inference Lag refers to the difference in incoming vs. outgoing requests in a given period of time. If choosing this approach, it is recommended to configure autoscaling for Servers, so that Models scale on Inference Lag, and in turn set up autoscaling for Servers to scale based on model needs.

This implementation of autoscaling is enabled if Core 2 is installed with the autoscaling.autoscalingModelEnabled helm value set to true (default is false) and at least MinReplicas or MaxReplicas is set in the Model Custom Resource. Then according to lag (how much the model "falls behind" in terms of serving inference requests) the system will scale the number of Replicas within this range. As an example, the following model will be deployed at first with 1 replica and will autoscale according to lag.

# samples/models/tfsimple_scaling.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki
  minReplicas: 1
  replicas: 1

When the system autoscales, the initial model spec is not changed (e.g. the number of replicas) and therefore the user cannot reset the number of replicas back to the initial specified value without an explicit change to a different value first. If only replicas is specified by the user, autoscaling of models is disabled and the system will have exactly the number of replicas of this model deployed regardless of inference lag.

The scale-up and scale-down logic, and it's configurability is described below:

  • Scale Up: To trigger scale up with the approach described above, we use Inference Lag as the metrics. Inference Lag is the difference between incoming and outgoing requests in a given time period. If the lag crosses a threshold, then we trigger a model scale up event. This threshold can be defined via SELDON_MODEL_INFERENCE_LAG_THRESHOLD inference server environment variable. The threshold used will apply to all the models hosted on the Server where the lag was configured.

  • Scale Down: When using Model autoscaling that is managed by Seldon Core, model scale down events are triggered if a model has not been used for a number of seconds. This is defined in SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD inference server environment variable.

  • Rate of metrics calculation: Each agent checks the above stats periodically and if any model hits the corresponding threshold, then the agent sends an event to the scheduler to request model scaling. How often this process executes can be defined via SELDON_SCALING_STATS_PERIOD_SECONDS inference server environment variable.

Based on the logic above, the scheduler will trigger model autoscaling if:

  • The model is stable (no state change in the last 5 minutes) and available.

  • The desired number of replicas is within range. Note we always have a least 1 replica of any deployed model and we rely on over commit to reduce the resources used further.

  • For scaling up the model when autoscaling of the Servers is not set up, trigger the scale-up only if there are sufficient server replicas that can load the new model replicas.

Last updated

Was this helpful?