Autoscaling
Autoscaling in Seldon Core 2
Seldon Core 2 provides multiple approaches to scaling your machine learning deployments, allowing you to optimize resource utilization and handle varying workloads efficiently. In Core 2, we separate out Models and Servers, and Servers can have multiple Models loaded on them (Multi-Model Serving). Given this, setting up autoscaling requires defining the logic by which you want to scale your Models and then configuring the autoscaling of Servers such that they autoscale in a coordinated way. The following steps can be followed to set up autoscaling based on specific requirements:
Identify metrics that you want to scale Models on. There are a couple of different options here:
Core 2 natively supports scaling based on Inference Lag, meaning the difference between incoming and outgoing requests for a model in a given period of time. This is done by configuring
minReplicas
ormaxReplicas
in the Model CRDs and making sure you configure the Core 2 install with theautoscaling.autoscalingModelEnabled
helm value set totrue
(default isfalse
).Users can expose custom or Kubernetes-native metrics, and then target the scaling of models based on those metrics by using
HorizontalPodAutoscaler
. This requires exposing the right metrics, using the monitoring tool of your choice (e.g. Prometheus).
Once the approach for Model scaling is implemented, Server scaling needs to be configured.
Implement Server Scaling by either:
Enabling Autoscaling of Servers based on Model needs. This is managed by Seldon's scheduler, and is enabled by setting
minReplicas
andmaxReplicas
in the Server Custom Resource and making sure you configure the Core 2 install with theautoscaling.autoscalingServerEnabled
helm value set totrue
(the default)If Models and Servers are to have a one-to-one mapping (no Multi-Model Serving) then users can also define scaling of Servers using an HPA manifest that matches the HPA applied to the associated Models. This approach is outlined here. This approach will only work with custom metrics, as Kubernetes does not allow mutliple HPAs to target the same metrics from Kubernetes directly.
Based on the requirements above, one of the following three options for coordinated autoscaling of Models and Servers can be chosen:
Inference lag
✅
- Simplest Implementation - One metric across models
- Model scale down only when no inference traffic exists to that model
User-defined (HPA)
✅
- Custom scaling metric
- Requires Metrics store integration (e.g. Prometheus) - Potentially suboptimal Server packing on scale down
User-defined (HPA)
❌
- Coordinated Model and Server scaling
- Requires Metrics store integration (e.g. Prometheus) - No Multi-Model Serving
Alternatively, the following decision-tree showcases the approaches we recommend based on users' requirements:

Scaling Seldon Services
When running Core 2 at scale, it is important to understand the scaling behaviour of Seldon's services as well as the scaling of the Models and Servers themselves. This is outlined in the Scaling Core Services page.
Last updated
Was this helpful?