# Autoscaling

## Autoscaling in Seldon Core 2

Seldon Core 2 provides multiple approaches to scaling your machine learning deployments, allowing you to optimize resource utilization and handle varying workloads efficiently. In Core 2, we separate out Models and Servers, and Servers can have multiple Models loaded on them (Multi-Model Serving). Given this, setting up autoscaling requires defining the logic by which you want to scale your Models and *then* configuring the autoscaling of Servers such that they autoscale in a coordinated way. The following steps can be followed to set up autoscaling based on specific requirements:

1. **Identify metrics** that you want to scale Models on. There are a couple of different options here:
   1. Core 2 natively supports scaling based on **Inference Lag**, meaning the difference between incoming and outgoing requests for a model in a given period of time. This is done by configuring `minReplicas` or `maxReplicas` in the Model CRDs and making sure you configure the Core 2 install with the `autoscaling.autoscalingModelEnabled` helm value set to `true` (default is `false`).
   2. Users can expose **custom or Kubernetes-native metrics**, and then target the scaling of models based on those metrics by using `HorizontalPodAutoscaler`. This requires exposing the right metrics, using the monitoring tool of your choice (e.g. Prometheus).

Once the approach for Model scaling is implemented, Server scaling needs to be configured.

2. **Implement Server Scaling** by either:
   1. Enabling Autoscaling of Servers based on Model needs. This is managed by Seldon's scheduler, and is enabled by setting `minReplicas` and `maxReplicas` in the Server Custom Resource and making sure you configure the Core 2 install with the `autoscaling.autoscalingServerEnabled` helm value set to `true` (the default)
   2. If Models and Servers are to have a one-to-one mapping (no Multi-Model Serving) then users can also define scaling of Servers using an HPA manifest that matches the HPA applied to the associated Models. This approach is outlined [here](https://docs.seldon.ai/seldon-core-2/user-guide/scaling/hpa-overview/single-model-serving-hpa). This approach will only work with custom metrics, as Kubernetes does not allow mutliple HPAs to target the same metrics from Kubernetes directly.

Based on the requirements above, one of the following three options for coordinated autoscaling of Models and Servers can be chosen:

| Scaling Approach                                                                                                                       | Scaling Metric     | Multi-Model Serving | Pros                                                           | Cons                                                                                                                   |
| -------------------------------------------------------------------------------------------------------------------------------------- | ------------------ | ------------------- | -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
| [Seldon Core Autoscaling](https://docs.seldon.ai/seldon-core-2/user-guide/scaling/core-autoscaling)                                    | Inference lag      | ✅                   | <p>- Simplest Implementation<br>- One metric across models</p> | - Model scale down only when no inference traffic exists to that model                                                 |
| [Model Autoscaling with HPA](https://docs.seldon.ai/seldon-core-2/user-guide/scaling/hpa-overview/model-hpa-autoscaling)               | User-defined (HPA) | ✅                   | - Custom scaling metric                                        | <p>- Requires Metrics store integration (e.g. Prometheus)<br>- Potentially suboptimal Server packing on scale down</p> |
| [Model and Server Autoscaling with HPA](https://docs.seldon.ai/seldon-core-2/user-guide/scaling/hpa-overview/single-model-serving-hpa) | User-defined (HPA) | ❌                   | - Coordinated Model and Server scaling                         | <p>- Requires Metrics store integration (e.g. Prometheus)<br>- No Multi-Model Serving</p>                              |

Alternatively, the following decision-tree showcases the approaches we recommend based on users' requirements:

![Autoscaling Approach Decision-tree](https://1538183904-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2Fn8ONqKsfp9k7WHp5OWSM%2Fuploads%2Fgit-blob-e808249f08d204ea5c2717464b5f605f7df26de8%2Fautoscaling-decision-tree.png?alt=media)

## Scaling Seldon Services

When running Core 2 at scale, it is important to understand the scaling behaviour of Seldon's services as well as the scaling of the Models and Servers themselves. This is outlined in the [Scaling Core Services](https://docs.seldon.ai/seldon-core-2/user-guide/scaling/scaling-core-services) page.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.seldon.ai/seldon-core-2/user-guide/scaling.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
