# Scalability of Pipelines

Core 2 supports full **horizontal scaling** for the **dataflow engine**, **model gateway**, and **pipeline gateway**. Each service automatically distributes pipelines or models across replicas using consistent hashing, so you don’t need to manually assign workloads.

This guide explains **how scaling works**, **what configuration controls it**, and **what happens when replicas or pipelines/models change**.

## 1. How scaling works (at a glance)

| Component            | What it scales with (max `maxShardCountMultiplier` = #partitions )              | Max replicas used                                             |
| -------------------- | ------------------------------------------------------------------------------- | ------------------------------------------------------------- |
| **Dataflow engine**  | `#pipelines × maxShardCountMultiplier` (capped by configured replicas)          | `min(replicas, pipelines × partitions)`                       |
| **Model gateway**    | `#models × maxShardCountMultiplier` (capped by replicas and maxNumConsumers)    | `min(replicas, min(models, maxNumConsumers) × partitions)`    |
| **Pipeline gateway** | `#pipelines × maxShardCountMultiplier` (capped by replicas and maxNumConsumers) | `min(replicas, min(pipelines, maxNumConsumers) × partitions)` |

Each pipeline/model **is loaded only on a subset of replicas**, and **automatically rebalanced** when:

1. You **scale replicas up/down**
2. You **deploy or delete pipelines / models**

The configuration parameter determining the maximum number of component replicas on which a pipeline/model can be loaded is `maxShardCountMultiplier`. It can be set in `SeldonConfig` under `config.scalingConfig.pipelines.maxShardCountMultiplier`. For installs via helm, the value of this parameter defaults to `{{ .Values.kafka.topics.numPartitions }}`. In fact, the number of kafka partitions per topic is the maximum value that `maxShardCountMultiplier` should be set to. Increasing this value beyond the number of kafka partitions not only does not bring any additional performance benefits, but may actually lead to dropped requests due to the extra replicas receiving requests but managing no kafka partitions within their consumer groups.

This parameter may be changed during cluster operation, with the new value being propagated to all components over a \~1 minute interval. Changing this value will cause pipelines/models to be rebalanced across the `dataflow-engine`, `model-gateway`, and `pipeline-gateway` replicas and may lead to downtime depending on the configured kafka partition assignment strategy. If the used Kafka version supports cooperative rebalancing of consumer groups, then setting the partition assignment strategy to `cooperative-sticky` will ensure that rebalancing happens with minimal disruption. `dataflow-engine` uses a cooperative rebalancing strategy by default.

You **do not** need to manually assign work — it’s handled automatically.

## 2. Scaling the dataflow engine

**Dataflow engine** is responsible for **executing pipeline logic** and moving data between pipeline stages. Core 2 now supports running **multiple pipelines in parallel across multiple dataflow engine replicas**.

### 2.1. What controls scaling?

You control scaling using:

| Config                                                   | Location        | Purpose                                                                     |
| -------------------------------------------------------- | --------------- | --------------------------------------------------------------------------- |
| `spec.replicas`                                          | `SeldonRuntime` | Maximum number of dataflow engine instances                                 |
| `config.scalingConfig.pipelines.maxShardCountMultiplier` | `SeldonConfig`  | Determines max replication per pipeline (max possible: `#Kafka partitions`) |

### 2.2. How many replicas will actually be used?

Dataflow engine replicas are dynamically adjusted based on **number of pipelines deployed** and **Kafka partitions**. The final number of dataflow engine replicas is given by:

$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \text{pipelines} \times \text{partitions})$

**Example**

| Pipelines deployed | maxShardCountMultiplier | spec.replicas | Final dataflow replicas used      |
| ------------------ | ----------------------- | ------------- | --------------------------------- |
| 3                  | 4                       | 9             | `min(9, 3 x 4 = 12)` → 9 replicas |
| 2                  | 4                       | 9             | `min(9, 2 x 4 = 8)` → 8 replicas  |
| 1                  | 4                       | 9             | `min(9. 1 x 4 = 4)` → 4 replicas  |

**Note**: Unused replicas are automatically scaled down. As more pipelines are added, dataflow engine automatically scales up, capped by the maximum number of replicas.

### 2.3. How are pipeline assigned to replicas?

* Core 2 uses **consistent hashing** to distribute pipelines evenly across dataflow replicas. This ensures a **balanced workload**, but it does *not* guarantee a perfect one-to-one mapping.
  * Even if the number of replicas equals `pipelines × partitions`, small imbalances between the number of pipelines handled by each replica may exist. In practice, the distribution is **statistically uniform**.
* Each pipeline is **replicated across multiple dataflow engines** (up to number of Kafka partitions).
* When instances are added or removed, **pipelines are automatically rebalanced**.

**Note:** This process is handled internally by Core 2, so no manual intervention is needed.

### 2.4. Loading/unloading of the pipelines from dataflow engine

* Loading/unloading of the pipeline from the dataflow engine is performed when the pipeline CR is loaded/unloaded.
* The scheduler confirms whether the loading/unloading was performed successfully through the `Pipeline` status under the CR.

Rebalancing happens in the background — you don’t need to intervene.

**Note:** For pipelines, `Pipeline` ready status must be satisfied in order for the pipeline to be marked ready.

## 3. Scaling the model gateway

The **model gateway** is responsible for routing inference requests to models when used inside pipelines. Like the **dataflow engine**, it scales dynamically based on *how many models are deployed*.

### 3.1. What controls scaling?

| Config                                                   | Location                                              | Purpose                                                    |
| -------------------------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------- |
| `spec.replcias`                                          | `SeldonRuntime`                                       | Maximum number of model gateway instances                  |
| `config.scalingConfig.pipelines.maxShardCountMultiplier` | `SeldonConfig`                                        | Determines max replication per model (`#Kafka partitions`) |
| `maxNumConsumers`                                        | `SeldonConfig` - model gateway enc var (default: 100) | Caps how many distinct consumer groups can exist           |

### 3.2. How many replicas will actually be used?

Model gateway replicas are dynamically adjusted based on **number of models deployed**, **Kafka partitions**, and **maxNumConsumers**. The final number of model gateway replicas is given by:

$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \min(\text{models}, \text{maxNumConsumers}) \times \text{partitions})$

**Example**

| Models Deployed | maxShardCountMultiplier | spec.replicas | maxNumConsumers | Final model gateway replicas                       |
| --------------- | ----------------------- | ------------- | --------------- | -------------------------------------------------- |
| 5               | 4                       | 20            | 100             | `min(20, min(5, 100) x 4 = 20) = 20` → 20 replicas |
| 1               | 4                       | 20            | 100             | `min(20, min(1, 100) x 4 = 4) = 4` → 4 replicas    |

**Note:** If you remove models, the model gateway automatically scales down, and if we add models, the model gateway automatically scales up, capped by the maximum number of replicas.

### 3.3. How are models assigned to replicas?

Model gateway doesn’t load every model on every replica but only on a subset of replicas. The same principle as for dataflow engine applies for model gateway (sharding through consistent hashing).

### 3.4. Loading/unloading of the models from model gateway

* Loading/unloading of the model from the model gateway is performed when the model CR is loaded/unloaded.
* The scheduler confirms whether the loading/unloading was performed successfully through the `ModelGw` status under the CR.

Rebalancing happens in the background — you don’t need to intervene.

**Note:** `ModelGw` status does not represent a condition for the model to be available. If the loading was successful on the dedicated servers, the model itself is ready for inference.

* The `ModelGw` status becomes relevant for pipelines, or whether the end user wants to perform inference via the async path (i.e., writing the requests in the model input topic and reading the responses from the model output topic from Kafka).
* In the context of pipelines, the `ModelReady` status becomes a conjunction on whether the model is available on servers and if the model has been loaded successfully on the model gateway.

## 4. Scaling the pipeline gateway

The **pipeline gateway** is responsible for writing the requests in the input topic of the pipeline, and wait for the response on the output topic. Like dataflow engine and model gateway, pipeline gateway can scale horizontally.

### 4.1. What Controls Scaling?

| Config                                                   | Location                                                 | Purpose                                                        |
| -------------------------------------------------------- | -------------------------------------------------------- | -------------------------------------------------------------- |
| `spec.replcias`                                          | `SeldonRuntime`                                          | Maximum number of pipeline gateway instances                   |
| `config.scalingConfig.pipelines.maxShardCountMultiplier` | `SeldonConfig`                                           | Determines max replication per pipeline (`# Kafka partitions`) |
| `maxNumConsumers`                                        | `SeldonConfig` - Pipeline gateway enc var (default: 100) | Caps how many distinct consumer groups can exist               |

### 4.2. How many replicas will actually be used?

Pipeline gateway replicas are dynamically adjusted based on **number of pipelines deployed**, **Kafka partitions**, and **maxNumConsumers**. The final number of pipeline gateway replicas is given by:

$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \min(\text{pipelines}, \text{maxNumConsumers}) \times \text{partitions})$

**Example**

| Pipelines Deployed | maxShardCountMultiplier | spec.replicas | maxNumConsumers | Final pipeline gateway replicas                    |
| ------------------ | ----------------------- | ------------- | --------------- | -------------------------------------------------- |
| 8                  | 4                       | 10            | 100             | `min(10, min(8, 100) x 4 = 32) = 10` → 10 replicas |
| 2                  | 4                       | 10            | 100             | `min(10, min(2, 100) x 4 = 8) = 8` → 8 replicas    |
| 1                  | 4                       | 10            | 100             | `min(10, min(1, 100) x 4 = 4) = 4`→ 4 replicas     |

**Note:** Similarly to dataflow engine, pipeline gateway scales up and down as pipeline are added and removed.

### 4.3. How are pipeline assigned to replicas?

Pipeline gateway doesn’t load every pipeline on every replica but only on a subset of replicas. The same principle as for dataflow engine and model gateway applies for pipeline gateway (sharding through consistent hashing).

### 4.4. Loading/unloading of the pipelines from Pipeline Gateway

* Loading/unloading of the pipeline from the pipeline gateway is performed when the pipeline CR is loaded/unloaded.
* The scheduler confirms whether the loading/unloading was performed successfully through the `PipelineGw` status under the CR.

Analogous with the previous services, rebalancing happens in the background — you don’t need to intervene.

**Note:** For pipelines, `PipelineGw` ready status must be satisfied in order for the pipeline to be marked ready.
