Scalability of Pipelines
Core 2 supports full horizontal scaling for the dataflow engine, model gateway, and pipeline gateway. Each service automatically distributes pipelines or models across replicas using consistent hashing, so you don’t need to manually assign workloads.
This guide explains how scaling works, what configuration controls it, and what happens when replicas or pipelines/models change.
1. How scaling works (at a glance)
Component
What it scales with (max maxShardCountMultiplier
= #partitions )
Max replicas used
Dataflow engine
#pipelines × maxShardCountMultiplier
(capped by configured replicas)
min(replicas, pipelines × partitions)
Model gateway
#models × maxShardCountMultiplier
(capped by replicas and maxNumConsumers)
min(replicas, min(models, maxNumConsumers) × partitions)
Pipeline gateway
#pipelines × maxShardCountMultiplier
(capped by replicas and maxNumConsumers)
min(replicas, min(pipelines, maxNumConsumers) × partitions)
Each pipeline/model is loaded only on a subset of replicas, and automatically rebalanced when:
You scale replicas up/down
You deploy or delete pipelines / models
The configuration parameter determining the maximum number of component replicas on which a pipeline/model can be loaded is maxShardCountMultiplier
. It can be set in SeldonConfig
under config.scalingConfig.pipelines.maxShardCountMultiplier
. For installs via helm, the value of this parameter defaults to {{ .Values.kafka.topics.numPartitions }}
. In fact, the number of kafka partitions per topic is the maximum value that maxShardCountMultiplier
should be set to. Increasing this value beyond the number of kafka partitions not only does not bring any additional performance benefits, but may actually lead to dropped requests due to the extra replicas receiving requests but managing no kafka partitions within their consumer groups.
This parameter may be changed during cluster operation, with the new value being propagated to all components over a ~1 minute interval. Changing this value will cause pipelines/models to be rebalanced across the dataflow-engine
, model-gateway
, and pipeline-gateway
replicas and may lead to downtime depending on the configured kafka partition assignment strategy. If the used Kafka version supports cooperative rebalancing of consumer groups, then setting the partition assignment strategy to cooperative-sticky
will ensure that rebalancing happens with minimal disruption. dataflow-engine
uses a cooperative rebalancing strategy by default.
You do not need to manually assign work — it’s handled automatically.
2. Scaling the dataflow engine
Dataflow engine is responsible for executing pipeline logic and moving data between pipeline stages. Core 2 now supports running multiple pipelines in parallel across multiple dataflow engine replicas.
2.1. What controls scaling?
You control scaling using:
spec.replicas
SeldonRuntime
Maximum number of dataflow engine instances
config.scalingConfig.pipelines.maxShardCountMultiplier
SeldonConfig
Determines max replication per pipeline (max possible: #Kafka partitions
)
2.2. How many replicas will actually be used?
Dataflow engine replicas are dynamically adjusted based on number of pipelines deployed and Kafka partitions. The final number of dataflow engine replicas is given by:
$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \text{pipelines} \times \text{partitions})$
Example
3
4
9
min(9, 3 x 4 = 12)
→ 9 replicas
2
4
9
min(9, 2 x 4 = 8)
→ 8 replicas
1
4
9
min(9. 1 x 4 = 4)
→ 4 replicas
Note: Unused replicas are automatically scaled down. As more pipelines are added, dataflow engine automatically scales up, capped by the maximum number of replicas.
2.3. How are pipeline assigned to replicas?
Core 2 uses consistent hashing to distribute pipelines evenly across dataflow replicas. This ensures a balanced workload, but it does not guarantee a perfect one-to-one mapping.
Even if the number of replicas equals
pipelines × partitions
, small imbalances between the number of pipelines handled by each replica may exist. In practice, the distribution is statistically uniform.
Each pipeline is replicated across multiple dataflow engines (up to number of Kafka partitions).
When instances are added or removed, pipelines are automatically rebalanced.
Note: This process is handled internally by Core 2, so no manual intervention is needed.
2.4. Loading/unloading of the pipelines from dataflow engine
Loading/unloading of the pipeline from the dataflow engine is performed when the pipeline CR is loaded/unloaded.
The scheduler confirms whether the loading/unloading was performed successfully through the
Pipeline
status under the CR.
Rebalancing happens in the background — you don’t need to intervene.
Note: For pipelines, Pipeline
ready status must be satisfied in order for the pipeline to be marked ready.
3. Scaling the model gateway
The model gateway is responsible for routing inference requests to models when used inside pipelines. Like the dataflow engine, it scales dynamically based on how many models are deployed.
3.1. What controls scaling?
spec.replcias
SeldonRuntime
Maximum number of model gateway instances
config.scalingConfig.pipelines.maxShardCountMultiplier
SeldonConfig
Determines max replication per model (#Kafka partitions
)
maxNumConsumers
SeldonConfig
- model gateway enc var (default: 100)
Caps how many distinct consumer groups can exist
3.2. How many replicas will actually be used?
Model gateway replicas are dynamically adjusted based on number of models deployed, Kafka partitions, and maxNumConsumers. The final number of model gateway replicas is given by:
$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \min(\text{models}, \text{maxNumConsumers}) \times \text{partitions})$
Example
5
4
20
100
min(20, min(5, 100) x 4 = 20) = 20
→ 20 replicas
1
4
20
100
min(20, min(1, 100) x 4 = 4) = 4
→ 4 replicas
Note: If you remove models, the model gateway automatically scales down, and if we add models, the model gateway automatically scales up, capped by the maximum number of replicas.
3.3. How are models assigned to replicas?
Model gateway doesn’t load every model on every replica but only on a subset of replicas. The same principle as for dataflow engine applies for model gateway (sharding through consistent hashing).
3.4. Loading/unloading of the models from model gateway
Loading/unloading of the model from the model gateway is performed when the model CR is loaded/unloaded.
The scheduler confirms whether the loading/unloading was performed successfully through the
ModelGw
status under the CR.
Rebalancing happens in the background — you don’t need to intervene.
Note: ModelGw
status does not represent a condition for the model to be available. If the loading was successful on the dedicated servers, the model itself is ready for inference.
The
ModelGw
status becomes relevant for pipelines, or whether the end user wants to perform inference via the async path (i.e., writing the requests in the model input topic and reading the responses from the model output topic from Kafka).In the context of pipelines, the
ModelReady
status becomes a conjunction on whether the model is available on servers and if the model has been loaded successfully on the model gateway.
4. Scaling the pipeline gateway
The pipeline gateway is responsible for writing the requests in the input topic of the pipeline, and wait for the response on the output topic. Like dataflow engine and model gateway, pipeline gateway can scale horizontally.
4.1. What Controls Scaling?
spec.replcias
SeldonRuntime
Maximum number of pipeline gateway instances
config.scalingConfig.pipelines.maxShardCountMultiplier
SeldonConfig
Determines max replication per pipeline (# Kafka partitions
)
maxNumConsumers
SeldonConfig
- Pipeline gateway enc var (default: 100)
Caps how many distinct consumer groups can exist
4.2. How many replicas will actually be used?
Pipeline gateway replicas are dynamically adjusted based on number of pipelines deployed, Kafka partitions, and maxNumConsumers. The final number of pipeline gateway replicas is given by:
$\text{FinalReplicaCount} = \min(\text{spec.replicas},\ \min(\text{pipelines}, \text{maxNumConsumers}) \times \text{partitions})$
Example
8
4
10
100
min(10, min(8, 100) x 4 = 32) = 10
→ 10 replicas
2
4
10
100
min(10, min(2, 100) x 4 = 8) = 8
→ 8 replicas
1
4
10
100
min(10, min(1, 100) x 4 = 4) = 4
→ 4 replicas
Note: Similarly to dataflow engine, pipeline gateway scales up and down as pipeline are added and removed.
4.3. How are pipeline assigned to replicas?
Pipeline gateway doesn’t load every pipeline on every replica but only on a subset of replicas. The same principle as for dataflow engine and model gateway applies for pipeline gateway (sharding through consistent hashing).
4.4. Loading/unloading of the pipelines from Pipeline Gateway
Loading/unloading of the pipeline from the pipeline gateway is performed when the pipeline CR is loaded/unloaded.
The scheduler confirms whether the loading/unloading was performed successfully through the
PipelineGw
status under the CR.
Analogous with the previous services, rebalancing happens in the background — you don’t need to intervene.
Note: For pipelines, PipelineGw
ready status must be satisfied in order for the pipeline to be marked ready.
Last updated
Was this helpful?