1 of 4

Models

This section covers various aspects of optimizing model performance in Seldon Core 2, from initial load testing to infrastructure setup and inference optimization. Each subsection provides detailed guidance on different aspects of model performance tuning:

Load Testing

Learn how to conduct effective load testing to understand your model's performance characteristics:

Determining load saturation points
Understanding closed-loop vs. open-loop testing
Determining the right number of replicas based on your configuration (model, infrastructure, etc.)
Setting up reproducible test environments
Interpreting test results for autoscaling configuration

Explore different approaches to optimize inference performance:

Choosing between gRPC and REST protocols
Implementing adaptive batching
Optimizing input dimensions
Configuring parallel processing with workers

Understand how to configure the underlying infrastructure for optimal model performance:

Choosing between CPU and GPU deployments
Setting appropriate CPU specifications
Configuring thread affinity
Managing memory allocation

Each of these aspects plays a crucial role in achieving optimal model performance. We recommend starting with to establish a baseline, then using the insights gained to inform your and strategies.

Load Testing

Before looking to make changes to improve latency or throughput of your models, it is important to undergo load testing to understand the existing performance characteristics of your model(s) when deployed onto the chosen inference server (MLServer or Triton). The goal of load testing should be to understand the performance behavior of deployed models in saturated (i.e at the maximum throughput a model replica can handle) and non-saturated regimes, then compare it with expected latency objectives.

The results here will also inform the setup of autoscaling parameters, with the target being to run each replica with some margin below the saturation throughput (say, by 10-20%) in order to ensure that latency does not degrade, and that there is sufficient capacity to absorb load for the time it takes for new inference server replicas (and model replicas) to become available.

When testing latency, it is recommended to track different percentiles of latency (e.g. p50, p90, p95, p99). Choose percentiles based on the needs of your application - higher percentiles will help understand the variability of performance across requests.

Determining the Load Saturation point

A first target of load testing should be determining the maximum throughput that one model replica is able to sustain. We’ll refer to this as the (single replica) load saturation point. Increasing the number of inference requests per second (RPS) beyond this point degrades latency due to queuing that occurs at the bottleneck (a model processing step, contention on a resource such as CPU, memory, network, etc).

We recommend determining the load saturation point by running an open-loop load test that goes through a series of stages, each having a target RPS. A stage would first linearly ramp-up to its target RPS and then hold that RPS level constant for a set amount of time. The target RPS should monotonically increase between the stages.

In order to get reproducible load test results, it is recommended to run the inference servers corresponding to the models being load-tested on k8s nodes where other components are not concurrently using a large proportion of the shared resources (compute, memory, IO).

Closed-loop vs. Open-loop mode

Typically, load testing tools generate load by creating a number of "virtual users" that send requests. Knowing the behaviour of those virtual users is critical when interpreting the results of the load test. Load tests can be set up in closed-loop mode, meaning that when each virtual user sends a request and they wait for a response before sending the next one. Alternatively, there is open-loop mode, where a variable number of users are instantiated in order to maintain a constant overall RPS.

When running in closed-loop mode, an undesired side-effect called coordinated omission appears: when the system gets overloaded and latency spikes up, the load test users effectively reduce the actual load on the system by sending requests less frequently
When using closed-loop mode in load testing, be aware that reported latencies at a given throughput may be significantly smaller than what will be experienced in reality. In contrast, an open-loop load tester would maintain a constant RPS, resulting in a more accurate representation of the latencies that will be experienced when running the model in production.
You can refer to the documentation of your load testing tool (i.e , ) for guidance on choosing the right workload model (open vs. closed-loop) based on your testing goals.

Inference

When looking at optimizing the latency or throughput of deployed models or pipelines, it is important to consider different approaches to the execution of inference workloads. Below are some tips on different approaches that may be relevant depending on the requirements of a given use-case:

gRPC may be more efficient than REST when your inference request payload benefits from a binary serialization format.
Grouping multiple real-time requests into small batches can improve throughput while maintaining acceptable latency. For more information on adaptive batching in MLServer, see here.
Reducing input size by reducing the dimensions of your inputs can speed up processing. This also reduces (de)serialization overhead that might be needed around model deployments.
For models deployed with MLServer, adjust parallel_workers in line with the number of CPU cores assigned to the Server pod. This is most effective for synchronous models, CPU-bound asynchronous models, and I/O-bound asynchronous models with high throughput. Proper tuning here can improve throughput, stabilize latency distribution, and potentially reduce overall latency due to reduced queuing. This is outlined in more detail in section below.

Configuring Parallel Processing

When deploying models using MLServer, it is possible to execute inference workloads via a pool of workers running in separate processes (see in MLServer docs ).

To assess the throughput behavior of individual model(s) it is first helpful to identify the maximum throughput possible with one worker (one-worker max throughput) and then the maximum throughput possible with N workers (n_workers max throughput). It is important to note that the n_workers max throughput is not simply n_workers × one-worker max throughput because workers run in separate processes, and the OS can only run as many processes in parallel as there are available CPUs. If all workers are CPU-bound, then setting n_workers higher than the number of CPU cores will be ineffective, as the OS will be limited by the number of CPU cores in terms of processes available to parallelize.

However, if some workers are waiting for either I/O or for a GPU, then setting n_workers to a value higher than the number of CPUs can help increase throughput, as some workers can then continue processing while others wait. Generally, if a model is receiving inference requests at a throughput that is lower than the one-worker max throughput, then adding additional workers will not help increase throughput or decrease latency. Similarly, if MLServer is configured with n_workers (on a pod with more than n_CPUs) and the request rate is below the n_workers worker max throughput, latency remains constant - the system is below saturation.

Given the above, it is worth considering increasing the number of workers available to process data for a given deployed model when the system becomes saturated. Increasing the number of workers up to or slightly above the number of CPU cores available may reduce latency when the system is saturated, provided the MLServer pod has sufficient spare CPU. The effect of increasing workers also depends on whether the model is CPU-bound or uses async versus blocking operations, where CPU-bound and blocking models would benefit most. When the system is saturated with requests, those requests will queue. Increasing workers aims to run enough tasks in parallel to cope with higher throughput while minimizing queuing.

Optimizing the model artefact

If optimizing for speed, the model artefact itself can have a big impact on performance. The speed at which an ML model can return results given input is based on the model’s architecture, model size, the precision of the model’s weights, and input size. In order to reduce the inherent complexity in the data processing required to execute an inference due to the attributes of a model, it is worth considering:

Model pruning to reduce parameters that may be unimportant. This can help reduce model size without having a big impact on the quality of the model’s outputs.
Quantization to reduce the computational and memory overheads of running inference by using model weights and activations with lower precision data types.
Dimensionality reduction of inputs to reduce the complexity of computation.
Efficient model architectures

Infrastructure Setup

CPUs vs. GPUs

Overall performance of your models will be constrained by the specifications of the underlying hardware which it is run on, and how it is leveraged. Choosing between CPUs and GPUs depends on the latency and throughput requirements for your use-case, as well as the type of model you are putting into production:

CPUs are generally sufficient for lightweight models, such as tree-based models, regression models, or small neural networks.

Load Testing

Determining the Load Saturation point

Closed-loop vs. Open-loop mode

When running in closed-loop mode, an undesired side-effect called coordinated omission appears: when the system gets overloaded and latency spikes up, the load test users effectively reduce the actual load on the system by sending requests less frequently
When using closed-loop mode in load testing, be aware that reported latencies at a given throughput may be significantly smaller than what will be experienced in reality. In contrast, an open-loop load tester would maintain a constant RPS, resulting in a more accurate representation of the latencies that will be experienced when running the model in production.
You can refer to the documentation of your load testing tool (i.e , ) for guidance on choosing the right workload model (open vs. closed-loop) based on your testing goals.

Inference

gRPC may be more efficient than REST when your inference request payload benefits from a binary serialization format.
Grouping multiple real-time requests into small batches can improve throughput while maintaining acceptable latency. For more information on adaptive batching in MLServer, see here.
Reducing input size by reducing the dimensions of your inputs can speed up processing. This also reduces (de)serialization overhead that might be needed around model deployments.
For models deployed with MLServer, adjust parallel_workers in line with the number of CPU cores assigned to the Server pod. This is most effective for synchronous models, CPU-bound asynchronous models, and I/O-bound asynchronous models with high throughput. Proper tuning here can improve throughput, stabilize latency distribution, and potentially reduce overall latency due to reduced queuing. This is outlined in more detail in section below.

Configuring Parallel Processing

When deploying models using MLServer, it is possible to execute inference workloads via a pool of workers running in separate processes (see in MLServer docs ).

Optimizing the model artefact

Model pruning to reduce parameters that may be unimportant. This can help reduce model size without having a big impact on the quality of the model’s outputs.
Quantization to reduce the computational and memory overheads of running inference by using model weights and activations with lower precision data types.
Dimensionality reduction of inputs to reduce the complexity of computation.
Efficient model architectures

Models

Load Testing

Load Testing

Determining the Load Saturation point

Closed-loop vs. Open-loop mode

Inference

Configuring Parallel Processing

Optimizing the model artefact

Infrastructure Setup

CPUs vs. GPUs

Load Testing

Determining the Load Saturation point

Closed-loop vs. Open-loop mode

Inference

Configuring Parallel Processing

Optimizing the model artefact

Models

Load Testing

Infrastructure Setup

CPUs vs. GPUs

Setting the right specifications for CPUs