All pages
Powered by GitBook
1 of 4

Loading...

Loading...

Loading...

Loading...

Load Testing

Before looking to make changes to improve latency or throughput of your models, it is important to undergo load testing to understand the existing performance characteristics of your model(s) when deployed onto the chosen inference server (MLServer or Triton). The goal of load testing should be to understand the performance behavior of deployed models in saturated (i.e at the maximum throughput a model replica can handle) and non-saturated regimes, then compare it with expected latency objectives.

The results here will also inform the setup of autoscaling parameters, with the target being to run each replica with some margin below the saturation throughput (say, by 10-20%) in order to ensure that latency does not degrade, and that there is sufficient capacity to absorb load for the time it takes for new inference server replicas (and model replicas) to become available.

When testing latency, it is recommended to track different percentiles of latency (e.g. p50, p90, p95, p99). Choose percentiles based on the needs of your application - higher percentiles will help understand the variability of performance across requests.

Determining the Load Saturation point

A first target of load testing should be determining the maximum throughput that one model replica is able to sustain. We’ll refer to this as the (single replica) load saturation point. Increasing the number of inference requests per second (RPS) beyond this point degrades latency due to queuing that occurs at the bottleneck (a model processing step, contention on a resource such as CPU, memory, network, etc).

We recommend determining the load saturation point by running an open-loop load test that goes through a series of stages, each having a target RPS. A stage would first linearly ramp-up to its target RPS and then hold that RPS level constant for a set amount of time. The target RPS should monotonically increase between the stages.

In order to get reproducible load test results, it is recommended to run the inference servers corresponding to the models being load-tested on k8s nodes where other components are not concurrently using a large proportion of the shared resources (compute, memory, IO).

Closed-loop vs. Open-loop mode

Typically, load testing tools generate load by creating a number of "virtual users" that send requests. Knowing the behaviour of those virtual users is critical when interpreting the results of the load test. Load tests can be set up in closed-loop mode, meaning that when each virtual user sends a request and they wait for a response before sending the next one. Alternatively, there is open-loop mode, where a variable number of users are instantiated in order to maintain a constant overall RPS.

  • When running in closed-loop mode, an undesired side-effect called coordinated omission appears: when the system gets overloaded and latency spikes up, the load test users effectively reduce the actual load on the system by sending requests less frequently

  • When using closed-loop mode in load testing, be aware that reported latencies at a given throughput may be significantly smaller than what will be experienced in reality. In contrast, an open-loop load tester would maintain a constant RPS, resulting in a more accurate representation of the latencies that will be experienced when running the model in production.

  • You can refer to the documentation of your load testing tool (i.e , ) for guidance on choosing the right workload model (open vs. closed-loop) based on your testing goals.

Locust
k6
Load Saturation Point

Inference

When looking at optimizing the latency or throughput of deployed models or pipelines, it is important to consider different approaches to the execution of inference workloads. Below are some tips on different approaches that may be relevant depending on the requirements of a given use-case:

  • gRPC may be more efficient than REST when your inference request payload benefits from a binary serialization format.

  • Grouping multiple real-time requests into small batches can improve throughput while maintaining acceptable latency. For more information on adaptive batching in MLServer, see here.

  • Reducing input size by reducing the dimensions of your inputs can speed up processing. This also reduces (de)serialization overhead that might be needed around model deployments.

  • For models deployed with MLServer, adjust parallel_workers in line with the number of CPU cores assigned to the Server pod. This is most effective for synchronous models, CPU-bound asynchronous models, and I/O-bound asynchronous models with high throughput. Proper tuning here can improve throughput, stabilize latency distribution, and potentially reduce overall latency due to reduced queuing. This is outlined in more detail in section below.

Configuring Parallel Processing

When deploying models using MLServer, it is possible to execute inference workloads via a pool of workers running in separate processes (see in MLServer docs ).

To assess the throughput behavior of individual model(s) it is first helpful to identify the maximum throughput possible with one worker (one-worker max throughput) and then the maximum throughput possible with N workers (n_workers max throughput). It is important to note that the n_workers max throughput is not simply n_workers × one-worker max throughput because workers run in separate processes, and the OS can only run as many processes in parallel as there are available CPUs. If all workers are CPU-bound, then setting n_workers higher than the number of CPU cores will be ineffective, as the OS will be limited by the number of CPU cores in terms of processes available to parallelize.

However, if some workers are waiting for either I/O or for a GPU, then setting n_workers to a value higher than the number of CPUs can help increase throughput, as some workers can then continue processing while others wait. Generally, if a model is receiving inference requests at a throughput that is lower than the one-worker max throughput, then adding additional workers will not help increase throughput or decrease latency. Similarly, if MLServer is configured with n_workers (on a pod with more than n_CPUs) and the request rate is below the n_workers worker max throughput, latency remains constant - the system is below saturation.

Given the above, it is worth considering increasing the number of workers available to process data for a given deployed model when the system becomes saturated. Increasing the number of workers up to or slightly above the number of CPU cores available may reduce latency when the system is saturated, provided the MLServer pod has sufficient spare CPU. The effect of increasing workers also depends on whether the model is CPU-bound or uses async versus blocking operations, where CPU-bound and blocking models would benefit most. When the system is saturated with requests, those requests will queue. Increasing workers aims to run enough tasks in parallel to cope with higher throughput while minimizing queuing.

Optimizing the model artefact

If optimizing for speed, the model artefact itself can have a big impact on performance. The speed at which an ML model can return results given input is based on the model’s architecture, model size, the precision of the model’s weights, and input size. In order to reduce the inherent complexity in the data processing required to execute an inference due to the attributes of a model, it is worth considering:

  • Model pruning to reduce parameters that may be unimportant. This can help reduce model size without having a big impact on the quality of the model’s outputs.

  • Quantization to reduce the computational and memory overheads of running inference by using model weights and activations with lower precision data types.

  • Dimensionality reduction of inputs to reduce the complexity of computation.

  • Efficient model architectures

such as MobileNet, EfficientNet, or DistilBERT, which are designed for faster inference with minimal accuracy loss.
  • Optimized model formats and runtimes like ONNX Runtime, TensorRT, or OpenVINO, which leverage hardware-specific acceleration for improved performance.

  • here
    Parallel Inference

    Models

    This section covers various aspects of optimizing model performance in Seldon Core 2, from initial load testing to infrastructure setup and inference optimization. Each subsection provides detailed guidance on different aspects of model performance tuning:

    Load Testing

    Learn how to conduct effective load testing to understand your model's performance characteristics:

    • Determining load saturation points

    • Understanding closed-loop vs. open-loop testing

    • Determining the right number of replicas based on your configuration (model, infrastructure, etc.)

    • Setting up reproducible test environments

    • Interpreting test results for autoscaling configuration

    Explore different approaches to optimize inference performance:

    • Choosing between gRPC and REST protocols

    • Implementing adaptive batching

    • Optimizing input dimensions

    • Configuring parallel processing with workers

    Understand how to configure the underlying infrastructure for optimal model performance:

    • Choosing between CPU and GPU deployments

    • Setting appropriate CPU specifications

    • Configuring thread affinity

    • Managing memory allocation

    Each of these aspects plays a crucial role in achieving optimal model performance. We recommend starting with to establish a baseline, then using the insights gained to inform your and strategies.

    Infrastructure Setup

    CPUs vs. GPUs

    Overall performance of your models will be constrained by the specifications of the underlying hardware which it is run on, and how it is leveraged. Choosing between CPUs and GPUs depends on the latency and throughput requirements for your use-case, as well as the type of model you are putting into production:

    • CPUs are generally sufficient for lightweight models, such as tree-based models, regression models, or small neural networks.

    Understanding CPU vs. GPU utilization
  • Optimizing your model artefact

  • Optimizing resource utilization
    Inference
    Infrastructure Setup
    load testing
    infrastructure setup
    inference optimization

    GPUs are recommended for deep learning models, large-scale neural networks, and large language models (LLMs) where lower latency is critical. Models with high matrix computation demands (like transformers or CNNs) benefit significantly from GPU acceleration.

    If cost is a concern, it is recommended to start with CPUs and use profiling or performance monitoring tools (e.g. as py-spy or scalene) to identify CPU bottlenecks. Based on these results, you can transition to GPUs as needed.

    Setting the right specifications for CPUs

    If working with models that will receive many concurrent requests in production, individual CPUs can often act as bottlenecks when processing data. In these cases, increasing the parallel workers can help. This can be configured through your serving solution as described in the Inference section. It is important to note that when increasing the number of workers available to process concurrent requests, it is best practice to ensure the number of workers is not significantly higher than the number of available CPU cores, in order to reduce contention. Each worker executes in it’s own process. This is most relevant for synchronous models where subsequent processing is blocked on completion of each request.

    Leveraging Multiple CPU Cores

    For more advanced configuration of CPU utilization, users can configure thread affinity through environment variables which determine how threads are bound to physical processors. For example, KMP_AFFINITY and OMP_NUM_THREADS are variables that can be set for technologies that use OpenMP. For more information on thread affinity, see here. In general, the ML Framework that you’re using might have it’s own recommendations for improving resource usage.

    Finally, increasing the RAM available for your models can improve performance for memory intensive models, such as models with large parameter sizes, ones that require high-dimensional data processing, or involve complex intermediate computations.