1 of 20

Kubernetes

Kubernetes provides the core platform for production use. However, the architecture in agnostic to the underlying cluster technology as shown by the local Docker installation.

Scaling

Models

Models can be scaled by setting their replica count, e.g.

Currently, the number of replicas will need not to exceed the replicas of the Server the model is scheduled to.

Servers

Servers can be scaled by setting their replica count, e.g.

Currently, models scheduled to a server can only scale up to the server replica count.

Internal Components

Seldon Core 2 runs with several control and dataplane components. The scaling of these resources is discussed below:

Pipeline gateway.
- This pipeline gateway handles REST and gRPC synchronous requests to Pipelines. It is stateless and can be scaled based on traffic demand.
Model gateway.
- This component pulls model requests from Kafka and sends them to inference servers. It can be scaled up to the partition factor of your Kafka topics. At present we set a uniform partition factor for all topics in one installation of Seldon .
Dataflow engine.
- The dataflow engine runs KStream topologies to manage Pipelines. It can run as multiple replicas and the scheduler will balance Pipelines to run across it with a consistent hashing load balancer. Each Pipeline is managed up to the partition factor of Kafka (presently hardwired to one).
Scheduler.
- This manages the control plane operations. It is presently required to be one replica as it maintains internal state within a BadgerDB held on local persistent storage (stateful set in Kubernetes). Performance tests have shown this not to be a bottleneck at present.
Kubernetes Controller.
- The Kubernetes controller manages resources updates on the cluster which it passes on to the Scheduler. It is by default one replica but has the ability to scale.
Envoy
- Envoy replicas get their state from the scheduler for routing information and can be scaled as needed.

Future Enhancements

Allow configuration of partition factor for data plane consistent hashing load balancer.
Allow Model gateway and Pipeline gateway to use consistent hashing load balancer.
Consider control plane scaling options.

Autoscaling

Autoscaling in Seldon applies to various concerns:

Inference servers autoscaling
Model autoscaling
Model memory overcommit

Inference servers autoscaling

Autoscaling of servers can be done via HorizontalPodAutoscaler (HPA).

HPA can be applied to any deployed Server resource. In this case HPA will manage the number of server replicas in the corresponding statefulset according to utilisation metrics (e.g. CPU or memory).

For example assuming that a triton server is deployed, then the user can attach an HPA based on cpu utilisation as follows:

kubectl autoscale server triton --cpu-percent=50 --min=1 --max=5

In this case, according to load, the system will add / remove server replicas to / from the triton statefulset.

It is worth considering the following points:

If HPA adds a new server replica, this new replica will be included in any future scheduling decisions. In other words, when deploying a new model or rescheduling failed models this new replica will be considered.
If HPA deletes an existing server replica, the scheduler will first attempt to drain any loaded model on this server replica before the server replica gets actually deleted. This is achieved by leveraging a PreStop hook on the server replica pod that triggers a process before receiving the termination signal. This draining process is capped by terminationGracePeriodSeconds, which the user can set (default is 2 minutes).

Therefore there should generally be minimal disruption to the inference workload during scaling.

For more details on HPA check this Kubernetes walk-through.

Autoscaling of inference servers via seldon-scheduler is under consideration for the roadmap. This allow for more fine grained interactions with model autoscaling.

Autoscaling both Models and Servers using HPA and custom metrics is possible for the special case of single model serving (i.e. single model per server). Check the detailed documentation here. For multi-model serving (MMS), a different solution is needed as discussed below.

Model autoscaling

As each model server can serve multiple models, models can scale across the available replicas of the server according to load.

Autoscaling of models is enabled if at least MinReplicas or MaxReplicas is set in the model custom resource. Then according to load the system will scale the number of Replicas within this range.

For example the following model will be deployed at first with 1 replica and it can scale up according to load.

# samples/models/tfsimple_scaling.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki
  minReplicas: 1
  replicas: 1

Note that model autoscaling will not attempt to add extra servers if the desired number of replicas cannot be currently fulfilled by the current provisioned number of servers. This is a process left to be done by server autoscaling.

Additionally when the system autoscales, the initial model spec is not changed (e.g. the number of Replicas) and therefore the user cannot reset the number of replicas back to the initial specified value without an explicit change.

If only Replicas is specified by the user, autoscaling of models is disabled and the system will have exactly the number of replicas of this model deployed regardless of inference load.

Architecture

The model autoscaling architecture is designed such as each agent decides on which models to scale up / down according to some defined internal metrics and then sends a triggering message to the scheduler. The current metrics are collected from the data plane (inference path), representing a proxy on how loaded is a given model with fulfilling inference requests.

Agent autoscaling stats collection

Scale up logic:

The main idea is that we keep the "lag" for each model. We define the "lag" as the difference between incoming and outgoing requests in a given time period. If the lag crosses a threshold, then we trigger a model scale up event. This threshold can be defined via SELDON_MODEL_INFERENCE_LAG_THRESHOLD inference server environment variable.

Scale down logic:

For now we keep things simple and we trigger model scale down events if a model has not been used for a number of seconds. This is defined in SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD inference server environment variable.

Each agent checks the above stats periodically and if any model hits the corresponding threshold, then the agent sends an event to the scheduler to request model scaling.

How often this process executes can be defined via SELDON_SCALING_STATS_PERIOD_SECONDS inference server environment variable.

Scheduler autoscaling

The scheduler will perform model autoscale if:

The model is stable (no state change in the last 5 minutes) and available.
The desired number of replicas is within range. Note we always have a least 1 replica of any deployed model and we rely on over commit to reduce the resources used further.
For scaling up, there is enough capacity for the new model replica.

Model memory overcommit

Servers can hold more models than available memory if overcommit is swictched on (default yes). This allows under utilized models to be moved from inference server memory to allow for other models to take their place. Note that these evicted models are still registered and in the case future inference requests arrive, the system will reload the models back to memory before serving the requests. If traffic patterns for inference of models vary then this can allow more models than available server memory to be run on the system.

HPA Autoscaling in single-model serving

Learn how to jointly autoscale model and server replicas based on a metric of inference requests per second (RPS) using HPA, when there is a one-to-one correspondence between models and servers (single-model serving). This will require:

Having a Seldon Core 2 install that publishes metrics to prometheus (default). In the following, we will assume that prometheus is already installed and configured in the seldon-monitoring namespace.
Installing and configuring Prometheus Adapter, which allows prometheus queries on relevant metrics to be published as k8s custom metrics
Configuring HPA manifests to scale Models and the corresponding Server replicas based on the custom metrics

Installing and configuring the Prometheus Adapter

The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom or external metrics. Those can then be accessed by HPA in order to take scaling decisions.

To install through helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install --set prometheus.url='http://seldon-monitoring-prometheus' hpa-metrics prometheus-community/prometheus-adapter -n seldon-monitoring

These commands install prometheus-adapter as a helm release named hpa-metrics in the same namespace where Prometheus is installed, and point to its service URL (without the port).

The URL is not fully qualified as it references a Prometheus instance running in the same namespace. If you are using a separately-managed Prometheus instance, please update the URL accordingly.

If you are running Prometheus on a different port than the default 9090, you can also pass --set prometheus.port=[custom_port] You may inspect all the options available as helm values by running helm show values prometheus-community/prometheus-adapter

We now need to configure the adapter to look for the correct prometheus metrics and compute per-model RPS values. On install, the adapter has created a ConfigMap in the same namespace as itself, named [helm_release_name]-prometheus-adapter. In our case, it will be hpa-metrics-prometheus-adapter.

Overwrite the ConfigMap as shown in the following manifest, after applying any required customizations.

Change the name if you've chosen a different value for the prometheus-adapter helm release name. Change the namespace to match the namespace where prometheus-adapter is installed.

prometheus-adapter.config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: hpa-metrics-prometheus-adapter
  namespace: seldon-monitoring
data:
  config.yaml: |-
    "rules":
    -
      "seriesQuery": |
         {__name__=~"^seldon_model.*_total",namespace!=""}
      "seriesFilters":
        - "isNot": "^seldon_.*_seconds_total"
        - "isNot": "^seldon_.*_aggregate_.*"
      "resources":
        "overrides":
          "model": {group: "mlops.seldon.io", resource: "model"}
          "server": {group: "mlops.seldon.io", resource: "server"}
          "pod": {resource: "pod"}
          "namespace": {resource: "namespace"}
      "name":
        "matches": "^seldon_model_(.*)_total"
        "as": "${1}_rps"
      "metricsQuery": |
        sum by (<<.GroupBy>>) (
          rate (
            <<.Series>>{<<.LabelMatchers>>}[1m]
          )
        )

In this example, a single rule is defined to fetch the seldon_model_infer_total metric from Prometheus, compute its rate over a 1 minute window, and expose this to k8s as the infer_rps metric, with aggregations at model, server, inference server pod and namespace level.

A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available here, and those may be used when customizing the configuration.

Understanding prometheus-adapter rule definitions

The rule definition can be broken down in four parts:

Discovery (the seriesQuery and seriesFilters keys) controls what Prometheus metrics are considered for exposure via the k8s custom metrics API.
In the example, all the Seldon Prometheus metrics of the form seldon_model_*_total are considered, excluding metrics pre-aggregated across all models (.*_aggregate_.*) as well as the cummulative infer time per model (.*_seconds_total). For RPS, we are only interested in the model inference count (seldon_model_infer_total)
Association (the resources key) controls the Kubernetes resources that a particular metric can be attached to or aggregated over.
The resources key defines an association between certain labels from the Prometheus metric and k8s resources. For example, on line 17, "model": {group: "mlops.seldon.io", resource: "model"} lets prometheus-adapter know that, for the selected Prometheus metrics, the value of the "model" label represents the name of a k8s model.mlops.seldon.io CR.
One k8s custom metric is generated for each k8s resource associated with a prometheus metric. In this way, it becomes possible to request the k8s custom metric values for models.mlops.seldon.io/iris or for servers.mlops.seldon.io/mlserver.
The labels that do not refer to a namespace resource generate "namespaced" custom metrics (the label values refer to resources which are part of a namespace) -- this distinction becomes important when needing to fetch the metrics via kubectl, and in understanding how certain Prometheus query template placeholders are replaced.
Naming (the name key) configures the naming of the k8s custom metric.
In the example ConfigMap, this is configured to take the Prometheus metric named seldon_model_infer_total and expose custom metric endpoints named infer_rps, which when called return the result of a query over the Prometheus metric.
The matching over the Prometheus metric name uses regex group capture expressions (line 22), which are then be referenced in the custom metric name (line 23).
Querying (the metricsQuery key) defines how a request for a specific k8s custom metric gets converted into a Prometheus query.
The query can make use of the following placeholders:
- .Series is replaced by the discovered prometheus metric name (e.g. seldon_model_infer_total)
- .LabelMatchers, when requesting a namespaced metric for resource X with name x in namespace n, is replaced by X=~"x",namespace="n". For example, model=~"iris0", namespace="seldon-mesh". When requesting the namespace resource itself, only the namespace="n" is kept.
- .GroupBy is replaced by the resource type of the requested metric (e.g. model, server, pod or namespace).
You may want to modify the query in the example to match the one that you typically use in your monitoring setup for RPS metrics. The example calls rate() with a 1 minute window.

For a complete reference for how prometheus-adapter can be configured via the ConfigMap, please consult the docs here.

Once you have applied any necessary customizations, replace the default prometheus-adapter config with the new one, and restart the deployment (this restart is required so that prometheus-adapter picks up the new config):

# Replace default prometheus adapter config
kubectl replace -f prometheus-adapter.config.yaml
# Restart prometheus-adapter pods
kubectl rollout restart deployment hpa-metrics-prometheus-adapter -n seldon-monitoring

Testing the install using the custom metrics API

In order to test that the prometheus adapter config works and everything is set up correctly, you can issue raw kubectl requests against the custom metrics API

If no inference requests were issued towards any model in the Core 2 install, the corresponding metrics will not be available in prometheus, and thus will also not appear when checking via kubectl. Therefore, please first run some inference requests towards a sample model to ensure that the metrics are available — this is only required for the testing of the install.

Listing the available metrics:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq .

For namespaced metrics, the general template for fetching is:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/[NAMESPACE]/[API_RESOURCE_NAME]/[CR_NAME]/[METRIC_NAME]"

For example:

Fetching model RPS metric for a specific (namespace, model) pair (seldon-mesh, irisa0):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/models.mlops.seldon.io/irisa0/infer_rps

Fetching model RPS metric aggregated at the (namespace, server) level (seldon-mesh, mlserver):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/servers.mlops.seldon.io/mlserver/infer_rps

Fetching model RPS metric aggregated at the (namespace, pod) level (seldon-mesh, mlserver-0):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/pods/mlserver-0/infer_rps

Fetching the same metric aggregated at namespace level (seldon-mesh):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/*/metrics/infer_rps

Configuring HPA manifests

For every (Model, Server) pair you want to autoscale, you need to apply 2 HPA manifests based on the same metric: one scaling the Model, the other the Server. The example below only works if the mapping between Models and Servers is 1-to-1 (i.e no multi-model serving).

Consider a model named irisa0 with the following manifest. Please note we don’t set minReplicas/maxReplicas. This disables the seldon lag-based autoscaling so that it doesn’t interact with HPA (separate minReplicas/maxReplicas configs will be set on the HPA side)

You must also explicitly define a value for spec.replicas. This is the key modified by HPA to increase the number of replicas, and if not present in the manifest it will result in HPA not working until the Model CR is modified to have spec.replicas defined.

irisa0.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: irisa0
  namespace: seldon-mesh
spec:
  memory: 3M
  replicas: 1
  requirements:
  - sklearn
  storageUri: gs://seldon-models/testing/iris1

Let’s scale this model when it is deployed on a server named mlserver, with a target RPS per replica of 3 RPS (higher RPS would trigger scale-up, lower would trigger scale-down):

irisa0-hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: irisa0-model-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    name: irisa0
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Object
    object:
      metric:
        name: infer_rps
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      target:
        type: AverageValue
        averageValue: 3
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mlserver-server-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    name: mlserver
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Object
    object:
      metric:
        name: infer_rps
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      target:
        type: AverageValue
        averageValue: 3

In the preceding HPA manifests, the scaling metric is exactly the same, and uses the exact same parameters. This is to ensure that both the Models and the Servers are scaled up/down at approximately the same time. Small variations in the scale-up time are expected because each HPA samples the metrics independently, at regular intervals.

If a Model gets scaled up slightly before its corresponding Server, the model is currently marked with the condition ModelReady "Status: False" with a "ScheduleFailed" message until new Server replicas become available. However, the existing replicas of that model remain available and will continue to serve inference load.

In order to ensure similar scaling behaviour between Models and Servers, the number of minReplicas and maxReplicas, as well as any other configured scaling policies should be kept in sync across the HPA for the model and the server.

Details on custom metrics of type Object

The HPA manifests use metrics of type "Object" that fetch the data used in scaling decisions by querying k8s metrics associated with a particular k8s object. The endpoints that HPA uses for fetching those metrics are the same ones that were tested in the previous section using kubectl get --raw .... Because you have configured the Prometheus Adapter to expose those k8s metrics based on queries to Prometheus, a mapping exists between the information contained in the HPA Object metric definition and the actual query that is executed against Prometheus. This section aims to give more details on how this mapping works.

In our example, the metric.name:infer_rps gets mapped to the seldon_model_infer_total metric on the prometheus side, based on the configuration in the name section of the Prometheus Adapter ConfigMap. The prometheus metric name is then used to fill in the <<.Series>> template in the query (metricsQuery in the same ConfigMap).

Then, the information provided in the describedObject is used within the Prometheus query to select the right aggregations of the metric. For the RPS metric used to scale the Model (and the Server because of the 1-1 mapping), it makes sense to compute the aggregate RPS across all the replicas of a given model, so the describedObject references a specific Model CR.

However, in the general case, the describedObject does not need to be a Model. Any k8s object listed in the resources section of the Prometheus Adapter ConfigMap may be used. The Prometheus label associated with the object kind fills in the <<.GroupBy>> template, while the name gets used as part of the <<.LabelMatchers>>. For example:

If the described object is { kind: Namespace, name: seldon-mesh }, then the Prometheus query template configured in our example would be transformed into:

sum by (namespace) (
  rate (
    seldon_model_infer_total{namespace="seldon-mesh"}[1m]
  )
)

If the described object is not a namespace (for example, { kind: Pod, name: mlserver-0 }) then the query will be passed the label describing the object, alongside an additional label identifying the namespace where the HPA manifest resides in.:

sum by (pod) (
  rate (
    seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[1m]
  )
)

For the target of the Object metric you must use a type of AverageValue. The value given in averageValue represents the per replica RPS scaling threshold of the scaleTargetRef object (either a Model or a Server in our case), with the target number of replicas being computed by HPA according to the following formula:

Attempting other target types does not work under the current Seldon Core 2 setup, because they use the number of active Pods associated with the Model CR (i.e. the associated Server pods) in the targetReplicas computation. However, this also means that this set of pods becomes "owned" by the Model HPA. Once a pod is owned by a given HPA it is not available for other HPAs to use, so we would no longer be able to scale the Server CRs using HPA.

HPA sampling of custom metrics

Each HPA CR has it's own timer on which it samples the specified custom metrics. This timer starts when the CR is created, with sampling of the metric being done at regular intervals (by default, 15 seconds).

As a side effect of this, creating the Model HPA and the Server HPA (for a given model) at different times will mean that the scaling decisions on the two are taken at different times. Even when creating the two CRs together as part of the same manifest, there will usually be a small delay between the point where the Model and Server spec.replicas values are changed.

Despite this delay, the two will converge to the same number when the decisions are taken based on the same metric (as in the previous examples).

When showing the HPA CR information via kubectl get, a column of the output will display the current metric value per replica and the target average value in the format [per replica metric value]/[target]. This information is updated in accordance to the sampling rate of each HPA resource. It is therefore expected to sometimes see different metric values for the Model and it's corresponding Server.

Some versions of k8s will display [per pod metric value] instead of [per replica metric value], with the number of pods being computed based on a label selector present in the target resource CR (the status.selector value for the Model or Server in the Core 2 case).

HPA is designed so that multiple HPA CRs cannot target the same underlying pod with this selector (with HPA stopping when such a condition is detected). This means that in Core 2, the Model and Server selector cannot be the same. A design choice was made to assign the Model a unique selector that does not match any pods.

As a result, for the k8s versions displaying [per pod metric value], the information shown for the Model HPA CR will be an overflow caused by division by zero. This is only a display artefact, with the Model HPA continuing to work normally. The actual value of the metric can be seen by inspecting the corresponding Server HPA CR, or by fetching the metric directly via kubectl get --raw

Advanced settings

Filtering metrics by additional labels on the prometheus metric:

The prometheus metric from which the model RPS is computed has the following labels:

seldon_model_infer_total{
    code="200", 
    container="agent", 
    endpoint="metrics", 
    instance="10.244.0.39:9006", 
    job="seldon-mesh/agent", 
    method_type="rest", 
    model="irisa0", 
    model_internal="irisa0_1", 
    namespace="seldon-mesh", 
    pod="mlserver-0", 
    server="mlserver", 
    server_replica="0"
}

If you want the scaling metric to be computed based on inferences with a particular value for any of those labels, you can add this in the HPA metric config, as in the example (targeting method_type="rest"):

  metrics:
  - type: Object
    object:
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      metric:
        name: infer_rps
        selector:
          matchLabels:
            method_type: rest
      target:
	    type: AverageValue
        averageValue: "3"

Customize scale-up / scale-down rate & properties by using scaling policies as described in the HPA scaling policies docs
For more resources, please consult the HPA docs and the HPA walkthrough

Cluster operation guidelines when using HPA-based scaling

When deploying HPA-based scaling for Seldon Core 2 models and servers as part of a production deployment, it is important to understand the exact interactions between HPA-triggered actions and Seldon Core 2 scheduling, as well as potential pitfalls in choosing particular HPA configurations.

Using the default scaling policy, HPA is relatively aggressive on scale-up (responding quickly to increases in load), with a maximum replicas increase of either 4 every 15 seconds or 100% of existing replicas within the same period (whichever is highest). In contrast, scaling-down is more gradual, with HPA only scaling down to the maximum number of recommended replicas in the most recent 5 minute rolling window, in order to avoid flapping. Those parameters can be customized via scaling policies.

When using custom metrics such as RPS, the actual number of replicas added during scale-up or reduced during scale-down will entirely depend, alongside the maximums imposed by the policy, on the configured target (averageValue RPS per replica) and on how quickly the inferencing load varies in your cluster. All three need to be considered jointly in order to deliver both an efficient use of resources and meeting SLAs.

Customizing per-replica RPS targets and replica limits

Naturally, the first thing to consider is an estimated peak inference load (including some margins) for each of the models in the cluster. If the minimum number of model replicas needed to serve that load without breaching latency SLAs is known, it should be set as spec.maxReplicas, with the HPA target.averageValue set to peak_infer_RPS/maxReplicas.

If maxReplicas is not already known, an open-loop load test with a slowly ramping up request rate should be done on the target model (one replica, no scaling). This would allow you to determine the RPS (inference request throughput) when latency SLAs are breached or (depending on the desired operation point) when latency starts increasing. You would then set the HPA target.averageValue taking some margin below this saturation RPS, and compute spec.maxReplicas as peak_infer_RPS/target.averageValue. The margin taken below the saturation point is very important, because scaling-up cannot be instant (it requires spinning up new pods, downloading model artifacts, etc.). In the period until the new replicas become available, any load increases will still need to be absorbed by the existing replicas.

If there are multiple models which typically experience peak load in a correlated manner, you need to ensure that sufficient cluster resources are available for k8s to concurrently schedule the maximum number of server pods, with each pod holding one model replica. This can be ensured by using either Cluster Autoscaler or, when running workloads in the cloud, any provider-specific cluster autoscaling services.

It is important for the cluster to have sufficient resources for creating the total number of desired server replicas set by the HPA CRs across all the models at a given time.

Not having sufficient cluster resources to serve the number of replicas configured by HPA at a given moment, in particular under aggressive scale-up HPA policies, may result in breaches of SLAs. This is discussed in more detail in the following section.

A similar approach should be taken for setting minReplicas, in relation to estimated RPS in the low-load regime. However, it's useful to balance lower resource usage to immediate availability of replicas for inference rate increases from that lowest load point. If low-load regimes only occur for small periods of time, and especially combined with a high rate of increase in RPS when moving out of the low-load regime, it might be worth to set the minReplicas floor higher in order to ensure SLAs are met at all times.

Customizing HPA policy settings for ensuring correct scaling behaviour

Each spec.replica value change for a Model or Server triggers a rescheduling event for the Seldon Core 2 scheduler, which considers any updates that are needed in mapping Model replicas to Server replicas such as rescheduling failed Model replicas, loading new ones, unloading in the case of the number of replicas going down, etc.

Two characteristics in the current implementation are important in terms of autoscaling and configuring the HPA scale-up policy:

The scheduler does not create new Server replicas when the existing replicas are not sufficient for loading a Model's replicas (one Model replica per Server replica). Whenever a Model requests more replicas than available on any of the available Servers, its ModelReady condition transitions to Status: False with a ScheduleFailed message. However, any replicas of that Model that are already loaded at that point remain available for servicing inference load.
There is no partial scheduling of replicas. For example, consider a model with 2 replicas, currently loaded on a server with 3 replicas (two of those server replicas will have the model loaded). If you update the model replicas to 4, the scheduler will transition the model to ScheduleFailed, seeing that it cannot satisfy the requested number of replicas. The existing 2 model replicas will continue to serve traffic, but a third replica will not be loaded onto the remaining server replica.
In other words, the scheduler either schedules all the requested replicas, or, if unable to do so, leaves the state of the cluster unchanged.
Introducing partial scheduling would make the overall results of assigning models to servers significantly less predictable and ephemeral. This is because models may end up moved back-and forth between servers depending on the speed with which various server replicas become available. Network partitions or other transient errors may also trigger large changes to the model-to-server assignments, making it challenging to sustain consistent data plane load during those periods.

Taken together, the two Core 2 scheduling characteristics, combined with a very aggressive HPA scale-up policy and a continuously increasing RPS may lead to the following pathological case:

Based on RPS, HPA decides to increase both the Model and Server replicas from 2 (an example start stable state) to 8. While the 6 new Server pods get scheduled and get the Model loaded onto them, the scheduler will transition the Model into the ScheduleFailed state, because it cannot fulfill the requested replicas requirement. During this period, the initial 2 Model replicas continue to serve load, but are using their RPS margins and getting closer to the saturation point.
At the same time, load continues to increase, so HPA further increases the number of required Model and Server replicas from 8 to 12, before all of the 6 new Server pods had a chance to become available. The new replica target for the scheduler also becomes 12, and this would not be satisfied until all the 12 Server replicas are available. The 2 Model replicas that are available may by now be saturated and the infer latency spikes up, breaching set SLAs.
The process may continue until load stabilizes.
If at any point the number of requested replicas (<=maxReplicas) exceeds the resource capacity of the cluster, the requested server replica count will never be reached and thus the Model will remain permanently in the ScheduleFailed state.

While most likely encountered during continuous ramp-up RPS load tests with autoscaling enabled, the pathological case example is a good showcase for the elements that need to be taken into account when setting the HPA policies.

The speed with which new Server replicas can become available versus how many new replicas may HPA request in a given time:
- The HPA scale-up policy should not be configured to request more replicas than can become available in the specified time. The following example reflects a confidence that 5 Server pods will become available within 90 seconds, with some safety margin. The default scale-up config, that also adds a percentage based policy (double the existing replicas within the set periodSeconds) is not recommended because of this.
- Perhaps more importantly, there is no reason to scale faster than the time it takes for replicas to become available - this is the true maximum rate with which scaling up can happen anyway. Because the underlying Server replica pods are part of a stateful set, they are created sequentially by k8s.

hpa-custom-policy.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: irisa0-model-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    ...
  minReplicas: 1
  maxReplicas: 3
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 5
        periodSeconds: 90
  metrics:
    ...

The duration of transient load spikes which you might want to absorb within the existing per-replica RPS margins.
- The previous example, at line 13, configures a scale-up stabilization window of one minute. It means that for all of the HPA recommended replicas in the last 60 second window (4 samples of the custom metric considering the default sampling rate), only the smallest will be applied.
- Such stabilization windows should be set depending on typical load patterns in your cluster: not being too aggressive in reacting to increased load will allow you to achieve cost savings, but has the disadvantage of a delayed reaction if the load spike turns out to be sustained.
The duration of any typical/expected sustained ramp-up period, and the RPS increase rate during this period.
- It is useful to consider whether the replica scale-up rate configured via the policy (line 15 in the example) is able to keep-up with this RPS increase rate.
- Such a scenario may appear, for example, if you are planning for a smooth traffic ramp-up in a blue-green deployment as you are draining the "blue" deployment and transitioning to the "green" one

Tracing

We support Open Telemetry tracing. By default all components will attempt to send OLTP events to seldon-collector.seldon-mesh:4317 which will export to Jaeger at simplest-collector.seldon-mesh:4317.

The components can be installed from the tracing/k8s folder. In future an Ansible playbook will be created. This installs a Open Telemetry collector and a simple Jaeger install with a service that can be port forwarded to at simplest.seldon-mesh:16686.

An example Jaeger trace is show below:

Storage Secrets

Inference artifacts referenced by Models can be stored in any of the storage backends supported by Rclone. This includes local filesystems, AWS S3, and Google Cloud Storage (GCS), among others. Configuration is provided out-of-the-box for public GCS buckets, which enables the use of Seldon-provided models like in the below example:

# samples/models/sklearn-iris-gs.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
  requirements:
  - sklearn
  memory: 100Ki

This configuration is provided by the Kubernetes Secret seldon-rclone-gs-public. It is made available to Servers as a preloaded secret. You can define and use your own storage configurations in exactly the same way.

Configuration Format

To define a new storage configuration, you need the following details:

Remote name
Remote type
Provider parameters

A remote is what Rclone calls a storage location. The type defines what protocol Rclone should use to talk to this remote. A provider is a particular implementation for that storage type. Some storage types have multiple providers, such as s3 having AWS S3 itself, MinIO, Ceph, and so on.

The remote name is your choice. The prefix you use for models in spec.storageUri must be the same as this remote name.

The remote type is one of the values supported by Rclone. For example, for AWS S3 it is s3 and for Dropbox it is dropbox.

The provider parameters depend entirely on the remote type and the specific provider you are using. Please check the Rclone documentation for the appropriate provider. Note that Rclone docs for storage types call the parameters properties and provide both config and env var formats--you need to use the config format. For example, the GCS parameter --gcs-client-id described here should be used as client_id.

For reference, this format is described in the Rclone documentation. Note that we do not support the use of opts discussed in that section.

Kubernetes Secrets

Kubernetes Secrets are used to store Rclone configurations, or storage secrets, for use by Servers. Each Secret should contain exactly one Rclone configuration.

A Server can use storage secrets in one of two ways:

It can dynamically load a secret specified by a Model in its .spec.secretName
It can use global configurations made available via preloaded secrets

The name of a Secret is entirely your choice, as is the name of the data key in that Secret. All that matters is that there is a single data key and that its value is in the format described above.

It is possible to use preloaded secrets for some Models and dynamically loaded secrets for others.

Preloaded Secrets

Rather than Models always having to specify which secret to use, a Server can load storage secrets ahead of time. These can then be reused across many Models.

When using a preloaded secret, the Model definition should leave .spec.secretName empty. The protocol prefix in .spec.storageUri still needs to match the remote name specified by a storage secret.

The secrets to preload are named in a centralised ConfigMap called seldon-agent. This ConfigMap applies to all Servers managed by the same SeldonRuntime. By default this ConfigMap only includes seldon-rclone-gs-public, but can be extended with your own secrets as shown below:

# samples/auth/agent.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: seldon-agent
data:
  agent.json: |-
   {
      "rclone" : {
          "config_secrets": ["seldon-rclone-gs-public","minio-secret"]
      },
   }

The easiest way to change this is to update your SeldonRuntime.

If your SeldonRuntime is configured using the seldon-core-v2-runtime Helm chart, the corresponding value is config.agentConfig.rclone.configSecrets. This can be used as shown below:
```
config:
  agentConfig:
    rclone:
      configSecrets:
        - my-s3
        - custom-gcs
        - minio-in-cluster
```

Otherwise, if your SeldonRuntime is configured directly, you can add secrets by setting .spec.config.agentConfig.rclone.config_secrets. This can be used as follows:

apiVersion: mlops.seldon.io/v1alpha1
kind: SeldonRuntime
metadata:
  name: seldon
spec:
  seldonConfig: default
  config:
    agentConfig:
      rclone:
        config_secrets:
          - my-s3
          - custom-gcs
          - minio-in-cluster
  ...

Examples

Assuming you have installed MinIO in the minio-system namespace, a corresponding secret could be:

# samples/auth/minio-secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: minio-secret
  namespace: seldon-mesh
type: Opaque
stringData:
  s3: |
    type: s3
    name: s3
    parameters:
      provider: minio
      env_auth: false
      access_key_id: minioadmin
      secret_access_key: minioadmin
      endpoint: http://minio.minio-system:9000

You can then reference this in a Model with .spec.secretName:

# samples/models/sklearn-iris-minio.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "s3://models/iris"
  secretName: "minio-secret"
  requirements:
  - sklearn

GCS can use service accounts for access. You can generate the credentials for a service account using the gcloud CLI:

gcloud iam service-accounts keys create \
  gcloud-application-credentials.json \
  --iam-account [SERVICE-ACCOUNT--NAME]@[PROJECT-ID].iam.gserviceaccount.com

The contents of gcloud-application-credentials.json can be put into a secret:

apiVersion: v1
kind: Secret
metadata:
  name: gcs-bucket
type: Opaque
stringData:
  gcs: |
    type: gcs
    name: gcs
    parameters:
      service_account_credentials: '<gcloud-application-credentials.json>'

You can then reference this in a Model with .spec.secretName:

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: mymodel
spec:
  storageUri: "gcs://my-bucket/my-path/my-pytorch-model"
  secretName: "gcs-bucket"
  requirements:
  - pytorch

Kafka

Seldon Core 2 requires Kafka to implement data-centric inference Pipelines. See our architecture documentation to learn more on how Seldon Core 2 uses Kafka.

Kafka integration is required to enable data-centric inference pipelines feature. It is highly advice to configure Kafka integration to take full advantage of Seldon Core 2 features.

We list alternatives below.

Managed Kafka

We recommend to use managed Kafka solution for production installation. This allow to take away all the complexity on running secure and scalable Kafka cluster away.

We currently have tested and documented integration with following managed solutions:

Confluent Cloud (security: SASL/PLAIN)
Confluent Cloud (security: SASL/OAUTHBEARER)
Amazon MSK (security: mTLS)
Amazon MSK (security: SASL/SCRAM)
Azure Event Hub (security: SASL/PLAIN)

See our Kafka security section for configuration examples.

Self Hosted Kafka

Strimzi Kafka

Seldon Core 2 requires Kafka to implement data-centric inference Pipelines. To install Kafka for testing purposed in your k8s cluster, we recommend to use Strimzi Operator.

This page discuss how to install Strimzi Operator and create Kafka cluster for trial, dev, or testing purposes. For production grade installation consult Strimzi documentation or use one of managed solutions mentioned here.

You can install and configure Strimzi using either Helm charts or our Ansible playbooks, both documented below.

Helm

The installation of a Kafka cluster requires the Strimzi Kafka operator installed in the same namespace. This allows to directly use the mTLS certificates created by Strimzi Operator. One option to install the Strimzi operator is via Helm.

Note that we are using here KRaft instead of Zookeeper for Kafka. You can enable featureGates during Helm installation via:

helm upgrade --install strimzi-kafka-operator \
  strimzi/strimzi-kafka-operator \
  --namespace seldon-mesh --create-namespace \
  --set featureGates='+UseKRaft\,+UseStrimziPodSets'

Use with caution! Currently Kraft installation of Strimzi is not production ready. See Strimzi documentation and related GitHub issue for further details.

Create Kafka cluster in seldon-mesh namespace

helm upgrade seldon-core-v2-kafka kafka/strimzi -n seldon-mesh --install

Note that a specific strimzi operator version is associated with a subset of supported Kafka versions.

Ansible

We provide automation around the installation of a Kafka cluster for Seldon Core 2 to help with development and testing use cases. You can follow the steps defined here to install Kafka via ansible.

You can use our Ansible playbooks to install only Strimzi Operator and Kafka cluster by setting extra Ansible vars:

ansible-playbook playbooks/setup-ecosystem.yaml -e full_install=no -e install_kafka=yes

Notes

You can check kafka-examples for more details.
As we are using KRaft, use Kafka version 3.3 or above.
For security settings check here.

Metrics

Metrics are exposed for scrapping by Prometheus.

Example Installation

We recommend to install kube-prometheus that provides an all-in-one package with the Prometheus operator.

RBAC

You will need to modify the default RBAC installed by kube-prometheus as described here.

From the prometheus folder in the project run:

kubectl apply -f rbac/cr.yaml

Monitors

We use a PodMonitor for scrapping agent metrics. The envoy and server monitors are there for completeness but not presently needed.

kubectl apply -f monitors

Includes:

Agent pod monitor. Monitors the metrics port of server inference pods.
Server pod monitor. Monitors the server-metrics port of inference server pods.
Envoy service monitor. Monitors the Envoy gateway proxies.
Pipeline gateway pod monitor. Monitors the metrics port of pipeline gateway pods.

Pod monitors were chosen as ports for metrics are not exposed at service level as we do not have a top level service for server replicas but 1 headless service per replica. Future discussions could reference this.

Example Grafana Dashboard

Check metrics for more information.

Reference

Prometheus CRDs

Resources

For Kubernetes usage we provide a set of custom resources for interacting with Seldon.

SeldonRuntime - for installing Seldon in a particular namespace.
Servers - for deploying sets of replicas of core inference servers (MLServer or Triton).
Models - for deploying single machine learning models, custom transformation logic, drift detectors, outliers detectors and explainers.
Experiments - for testing new versions of models.
Pipelines - for connecting together flows of data between models.

Advanced Customization

SeldonConfig and ServerConfig define the core installation configuration and machine learning inference server configuration for Seldon. Normally, you would not need to customize these but this may be required for your particular custom installation within your organisation.

ServerConfigs - for defining new types of inference server that can be reference by a Server resource.
SeldonConfig - for defining how seldon is installed

Model

A Model is the core atomic building block. It specifies a machine learning artifact that will be loaded onto one of the running Servers. A model could be a standard machine learning inference component such as

a Tensorflow model, PyTorch model or SKLearn model.
an inference transformation component such as a SKLearn pipeline or a piece of custom python logic. a monitoring component such as an outlier detector or drift detector.
An alibi-explain model explainer

An example is shown below for a SKLearn model for iris classification:

:language: yaml

Its Kubernetes spec has two core requirements

A storageUri specifying the location of the artifact. This can be any rclone URI specification.
A requirements list which provides tags that need to be matched by the Server that can run this artifact type. By default when you install Seldon we provide a set of Servers that cover a range of artifact types.

Experiment

An Experiment defines a traffic split between Models or Pipelines. This allows new versions of models and pipelines to be tested.

An experiment spec has three sections:

candidates (required) : a set of candidate models to split traffic.
default (optional) : an existing candidate who endpoint should be modified to split traffic as defined by the candidates.
- Each candidate has a traffic weight. The percentage of traffic will be this weight divided by the sum of traffic weights.
mirror (optional) : a single model to mirror traffic to the candidates. Responses from this model will not be returned to the caller.

An example experiment with a defaultModel is shown below:

This defines a split of 50% traffic between two models iris and iris2. In this case we want to expose this traffic split on the existing endpoint created for the iris model. This allows us to test new versions of models (in this case iris2) on an existing endpoint (in this case iris). The default key defines the model whose endpoint we want to change. The experiment will become active when both underplying models are in Ready status.

An experiment over two separate models which exposes a new API endpoint is shown below:

To call the endpoint add the header seldon-model: <experiment-name>.experiment in this case: seldon-model: experiment-iris.experiment. For example with curl:

Pipeline Experiments

Running an experiment between some pipelines is very similar. The difference is resourceType: pipeline needs to be defined and in this case the candidates or mirrors will refer to pipelines. An example is shown below:

Mirror Experiments

A mirror can be added easily for model or pipeline experiments. An example model mirror experiment is shown below:

An example pipeline mirror experiment is shown below:

Sticky Sessions

To allow cohorts to get consistent views in an experiment each inference request passes back a response header x-seldon-route which can be passed in future requests to an experiment to bypass the random traffic splits and get a prediction from the sequence of models and pipelines used in the initial request.

Note: you must pass the normal seldon-model header along with the x-seldon-route header.

Caveats: the models used will be the same but not necessarily the same replica instances. This means at present this will not work for stateful models that need to go to the same model replica instance.

Service Meshes

Pipeline

Pipelines allow one to connect flows of inference data transformed by Model components. A directed acyclic graph (DAG) of steps can be defined to join Models together. Each Model will need to be capable of receiving a V2 inference request and respond with a V2 inference response. An example Pipeline is shown below:

The steps list shows three models: tfsimple1, tfsimple2 and tfsimple3. These three models each take two tensors called INPUT0 and INPUT1 of integers. The models produce two outputs OUTPUT0 (the sum of the inputs) and OUTPUT1 (subtraction of the second input from the first).

tfsimple1 and tfsimple2 take as inputs the input to the Pipeline: the default assumption when no explicit inputs are defined. tfsimple3 takes one V2 tensor input from each of the outputs of tfsimple1 and tfsimple2. As the outputs of tfsimple1 and tfsimple2 have tensors named OUTPUT0 and OUTPUT1 their names need to be changed to respect the expected input tensors and this is done with a tensorMap component providing this tensor renaming. This is only required if your models can not be directly chained together.

The output of the Pipeline is the output from the tfsimple3 model.

Detailed Specification

The full GoLang specification for a Pipeline is shown below:

Server

The default installation will provide two initial servers: one MLServer and one Triton. You only need to define additional servers for advanced use cases.

A Server defines an inference server onto which models will be placed for inference. By default on installation two server StatefulSets will be deployed one MlServer and one Triton. An example Server definition is shown below:

The main requirement is a reference to a ServerConfig resource in this case mlserver.

Detailed Specs

Custom Servers

One can easily utilize a custom image with the existing ServerConfigs. For example, the following defines an MLServer server with a custom image:

This server can then be targeted by a particular model by specifying this server name when creating the model, for example:

Server with PVC

One can also create a Server definition to add a persistent volume to your server. This can be used to allow models to be loaded directly from the persistent volume.

The server can be targeted by a model whose artifact is on the persistent volume as shown below.

Server Config

This section is for advanced usage where you want to define new types of inference servers.

Server configurations define how to create an inference server. By default one is provided for Seldon MLServer and one for NVIDIA Triton Inference Server. Both these servers support the V2 inference protocol which is a requirement for all inference servers. They define how the Kubernetes ReplicaSet is defined which includes the Seldon Agent reverse proxy as well as an Rclone server for downloading artifacts for the server. The Kustomize ServerConfig for MlServer is shown below:

Server Runtime

The SeldonRuntime resource is used to create an instance of Seldon installed in a particular namespace.

For the definition of SeldonConfiguration above see the .

The specification above contains overrides for the chosen SeldonConfig. To override the PodSpec for a given component, the overrides field needs to specify the component name and the PodSpec needs to specify the container name, along with fields to override.

For instance, the following overrides the resource limits for cpu and memory in the hodometer component in the seldon-mesh namespace, while using values specified in the seldonConfig elsewhere (e.g. default).

As a minimal use you should just define the SeldonConfig to use as a base for this install, for example to install in the seldon-mesh namespace with the SeldonConfig named default:

The helm chart seldon-core-v2-runtime allows easy creation of this resource and associated default Servers for an installation of Seldon in a particular namespace.

SeldonConfig Update Propagation

Seldon Config

This section is for advanced usage where you want to define how seldon is installed in each namespace.

The SeldonConfig resource defines the core installation components installed by Seldon. If you wish to install Seldon, you can use the SeldonRuntime resource which allows easy overriding of some parts defined in this specification. In general, we advise core DevOps to use the default SeldonConfig or customize it for their usage. Individual installation of Seldon can then use the SeldonRuntime with a few overrides for special customisation needed in that namespace.

The specification contains core PodSpecs for each core component and a section for general configuration including the ConfigMaps that are created for the Agent (rclone defaults), Kafka and Tracing (open telemetry).

// operator/apis/mlops/v1alpha1/seldonconfig_types.go
// :language: golang
// :start-after: // SeldonConfigSpec
// :end-before: // SeldonConfigStatus

Some of these values can be overridden on a per namespace basis via the SeldonRuntime resource. Labels and annotations can also be set at the component level - these will be merged with the labels and annotations from the SeldonConfig resource in which they are defined and added to the component's corresponding Deployment, or StatefulSet.

The default configuration is shown below.

# operator/config/seldonconfigs/default.yaml

Service Meshes

The Seldon models and pipelines are exposed via a single service endpoint in the install namespace called seldon-mesh. All models, pipelines and experiments can be reached via this single Service endpoint by setting appropriate headers on the inference REST/gRPC request. By this means Seldon is agnostic to any service mesh you may wish to use in your organisation. We provide some example integrations for some example service meshes below (alphabetical order):

Ambassador
Istio
Traefik

We welcome help to extend these to other service meshes.

Ambassador

Ambassador provides service mesh and ingress products. Our examples here are based on the Emissary ingress.

We will run through some examples as shown in the notebook service-meshes/ambassador/ambassador.ipynb in our repo.

Single Model

Seldon Iris classifier model
Default Ambassador Host and Listener
Ambassador Mappings for REST and gRPC endpoints

# service-meshes/ambassador/static/single-model.yaml
apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: wildcard
  namespace: seldon-mesh
spec:
  hostname: '*'
  requestPolicy:
    insecure:
      action: Route
---
apiVersion: getambassador.io/v3alpha1
kind: Listener
metadata:
  name: emissary-ingress-listener-8080
  namespace: seldon-mesh
spec:
  hostBinding:
    namespace:
      from: ALL
  port: 8080
  protocol: HTTP
  securityModel: INSECURE
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris-grpc
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris
  grpc: true
  hostname: '*'
  prefix: /inference.GRPCInferenceService
  rewrite: ""
  service: seldon-mesh:80
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris-http
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris
  hostname: '*'
  prefix: /v2/
  rewrite: ""
  service: seldon-mesh:80
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris

Traffic Split

Traffic splitting does not presently work due to this issue. We recommend you use a Seldon Experiment instead.

Seldon provides an Experiment resource for service mesh agnostic traffic splitting but if you wish to control this via Ambassador and example is shown below to split traffic between two models.

# service-meshes/ambassador/static/traffic-split.yaml
apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: wildcard
  namespace: seldon-mesh
spec:
  hostname: '*'
  requestPolicy:
    insecure:
      action: Route
---
apiVersion: getambassador.io/v3alpha1
kind: Listener
metadata:
  name: emissary-ingress-listener-8080
  namespace: seldon-mesh
spec:
  hostBinding:
    namespace:
      from: ALL
  port: 8080
  protocol: HTTP
  securityModel: INSECURE
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris1-grpc
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris1
  grpc: true
  hostname: '*'
  prefix: /inference.GRPCInferenceService
  rewrite: ""
  service: seldon-mesh:80
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris1-http
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris1
  add_response_headers:
    seldon_model:
      value: iris1
  hostname: '*'
  prefix: /v2
  rewrite: ""
  service: seldon-mesh:80
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris2-grpc
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris2
  grpc: true
  hostname: '*'
  prefix: /inference.GRPCInferenceService
  rewrite: ""
  service: seldon-mesh:80
  weight: 50
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris2-http
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris2
  add_response_headers:
    seldon_model:
      value: iris2
  hostname: '*'
  prefix: /v2
  rewrite: ""
  service: seldon-mesh:80
  weight: 50
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris1
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris2
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris

Ambassador Notebook Example

Assumes

You have installed emissary as per their docs

Tested with

emissary-ingress-7.3.2 insatlled via helm

INGRESS_IP=!kubectl get svc emissary-ingress -n emissary -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
INGRESS_IP=INGRESS_IP[0]
import os
os.environ['INGRESS_IP'] = INGRESS_IP
INGRESS_IP

'172.21.255.1'

Ambassador Single Model Example

!kustomize build config/single-model

apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: wildcard
  namespace: seldon-mesh
spec:
  hostname: '*'
  requestPolicy:
    insecure:
      action: Route
---
apiVersion: getambassador.io/v3alpha1
kind: Listener
metadata:
  name: emissary-ingress-listener-8080
  namespace: seldon-mesh
spec:
  hostBinding:
    namespace:
      from: ALL
  port: 8080
  protocol: HTTP
  securityModel: INSECURE
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris-grpc
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris
  grpc: true
  hostname: '*'
  prefix: /inference.GRPCInferenceService
  rewrite: ""
  service: seldon-mesh:80
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris-http
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris
  hostname: '*'
  prefix: /v2/
  rewrite: ""
  service: seldon-mesh:80
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris

!kustomize build config/single-model | kubectl apply --validate=false -f -

host.getambassador.io/wildcard created
listener.getambassador.io/emissary-ingress-listener-8080 created
mapping.getambassador.io/iris-grpc created
mapping.getambassador.io/iris-http created
model.mlops.seldon.io/iris created

!kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh

model.mlops.seldon.io/iris condition met

!curl -v http://${INGRESS_IP}/v2/models/iris/infer -H "Content-Type: application/json"\
        -d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

*   Trying 172.21.255.1...
* Connected to 172.21.255.1 (172.21.255.1) port 80 (#0)
> POST /v2/models/iris/infer HTTP/1.1
> Host: 172.21.255.1
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 94
>
* upload completely sent off: 94 out of 94 bytes
< HTTP/1.1 200 OK
< content-length: 196
< content-type: application/json
< date: Sat, 16 Apr 2022 15:45:43 GMT
< server: envoy
< x-envoy-upstream-service-time: 792
< seldon-route: iris_1
<
* Connection #0 to host 172.21.255.1 left intact
{"model_name":"iris_1","model_version":"1","id":"72ac79f5-b355-4be3-b8c5-2ebedaa39f60","parameters":null,"outputs":[{"name":"predict","shape":[1],"datatype":"INT64","parameters":null,"data":[2]}]}

!grpcurl -d '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
    -plaintext \
    -import-path ../../apis \
    -proto ../../apis/mlops/v2_dataplane/v2_dataplane.proto \
    ${INGRESS_IP}:80 inference.GRPCInferenceService/ModelInfer

{
  "modelName": "iris_1",
  "modelVersion": "1",
  "outputs": [
    {
      "name": "predict",
      "datatype": "INT64",
      "shape": [
        "1"
      ],
      "contents": {
        "int64Contents": [
          "2"
        ]
      }
    }
  ]
}

!kustomize build config/single-model | kubectl delete -f -

host.getambassador.io "wildcard" deleted
listener.getambassador.io "emissary-ingress-listener-8080" deleted
mapping.getambassador.io "iris-grpc" deleted
mapping.getambassador.io "iris-http" deleted
model.mlops.seldon.io "iris" deleted

Traffic Split Two Models

Currently not working due to this issue

!kustomize build config/traffic-split

apiVersion: getambassador.io/v3alpha1
kind: Host
metadata:
  name: wildcard
  namespace: seldon-mesh
spec:
  hostname: '*'
  requestPolicy:
    insecure:
      action: Route
---
apiVersion: getambassador.io/v3alpha1
kind: Listener
metadata:
  name: emissary-ingress-listener-8080
  namespace: seldon-mesh
spec:
  hostBinding:
    namespace:
      from: ALL
  port: 8080
  protocol: HTTP
  securityModel: INSECURE
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris1-grpc
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris1
  grpc: true
  hostname: '*'
  prefix: /inference.GRPCInferenceService
  rewrite: ""
  service: seldon-mesh:80
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris1-http
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris1
  add_response_headers:
    seldon_model:
      value: iris1
  hostname: '*'
  prefix: /v2
  rewrite: ""
  service: seldon-mesh:80
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris2-grpc
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris2
  grpc: true
  hostname: '*'
  prefix: /inference.GRPCInferenceService
  rewrite: ""
  service: seldon-mesh:80
  weight: 50
---
apiVersion: getambassador.io/v3alpha1
kind: Mapping
metadata:
  name: iris2-http
  namespace: seldon-mesh
spec:
  add_request_headers:
    seldon-model:
      value: iris2
  add_response_headers:
    seldon_model:
      value: iris2
  hostname: '*'
  prefix: /v2
  rewrite: ""
  service: seldon-mesh:80
  weight: 50
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris1
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris2
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris

!kustomize build config/traffic-split | kubectl apply -f -

host.getambassador.io/wildcard created
listener.getambassador.io/emissary-ingress-listener-8080 created
mapping.getambassador.io/iris1-grpc created
mapping.getambassador.io/iris1-http created
mapping.getambassador.io/iris2-grpc created
mapping.getambassador.io/iris2-http created
model.mlops.seldon.io/iris1 created
model.mlops.seldon.io/iris2 created

!kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh

model.mlops.seldon.io/iris1 condition met
model.mlops.seldon.io/iris2 condition met

!curl -v http://${INGRESS_IP}/v2/models/iris/infer -H "Content-Type: application/json" \
        -d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

*   Trying 172.21.255.1...
* Connected to 172.21.255.1 (172.21.255.1) port 80 (#0)
> POST /v2/models/iris/infer HTTP/1.1
> Host: 172.21.255.1
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 94
>
* upload completely sent off: 94 out of 94 bytes
< HTTP/1.1 200 OK
< content-length: 197
< content-type: application/json
< date: Sat, 16 Apr 2022 15:46:17 GMT
< server: envoy
< x-envoy-upstream-service-time: 920
< seldon-route: iris2_1
< seldon_model: iris2
<
* Connection #0 to host 172.21.255.1 left intact
{"model_name":"iris2_1","model_version":"1","id":"ed521c32-cd85-4cb8-90eb-7c896803f271","parameters":null,"outputs":[{"name":"predict","shape":[1],"datatype":"INT64","parameters":null,"data":[2]}]}

!grpcurl -d '{"model_name":"iris1","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
    -plaintext \
    -import-path ../../apis \
    -proto ../../apis/mlops/v2_dataplane/v2_dataplane.proto \
    ${INGRESS_IP}:80 inference.GRPCInferenceService/ModelInfer

{
  "modelName": "iris2_1",
  "modelVersion": "1",
  "outputs": [
    {
      "name": "predict",
      "datatype": "INT64",
      "shape": [
        "1"
      ],
      "contents": {
        "int64Contents": [
          "2"
        ]
      }
    }
  ]
}

!kustomize build config/traffic-split | kubectl delete -f -

host.getambassador.io "wildcard" deleted
listener.getambassador.io "emissary-ingress-listener-8080" deleted
mapping.getambassador.io "iris1-grpc" deleted
mapping.getambassador.io "iris1-http" deleted
mapping.getambassador.io "iris2-grpc" deleted
mapping.getambassador.io "iris2-http" deleted
model.mlops.seldon.io "iris1" deleted
model.mlops.seldon.io "iris2" deleted

Istio

Istio provides a service mesh and ingress solution.

We will run through some examples as shown in the notebook service-meshes/istio/istio.ipynb in our repo.

Single Model

A Seldon Iris Model
An istio Gateway
An instio VirtualService to expose REST and gRPC

# service-meshes/istio/static/single-model.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: seldon-gateway
  namespace: seldon-mesh
spec:
  selector:
    app: istio-ingressgateway
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP
  - hosts:
    - '*'
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      privateKey: /etc/istio/ingressgateway-certs/tls.key
      serverCertificate: /etc/istio/ingressgateway-certs/tls.crt
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: iris-route
  namespace: seldon-mesh
spec:
  gateways:
  - istio-system/seldon-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /v2
    name: iris-http
    route:
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris
  - match:
    - uri:
        prefix: /inference.GRPCInferenceService
    name: iris-grpc
    route:
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris

Traffic Split

Two Iris Models
An istio Gateway
An istio VirtualService with traffic split

# service-meshes/istio/static/traffic-split.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris1
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris2
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: seldon-gateway
  namespace: seldon-mesh
spec:
  selector:
    app: istio-ingressgateway
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP
  - hosts:
    - '*'
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      privateKey: /etc/istio/ingressgateway-certs/tls.key
      serverCertificate: /etc/istio/ingressgateway-certs/tls.crt
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: iris-route
  namespace: seldon-mesh
spec:
  gateways:
  - istio-system/seldon-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /v2/models/iris
    name: iris-http
    route:
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris1
      weight: 50
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris2
      weight: 50
  - match:
    - uri:
        prefix: /inference.GRPCInferenceService
    name: iris-grpc
    route:
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris1
      weight: 50
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris2
      weight: 50

Istio Notebook Examples

Assumes

You have installed istio as per their docs
You have exposed the ingressgateway as an external loadbalancer

tested with:

istioctl version
1.13.2

istioctl install --set profile=demo -y

INGRESS_IP=!kubectl get svc istio-ingressgateway -n istio-system -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
INGRESS_IP=INGRESS_IP[0]
import os
os.environ['INGRESS_IP'] = INGRESS_IP
INGRESS_IP

'172.21.255.1'

Istio Single Model Example

!kustomize build config/single-model

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: seldon-gateway
  namespace: seldon-mesh
spec:
  selector:
    app: istio-ingressgateway
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP
  - hosts:
    - '*'
    port:
      name: https
      number: 443
      protocol: HTTPS
    tls:
      mode: SIMPLE
      privateKey: /etc/istio/ingressgateway-certs/tls.key
      serverCertificate: /etc/istio/ingressgateway-certs/tls.crt
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: iris-route
  namespace: seldon-mesh
spec:
  gateways:
  - istio-system/seldon-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /v2
    name: iris-http
    route:
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris
  - match:
    - uri:
        prefix: /inference.GRPCInferenceService
    name: iris-grpc
    route:
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris

!kustomize build config/single-model | kubectl apply -f -

model.mlops.seldon.io/iris unchanged
gateway.networking.istio.io/seldon-gateway unchanged
virtualservice.networking.istio.io/iris-route configured

!kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh

model.mlops.seldon.io/iris condition met

!curl -v http://${INGRESS_IP}/v2/models/iris/infer -H "Content-Type: application/json" \
        -d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

*   Trying 172.21.255.1...
* Connected to 172.21.255.1 (172.21.255.1) port 80 (#0)
> POST /v2/models/iris/infer HTTP/1.1
> Host: 172.21.255.1
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 94
>
* upload completely sent off: 94 out of 94 bytes
< HTTP/1.1 200 OK
< content-length: 196
< content-type: application/json
< date: Sat, 16 Apr 2022 15:34:11 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 802
< seldon-route: iris_1
<
* Connection #0 to host 172.21.255.1 left intact
{"model_name":"iris_1","model_version":"1","id":"83520c4a-c7f1-4363-9bfd-60c5d8ee2dc5","parameters":null,"outputs":[{"name":"predict","shape":[1],"datatype":"INT64","parameters":null,"data":[2]}]}

!grpcurl -d '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
    -plaintext \
    -import-path ../../apis \
    -proto ../../apis/mlops/v2_dataplane/v2_dataplane.proto \
    ${INGRESS_IP}:80 inference.GRPCInferenceService/ModelInfer

{
  "modelName": "iris_1",
  "modelVersion": "1",
  "outputs": [
    {
      "name": "predict",
      "datatype": "INT64",
      "shape": [
        "1"
      ],
      "contents": {
        "int64Contents": [
          "2"
        ]
      }
    }
  ]
}

!kustomize build config/single-model | kubectl delete -f -

model.mlops.seldon.io "iris" deleted
gateway.networking.istio.io "seldon-gateway" deleted
virtualservice.networking.istio.io "iris-route" deleted

Traffic Split Two Models

!kustomize build config/traffic-split

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris1
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris2
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: networking.istio.io/v1beta1
kind: Gateway
metadata:
  name: seldon-gateway
  namespace: seldon-mesh
spec:
  selector:
    app: istio-ingressgateway
    istio: ingressgateway
  servers:
  - hosts:
    - '*'
    port:
      name: http
      number: 80
      protocol: HTTP
---
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: iris-route
  namespace: seldon-mesh
spec:
  gateways:
  - seldon-gateway
  hosts:
  - '*'
  http:
  - match:
    - uri:
        prefix: /v2/models/iris
    name: iris-http
    route:
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris1
      weight: 50
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris2
      weight: 50
  - match:
    - uri:
        prefix: /inference.GRPCInferenceService
    name: iris-grpc
    route:
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris1
      weight: 50
    - destination:
        host: seldon-mesh.seldon-mesh.svc.cluster.local
      headers:
        request:
          set:
            seldon-model: iris2
      weight: 50

!kustomize build config/traffic-split | kubectl apply -f -

model.mlops.seldon.io/iris1 created
model.mlops.seldon.io/iris2 created
gateway.networking.istio.io/seldon-gateway created
virtualservice.networking.istio.io/iris-route created

!kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh

model.mlops.seldon.io/iris1 condition met
model.mlops.seldon.io/iris2 condition met

!curl -v http://${INGRESS_IP}/v2/models/iris/infer -H "Content-Type: application/json" \
        -d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

*   Trying 172.21.255.1...
* Connected to 172.21.255.1 (172.21.255.1) port 80 (#0)
> POST /v2/models/iris/infer HTTP/1.1
> Host: 172.21.255.1
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 94
>
* upload completely sent off: 94 out of 94 bytes
< HTTP/1.1 200 OK
< content-length: 197
< content-type: application/json
< date: Sat, 16 Apr 2022 15:35:01 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 801
< seldon-route: iris1_1
<
* Connection #0 to host 172.21.255.1 left intact
{"model_name":"iris1_1","model_version":"1","id":"b54e6d8c-d253-4bb9-bb64-02c2ee49e89f","parameters":null,"outputs":[{"name":"predict","shape":[1],"datatype":"INT64","parameters":null,"data":[2]}]}

!grpcurl -d '{"model_name":"iris1","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
    -plaintext \
    -import-path ../../apis \
    -proto ../../apis/mlops/v2_dataplane/v2_dataplane.proto \
    ${INGRESS_IP}:80 inference.GRPCInferenceService/ModelInfer

{
  "modelName": "iris1_1",
  "modelVersion": "1",
  "outputs": [
    {
      "name": "predict",
      "datatype": "INT64",
      "shape": [
        "1"
      ],
      "contents": {
        "int64Contents": [
          "2"
        ]
      }
    }
  ]
}

!kustomize build config/traffic-split | kubectl delete -f -

model.mlops.seldon.io "iris1" deleted
model.mlops.seldon.io "iris2" deleted
gateway.networking.istio.io "seldon-gateway" deleted
virtualservice.networking.istio.io "iris-route" deleted

Traefik

Traefik provides a service mesh and ingress solution.

We will run through some examples as shown in the notebook service-meshes/traefik/traefik.ipynb

Single Model

A Seldon Iris Model
Traefik Service
Traefik IngressRoute
Traefik Middleware for adding a header

apiVersion: v1
kind: Service
metadata:
  name: myapps
  namespace: seldon-mesh
spec:
  ports:
  - name: web
    port: 80
    protocol: TCP
  selector:
    app: traefik-ingress-lb
  type: LoadBalancer
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: iris
  namespace: seldon-mesh
spec:
  entryPoints:
  - web
  routes:
  - kind: Rule
    match: PathPrefix(`/`)
    middlewares:
    - name: iris-header
    services:
    - name: seldon-mesh
      port: 80
      scheme: h2c
---
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: iris-header
  namespace: seldon-mesh
spec:
  headers:
    customRequestHeaders:
      seldon-model: iris

Traffic Split

Warning Traffic splitting does not presently work due to this issue. We recommend you use a Seldon Experiment instead.

Traefik Examples

Assumes

You have installed Traefik as per their docs into namespace traefik-v2

Tested with traefik-10.19.4

INGRESS_IP=!kubectl get svc traefik -n traefik-v2 -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
INGRESS_IP=INGRESS_IP[0]
import os
os.environ['INGRESS_IP'] = INGRESS_IP
INGRESS_IP

'172.21.255.1'

Traefik Single Model Example

!kustomize build config/single-model

apiVersion: v1
kind: Service
metadata:
  name: myapps
  namespace: seldon-mesh
spec:
  ports:
  - name: web
    port: 80
    protocol: TCP
  selector:
    app: traefik-ingress-lb
  type: LoadBalancer
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
  namespace: seldon-mesh
spec:
  requirements:
  - sklearn
  storageUri: gs://seldon-models/mlserver/iris
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: iris
  namespace: seldon-mesh
spec:
  entryPoints:
  - web
  routes:
  - kind: Rule
    match: PathPrefix(`/`)
    middlewares:
    - name: iris-header
    services:
    - name: seldon-mesh
      port: 80
      scheme: h2c
---
apiVersion: traefik.containo.us/v1alpha1
kind: Middleware
metadata:
  name: iris-header
  namespace: seldon-mesh
spec:
  headers:
    customRequestHeaders:
      seldon-model: iris

!kustomize build config/single-model | kubectl apply -f -

service/myapps created
model.mlops.seldon.io/iris created
ingressroute.traefik.containo.us/iris created
middleware.traefik.containo.us/iris-header created

!kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh

model.mlops.seldon.io/iris condition met

!curl -v http://${INGRESS_IP}/v2/models/iris/infer -H "Content-Type: application/json" \
        -d '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

*   Trying 172.21.255.1...
* Connected to 172.21.255.1 (172.21.255.1) port 80 (#0)
> POST /v2/models/iris/infer HTTP/1.1
> Host: 172.21.255.1
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 94
> 
* upload completely sent off: 94 out of 94 bytes
< HTTP/1.1 200 OK
< Content-Length: 196
< Content-Type: application/json
< Date: Sat, 16 Apr 2022 15:53:27 GMT
< Seldon-Route: iris_1
< Server: envoy
< X-Envoy-Upstream-Service-Time: 895
< 
* Connection #0 to host 172.21.255.1 left intact
{"model_name":"iris_1","model_version":"1","id":"0dccf477-78fa-4a11-92ff-4d7e4f1cdda8","parameters":null,"outputs":[{"name":"predict","shape":[1],"datatype":"INT64","parameters":null,"data":[2]}]}

!grpcurl -d '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}' \
    -plaintext \
    -import-path ../../apis \
    -proto ../../apis/mlops/v2_dataplane/v2_dataplane.proto \
    ${INGRESS_IP}:80 inference.GRPCInferenceService/ModelInfer

{
  "modelName": "iris_1",
  "modelVersion": "1",
  "outputs": [
    {
      "name": "predict",
      "datatype": "INT64",
      "shape": [
        "1"
      ],
      "contents": {
        "int64Contents": [
          "2"
        ]
      }
    }
  ]
}

!kustomize build config/single-model | kubectl delete -f -

service "myapps" deleted
model.mlops.seldon.io "iris" deleted
ingressroute.traefik.containo.us "iris" deleted
middleware.traefik.containo.us "iris-header" deleted

HPA Autoscaling in single-model serving

Having a Seldon Core 2 install that publishes metrics to prometheus (default). In the following, we will assume that prometheus is already installed and configured in the seldon-monitoring namespace.
Installing and configuring Prometheus Adapter, which allows prometheus queries on relevant metrics to be published as k8s custom metrics
Configuring HPA manifests to scale Models and the corresponding Server replicas based on the custom metrics

Installing and configuring the Prometheus Adapter

The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom or external metrics. Those can then be accessed by HPA in order to take scaling decisions.

To install through helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install --set prometheus.url='http://seldon-monitoring-prometheus' hpa-metrics prometheus-community/prometheus-adapter -n seldon-monitoring

These commands install prometheus-adapter as a helm release named hpa-metrics in the same namespace where Prometheus is installed, and point to its service URL (without the port).

The URL is not fully qualified as it references a Prometheus instance running in the same namespace. If you are using a separately-managed Prometheus instance, please update the URL accordingly.

Overwrite the ConfigMap as shown in the following manifest, after applying any required customizations.

Change the name if you've chosen a different value for the prometheus-adapter helm release name. Change the namespace to match the namespace where prometheus-adapter is installed.

prometheus-adapter.config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: hpa-metrics-prometheus-adapter
  namespace: seldon-monitoring
data:
  config.yaml: |-
    "rules":
    -
      "seriesQuery": |
         {__name__=~"^seldon_model.*_total",namespace!=""}
      "seriesFilters":
        - "isNot": "^seldon_.*_seconds_total"
        - "isNot": "^seldon_.*_aggregate_.*"
      "resources":
        "overrides":
          "model": {group: "mlops.seldon.io", resource: "model"}
          "server": {group: "mlops.seldon.io", resource: "server"}
          "pod": {resource: "pod"}
          "namespace": {resource: "namespace"}
      "name":
        "matches": "^seldon_model_(.*)_total"
        "as": "${1}_rps"
      "metricsQuery": |
        sum by (<<.GroupBy>>) (
          rate (
            <<.Series>>{<<.LabelMatchers>>}[1m]
          )
        )

A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available here, and those may be used when customizing the configuration.

Understanding prometheus-adapter rule definitions

The rule definition can be broken down in four parts:

Discovery (the seriesQuery and seriesFilters keys) controls what Prometheus metrics are considered for exposure via the k8s custom metrics API.
In the example, all the Seldon Prometheus metrics of the form seldon_model_*_total are considered, excluding metrics pre-aggregated across all models (.*_aggregate_.*) as well as the cummulative infer time per model (.*_seconds_total). For RPS, we are only interested in the model inference count (seldon_model_infer_total)
Association (the resources key) controls the Kubernetes resources that a particular metric can be attached to or aggregated over.
The resources key defines an association between certain labels from the Prometheus metric and k8s resources. For example, on line 17, "model": {group: "mlops.seldon.io", resource: "model"} lets prometheus-adapter know that, for the selected Prometheus metrics, the value of the "model" label represents the name of a k8s model.mlops.seldon.io CR.
One k8s custom metric is generated for each k8s resource associated with a prometheus metric. In this way, it becomes possible to request the k8s custom metric values for models.mlops.seldon.io/iris or for servers.mlops.seldon.io/mlserver.
The labels that do not refer to a namespace resource generate "namespaced" custom metrics (the label values refer to resources which are part of a namespace) -- this distinction becomes important when needing to fetch the metrics via kubectl, and in understanding how certain Prometheus query template placeholders are replaced.
Naming (the name key) configures the naming of the k8s custom metric.
In the example ConfigMap, this is configured to take the Prometheus metric named seldon_model_infer_total and expose custom metric endpoints named infer_rps, which when called return the result of a query over the Prometheus metric.
The matching over the Prometheus metric name uses regex group capture expressions (line 22), which are then be referenced in the custom metric name (line 23).
Querying (the metricsQuery key) defines how a request for a specific k8s custom metric gets converted into a Prometheus query.
The query can make use of the following placeholders:
- .Series is replaced by the discovered prometheus metric name (e.g. seldon_model_infer_total)
- .LabelMatchers, when requesting a namespaced metric for resource X with name x in namespace n, is replaced by X=~"x",namespace="n". For example, model=~"iris0", namespace="seldon-mesh". When requesting the namespace resource itself, only the namespace="n" is kept.
- .GroupBy is replaced by the resource type of the requested metric (e.g. model, server, pod or namespace).
You may want to modify the query in the example to match the one that you typically use in your monitoring setup for RPS metrics. The example calls rate() with a 1 minute window.

For a complete reference for how prometheus-adapter can be configured via the ConfigMap, please consult the docs here.

# Replace default prometheus adapter config
kubectl replace -f prometheus-adapter.config.yaml
# Restart prometheus-adapter pods
kubectl rollout restart deployment hpa-metrics-prometheus-adapter -n seldon-monitoring

Testing the install using the custom metrics API

In order to test that the prometheus adapter config works and everything is set up correctly, you can issue raw kubectl requests against the custom metrics API

Listing the available metrics:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq .

For namespaced metrics, the general template for fetching is:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/[NAMESPACE]/[API_RESOURCE_NAME]/[CR_NAME]/[METRIC_NAME]"

For example:

Fetching model RPS metric for a specific (namespace, model) pair (seldon-mesh, irisa0):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/models.mlops.seldon.io/irisa0/infer_rps

Fetching model RPS metric aggregated at the (namespace, server) level (seldon-mesh, mlserver):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/servers.mlops.seldon.io/mlserver/infer_rps

Fetching model RPS metric aggregated at the (namespace, pod) level (seldon-mesh, mlserver-0):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/pods/mlserver-0/infer_rps

Fetching the same metric aggregated at namespace level (seldon-mesh):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/*/metrics/infer_rps

Configuring HPA manifests

irisa0.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: irisa0
  namespace: seldon-mesh
spec:
  memory: 3M
  replicas: 1
  requirements:
  - sklearn
  storageUri: gs://seldon-models/testing/iris1

Let’s scale this model when it is deployed on a server named mlserver, with a target RPS per replica of 3 RPS (higher RPS would trigger scale-up, lower would trigger scale-down):

irisa0-hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: irisa0-model-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    name: irisa0
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Object
    object:
      metric:
        name: infer_rps
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      target:
        type: AverageValue
        averageValue: 3
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mlserver-server-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    name: mlserver
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Object
    object:
      metric:
        name: infer_rps
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      target:
        type: AverageValue
        averageValue: 3

Details on custom metrics of type Object

If the described object is { kind: Namespace, name: seldon-mesh }, then the Prometheus query template configured in our example would be transformed into:

sum by (namespace) (
  rate (
    seldon_model_infer_total{namespace="seldon-mesh"}[1m]
  )
)

If the described object is not a namespace (for example, { kind: Pod, name: mlserver-0 }) then the query will be passed the label describing the object, alongside an additional label identifying the namespace where the HPA manifest resides in.:

sum by (pod) (
  rate (
    seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[1m]
  )
)

$\texttt{targetReplicas} = \frac{\texttt{infer\_rps}}{\texttt{thresholdPerReplicaRPS}}$

HPA sampling of custom metrics

Despite this delay, the two will converge to the same number when the decisions are taken based on the same metric (as in the previous examples).

Advanced settings

Filtering metrics by additional labels on the prometheus metric:

The prometheus metric from which the model RPS is computed has the following labels:

seldon_model_infer_total{
    code="200", 
    container="agent", 
    endpoint="metrics", 
    instance="10.244.0.39:9006", 
    job="seldon-mesh/agent", 
    method_type="rest", 
    model="irisa0", 
    model_internal="irisa0_1", 
    namespace="seldon-mesh", 
    pod="mlserver-0", 
    server="mlserver", 
    server_replica="0"
}

  metrics:
  - type: Object
    object:
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      metric:
        name: infer_rps
        selector:
          matchLabels:
            method_type: rest
      target:
	    type: AverageValue
        averageValue: "3"

Customize scale-up / scale-down rate & properties by using scaling policies as described in the HPA scaling policies docs
For more resources, please consult the HPA docs and the HPA walkthrough

Cluster operation guidelines when using HPA-based scaling

Customizing per-replica RPS targets and replica limits

It is important for the cluster to have sufficient resources for creating the total number of desired server replicas set by the HPA CRs across all the models at a given time.

Customizing HPA policy settings for ensuring correct scaling behaviour

Two characteristics in the current implementation are important in terms of autoscaling and configuring the HPA scale-up policy:

The scheduler does not create new Server replicas when the existing replicas are not sufficient for loading a Model's replicas (one Model replica per Server replica). Whenever a Model requests more replicas than available on any of the available Servers, its ModelReady condition transitions to Status: False with a ScheduleFailed message. However, any replicas of that Model that are already loaded at that point remain available for servicing inference load.
There is no partial scheduling of replicas. For example, consider a model with 2 replicas, currently loaded on a server with 3 replicas (two of those server replicas will have the model loaded). If you update the model replicas to 4, the scheduler will transition the model to ScheduleFailed, seeing that it cannot satisfy the requested number of replicas. The existing 2 model replicas will continue to serve traffic, but a third replica will not be loaded onto the remaining server replica.
In other words, the scheduler either schedules all the requested replicas, or, if unable to do so, leaves the state of the cluster unchanged.
Introducing partial scheduling would make the overall results of assigning models to servers significantly less predictable and ephemeral. This is because models may end up moved back-and forth between servers depending on the speed with which various server replicas become available. Network partitions or other transient errors may also trigger large changes to the model-to-server assignments, making it challenging to sustain consistent data plane load during those periods.

Taken together, the two Core 2 scheduling characteristics, combined with a very aggressive HPA scale-up policy and a continuously increasing RPS may lead to the following pathological case:

Based on RPS, HPA decides to increase both the Model and Server replicas from 2 (an example start stable state) to 8. While the 6 new Server pods get scheduled and get the Model loaded onto them, the scheduler will transition the Model into the ScheduleFailed state, because it cannot fulfill the requested replicas requirement. During this period, the initial 2 Model replicas continue to serve load, but are using their RPS margins and getting closer to the saturation point.
At the same time, load continues to increase, so HPA further increases the number of required Model and Server replicas from 8 to 12, before all of the 6 new Server pods had a chance to become available. The new replica target for the scheduler also becomes 12, and this would not be satisfied until all the 12 Server replicas are available. The 2 Model replicas that are available may by now be saturated and the infer latency spikes up, breaching set SLAs.
The process may continue until load stabilizes.
If at any point the number of requested replicas (<=maxReplicas) exceeds the resource capacity of the cluster, the requested server replica count will never be reached and thus the Model will remain permanently in the ScheduleFailed state.

The speed with which new Server replicas can become available versus how many new replicas may HPA request in a given time:
- The HPA scale-up policy should not be configured to request more replicas than can become available in the specified time. The following example reflects a confidence that 5 Server pods will become available within 90 seconds, with some safety margin. The default scale-up config, that also adds a percentage based policy (double the existing replicas within the set periodSeconds) is not recommended because of this.
- Perhaps more importantly, there is no reason to scale faster than the time it takes for replicas to become available - this is the true maximum rate with which scaling up can happen anyway. Because the underlying Server replica pods are part of a stateful set, they are created sequentially by k8s.

hpa-custom-policy.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: irisa0-model-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    ...
  minReplicas: 1
  maxReplicas: 3
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 5
        periodSeconds: 90
  metrics:
    ...

The duration of transient load spikes which you might want to absorb within the existing per-replica RPS margins.
- The previous example, at line 13, configures a scale-up stabilization window of one minute. It means that for all of the HPA recommended replicas in the last 60 second window (4 samples of the custom metric considering the default sampling rate), only the smallest will be applied.
- Such stabilization windows should be set depending on typical load patterns in your cluster: not being too aggressive in reacting to increased load will allow you to achieve cost savings, but has the disadvantage of a delayed reaction if the load spike turns out to be sustained.
The duration of any typical/expected sustained ramp-up period, and the RPS increase rate during this period.
- It is useful to consider whether the replica scale-up rate configured via the policy (line 15 in the example) is able to keep-up with this RPS increase rate.
- Such a scenario may appear, for example, if you are planning for a smooth traffic ramp-up in a blue-green deployment as you are draining the "blue" deployment and transitioning to the "green" one