Explore how Seldon Core 2 uses data flow paradigm and Kafka-based streaming to improve ML model deployment with better scalability, fault tolerance, and data observability.
Seldon Core 2 is designed around data flow paradigm. Here we will explain what that means and some of the rationals behind this choice.
Initial release of Seldon Core introduced a concept of an inference graph, which can be thought of as a sequence of operations that happen to the inference request. Here is how it may look like:
In reality though this was not how Seldon Core v1 is implemented. Instead, Seldon deployment consists of a range of independent services that host models, transformations, detectors and explainers, and a central orchestrator that knows the inference graph topology and makes service calls in the correct order, passing data between requests and responses as necessary. Here is how the picture looks under the hood:
While this is a convenient way of implementing evaluation graph with microservices, it has a few problems. Orchestrator becomes a bottleneck and a single point of failure. It also hides all the data transformations that need to happen to translate one service's response to another service's request. Data tracing and lineage becomes difficult. All in all, while Seldon platform is all about processing data, under-the-hood implementation was still focused on order of operations and not on data itself.
The realisation of this disparity led to a new approach towards inference graph evaluation in v2, based on the data flow paradigm. Data flow is a well known concept in software engineering, known from 1960s. In contrast to services, that model programs as a control flow, focusing on the order of operations, data flow proposes to model software systems as a series of connections that modify incoming data, focusing on data flowing through the system. A particular flavor of data flow paradigm used by v2 is known as flow-based programming, FBP. FBP defines software applications as a set of processes which exchange data via connections that are external to those processes. Connections are made via named ports, which promotes data coupling between components of the system.
Data flow design makes data in software the top priority. That is one of the key messages of the so called "data-centric AI" idea, which is becoming increasingly popular within the ML community. Data is a key component of a successful ML project. Data needs to be discovered, described, cleaned, understood, monitored and verified. Consequently, there is a growing demand for data-centric platforms and solutions. Making Seldon Core data-centric was one of the key goals of the Seldon Core 2 design.
In the context of Seldon Core application of FBP design approach means that the evaluation implementation is done the same way inferece graph. So instead of routing everything through a centralized orchestrator the evaluation happens in the same graph-like manner:
As far as implementation goes, Seldon Core 2 runs on Kafka. Inference request is put onto a pipeline input topic, which triggers an evaluation. Each part of the inference graph is a service running in its own container fronted by a model gateway. Model gateway listens to a corresponding input Kafka topic, reads data from it, calls the service and puts the received response to an output Kafka topic. There is also a pipeline gateway that allows to interact with Seldon Core in synchronous manner.
This approach gives SCv2 several important features. Firstly, Seldon Core natively supports both synchronous and asynchronous modes of operation. Asynchronicity is achieved via streaming: input data can be sent to an input topic in Kafka, and after the evaluation the output topic will contain the inference result. For those looking to use it in the v1 style, a service API is provided.
Secondly, there is no single point of failure. Even if one or more nodes in the graph go down, the data will still be sitting on the streams waiting to be processed, and the evaluation resumes whenever the failed node comes back up.
Thirdly, data flow means intermediate data can be accessed at any arbitrary step of the graph, inspected and collected as necessary. Data lineage is possible throughout, which opens up opportunities for advanced monitoring and explainability use cases. This is a key feature for effective error surfacing in production environments as it allows:
Adding context from different parts of the graph to better understand a particular output
Reducing false positive rates of alerts as different slices of the data can be investigated
Enabling reproducing of results as fined-grained lineage of computation and associated data transformation are tracked by design
Finally, inference graph can now be extended with adding new nodes at arbitrary places, all without affecting the pipeline execution. This kind of flexibility was not possible with v1. This also allows multiple pipelines to share common nodes and therefore optimising resources usage.
More details and information on data-centric AI and data flow paradigm can be found in these resources:
Stanford MLSys seminar
A paper that explores
from its creator J.P. Morrison:
Pathways: Asynchronous Distributed Dataflow for ML research workfrom Google on the design and implementation of data flow based orchestration layer for accelerators
Better understanding of data requires tracking its history and context



Learn how to implement outlier detection in Seldon Core using Alibi-Detect integration for model monitoring and anomaly detection.
Outlier detection models are treated as any other Model. You can run any saved Alibi-Detect outlier detection model by adding the requirement alibi-detect.
An example outlier detection model from the CIFAR10 image classification example is shown below:
# samples/models/cifar10-outlier-detect.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: cifar10-outlier
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/outlier-detector"
requirements:
- mlserver
- alibi-detectLearn how to implement drift detection in Seldon Core 2 using Alibi-Detect integration for model monitoring and batch processing.
Drift detection models are treated as any other Model. You can run any saved Alibi-Detect drift detection model by adding the requirement alibi-detect.
An example drift detection model from the CIFAR10 image classification example is shown below:
# samples/models/cifar10-drift-detect.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: cifar10-drift
spec:
storageUri: "gs://seldon-models/scv2/examples/mlserver_1.3.5/cifar10/drift-detector"
requirements:
- mlserver
- alibi-detectUsually you would run these models in an asynchronous part of a Pipeline, i.e. they are not connected to the output of the Pipeline which defines the synchronous path. For example, the CIFAR-10 image detection example uses a pipeline as shown below:
# samples/pipelines/cifar10.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
name: cifar10-production
spec:
steps:
- name: cifar10
- name: cifar10-outlier
- name: cifar10-drift
batch:
size: 20
output:
steps:
- cifar10
- cifar10-outlier.outputs.is_outlierNote how the cifar10-drift model is not part of the path to the outputs. Drift alerts can be read from the Kafka topic of the model.
Learn how to run performance tests for Seldon Core 2 deployments, including load testing, benchmarking, and analyzing inference latency and throughput metrics.
This section describes how a user can run performance tests to understand the limits of a particular SCv2 deployment.
The base directly is tests/k6
k6 is used to drive requests for load, unload and infer workloads. It is recommended that the load test is run within the same cluster that has SCv2 installed as it requires internal access to some of the services that are not automatically exposed to the outside world. Furthermore having the driver withthin the same cluster minimises link latency to SCv2 entrypoint; therefore infer latencies are more representatives of actual overheads of the system.
Envoy Tests synchronous inference requests via envoy
To run: make deploy-envoy-test
Agent Tests inference requests direct to a specific agent, defaults to triton-0 or mlserver-0
To run: make deploy-rproxy-test pr make deploy-rproxy-mlserver-test
Server Tests inference requests direct to a specific server (bypassing agent), defaults to triton-0 or mlserver-0
to run: make deploy-server-test or deploy-server-mlserver-test
Pipeline gateway (HTTP-Kafka gateway) Tests inference requests to one-node pipeline HTTP and GPRC requests
To run: make deploy-kpipeline-test
Model gateway (Kafka-HTTP gateway) Tests inference requests to a model via kafka
To run: deploy-kmodel-test
One way to look at results is to look at the log of the pod that executed the kubernetes job.
Results can also be persisted to a gs bucket, a service account k6-sa-key in the same namespace is required,
Users can also look at the metrics that are exposed in prometheus while the test is underway
In the case a user is modifying the actual scenario of the test:
export DOCKERHUB_USERNAME=mydockerhubaccount
build the k6 image via make build-push
in the same shell environment, deploying jobs will use this custome built docker image
Users can modify settings of the tests in tests/k6/configs/k8s/base/k6.yaml. This will apply to all subsequent tests that are deployed using the above process.
Some settings that can be changed
k6 args
for a full list, check
Environment variables
for MODEL_TYPE, choose from:
Learn how to implement model explainability in Seldon Core using Alibi-Explain integration for black box model explanations and pipeline insights.
Explainers are Model resources with some extra settings. They allow a range of explainers from the Alibi-Explain library to be run on MLServer.
An example Anchors explainer definitions is shown below.
The key additions are:
type: This must be one of the supported Alibi Explainer types supported by the Alibi Explain runtime in MLServer.
modelRef: The model name for black box explainers.
pipelineRef: The pipeline name for black box explainers.
Only one of modelRef and pipelineRef is allowed.
Blackbox explainers can explain a Pipeline as well as a model. An example from the is show below.
# samples/models/income-explainer.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: income-explainer
spec:
storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/income-sklearn/anchor-explainer"
explainer:
type: anchor_tabular
modelRef: income # tests/k6/configs/k8s/base/k6.yaml
args: [
"--no-teardown",
"--summary-export",
"results/base.json",
"--out",
"csv=results/base.gz",
"-u",
"5",
"-i",
"100000",
"-d",
"120m",
"scenarios/infer_constant_vu.js",
]
# # infer_constant_rate
# args: [
# "--no-teardown",
# "--summary-export",
# "results/base.json",
# "--out",
# "csv=results/base.gz",
# "scenarios/infer_constant_rate.js",
# ]
# # k8s-test-script
# args: [
# "--summary-export",
# "results/base.json",
# "--out",
# "csv=results/base.gz",
# "scenarios/k8s-test-script.js",
# ]
# # core2_qa_control_plane_ops
# args: [
# "--no-teardown",
# "--verbose",
# "--summary-export",
# "results/base.json",
# "--out",
# "csv=results/base.gz",
# "-u",
# "5",
# "-i",
# "10000",
# "-d",
# "9h",
# "scenarios/core2_qa_control_plane_ops.js",
# ] - name: INFER_HTTP_ITERATIONS
value: "1"
- name: INFER_GRPC_ITERATIONS
value: "1"
- name: MODELNAME_PREFIX
value: "tfsimplea,pytorch-cifar10a,tfmnista,mlflow-winea,irisa"
- name: MODEL_TYPE
value: "tfsimple,pytorch_cifar10,tfmnist,mlflow_wine,iris"
# Specify MODEL_MEMORY_BYTES using unit-of measure suffixes (k, M, G, T)
# rather than numbers without units of measure. If supplying "naked
# numbers", the seldon operator will take care of converting the number
# for you but also take ownership of the field (as FieldManager), so the
# next time you run the scenario creating/updating of the model CR will
# fail.
- name: MODEL_MEMORY_BYTES
value: "400k,8M,43M,200k,3M"
- name: MAX_MEM_UPDATE_FRACTION
value: "0.1"
- name: MAX_NUM_MODELS
value: "800,100,25,100,100"
# value: "0,0,25,100,100"
#
# MAX_NUM_MODELS_HEADROOM is a variable used by control-plane tests.
# It's the approximate number of models that can be created over
# MAX_NUM_MODELS over the experiment. In the worst case scenario
# (very unlikely) the HEADROOM values may temporarily exceed the ones
# specified here with the number of VUs, because each VU checks the
# headroom constraint independently before deciding on the available
# operations (no communication/sync between VUs)
# - name: MAX_NUM_MODELS_HEADROOM
# value: "20,5,0,20,30"
#
# MAX_MODEL_REPLICAS is used by control-plane tests. It controls the
# maximum number of replicas that may be requested when
# creating/updating models of a given type.
# - name: MAX_MODEL_REPLICAS
# value: "2,2,0,2,2"
#
- name: INFER_BATCH_SIZE
value: "1,1,1,1,1"
# MODEL_CREATE_UPDATE_DELETE_BIAS defines the probability ratios between
# the operations, for control-plane tests. For example, "1, 4, 3"
# makes an Update four times more likely then a Create, and a Delete 3
# times more likely than the Create.
# - name: MODEL_CREATE_UPDATE_DELETE_BIAS
# value: "1,3,1"
- name: WARMUP
value: "false"// tests/k6/components/model.js
import { dump as yamlDump } from "https://cdn.jsdelivr.net/npm/[email protected]/dist/js-yaml.mjs";
import { getConfig } from '../components/settings.js'
const tfsimple_string = "tfsimple_string"
const tfsimple = "tfsimple"
const iris = "iris" // mlserver
const pytorch_cifar10 = "pytorch_cifar10"
const tfmnist = "tfmnist"
const tfresnet152 = "tfresnet152"
const onnx_gpt2 = "onnx_gpt2"
const mlflow_wine = "mlflow_wine" // mlserver
const add10 = "add10" // https://github.com/SeldonIO/triton-python-examples/tree/master/add10
const sentiment = "sentiment" // mlserver# samples/models/hf-sentiment-explainer.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
name: sentiment-explainer
spec:
storageUri: "gs://seldon-models/scv2/examples/huggingface/speech-sentiment/explainer"
explainer:
type: anchor_text
pipelineRef: sentiment-explain