Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
A Model is the core atomic building block. It specifies a machine learning artifact that will be loaded onto one of the running Servers. A model could be a standard machine learning inference component such as
a Tensorflow model, PyTorch model or SKLearn model.
an inference transformation component such as a SKLearn pipeline or a piece of custom python logic. a monitoring component such as an outlier detector or drift detector.
An alibi-explain model explainer
An example is shown below for a SKLearn model for iris classification:
Its Kubernetes spec
has two core requirements
A storageUri
specifying the location of the artifact. This can be any rclone URI specification.
A requirements
list which provides tags that need to be matched by the Server that can run this artifact type. By default when you install Seldon we provide a set of Servers that cover a range of artifact types.
For Kubernetes usage we provide a set of custom resources for interacting with Seldon.
SeldonRuntime - for installing Seldon in a particular namespace.
Servers - for deploying sets of replicas of core inference servers (MLServer or Triton).
Models - for deploying single machine learning models, custom transformation logic, drift detectors, outliers detectors and explainers.
Experiments - for testing new versions of models.
Pipelines - for connecting together flows of data between models.
SeldonConfig and ServerConfig define the core installation configuration and machine learning inference server configuration for Seldon. Normally, you would not need to customize these but this may be required for your particular custom installation within your organisation.
ServerConfigs - for defining new types of inference server that can be reference by a Server resource.
SeldonConfig - for defining how seldon is installed
Pipelines allow one to connect flows of inference data transformed by Model
components. A directed acyclic graph (DAG) of steps can be defined to join Models together. Each Model will need to be capable of receiving a V2 inference request and respond with a V2 inference response. An example Pipeline is shown below:
The steps
list shows three models: tfsimple1
, tfsimple2
and tfsimple3
. These three models each take two tensors called INPUT0
and INPUT1
of integers. The models produce two outputs OUTPUT0
(the sum of the inputs) and OUTPUT1
(subtraction of the second input from the first).
tfsimple1
and tfsimple2
take as inputs the input to the Pipeline: the default assumption when no explicit inputs are defined. tfsimple3
takes one V2 tensor input from each of the outputs of tfsimple1
and tfsimple2
. As the outputs of tfsimple1
and tfsimple2
have tensors named OUTPUT0
and OUTPUT1
their names need to be changed to respect the expected input tensors and this is done with a tensorMap
component providing this tensor renaming. This is only required if your models can not be directly chained together.
The output of the Pipeline is the output from the tfsimple3
model.
The full GoLang specification for a Pipeline is shown below:
An Experiment defines a traffic split between Models or Pipelines. This allows new versions of models and pipelines to be tested.
An experiment spec has three sections:
candidates
(required) : a set of candidate models to split traffic.
default
(optional) : an existing candidate who endpoint should be modified to split traffic as defined by the candidates.
Each candidate has a traffic weight. The percentage of traffic will be this weight divided by the sum of traffic weights.
mirror
(optional) : a single model to mirror traffic to the candidates. Responses from this model will not be returned to the caller.
An example experiment with a defaultModel
is shown below:
This defines a split of 50% traffic between two models iris
and iris2
. In this case we want to expose this traffic split on the existing endpoint created for the iris
model. This allows us to test new versions of models (in this case iris2
) on an existing endpoint (in this case iris
). The default
key defines the model whose endpoint we want to change. The experiment will become active when both underplying models are in Ready status.
An experiment over two separate models which exposes a new API endpoint is shown below:
To call the endpoint add the header seldon-model: <experiment-name>.experiment
in this case: seldon-model: experiment-iris.experiment
. For example with curl:
For examples see the local experiments notebook.
Running an experiment between some pipelines is very similar. The difference is resourceType: pipeline
needs to be defined and in this case the candidates or mirrors will refer to pipelines. An example is shown below:
For an example see the local experiments notebook.
A mirror can be added easily for model or pipeline experiments. An example model mirror experiment is shown below:
For an example see the local experiments notebook.
An example pipeline mirror experiment is shown below:
For an example see the local experiments notebook.
To allow cohorts to get consistent views in an experiment each inference request passes back a response header x-seldon-route
which can be passed in future requests to an experiment to bypass the random traffic splits and get a prediction from the sequence of models and pipelines used in the initial request.
Note: you must pass the normal seldon-model
header along with the x-seldon-route
header.
This is illustrated in the local experiments notebook.
Caveats: the models used will be the same but not necessarily the same replica instances. This means at present this will not work for stateful models that need to go to the same model replica instance.
As an alternative you can choose to run experiments at the service mesh level if you use one of the popular service meshes that allow header based routing in traffic splits. For further discussion see here.
This section is for advanced usage where you want to define new types of inference servers.
Server configurations define how to create an inference server. By default one is provided for Seldon MLServer and one for NVIDIA Triton Inference Server. Both these servers support the V2 inference protocol which is a requirement for all inference servers. They define how the Kubernetes ReplicaSet is defined which includes the Seldon Agent reverse proxy as well as an Rclone server for downloading artifacts for the server. The Kustomize ServerConfig for MlServer is shown below:
The SeldonRuntime resource is used to create an instance of Seldon installed in a particular namespace.
For the definition of SeldonConfiguration
above see the .
The specification above contains overrides for the chosen SeldonConfig
. To override the PodSpec
for a given component, the overrides
field needs to specify the component name and the PodSpec
needs to specify the container name, along with fields to override.
For instance, the following overrides the resource limits for cpu
and memory
in the hodometer
component in the seldon-mesh
namespace, while using values specified in the seldonConfig
elsewhere (e.g. default
).
As a minimal use you should just define the SeldonConfig
to use as a base for this install, for example to install in the seldon-mesh
namespace with the SeldonConfig
named default
:
The helm chart seldon-core-v2-runtime
allows easy creation of this resource and associated default Servers for an installation of Seldon in a particular namespace.
This section is for advanced usage where you want to define how seldon is installed in each namespace.
The SeldonConfig resource defines the core installation components installed by Seldon. If you wish to install Seldon, you can use the resource which allows easy overriding of some parts defined in this specification. In general, we advise core DevOps to use the default SeldonConfig or customize it for their usage. Individual installation of Seldon can then use the SeldonRuntime with a few overrides for special customisation needed in that namespace.
The specification contains core PodSpecs for each core component and a section for general configuration including the ConfigMaps that are created for the Agent (rclone defaults), Kafka and Tracing (open telemetry).
Some of these values can be overridden on a per namespace basis via the SeldonRuntime resource. Labels and annotations can also be set at the component level - these will be merged with the labels and annotations from the SeldonConfig resource in which they are defined and added to the component's corresponding Deployment, or StatefulSet.
The default configuration is shown below.
The default installation will provide two initial servers: one MLServer and one Triton. You only need to define additional servers for advanced use cases.
A Server defines an inference server onto which models will be placed for inference. By default on installation two server StatefulSets will be deployed one MlServer and one Triton. An example Server definition is shown below:
The main requirement is a reference to a ServerConfig resource in this case mlserver
.
One can easily utilize a custom image with the existing ServerConfigs. For example, the following defines an MLServer server with a custom image:
This server can then be targeted by a particular model by specifying this server name when creating the model, for example:
One can also create a Server definition to add a persistent volume to your server. This can be used to allow models to be loaded directly from the persistent volume.
The server can be targeted by a model whose artifact is on the persistent volume as shown below.
When a resource changes any SeldonRuntime resources that reference the changed SeldonConfig will also be updated immediately. If this behaviour is not desired you can set spec.disableAutoUpdate
in the SeldonRuntime resource for it not be be updated immediately but only when it changes or any owned resource changes.
A fully worked example for this can be found .
An alternative would be to create your own for more complex use cases or you want to standardise the Server definition in one place.