Seldon Core 2 provides a highly configurable deployment framework that allows you to fine-tune various components using Helm configuration options. These options offer control over deployment behavior, resource management, logging, autoscaling, and model lifecycle policies to optimize the performance and scalability of machine learning deployments.
This section details the key Helm configuration parameters for Envoy, Autoscaling, Server Prestop, and Model Control Plane, ensuring that you can customize deployment workflows and enhance operational efficiency.
Envoy: Manage pre-stop behaviors and configure access logging to track request-level interactions.
Autoscaling (Experimental): Fine-tune dynamic scaling policies for efficient resource allocation based on real-time inference workloads.
Servers: Define grace periods for controlled shutdowns and optimize model control plane parameters for efficient model loading, unloading, and error handling.
Logging: Define log levels for the different components of the system.
Envoy
Prestop
Key
Chart
Description
Default
envoy.preStopSleepPeriodSeconds
components
Sleep after calling prestop command.
30
envoy.terminationGracePeriodSeconds
components
Grace period to wait for prestop to finish for Envoy pods.
120
Access Log
Key
Chart
Description
Default
envoy.enableAccesslog
components
Whether to enable logging of requests.
true
envoy.accesslogPath
components
Path on disk to store logfile. This is only used when enableAccesslog is set.
/tmp/envoy-accesslog.txt
envoy.includeSuccessfulRequests
components
Whether to including successful requests. If set to false, then only failed requests are logged. This is only used when enableAccesslog is set.
false
Autoscaling
Native autoscaling (experimental)
Key
Chart
Description
Default
autoscaling.autoscalingModelEnabled
components
Enable native autoscaling for Models. This is orthogonal to external autoscaling services e.g. HPA.
false
autoscaling.autoscalingServerEnabled
components
Enable native autoscaling for Models. This is orthogonal to external autoscaling services e.g. HPA.
true
agent.scalingStatsPeriodSeconds
components
Sampling rate for metrics used for autoscaling.
20
agent.modelInferenceLagThreshold
components
Queue lag threshold to trigger scaling up of a model replica.
30
agent.modelInactiveSecondsThreshold
components
Period with no activity after which to trigger scaling down of a model replica.
600
autoscaling.serverPackingEnabled
components
Whether packing of models onto fewer servers is enabled.
false
autoscaling.serverPackingPercentage
components
Percentage of events where packing is allowed. Higher values represent more aggressive packing. This is only used when serverPackingEnabled is set. Range is from 0.0 to 1.0
0.0
Server
Prestop
Key
Chart
Description
Default
serverConfig.terminationGracePeriodSeconds
components
Grace period to wait for prestop process to finish for this particular Server pod.
120
Model Control Plane
Key
Chart
Description
Default
agent.overcommitPercentage
components
Overcommit percentage (of memory) allowed. Range is from 0 to 100
10
agent.maxLoadElapsedTimeMinutes
components
Max time allowed for one model load command for a model on a particular server replica to take. Lower values allow errors to be exposed faster.
120
agent.maxLoadRetryCount
components
Max number of retries for unsuccessful load command for a model on a particular server replica. Lower values allow control plane commands to fail faster.
5
agent.maxUnloadElapsedTimeMinutes
components
Max time allowed for one model unload command for a model on a particular server replica to take. Lower values allow errors to be exposed faster.
15
agent.maxUnloadRetryCount
components
Max number of retries for unsuccessful unload command for a model on a particular server replica. Lower values allow control plane commands to fail faster.
5
agent.unloadGracePeriodSeconds
components
A period guarding against race conditions between Envoy actually applying the cluster change to remove a route and before proceeding with the model replica unloading command.
2
Logging
Component Log Level
Key
Chart
Description
Default
logging.logLevel
components
Components wide settings for logging level, if individual component levels are not set. Options are: debug, info, error.
We set kafka client library log level from the log level that is passed to the component, which could be different to the level expected by librdkafka (syslog level). In this case we attempt to map the log level value to the best match.
Server Config
Note: This section is for advanced usage where you want to define new types of inference servers.
Server configurations define how to create an inference server. By default one is provided for Seldon MLServer and one for NVIDIA Triton Inference Server. Both these servers support the V2 inference protocol which is a requirement for all inference servers. They define how the Kubernetes ReplicaSet is defined which includes the Seldon Agent reverse proxy as well as an Rclone server for downloading artifacts for the server. The Kustomize ServerConfig for MlServer is shown below:
Server Runtime
The SeldonRuntime resource is used to create an instance of Seldon installed in a particular namespace.
type SeldonRuntimeSpec struct {
SeldonConfig string `json:"seldonConfig"`
Overrides []*OverrideSpec `json:"overrides,omitempty"`
Config SeldonConfiguration `json:"config,omitempty"`
// +Optional
// If set then when the referenced SeldonConfig changes we will NOT update the SeldonRuntime immediately.
// Explicit changes to the SeldonRuntime itself will force a reconcile though
DisableAutoUpdate bool `json:"disableAutoUpdate,omitempty"`
}
type OverrideSpec struct {
Name string `json:"name"`
Disable bool `json:"disable,omitempty"`
Replicas *int32 `json:"replicas,omitempty"`
ServiceType v1.ServiceType `json:"serviceType,omitempty"`
PodSpec *PodSpec `json:"podSpec,omitempty"`
}
The specification above contains overrides for the chosen SeldonConfig. To override the PodSpec for a given component, the overrides field needs to specify the component name and the PodSpec needs to specify the container name, along with fields to override.
For instance, the following overrides the resource limits for cpu and memory in the hodometer component in the seldon-mesh namespace, while using values specified in the seldonConfig elsewhere (e.g. default).
As a minimal use you should just define the SeldonConfig to use as a base for this install, for example to install in the seldon-mesh namespace with the SeldonConfig named default:
The helm chart seldon-core-v2-runtime allows easy creation of this resource and associated default Servers for an installation of Seldon in a particular namespace.
SeldonConfig Update Propagation
When a SeldonConfig resource changes any SeldonRuntime resources that reference the changed SeldonConfig will also be updated immediately. If this behaviour is not desired you can set spec.disableAutoUpdate in the SeldonRuntime resource for it not be be updated immediately but only when it changes or any owned resource changes.
Seldon Config
Note: This section is for advanced usage where you want to define how seldon is installed in each namespace.
The SeldonConfig resource defines the core installation components installed by Seldon. If you wish to install Seldon, you can use the SeldonRuntime resource which allows easy overriding of some parts defined in this specification. In general, we advise core DevOps to use the default SeldonConfig or customize it for their usage. Individual installation of Seldon can then use the SeldonRuntime with a few overrides for special customisation needed in that namespace.
The specification contains core PodSpecs for each core component and a section for general configuration including the ConfigMaps that are created for the Agent (rclone defaults), Kafka and Tracing (open telemetry).
Some of these values can be overridden on a per namespace basis via the SeldonRuntime resource. Labels and annotations can also be set at the component level - these will be merged with the labels and annotations from the SeldonConfig resource in which they are defined and added to the component's corresponding Deployment, or StatefulSet.
Pipelines allow one to connect flows of inference data transformed by Model components. A directed acyclic graph (DAG) of steps can be defined to join Models together. Each Model will need to be capable of receiving a V2 inference request and respond with a V2 inference response. An example Pipeline is shown below:
The steps list shows three models: tfsimple1, tfsimple2 and tfsimple3. These three models each take two tensors called INPUT0 and INPUT1 of integers. The models produce two outputs OUTPUT0 (the sum of the inputs) and OUTPUT1 (subtraction of the second input from the first).
tfsimple1 and tfsimple2 take as inputs the input to the Pipeline: the default assumption when no explicit inputs are defined. tfsimple3 takes one V2 tensor input from each of the outputs of tfsimple1 and tfsimple2. As the outputs of tfsimple1 and tfsimple2 have tensors named OUTPUT0 and OUTPUT1 their names need to be changed to respect the expected input tensors and this is done with a tensorMap component providing this tensor renaming. This is only required if your models can not be directly chained together.
The output of the Pipeline is the output from the tfsimple3 model.
Detailed Specification
The full GoLang specification for a Pipeline is shown below:
type PipelineSpec struct {
// External inputs to this pipeline, optional
Input *PipelineInput `json:"input,omitempty"`
// The steps of this inference graph pipeline
Steps []PipelineStep `json:"steps"`
// Synchronous output from this pipeline, optional
Output *PipelineOutput `json:"output,omitempty"`
}
// +kubebuilder:validation:Enum=inner;outer;any
type JoinType string
const (
// data must be available from all inputs
JoinTypeInner JoinType = "inner"
// data will include any data from any inputs at end of window
JoinTypeOuter JoinType = "outer"
// first data input that arrives will be forwarded
JoinTypeAny JoinType = "any"
)
type PipelineStep struct {
// Name of the step
Name string `json:"name"`
// Previous step to receive data from
Inputs []string `json:"inputs,omitempty"`
// msecs to wait for messages from multiple inputs to arrive before joining the inputs
JoinWindowMs *uint32 `json:"joinWindowMs,omitempty"`
// Map of tensor name conversions to use e.g. output1 -> input1
TensorMap map[string]string `json:"tensorMap,omitempty"`
// Triggers required to activate step
Triggers []string `json:"triggers,omitempty"`
// +kubebuilder:default=inner
InputsJoinType *JoinType `json:"inputsJoinType,omitempty"`
TriggersJoinType *JoinType `json:"triggersJoinType,omitempty"`
// Batch size of request required before data will be sent to this step
Batch *PipelineBatch `json:"batch,omitempty"`
}
type PipelineBatch struct {
Size *uint32 `json:"size,omitempty"`
WindowMs *uint32 `json:"windowMs,omitempty"`
Rolling bool `json:"rolling,omitempty"`
}
type PipelineInput struct {
// Previous external pipeline steps to receive data from
ExternalInputs []string `json:"externalInputs,omitempty"`
// Triggers required to activate inputs
ExternalTriggers []string `json:"externalTriggers,omitempty"`
// msecs to wait for messages from multiple inputs to arrive before joining the inputs
JoinWindowMs *uint32 `json:"joinWindowMs,omitempty"`
// +kubebuilder:default=inner
JoinType *JoinType `json:"joinType,omitempty"`
// +kubebuilder:default=inner
TriggersJoinType *JoinType `json:"triggersJoinType,omitempty"`
// Map of tensor name conversions to use e.g. output1 -> input1
TensorMap map[string]string `json:"tensorMap,omitempty"`
}
type PipelineOutput struct {
// Previous step to receive data from
Steps []string `json:"steps,omitempty"`
// msecs to wait for messages from multiple inputs to arrive before joining the inputs
JoinWindowMs uint32 `json:"joinWindowMs,omitempty"`
// +kubebuilder:default=inner
StepsJoin *JoinType `json:"stepsJoin,omitempty"`
// Map of tensor name conversions to use e.g. output1 -> input1
TensorMap map[string]string `json:"tensorMap,omitempty"`
}