Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Install Seldon Core 2 in a local learning environment.
You can install Seldon Core 2 on your local computer that is running a Kubernetes cluster using kind.
Seldon publishes the Helm charts that are required to install Seldon Core 2. For more information about the Helm charts and the related dependencies,see Helm charts and Dependencies.
Install a Kubernetes cluster that is running version 1.27 or later.
Install kubectl, the Kubernetes command-line tool.
Create a namespace to contain the main components of Seldon Core 2. For example, create the seldon-mesh
namespace.
Add and update the Helm charts, seldon-charts
, to the repository.
Install Custom resource definitions for Seldon Core 2.
Install Seldon Core 2 operator in the seldon-mesh
namespace.
This configuration installs the Seldon Core 2 operator across an entire Kubernetes cluster. To perform cluster-wide operations, create ClusterRoles
and ensure your user has the necessary permissions during deployment. With cluster-wide operations, you can create SeldonRuntimes
in any namespace.
You can configure the installation to deploy the Seldon Core 2 operator in a specific namespace so that it control resources in the provided namespace. To do this, set controller.clusterwide
to false
.
Install Seldon Core 2 runtimes in the seldon-mesh
namespace.
Install Seldon Core 2 servers in the seldon-mesh
namespace. Two example servers named mlserver-0
, and triton-0
are installed so that you can load the models to these servers after installation.
Check Seldon Core 2 operator, runtimes, servers, and CRDS are installed in the seldon-mesh
namespace. It might take a couple of minutes for all the Pods to be ready. To check the status of the Pods in real time use this command: kubectl get pods -w -n seldon-mesh
.
The output should be similar to this:
You can install Seldon Core 2 and its components using Ansible in one of the following methods:
To install Seldon Core 2 into a new local kind Kubernetes cluster, you can use the seldon-all
playbook with a single command:
This creates a kind cluster and installs ecosystem dependencies such kafka, Prometheus, OpenTelemetry, and Jaeger as well as all the seldon-specific components. The seldon components are installed using helm-charts from the current git checkout (../k8s/helm-charts/
).
Internally this runs, in order, the following playbooks:
kind-cluster.yaml
setup-ecosystem.yaml
setup-seldon.yaml
You may pass any of the additonal variables which are configurable for those playbooks to seldon-all
.
For example:
Running the playbooks individually gives you more control over what and when it runs. For example, if you want to install into an existing k8s cluster.
Create a kind cluster.
Setup ecosystem.
Seldon runs by default in the seldon-mesh
namespace and a Jaeger pod and OpenTelemetry collector are installed in the namespace. To install in a different <mynamespace>
namespace:
Install Seldon Core 2 in the ansible/
folder.
To install in a different namespace, <mynamespace>
.
If you installed Seldon Core 2 using Helm, you need to complete the installation of other components in the following order:
Seldon Core 2 uses a microservice architecture where each service has limited and well-defined responsibilities working together to orchestrate scalable and fault-tolerant ML serving and management. These components communicate internally using gRPC and they can be scaled independently. Seldon Core 2 services can be split into two categories:
Control Plane services are responsible for managing the operations and configurations of your ML models and workflows. This includes functionality to instantiate new inference servers, load models, update new versions of models, configure model experiments and pipelines, and expose endpoints that may receive inference requests. The main control plane component is the Scheduler that is responsible for managing the loading and unloading of resources (models, pipelines, experiments) onto the respective components.
Data Plane services are responsible for managing the flow of data between components or models. Core 2 supports REST and gRPC payloads that follow the Open Inference Protocol (OIP). The main data plane service is Envoy, which acts as a single ingress for all data plane load and routes data to the relevant servers internally (e.g. Seldon MLServer or NVidia Triton pods).
The current set of services used in Seldon Core 2 is shown below. Following the diagram, we will describe each control plane and data plan service.
This service manages the loading and unloading of Models, Pipelines and Experiments on the relevant micro services. It is also responsible for matching Models with available Servers in a way that optimises infrastructure use. In the current design we can only have one instance of the Scheduler as its internal state is persisted on disk.
When the Scheduler (re)starts there is a synchronisation flow to coordinate the startup process and to attempt to wait for expected Model Servers to connect before proceeding with control plane operations. This is important so that ongoing data plane operations are not interrupted. This then introduces a delay on any control plane operations until the process has finished (including control plan resources status updates). This synchronisation process has a timeout, which has a default of 10 minutes. It can be changed by setting helm seldon-core-v2-components value scheduler.schedulerReadyTimeoutSeconds
.
This service manages the loading and unloading of models on a server and access to the server over REST/gRPC. It acts as a reverse proxy to connect end users with the actual Model Servers. In this way the system collects stats and metrics about data plane inferences that helps with observability and scaling.
We also provide a Kubernetes Operator to allow Kubernetes usage. This is implemented in the Controller Manager microservice, which manages CRD reconciliation with Scheduler. Currently Core 2 supports one instance of the Controller.
This service handles REST/gRPC calls to Pipelines. It translates between synchronous requests to Kafka operations, producing a message on the relevant input topic for a Pipeline and consuming from the output topic to return inference results back to the users.
This service handles the flow of data from models to inference requests on servers and passes on the responses via Kafka.
This service handles the flow of data between components in a pipeline, using Kafka Streams. It enables Core 2 to chain and join Models together to provide complex Pipelines.
This service manages the proxying of requests to the correct servers including load balancing.
To support the movement towards data centric machine learning Seldon Core 2 follows a dataflow paradigm. By taking a decentralized route that focuses on the flow of data, users can have more flexibility and insight as they build and manage complex AI applications in production. This contrasts with more centralized orchestration approaches where data is secondary.
Kafka is used as the backbone for Pipelines allowing decentralized, synchronous and asynchronous usage. This enables Models to be connected together into arbitrary directed acyclic graphs. Models can be reused in different Pipelines. The flow of data between models is handled by the dataflow engine using Kafka Streams.
By focusing on the data we allow users to join various flows together using stream joining concepts as shown below.
We support several types of joins:
inner joins, where all inputs need to be present for a transaction to join the tensors passed through the Pipeline;
outer joins, where only a subset needs to be available during the join window
triggers, in which data flows need to wait until records on one or more trigger data flows appear. The data in these triggers is not passed onwards from the join.
These techniques allow users to create complex pipeline flows of data between machine learning components.
More discussion on the data flow view of machine learning and its effect on v2 design can be found here.
In the context of machine learning and Seldon Core 2, concepts provide a framework for understanding key functionalities, architectures, and workflows within the system. Some of the Key concepts in Seldon Core 2 are:
Data-centricity is an approach that prioritizes the management, integrity, and flow of data at the core of machine learning deployment. Rather than focusing solely on models, this approach ensures that data quality, consistency, and adaptability drive successful ML operations. In Seldon Core 2, data-centricity is embedded in every stage of the inference workflow, enabling scalable, real-time, and standardized model serving.
Key Principle
Description
Flexible Workflows
Seldon Core 2 supports adaptable and scalable data pathways, accommodating various use cases and experiments. This ensures ML pipelines remain agile, allowing you to evolve the data processing strategies as requirements change.
Real-Time Data Streaming
Integrated data streaming capabilities allow you to view, store, manage, and process data in real time. This enhances responsiveness and decision-making, ensuring models work with the most up-to-date data for accurate predictions.
Standardized Processing
Seldon Core 2 promotes reusable and consistent data transformation and routing mechanisms. Standardized processing ensures data integrity and uniformity across applications, reducing errors and inconsistencies.
Comprehensive Monitoring
Detailed metrics and logs provide real-time visibility into data integrity, transformations, and flow. This enables effective oversight and maintenance, allowing teams to detect anomalies, drifts, or inefficiencies early.
Why Data-Centricity Matters?
By adopting a data-centric approach, Seldon Core 2 enables:
More reliable and high-quality predictions by ensuring clean, well-structured data.
Scalable and future-proof ML deployments through standardized data management.
Efficient monitoring and maintenance, reducing risks related to model drift and inconsistencies.
With data-centricity as a core principle, Seldon Core 2 ensures end-to-end control over ML workflows, enabling you to maximize model performance and reliability in production environments.
The Open Inference Protocol (OIP) ensures standardized communication between inference servers and clients, enabling interoperability and flexibility across different model-serving runtimes in Seldon Core 2.
To be compliant, an inference server must implement these three key APIs:
API
Function
Health API
Ensures the server is operational and available for inference requests.
Metadata API
Provides essential details about deployed models, including capabilities and configurations.
V2 Inference Protocol API
Facilitates standardized request and response handling for model inference.
Protocol Compatibility
Flexible API Support: A compliant server can implement either HTTP/REST or gRPC APIs, or both.
Runtime Compatibility:Users should refer to the model serving runtime table or the protocolVersion field in runtime YAML to confirm V2 Inference Protocol support for their serving runtime.
Case Sensitivity and Extensions All strings are case-sensitive across API descriptions. V2 Inference Protocol includes an extension mechanism, though specific extensions are defined separately.
By adopting the V2 Inference Protocol, Seldon Core 2 ensures standardized, scalable, and flexible model serving across diverse deployment environments.
Components are the building blocks of an inference graph, processing data at various stages of the ML inference pipeline. They provide reusable, standardized interfaces, making it easier to maintain and update workflows without disrupting the entire system. Components include ML models, data processors, routers, and supplementary services.
Types of Components
Component Type
Description
Sources
Starting points of an inference graph that receive and validate incoming data before processing.
Data Processors
Transform, filter, or aggregate data to ensure consistent, repeatable pre-processing.
Data Routers
Dynamically route data to different paths based on predefined rules for A/B testing, experimentation, or conditional logic.
Models
Perform inference tasks, including classification, regression, and Large Language Models (LLMs), hosted internally or via external APIs.
Supplementary Data Services
External services like vector databases that enable models to access embeddings and extended functionality.
Drift/Outlier Detectors & Explainers
Monitor model predictions for drift, anomalies, and explainability insights, ensuring transparency and performance tracking.
Sinks
Endpoints of an inference graph that deliver results to external consumers while maintaining a stable interface.
By modularizing inference workflows, components allow you to scale, experiment, and optimize ML deployments efficiently while ensuring data consistency and reliability.
In a model serving platform, a pipeline is an automated sequence of steps that manages the deployment, execution, and monitoring of machine learning models. Pipelines ensure that models are efficiently served, dynamically updated, and continuously monitored for performance and reliability in production environments.
Stage
Description
Request Ingestion
Receives inference requests from applications, APIs, or streaming sources.
Preprocessing
Transforms input data for example, tokenization, or normalization before passing it to the model.
Model Selection & Routing
Directs requests to the appropriate model based on rules, versions, or A/B testing.
Inference Execution
Runs predictions using the deployed model.
Postprocessing
Converts model outputs into a consumable format such as confidence scores, structured responses.
Response Delivery
Returns inference results to the requesting application or system.
Monitoring & Logging
Tracks model performance, latency, and accuracy; detects drift and triggers alerts if needed.
In Seldon Core 2, servers are responsible for hosting and serving machine learning models, handling inference requests, and ensuring scalability, efficiency, and observability in production. Seldon Core 2 supports multiple inference servers, including MLServer and NVIDIA Triton, enabling flexible and optimized model deployments.
Server
Description
Best Suited For
MLServer
A lightweight, extensible inference server designed to work with multiple ML frameworks, including scikit-learn, XGBoost, TensorFlow, and PyTorch. It supports custom Python models and integrates well with MLOps workflows.
General-purpose model serving, custom model wrappers, multi-framework support.
NVIDIA Triton
A high-performance inference server optimized for GPU and CPU acceleration, supporting deep learning models across frameworks like TensorFlow, PyTorch, and ONNX. Triton enables multi-model and ensemble model inference, making it ideal for scalable AI workloads.
High-throughput deep learning inference, multi-model deployments, GPU-accelerated workloads.
With MLServer and Triton, Seldon Core 2 provides a powerful, efficient, and flexible model-serving platform for production-scale AI applications.
In Seldon Core 2, experiments enable controlled A/B testing, model comparisons, and performance evaluations by defining an HTTP traffic split between different models or inference pipelines. This allows organizations to test multiple versions of a model in production while managing risk and ensuring continuous improvements.
Experiment Type
Description
Traffic Splitting
Distributes inference requests across different models or pipelines based on predefined percentage splits. This enables A/B testing and comparison of multiple model versions.
Mirror Testing
Sends a percentage of the traffic to a mirror model or pipeline without affecting the response returned to users. This allows evaluation of new models without impacting production workflows.
Some of the advantages of using Experiments:
A/B Testing & Model comparison: Compare different models under real-world conditions without full deployment.
Risk-Free model validation: Test a new model or pipeline in parallel without affecting live predictions.
Performance & Drift monitoring: Assess latency, accuracy, and reliability before a full rollout.
Continuous improvement: Make data-driven deployment decisions based on real-time model performance.
Seldon Core 2 is a Kubernetes-native framework for deploying and managing machine learning (ML) and Large Language Model (LLM) systems at scale. Its data-centric approach and modular architecture enable seamless handling of simple models to complex ML applications across on-premise, hybrid, and multi-cloud environments while ensuring flexibility, standardization, observability, and cost efficiency.
Seldon Core 2 offers a platform-agnostic, flexible framework for seamless deployment of different types of ML models across on-premise, cloud, and hybrid environments. Its adaptive architecture enables customizable applications, future-proofing MLOps or LLMOps by scaling deployments as data and applications evolve. The modular design enhances resource-efficiency, allowing dynamic scaling, component reuse, and optimized resource allocation. This ensures long-term scalability, operational efficiency, and adaptability to changing demands.
Seldon Core 2 enforces best practices for ML deployment, ensuring consistency, reliability, and efficiency across the entire lifecycle. By automating key deployment steps, it removes operational bottlenecks, enabling faster rollouts and allowing teams to focus on high-value tasks.
With a "learn once, deploy anywhere" approach, Seldon Core 2 standardizes model deployment across on-premise, cloud, and hybrid environments, reducing risk and improving productivity. Its unified execution framework supports conventional, foundational, and LLM models, streamlining deployment and enabling seamless scalability. It also enhances collaboration between MLOps Engineers, Data Scientists, and Software Engineers by providing a customizable framework that fosters knowledge sharing, innovation, and the adoption of new data science capabilities.
Observability in Seldon Core 2 enables real-time monitoring, analysis, and performance tracking of ML systems, covering data pipelines, models, and deployment environments. Its customizable framework combines operational and data science monitoring, ensuring teams have the key metrics needed for maintenance and decision-making.
Seldon simplifies operational monitoring, allowing real-time ML or LLM deployments to expand across organizations while supporting complex, mission-critical use cases. A data-centric approach ensures all prediction data is auditable, maintaining explainability, compliance, and trust in AI-driven decisions.
Seldon Core 2 is built for scalability, efficiency, and cost-effective ML operations, enabling you to deploy only the necessary components while maintaining agility and high performance. Its modular architecture ensures that resources are optimized, infrastructure is consolidated, and deployments remain adaptable to evolving business needs.
Scaling for Efficiency: scaling infrastructure dynamically based on demand, auto-scaling for real-time use cases while scaling to zero for on-demand workloads, preserving deployment state for seamless reactivation. By eliminating redundancy and optimizing deployments, it balances cost efficiency and performance, ensuring reliable inference at any scale.
Consolidated Serving Infrastructure: maximizes resource utilization with multi-model serving (MMS) and overcommit, reducing compute overhead while ensuring efficient, reliable inference.
Extendability & Modular Adaptation: integrates with LLMs, Alibi, and other modules, enabling on-demand ML expansion. Its modular design ensures scalable AI, maximizing value extraction, agility, and cost efficiency.
Reusability for Cost Optimization: provides predictable, fixed pricing, enabling cost-effective scaling and innovation while ensuring financial transparency and flexibility.
Explore our other Solutions
Learn about the feature of Seldon Core 2
Join our Slack Community for updates or for answers to any questions
Install Core 2 in a production Kubernetes environment.
Set up and connect to a Kubernetes cluster running version 1.27 or later. For instructions on connecting to your Kubernetes cluster, refer to the documentation provided by your cloud provider.
Install kubectl, the Kubernetes command-line tool.
Install Helm, the package manager for Kubernetes.
To use Seldon Core 2 in a production environment:
Seldon publishes the Helm charts that are required to install Seldon Core 2. For more information about the Helm charts and the related dependencies, see Helm charts and Dependencies.
Create a namespace to contain the main components of Seldon Core 2. For example, create the namespace seldon-mesh
:
Create a namespace to contain the components related to monitoring. For example, create the namespace seldon-monitoring
:
Add and update the Helm charts seldon-charts
to the repository.
Install custom resource definitions for Seldon Core 2.
Install Seldon Core 2 operator in the seldon-mesh
namespace.
This configuration installs the Seldon Core 2 operator across an entire Kubernetes cluster. To perform cluster-wide operations, create ClusterRoles
and ensure your user has the necessary permissions during deployment. With cluster-wide operations, you can create SeldonRuntimes
in any namespace.
You can configure the installation to deploy the Seldon Core 2 operator in a specific namespace so that it control resources in the provided namespace. To do this, set controller.clusterwide
to false
.
Install Seldon Core 2 runtimes in the namespace seldon-mesh
.
Install Seldon Core 2 servers in the namespace seldon-mesh
. Two example servers named mlserver-0
, and triton-0
are installed so that you can load the models to these servers after installation.
Check Seldon Core 2 operator, runtimes, servers, and CRDS are installed in the namespace seldon-mesh
:
The output should be similar to this:
You can integrate Seldon Core 2 with Kafka that is self-hosted or a managed Kafka.
Seldon Core 2 can be installed in various setups to suit different stages of the development lifecycle. The most common modes include:
Ideal for development and testing purposes, a local setup allows for quick iteration and experimentation with minimal overhead. Common tools include:
Docker Compose: Simplifies deployment by orchestrating Seldon Core components and dependencies in Docker containers. Suitable for environments without Kubernetes, providing a lightweight alternative.
Kind (Kubernetes IN Docker): Runs a Kubernetes cluster inside Docker, offering a realistic testing environment. Ideal for experimenting with Kubernetes-native features.
Designed for high-availability and scalable deployments, a production setup ensures security, reliability, and resource efficiency. Typical tools and setups include:
Managed Kubernetes Clusters: Platforms like GKE (Google Kubernetes Engine), EKS (Amazon Elastic Kubernetes Service), and AKS (Azure Kubernetes Service) provide managed Kubernetes solutions. Suitable for enterprises requiring scalability and cloud integration.
On-Premises Kubernetes Clusters: For organizations with strict compliance or data sovereignty requirements. Can be deployed on platforms like OpenShift or custom Kubernetes setups.
By selecting the appropriate installation mode—whether it's Docker Compose for simplicity, Kind for local Kubernetes experimentation, or production-grade Kubernetes for scalability—you can effectively leverage Seldon Core 2 to meet your specific needs.
Name of the Helm Chart
Description
seldon-core-v2-crds
Cluster-wide installation of custom resources.
seldon-core-v2-setup
Installation of the manager to manage resources in the namespace or cluster-wide. This also installs default SeldonConfig
and ServerConfig
resources, allowing Runtimes and Servers to be installed on demand.
seldon-core-v2-runtime
Installs a SeldonRuntime
custom resource that creates the core components in a namespace.
seldon-core-v2-servers
Installs Server
custom resources providing example core servers to load models.
For more information, see the published Helm charts.
For the description of (some) values that can be configured for these charts, see this Helm parameters section.
Here is a list of components that Seldon Core 2 requires, along with the minimum and maximum supported versions:
Component
Minimum Version
Maximum Version
Notes
Kubernetes
1.27
1.33.0
Required
Envoy*
1.32.2
1.32.2
Required
Rclone*
1.68.2
1.69.0
Required
Kafka
3.4
3.8
Recommended (only required for operating Seldon Core 2 dataflow Pipelines)
Prometheus
2.0
2.x
Optional
Grafana
10.0
***
Optional (no hard limit on the maximum version to be used)
Prometheus-adapter
0.12
0.12
Optional
Opentelemetry Collector
0.68
***
Optional (no hard limit on the maximum version to be used)
Notes:
Envoy and Rclone: These components are included as part of the Seldon Core 2 Docker images. You are not required to install them separately but must be aware of the configuration options supported by these versions.
Kafka: Only required for operating Seldon Core 2 dataflow Pipelines. If not needed, you should avoid installing seldon-modelgateway
, seldon-pipelinegateway
, and seldon-dataflow-engine
.
Maximum Versions marked with ***
indicates no hard limit on the version that can be used.
Learning environment
Install Seldon Core 2 in Docker Compose, or Kind
Production environment
Install Seldon Core 2 in a Managed Kubernetes cluster, or On-Premises Kubernetes cluster
Integrate self-hosted Kafka with Seldon Core 2.
You can run Kafka in the same Kubernetes cluster that hosts the Seldon Core 2. Seldon recommends using the Strimzi operator for Kafka installation and maintenance. For more details about configuring Kafka with Seldon Core 2 see the Configuration section.
Integrating self-hosted Kafka with Seldon Core 2 includes these steps:
Strimzi provides a Kubernetes Operator to deploy and manage Kafka clusters. First, we need to install the Strimzi Operator in your Kubernetes cluster.
Create a namespace where you want to install Kafka. For example the namespace seldon-mesh
:
Install Strimzi.
Install Strimzi Operator.
This deploys the Strimzi Operator
in the seldon-mesh
namespace. After the Strimzi Operator is running, you can create a Kafka cluster by applying a Kafka custom resource definition.
Create a YAML file to specify the initial configuration.
Note: This configuration sets up a Kafka cluster with version 3.9.0. Ensure that you review the the supported versions of Kafka and update the version in the kafka.yaml
file as needed. For more configuration examples, see this strimzi-kafka-operator.
Use your preferred text editor to create and save the file as kafka.yaml
with the following content:
Apply the Kafka cluster configuration.
Create a YAML file named kafka-nodepool.yaml
to create a nodepool for the kafka cluster.
Apply the Kafka node pool configuration.
Check the status of the Kafka Pods to ensure they are running properly:
You should see multiple Pods for Kafka, and Strimzi operators running.
Error
The Pod that begins with the name seldon-dataflow-engine
does not show the status as Running
.
One of the possible reasons could be that the DNS resolution for the service failed.
Solution
Check the logs of the Pod <seldon-dataflow-engine>
:
In the output check if a message reads:
Verify the name
in the metadata
for the kafka.yaml
and kafka-nodepool.yaml
. It should read seldon
.
Check the name of the Kafka services in the namespace:
Restart the Pod:
When the SeldonRuntime
is installed in a namespace a ConfigMap is created with the
settings for Kafka configuration. Update the ConfigMap
only if you need to customize the configurations.
Verify that the ConfigMap resource named seldon-kafka
that is created in the namespace seldon-mesh
:
You should the ConfigMaps for Kafka, Zookeeper, Strimzi operators, and others.
View the configuration of the the ConfigMap named seldon-kafka
.
You should see an output simialr to this:
After you integrated Seldon Core 2 with Kafka, you need to Install an Ingress Controller that adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.
To customize the settings you can add and modify the Kafka configuration using Helm, for example to add compression for producers.
Create a YAML file to specify the compression configuration for Seldon Core 2 runtime. For example, create the values-runtime-kafka-compression.yaml
file. Use your preferred text editor to create and save the file with the following content:
Change to the directory that contains the values-runtime-kafka-compression.yaml
file and then install Seldon Core 2 runtime in the namespace seldon-mesh
.
If you are using a shared Kafka cluster with other applications, it is advisable to isolate topic names and consumer group IDs from other cluster users to prevent naming conflicts. This can be achieved by configuring the following two settings:
topicPrefix
: set a prefix for all topics
consumerGroupIdPrefix
: set a prefix for all consumer groups
Here’s an example of how to configure topic name and consumer group ID isolation during a Helm installation for an application named myorg
:
After you installed Seldon Core 2, and Kafka using Helm, you need to complete Installing a Service mesh.
Kafka is a component in the Seldon Core 2 ecosystem, that provides scalable, reliable, and flexible communication for machine learning deployments. It serves as a strong backbone for building complex inference pipelines, managing high-throughput asynchronous predictions, and seamlessly integrating with event-driven systems—key features needed for contemporary enterprise-grade ML platforms.
An inference request is a request sent to a machine learning model to make a prediction or inference based on input data. It is a core concept in deploying machine learning models in production, where models serve predictions to users or systems in real-time or batch mode.
To explore this feature of Seldon Core 2, you need to integrate with Kafka. Integrate Kafka through managed cloud services or by deploying it directly within a Kubernetes cluster.
Securing Kafka provides more information about the encrytion and authentication.
Configuration examples provides the steps to configure some of the managed Kafka services.
Seldon Core 2 requires Kafka to implement data-centric inference Pipelines. To install Kafka for testing purposed in your Kubernetes cluster, use Strimzi Operator. For more information, see Self Hosted Kafka
After the models are deployed, Core 2 enables the monitoring and experimentation on those systems in production. With support for a wide range of model types, and design patterns to build around those models, you can standardize ML deployment across a range of use-cases in the cloud or on-premise serving infrastructure of your choice.
Seldon Core 2 orchestrates and scales machine learning components running as production-grade microservices. These components can be deployed locally or in enterprise-scale kubernetes clusters. The components of your ML system - such as models, processing steps, custom logic, or monitoring methods - are deployed as Models, leveraging serving solutions compatible with Core 2 such as MLServer, Alibi, LLM Module, or Triton Inference Server. These serving solutions package the required dependencies and standardize inference using the Open Inference Protocol. This ensures that, regardless of your model types and use-cases, all request and responses follow a unified format. After models are deployed, they can process REST or gRPC requests for real-time inference.
Machine learning applications are increasingly complex. They’ve evolved from individual models deployed as services, to complex applications that can consist of multiple models, processing steps, custom logic, and asynchronous monitoring components. With Core you can build Pipelines that connect any of these components to make data-centric applications. Core 2 handles orchestration and scaling of the underlying components of such an application, and exposes the data streamed through the application in real time using Kafka.
This approach to MLOps, influenced by our position paper Desiderata for next generation of ML model serving, enables real-time observability, insight, and control on the behavior, and performance of your ML systems.
Lastly, Core 2 provides Experiments as part of its orchestration capabilities, enabling users to implement routing logic such as A/B tests or Canary deployments to models or pipelines in production. After experiments are run, you can promote new models and/or pipelines, or launch new experiments, so that you can continuously improve the performance of your ML applications.
In Seldon Core 2 your models are deployed on inference servers, which are software that manage the packaging and execution of ML workloads. As part its design, Core 2 separates out Servers and Models as separate resources. This approach enables flexible allocation of models to servers aligning with the requirements of your models, and to the underlying infrastructure that you want your servers to run on. Core 2 also provides functionality to autoscale your models and servers up and down as needed based on your workload requirements or user-defined metrics.
With the modular design of Core 2, users are able to implement cutting-edge methods to minimize hardware costs:
Multi-Model serving consolidates multiple models onto shared inference servers to optimize resource utilization and decrease the number of servers required.
Over-commit allows you to provision more models than the available memory would normally allow by dynamically loading and unloading models from memory to disk based on demand.
Core 2 demonstrates the power of a standardized, data-centric approach to MLOps at scale, ensuring that data observability and management are prioritized across every layer of machine learning operations. Furthermore, Core 2 seamlessly integrates into end-to-end MLOps workflows, from CI/CD, managing traffic with the service mesh of your choice, alerting, data visualization, or authentication and authorization.
This modular, flexible architecture not only supports diverse deployment patterns but also ensures compatibility with the latest AI innovations. By embedding data-centricity and adaptability into its foundation, Core 2 equips organizations to scale and improve their machine learning systems effectively, to capture value from increasingly complex AI systems.
Explore our Tutorials
Join our Slack Community for updates or for answers to any questions
To confirm the successful installation of Seldon Core 2, Kafka, and the service mesh, deploy a sample model and perform an inference test. Follow these steps:
Apply the following configuration to deploy the Iris model in the namespace seldon-mesh
:
The output is:
Verify that the model is deployed in the namespace seldon-mesh
.
When the model is deployed, the output is similar to:
Apply the following configuration to deploy the Iris model in the namespace seldon-mesh
:
The output is:
Verify that the pipeline is deployed in the namespace seldon-mesh
.
When the pipeline is deployed, the output is similar to:
Use curl to send a test inference request to the deployed model. Replace <INGRESS_IP> with your service mesh's ingress IP address. Ensure that:
The Host header matches the expected virtual host configured in your service mesh.
The Seldon-Model header specifies the correct model name.
The output is similar to:
Use curl to send a test inference request through the pipeline to the deployed model. Replace <INGRESS_IP> with your service mesh's ingress IP address. Ensure that:
The Host header matches the expected virtual host configured in your service mesh.
The Seldon-Model header specifies the correct pipeline name.
The output is similar to:
Server configurations define how to create an inference server. By default one is provided for Seldon MLServer and one for NVIDIA Triton Inference Server. Both these servers support the V2 inference protocol which is a requirement for all inference servers. They define how the Kubernetes ReplicaSet is defined which includes the Seldon Agent reverse proxy as well as an Rclone server for downloading artifacts for the server. The Kustomize ServerConfig for MlServer is shown below:
Learn about installing Istio ingress controller in a Kubernetes cluster running Seldon Core 2.
An ingress controller functions as a reverse proxy and load balancer, implementing a Kubernetes Ingress. It adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.
Seldon Core 2 works seamlessly with any service mesh or ingress controller, offering flexibility in your deployment setup. This guide provides detailed instructions for installing and configuring Istio with Seldon Core 2.
Istio implements the Kubernetes ingress resource to expose a service and make it accessible from outside the cluster. You can install Istio in either a self-hosted Kubernetes cluster or a managed Kubernetes service provided by a cloud provider that is running the Seldon Core 2.
Installing Istio ingress controller in a Kubernetes cluster running Seldon Core 2 involves these tasks:
Add the Istio Helm charts repository and update it:
Create the istio-system
namespace where Istio components are installed:
Install the base component:
Install Istiod, the Istio control plane:
Install Istio Ingress Gateway:
Verify that Istio Ingress Gateway is installed:
This should return details of the Istio Ingress Gateway, including the external IP address.
Verify that all Istio Pods are running:
The output is similar to:
Inject Envoy sidecars into application Pods in the namespace seldon-mesh
:
Verify that the injection happens to the Pods in the namespace seldon-mesh
:
Find the IP address of the Seldon Core 2 instance running with Istio:
It is important to expose seldon-service
service to enable communication between deployed machine learning models and external clients or services. The Seldon Core 2 inference API is exposed through the seldon-mesh
service in the seldon-mesh
namespace. If you install Core 2 in multiple namespaces, you need to expose the seldon-mesh
service in each of namespace.
Verify if the seldon-mesh
service is running for example, in the namespace seldon
.
When the services are running you should see something similar to this:
Create a YAML file to create a VirtualService named iris-route
in the namespace seldon-mesh
. For example, create the seldon-mesh-vs.yaml
file. Use your preferred text editor to create and save the file with the following content:
Create a virtual service to expose the seldon-mesh
service.
When the virtual service is created, you should see this:
Additional Resources
Install.
Ensure that you install a version of Istio that is compatible with your Kubernetes cluster version. For detailed information on supported versions, refer to the .
Make a note of the IP address that is displayed in the output. This is the IP address that you require to .
Seldon recommends a managed Kafka service for production installation. You can integrate and secure your managed Kafka Seldon Core 2.
Some of the Managed Kafka services that are tested are:
Confluent Cloud (security: SASL/PLAIN)
Confluent Cloud (security: SASL/OAUTHBEARER)
Amazon MSK (security: mTLS)
Amazon MSK (security: SASL/SCRAM)
Azure Event Hub (security: SASL/PLAIN)
These instructions outline the necessary configurations to integrate managed Kafka services with Seldon Core 2.
You can secure Seldon Core 2 integration with managed Kafka services using:
In production settings, always set up TLS encryption with Kafka. This ensures that neither the credentials nor the payloads are transported in plaintext.
When TLS is enabled, the client needs to know the root CA certificate used to create the server’s certificate. This is used to validate the certificate sent back by the Kafka server.
Create a certificate named ca.crt
that is encoded as a PEM certificate. It is important that the certificate is saved as ca.crt
. Otherwise, Seldon Core 2 may not be able to find the certificate. Within the cluster, you can provide the server’s root CA certificate through a secret. For example, a secret named kafka-broker-tls
with a certificate.
In production environments, Kafka clusters often require authentication, especially when using managed Kafka solutions. Therefore, when installing Seldon Core 2 components, it is crucial to provide the correct credentials for a secure connection to Kafka.
The type of authentication used with Kafka varies depending on the setup but typically includes one of the following:
Simple Authentication and Security Layer (SASL): Requires a username and password.
Mutual TLS (mTLS): Involves using SSL certificates as credentials.
OAuth 2.0: Uses the client credential flow to acquire a JWT token.
These credentials are stored as Kubernetes secrets within the cluster. When setting up Seldon Core 2 you must create the appropriate secret in the correct format and update the components-values.yaml
, and install-values
files respectively.
When you use SASL as the authentication mechanism for Kafka, the credentials consist of a username
and password
pair. The password is supplied through a secret.
Create a password for Seldon Core 2 in the namespace seldon-mesh
.
Values in Seldon Core 2
In Seldon Core 2 you need to specify these values in values.yaml
security.kafka.sasl.mechanism
- SASL security mechanism, e.g. SCRAM-SHA-512
security.kafka.sasl.client.username
- Kafka username
security.kafka.sasl.client.secret
- Created secret with password
security.kafka.ssl.client.brokerValidationSecret
- Certificate Authority of Kafka Brokers
The resulting set of values to include in values.yaml
is similar to:
The security.kafka.ssl.client.brokerValidationSecret
field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let’s Encrypt.
When you use OAuth 2.0 as the authentication mechanism for Kafka, the credentials consist of a Client ID and Client Secret, which are used with your Identity Provider to obtain JWT tokens for authenticating with Kafka brokers.
Create a Kubernetes secret kafka-oauth.yaml
file.
Store the secret in the seldon-mesh
namespace to configure with Seldon Core 2.
This secret must be present in seldon-logs
namespace and every namespace containing Seldon Core 2 runtime.
Client ID, client secret and token endpoint url should come from identity provider such as Keycloak or Azure AD.
Values in Seldon Core 2
In Seldon Core 2 you need to specify these values:
security.kafka.sasl.mechanism
- set to OAUTHBEARER
security.kafka.sasl.client.secret
- Created secret with client credentials
security.kafka.ssl.client.brokerValidationSecret
- Certificate Authority of Kafka brokers
The resulting set of values in components-values.yaml
is similar to:
The security.kafka.ssl.client.brokerValidationSecret
field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let’s Encrypt.
When you use mTLS
as authentication mechanism Kafka uses a set of certificates to authenticate the client.
A client certificate, referred to as tls.crt
.
A client key, referred to as tls.key
.
A root certificate, referred to as ca.crt
.
These certificates are expected to be encoded as PEM certificates and are provided through a secret, which can be created in teh namespace seldon
:
This secret must be present in seldon-logs
namespace and every namespace containing Seldon Core 2 runtime.
Ensure that the field used within the secret follow the same naming convention: tls.crt
, tls.key
and ca.crt
. Otherwise, Seldon Core 2 may not be able to find the correct set of certificates.
Reference these certificates it within the corresponding Helm values for Seldon Core 2 installation.
Values for Seldon Core 2 In Seldon Core 2 you need to specify these values:
security.kafka.ssl.client.secret
- Secret name containing client certificates
security.kafka.ssl.client.brokerValidationSecret
- Certificate Authority of Kafka Brokers
The resulting set of values in values.yaml
is similar to:
The security.kafka.ssl.client.brokerValidationSecret
field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let’s Encrypt.
Here are some examples to create secrets for managed Kafka services such as Azure Event Hub, Confluent Cloud(SASL), Confluent Cloud(OAuth2.0).
Prerequisites:
You must use at least the Standard tier for your Event Hub namespace because the Basic tier does not support the Kafka protocol.
Seldon Core 2 creates two Kafka topics for each model and pipeline, plus one global topic for errors. This results in a total number of topics calculated as: 2 x (number of models + number of pipelines) + 1. This topic count is likely to exceed the limit of the Standard tier in Azure Event Hub. For more information, see quota information.
Creating a namespace and obtaining the connection string
These are the steps that you need to perform in Azure Portal.
Create an Azure Event Hub namespace. You need to have an Azure Event Hub namespace. Follow the Azure quickstart documentation to create one. Note: You do not need to create individual Event Hubs (topics) as Seldon Core 2 automatically creates all necessary topics.
Connection string for Kafka Integration. To connect to the Azure Event Hub using the Kafka API, you need to obtain Kafka endpoint and Connection string. For more information, see Get an Event Hubs connection string
Note: Ensure you get the Connection string at the namespace level, as it is needed to dynamically create new topics. The format of the Connection string should be:
Creating secrets for Seldon Core 2 These are the steps that you need to perform in the Kubernetes cluster that run Seldon Core 2 to store the SASL password.
Create a secret named azure-kafka-secret
for Seldon Core 2 in the namespace seldon
. In the following command make sure to replace <password>
with a password of your choice and <namespace>
with the namespace form Azure Event Hub.
Create a secret named azure-kafka-secret
for Seldon Core 2 in the namespace seldon-system
. In the following command make sure to replace <password>
with a password of your choice and <namespace>
with the namespace form Azure Event Hub.
Creating API Keys
These are the steps that you need to perform in Confluent Cloud.
Navigate to Clients > New client and choose a client, for example GO and generate new Kafka cluster API key. For more information, see Confluent documentation.
Confluent generates a configuration file with the details.
Save the values of Key
, Secret
, and bootstrap.servers
from the configuration file.
Creating secrets for Seldon Core 2 These are the steps that you need to perform in the Kubernetes cluster that run Seldon Core 2 to store the SASL password.
Create a secret named confluent-kafka-sasl
for Seldon Core 2 in the namespace seldon
. In the following command make sure to replace <password>
with with the value of Secret
that you generated in Confluent cloud.
Confluent Cloud managed Kafka supports OAuth 2.0 to authenticate your Kafka clients. See Confluent Cloud documentation for further details.
Configuring Identity Provider
In Confluent Cloud Console Navigate to Account & Access / Identity providers and complete these steps:
register your Identity Provider. See Confluent Cloud documentation for further details.
add new identity pool to your newly registered Identity Provider. See Confluent Cloud documentation for further details.
Obtain these details from Confluent Cloud:
Cluster ID: Cluster Overview → Cluster Settings → General → Identification
Identity Pool ID: Accounts & access → Identity providers → .
Obtain these details from your identity providers such as Keycloak or Azure AD.
Client ID
Client secret
Token Endpoint URL
If you are using Azure AD you may will need to set scope: api://<client id>/.default
.
Creating Kubernetes secret
Create Kubernetes secrets to store the required client credentials. For example, create a kafka-secret.yaml
file by replacing the values of <client id>
, <client secret>
, <token endpoint url>
, <cluster id>
,<identity pool id>
with the values that you obtained from Confluent Cloud and your identity provider.
Provide the secret named confluent-kafka-oauth
in the seldon
namespace to configure with Seldon Core 2.
This secret must be present in seldon-logs
namespace and every namespace containing Seldon Core 2 runtime.
To integrate Kafka with Seldon Core 2.
Update the initial configuration.
Update the initial configuration for Seldon Core 2 in the components-values.yaml
file. Use your preferred text editor to update and save the file with the following content:
Update the initial configuration for Seldon Core 2 Operator in the components-values.yaml
file. Use your preferred text editor to update and save the file with the following content:
Update the initial configuration for Seldon Core 2 Operator in the components-values.yaml
file. Use your preferred text editor to update and save the file with the following content:
To enable Kafka Encryption (TLS) you need to reference the secret that you created in the security.kafka.ssl.client.secret
field of the Helm chart values. The resulting set of values to include in components-values.yaml
is similar to:
Change to the directory that contains the components-values.yaml
file and then install Seldon Core 2 operator in the namespace seldon-system
.
After you integrated Seldon Core 2 with Kafka, you need to Install an Ingress Controller that adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.
The SeldonConfig resource defines the core installation components installed by Seldon. If you wish to install Seldon, you can use the SeldonRuntime resource which allows easy overriding of some parts defined in this specification. In general, we advise core DevOps to use the default SeldonConfig or customize it for their usage. Individual installation of Seldon can then use the SeldonRuntime with a few overrides for special customisation needed in that namespace.
The specification contains core PodSpecs for each core component and a section for general configuration including the ConfigMaps that are created for the Agent (rclone defaults), Kafka and Tracing (open telemetry).
Some of these values can be overridden on a per namespace basis via the SeldonRuntime resource. Labels and annotations can also be set at the component level - these will be merged with the labels and annotations from the SeldonConfig resource in which they are defined and added to the component's corresponding Deployment, or StatefulSet.
The default configuration is shown below.
Seldon Core 2 provides a highly configurable deployment framework that allows you to fine-tune various components using Helm configuration options. These options offer control over deployment behavior, resource management, logging, autoscaling, and model lifecycle policies to optimize the performance and scalability of machine learning deployments.
This section details the key Helm configuration parameters for Envoy, Autoscaling, Server Prestop, and Model Control Plane, ensuring that you can customize deployment workflows and enhance operational efficiency.
Envoy: Manage pre-stop behaviors and configure access logging to track request-level interactions.
Autoscaling (Experimental): Fine-tune dynamic scaling policies for efficient resource allocation based on real-time inference workloads.
Servers: Define grace periods for controlled shutdowns and optimize model control plane parameters for efficient model loading, unloading, and error handling.
Logging: Define log levels for the different components of the system.
Notes:
We set kafka client library log level from the log level that is passed to the component, which could be different to the level expected by librdkafka
(syslog level). In this case we attempt to map the log level value to the best match.
envoy.preStopSleepPeriodSeconds
components
Sleep after calling prestop command.
30
envoy.terminationGracePeriodSeconds
components
Grace period to wait for prestop to finish for Envoy pods.
120
envoy.enableAccesslog
components
Whether to enable logging of requests.
true
envoy.accesslogPath
components
Path on disk to store logfile. This is only used when enableAccesslog
is set.
/tmp/envoy-accesslog.txt
envoy.includeSuccessfulRequests
components
Whether to including successful requests. If set to false, then only failed requests are logged. This is only used when enableAccesslog
is set.
false
autoscaling.autoscalingModelEnabled
components
Enable native autoscaling for Models. This is orthogonal to external autoscaling services e.g. HPA.
false
autoscaling.autoscalingServerEnabled
components
Enable native autoscaling for Models. This is orthogonal to external autoscaling services e.g. HPA.
true
agent.scalingStatsPeriodSeconds
components
Sampling rate for metrics used for autoscaling.
20
agent.modelInferenceLagThreshold
components
Queue lag threshold to trigger scaling up of a model replica.
30
agent.modelInactiveSecondsThreshold
components
Period with no activity after which to trigger scaling down of a model replica.
600
autoscaling.serverPackingEnabled
components
Whether packing of models onto fewer servers is enabled.
false
autoscaling.serverPackingPercentage
components
Percentage of events where packing is allowed. Higher values represent more aggressive packing. This is only used when serverPackingEnabled
is set. Range is from 0.0 to 1.0
0.0
serverConfig.terminationGracePeriodSeconds
components
Grace period to wait for prestop process to finish for this particular Server pod.
120
agent.overcommitPercentage
components
Overcommit percentage (of memory) allowed. Range is from 0 to 100
10
agent.maxLoadElapsedTimeMinutes
components
Max time allowed for one model load command for a model on a particular server replica to take. Lower values allow errors to be exposed faster.
120
agent.maxLoadRetryCount
components
Max number of retries for unsuccessful load command for a model on a particular server replica. Lower values allow control plane commands to fail faster.
5
agent.maxUnloadElapsedTimeMinutes
components
Max time allowed for one model unload command for a model on a particular server replica to take. Lower values allow errors to be exposed faster.
15
agent.maxUnloadRetryCount
components
Max number of retries for unsuccessful unload command for a model on a particular server replica. Lower values allow control plane commands to fail faster.
5
agent.unloadGracePeriodSeconds
components
A period guarding against race conditions between Envoy actually applying the cluster change to remove a route and before proceeding with the model replica unloading command.
2
logging.logLevel
components
Components wide settings for logging level, if individual component levels are not set. Options are: debug, info, error.
info
controller.logLevel
components
check zap log level here
dataflow.logLevel
components
check klogging level here
dataflow.logLevelKafka
components
check klogging level here
scheduler.logLevel
components
check logrus log level here
modelgateway.logLevel
components
check logrus log level here
pipelinegateway.logLevel
components
check logrus log level here
hodometer.logLevel
components
check logrus log level here
serverConfig.rclone.logLevel
components
check rclone log-level
here
serverConfig.agent.logLevel
components
check logrus log level here
The SeldonRuntime resource is used to create an instance of Seldon installed in a particular namespace.
For the definition of SeldonConfiguration
above see the SeldonConfig resource.
The specification above contains overrides for the chosen SeldonConfig
. To override the PodSpec
for a given component, the overrides
field needs to specify the component name and the PodSpec
needs to specify the container name, along with fields to override.
For instance, the following overrides the resource limits for cpu
and memory
in the hodometer
component in the seldon-mesh
namespace, while using values specified in the seldonConfig
elsewhere (e.g. default
).
As a minimal use you should just define the SeldonConfig
to use as a base for this install, for example to install in the seldon-mesh
namespace with the SeldonConfig
named default
:
The helm chart seldon-core-v2-runtime
allows easy creation of this resource and associated default Servers for an installation of Seldon in a particular namespace.
When a SeldonConfig resource changes any SeldonRuntime resources that reference the changed SeldonConfig will also be updated immediately. If this behaviour is not desired you can set spec.disableAutoUpdate
in the SeldonRuntime resource for it not be be updated immediately but only when it changes or any owned resource changes.
Core 2.8 introduces several new fields in our CRDs:
statefulSetPersistentVolumeClaimRetentionPolicy
enables users to configure the cleaning of PVC on their servers. This field is set to retain as default.
Status.selector
was introduced as a mandatory field for models in 2.8.4 and made optional in 2.8.5. This field enables autoscaling with HPA.
PodSpec
in the OverrideSpec
for SeldonRuntimes enables users to customize how Seldon Core 2 pods are created. In particular, this also allows for setting custom taints/tolerations, adding additional containers to our pods, configuring custom security settings.
These added fields do not result in breaking changes, apart from 2.8.4 which required the setting of the Status.selector
upon upgrading. This field was however changed to optional in the subsequent 2.8.5 release. Updating the CRDs (e.g. via helm) will enable users to benefit from the associated functionality.
All pods provisioned through the operator i.e. SeldonRuntime
and Server
resources now have the label app.kubernetes.io/name
for identifying the pods.
Previously, the labelling has been inconsistent across different versions of Seldon Core 2, with mixture of app
and app.kubernetes.io/name
used.
If using the Prometheus operator ("Kube Prometheus"), please apply the v2.7.0 manifests for Seldon Core 2 according to the metrics documentation.
Note that these manifests need to be adjusted to discover metrics endpoints based on the existing setup.
If previous pod monitors had namespaceSelector
fields set, these should be copied over and applied to the new manifests.
If namespaces do not matter, cluster-wide metrics endpoint discovery can be setup by modifying the namespaceSelector
field in the pod monitors:
Release 2.6 brings with it new custom resources SeldonConfig
and SeldonRuntime
, which provide a new way to install Seldon Core 2 in Kubernetes. Upgrading in the same namespace will cause downtime while the pods are being recreated. Alternatively users can have an external service mesh or other means to be used over multiple namespaces to bring up the system in a new namespace and redeploy models before switch traffic between them.
If the new 2.6 charts are used to upgrade in an existing namespace models will eventually be redeloyed but there will be service downtime as the core components are redeployed.
For Kubernetes usage we provide a set of custom resources for interacting with Seldon.
SeldonRuntime - for installing Seldon in a particular namespace.
Servers - for deploying sets of replicas of core inference servers (MLServer or Triton).
Models - for deploying single machine learning models, custom transformation logic, drift detectors, outliers detectors and explainers.
Experiments - for testing new versions of models.
Pipelines - for connecting together flows of data between models.
SeldonConfig and ServerConfig define the core installation configuration and machine learning inference server configuration for Seldon. Normally, you would not need to customize these but this may be required for your particular custom installation within your organisation.
ServerConfigs - for defining new types of inference server that can be reference by a Server resource.
SeldonConfig - for defining how seldon is installed
By default Seldon installs two server farms using MLServer and Triton with 1 replica each. Models are scheduled onto servers based on the server's resources and whether the capabilities of the server matches the requirements specified in the Model request. For example:
This model specifies the requirement sklearn
There is a default capabilities for each server as follows:
MLServer
Triton
Servers can be defined with a capabilities
field to indicate custom configurations (e.g. Python dependencies). For instance:
These capabilities
override the ones from the serverConfig: mlserver
. A model that takes advantage of this is shown below:
This above model will be matched with the previous custom server mlserver-134
.
Servers can also be set up with the extraCapabilities
that add to existing capabilities from the referenced ServerConfig. For instance:
This server, mlserver-extra
, inherits a default set of capabilities via serverConfig: mlserver
. These defaults are discussed above. The extraCapabilities
are appended to these to create a single list of capabilities for this server.
Models can then specify requirements to select a server that satisfies those requirements as follows.
The capabilities
field takes precedence over the extraCapabilities
field.
For some examples see here.
Within docker we don't support this but for Kubernetes see here
Learn more about using taints and tolerations with node affinity or node selector to allocate resources in a Kubernetes cluster.
When deploying machine learning models in Kubernetes, you may need to control which infrastructure resources these models use. This is especially important in environments where certain workloads, such as resource-intensive models, should be isolated from others or where specific hardware such as GPUs, needs to be dedicated to particular tasks. Without fine-grained control over workload placement, models might end up running on suboptimal nodes, leading to inefficiencies or resource contention.
For example, you may want to:
Isolate inference workloads from control plane components or other services to prevent resource contention.
Ensure that GPU nodes are reserved exclusively for models that require hardware acceleration.
Keep business-critical models on dedicated nodes to ensure performance and reliability.
Run external dependencies like Kafka on separate nodes to avoid interference with inference workloads.
To solve these problems, Kubernetes provides mechanisms such as taints, tolerations, and nodeAffinity
or nodeSelector
to control resource allocation and workload scheduling.
Taints are applied to nodes and tolerations to Pods to control which Pods can be scheduled on specific nodes within the Kubernetes cluster. Pods without a matching toleration for a node’s taint are not scheduled on that node. For instance, if a node has GPUs or other specialized hardware, you can prevent Pods that don’t need these resources from running on that node to avoid unnecessary resource usage.
When used together, taints and tolerations with nodeAffinity
or nodeSelector
can effectively allocate certain Pods to specific nodes, while preventing other Pods from being scheduled on those nodes.
In a Kubernetes cluster running Seldon Core 2, this involves two key configurations:
Configuring servers to run on specific nodes using mechanisms like taints, tolerations, and nodeAffinity
or nodeSelector
.
Configuring models so that they are scheduled and loaded on the appropriate servers.
This ensures that models are deployed on the optimal infrastructure and servers that meet their requirements.
Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. This means that, within a single instance of the server, you can serve multiple models under different paths. This is a feature provided out of the box by Nvidia Triton and Seldon MLServer, currently the two inference servers that are integrated in Seldon Core 2.
This deployment pattern allows the system to handle a large number of deployed models letting them share hardware resources allocated to inference servers (e.g GPUs). For example if a single model inference server is deployed on a one GPU node, the underlying loaded models on this inference server instance are able to effectively share this GPU. This is contrast to a single model per server deployment pattern where only one model can use the allocated GPU.
Multi-model serving is enabled by design in Seldon Core 2. Based on requirements that are specified by the user on a given Model
, the Scheduler will find an appropriate model inference server instance to load the model onto.
In the below example, given that the model is tensorflow
, the system will deploy the model onto a triton
server instance (matching with the Server
labels). Additionally as the model memory
requirement is 100Ki
, the system will pick the server instance that has enough (memory) capacity to host this model in parallel to other potentially existing models.
All models are loaded and active on this model server. Inference requests for these models are served concurrently and the hardware resources are shared to fulfil these inference requests.
Overcommit allows shared servers to handle more models than can fit in memory. This is done by keeping highly utilized models in memory and evicting other ones to disk using a least-recently-used (LRU) cache mechanism. From a user perspective these models are all registered and "ready" to serve inference requests. If an inference request comes for a model that is unloaded/evicted to disk, the system will reload the model first before forwarding the request to the inference server.
Overcommit is enabled by setting SELDON_OVERCOMMIT_PERCENTAGE
on shared servers; it is set by default at 10%. In other words a given model inference server instance can register models with a total memory requirement up to MEMORY_REQUEST
* ( 1 + SELDON_OVERCOMMIT_PERCENTAGE
/ 100).
The Seldon Agent (a side car next to each model inference server deployment) is keeping track of inference requests times on the different models. These models are sorted in time ascending order and this data structure is used to evict the least recently used model in order to make room for another incoming model. This happens during two scenarios:
A new model load request beyond the active memory capacity of the inference server.
An incoming inference request to a registered model that is not loaded in-memory (previously evicted).
This is done seamlessly to users and specifically for reloading a model onto the inference server to respond to an inference request, the model artifact is cached on disk which allows a faster reload (no remote artifact fetch). Therefore we expect that the extra latency to reload a model during an inference request is acceptable in many cases (with a lower bound of ~100ms).
Overcommit can be disabled by setting SELDON_OVERCOMMIT_PERCENTAGE
to 0 for a given shared server.
Check this notebook for a local example.
Pipelines allow one to connect flows of inference data transformed by Model
components. A directed acyclic graph (DAG) of steps can be defined to join Models together. Each Model will need to be capable of receiving a V2 inference request and respond with a V2 inference response. An example Pipeline is shown below:
The steps
list shows three models: tfsimple1
, tfsimple2
and tfsimple3
. These three models each take two tensors called INPUT0
and INPUT1
of integers. The models produce two outputs OUTPUT0
(the sum of the inputs) and OUTPUT1
(subtraction of the second input from the first).
tfsimple1
and tfsimple2
take as inputs the input to the Pipeline: the default assumption when no explicit inputs are defined. tfsimple3
takes one V2 tensor input from each of the outputs of tfsimple1
and tfsimple2
. As the outputs of tfsimple1
and tfsimple2
have tensors named OUTPUT0
and OUTPUT1
their names need to be changed to respect the expected input tensors and this is done with a tensorMap
component providing this tensor renaming. This is only required if your models can not be directly chained together.
The output of the Pipeline is the output from the tfsimple3
model.
The full GoLang specification for a Pipeline is shown below:
Models provide the atomic building blocks of Seldon. They represents machine learning models, drift detectors, outlier detectors, explainers, feature transformations, and more complex routing models such as multi-armed bandits.
A Kubernetes yaml example is shown below for a SKLearn model for iris classification:
Its Kubernetes spec
has two core requirements
A storageUri
specifying the location of the artifact. This can be any rclone URI specification.
A requirements
list which provides tags that need to be matched by the Server that can run this artifact type. By default when you install Seldon we provide a set of Servers that cover a range of artifact types.
You can also load models directly over the scheduler grpc service. An example is shown below use grpcurl tool:
Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. It is a feature provided out of the box by Nvidia Triton and Seldon MLServer. Multi-model serving reduces infrastructure hardware requirements (e.g. expensive GPUs) which enables the deployment of a large number of models while making it efficient to operate the system at scale.
Seldon Core 2 leverages multi-model serving by design and it is the default option for deploying models. The system will find an appropriate server to load the model onto based on requirements that the user defines in the Model
deployment definition.
Moreover, in many cases demand patterns allow for further Overcommit of resources. Seldon Core 2 is able to register more models than what can be served by the provisioned (memory) infrastructure and will swap models dynamically according to least used without adding significant latency overheads to inference workload.
Seldon can handle a wide range of
Artifacts can be stored on any of the 40 or more cloud storage technologies as well as from local (mounted) folder as discussed .
The proto buffer definitions for the scheduler are outlined .
See for more information.
See for discussion of autoscaling of models.
See for details on how Core 2 schedules Models onto Servers.
We utilize Rclone to copy model artifacts from a storage location to the model servers. This allows users to take advantage of Rclones support for over 40 cloud storage backends including Amazon S3, Google Storage and many others.
For local storage while developing see here.
For authorization needed for cloud storage when running on Kubernetes see here.
To run your model inside Seldon you must supply an inference artifact that can be downloaded and run on one of MLServer or Triton inference servers. We list artifacts below by alphabetical order below.
Alibi-Detect
MLServer
alibi-detect
Alibi-Explain
MLServer
alibi-explain
DALI
Triton
dali
TBC
Huggingface
MLServer
huggingface
LightGBM
MLServer
lightgbm
MLFlow
MLServer
mlflow
ONNX
Triton
onnx
OpenVino
Triton
openvino
TBC
Custom Python
MLServer
python, mlserver
Custom Python
Triton
python, triton
PyTorch
Triton
pytorch
SKLearn
MLServer
sklearn
Spark Mlib
MLServer
spark-mlib
TBC
Tensorflow
Triton
tensorflow
TensorRT
Triton
tensorrt
TBC
Triton FIL
Triton
fil
TBC
XGBoost
MLServer
xgboost
For many machine learning artifacts you can simply save them to a folder and load them into Seldon Core 2. Details are given below as well as a link to creating a custom model settings file if needed.
Alibi-Detect
Alibi-Explain
DALI
Follow the Triton docs to create a config.pbtxt and model folder with artifact.
Huggingface
Create an MLServer model-settings.json
with the Huggingface model required
LightGBM
Save model to file with extension.bst
.
MLFlow
Use the created artifacts/model
folder from your training run.
ONNX
Save you model with name model.onnx
.
OpenVino
Follow the Triton docs to create your model artifacts.
Custom MLServer Python
Create a python file with a class that extends MLModel
.
Custom Triton Python
Follow the Triton docs to create your config.pbtxt
and associated python files.
PyTorch
Create a Triton config.pbtxt
describing inputs and outputs and place traced torchscript in folder as model.pt
.
SKLearn
Save model via joblib to a file with extension .joblib
or with pickle to a file with extension .pkl
or .pickle
.
Spark Mlib
Follow the MLServer docs.
Tensorflow
Save model in "Saved Model" format as model.savedodel
. If using graphdef format you will need to create Triton config.pbtxt and place your model in a numbered sub folder. HDF5 is not supported.
TensorRT
Follow the Triton docs to create your model artifacts.
Triton FIL
Follow the Triton docs to create your model artifacts.
XGBoost
Save model to file with extension.bst
or .json
.
For MLServer targeted models you can create a model-settings.json
file to help MLServer load your model and place this alongside your artifact. See the MLServer project for details.
For Triton inference server models you can create a configuration config.pbtxt file alongside your artifact.
The tag
field represents the tag you need to add to the requirements
part of the Model spec for your artifact to be loaded on a compatible server. e.g. for an sklearn model:
This invocation check filters for tensor A having value 1.
The model also returns a tensor called status
which indicates the operation run and whether it was a success. If no rows satisfy the query then just a status
tensor output will be returned.
This model can be useful for conditional Pipelines. For example, you could have two invocations of this model:
and
By including these in a Pipeline as follows we can define conditional routes:
Here the mul10 model will be called if the choice-is-one model succeeds and the add10 model will be called if the choice-is-two model succeeds.
.
.
This model allows a query to be run in the input to select rows. An example is shown below:
Further details on Pandas query can be found
The full notebook can be found
The Model specification allows parameters to be passed to the loaded model to allow customization. For example:
This capability is only available for MLServer custom model runtimes. The named keys and values will be added to the model-settings.json file for the provided model in the parameters.extra
Dict. MLServer models are able to read these values in their load
method.
This example illustrates how to use taints, tolerations with nodeAffinity or nodeSelector to assign GPU nodes to specific models.
To serve a model on a dedicated GPU node, you should follow these steps:
You can add the taint when you are creating the node or after the node has been provisioned. You can apply the same taint to multiple nodes, not just a single node. A common approach is to define the taint at the node pool level.
When you apply a NoSchedule
taint to a node after it is created it may result in existing Pods that do not have a matching toleration to remain on the node without being evicted. To ensure that such Pods are removed, you can use the NoExecute
taint effect instead.
In this example, the node includes several labels that are used later for node affinity settings. You may choose to specify some labels, while others are usually added by the cloud provider or a GPU operator installed in the cluster. \
To ensure a specific inference server Pod runs only on the nodes you've configured, you can use nodeSelector
or nodeAffinity
together with a toleration
by modifying one of the following:
Seldon Server custom resource: Apply changes to each individual inference server.
ServerConfig custom resource: Apply settings across multiple inference servers at once.
Configuring Seldon Server custom resource
While nodeSelector
requires an exact match of node labels for server Pods to select a node, nodeAffinity
offers more fine-grained control. It enables a conditional approach by using logical operators in the node selection process. For more information, see Affinity and anti-affinity.
In this example, a nodeSelector
and a toleration
is set for the Seldon Server custom resource.
In this example, a nodeAffinity
and a toleration
is set for the Seldon Server custom resource.
You can configure more advanced Pod selection using nodeAffinity
, as in this example:
Configuring ServerConfig custom resource
This configuration automatically affects all servers using that ServerConfig
, unless you specify server-specific overrides, which takes precedence.
When you have a set of inference servers running exclusively on GPU nodes, you can assign a model to one of those servers in two ways:
Custom model requirements (recommended)
Explicit server pinning
Here's the distinction between the two methods of assigning models to servers.
Custom model requirements
If the assigned server cannot load the model due to insufficient resources, another similarly-capable server can be selected to load the model.
Explicit pinning
If the specified server lacks sufficient memory or resources, the model load fails without trying another server.
When you specify a requirement matching a server capability in the model custom resource it loads the model on any inference server with a capability matching the requirements.
Ensure that the additional capability that matches the requirement label is added to the Server custom resource.
Instead of adding a capability using extraCapabilities
on a Server custom resource, you may also add to the list of capabilities in the associated ServerConfig custom resource. This applies to all servers referencing the configuration.
With these specifications, the model is loaded on replicas of inference servers created by the referenced Server custom resource.
To define a new storage configuration, you need the following details:
Remote name
Remote type
Provider parameters
A remote is what Rclone calls a storage location. The type defines what protocol Rclone should use to talk to this remote. A provider is a particular implementation for that storage type. Some storage types have multiple providers, such as s3
having AWS S3 itself, MinIO, Ceph, and so on.
The remote name is your choice. The prefix you use for models in spec.storageUri
must be the same as this remote name.
Kubernetes Secrets are used to store Rclone configurations, or storage secrets, for use by Servers. Each Secret should contain exactly one Rclone configuration.
A Server can use storage secrets in one of two ways:
It can dynamically load a secret specified by a Model in its .spec.secretName
The name of a Secret is entirely your choice, as is the name of the data key in that Secret. All that matters is that there is a single data key and that its value is in the format described above.
Rather than Models always having to specify which secret to use, a Server can load storage secrets ahead of time. These can then be reused across many Models.
When using a preloaded secret, the Model definition should leave .spec.secretName
empty. The protocol prefix in .spec.storageUri
still needs to match the remote name specified by a storage secret.
The secrets to preload are named in a centralised ConfigMap called seldon-agent
. This ConfigMap applies to all Servers managed by the same SeldonRuntime. By default this ConfigMap only includes seldon-rclone-gs-public
, but can be extended with your own secrets as shown below:
The easiest way to change this is to update your SeldonRuntime.
If your SeldonRuntime is configured using the seldon-core-v2-runtime
Helm chart, the corresponding value is config.agentConfig.rclone.configSecrets
. This can be used as shown below:
Otherwise, if your SeldonRuntime is configured directly, you can add secrets by setting .spec.config.agentConfig.rclone.config_secrets
. This can be used as follows:
Assuming you have installed MinIO in the minio-system
namespace, a corresponding secret could be:
You can then reference this in a Model with .spec.secretName
:
The contents of gcloud-application-credentials.json
can be put into a secret:
You can then reference this in a Model with .spec.secretName
:
Inference artifacts referenced by Models can be stored in any of the storage backends supported by . This includes local filesystems, AWS S3, and Google Cloud Storage (GCS), among others. Configuration is provided out-of-the-box for public GCS buckets, which enables the use of Seldon-provided models like in the below example:
This configuration is provided by the Kubernetes Secret seldon-rclone-gs-public
. It is made available to Servers as a . You can define and use your own storage configurations in exactly the same way.
The remote type is one of the values . For example, for AWS S3 it is s3
and for Dropbox it is dropbox
.
The provider parameters depend entirely on the remote type and the specific provider you are using. Please check the Rclone documentation for the appropriate provider. Note that Rclone docs for storage types call the parameters properties and provide both config and env var formats--you need to use the config format. For example, the GCS parameter --gcs-client-id
described should be used as client_id
.
For reference, this format is described in the . Note that we do not support the use of opts
discussed in that section.
It can use global configurations made available via
can use for access. You can generate the credentials for a service account using the :
Core 2 runs with long lived server replicas, each able to host multiple models (through multi-model serving, or MMS). The server replicas can be autoscaled natively by Core 2 in response to dynamic changes in the requested number of model replicas, allowing users to seamlessly optimize the infrastructure cost associated with their deployments.
This document outlines the autoscaling policies and mechanisms that are available for autoscaling server replicas. These policies are designed to ensure that the server replicas are increased (scaled up) or decreased (scaled down) in response to changes in the number replicas requested for each model. In other words if a given model is scaled up, the system will scale up the server replicas in order to host the new model replicas. Similarly, if a given model is scaled down, the system may scale down the number of replicas of the server hosting the model, depending on other models that are loaded on the same server replica.
To enable autoscaling of server replicas, the following requirements need to be met:
Setting minReplicas
and maxReplicas
in the Server
CR. This will define the minimum and maximum number of server replicas that can be created.
Setting the autoscaling.autoscalingServerEnabled
value to true
(default) during installation of the Core 2 seldon-core-v2-setup
helm chart. If not installing via helm, setting the ENABLE_SERVER_AUTOSCALING
environment variable to true
in the seldon-scheduler
podSpec (via either SeldonConfig or a SeldonRuntime podSpec override) will have the same effect. This will enable the autoscaling of server replicas.
An example of a Server
CR with autoscaling enabled is shown below:
When we want to scale up the number of replicas for a model, the associated servers might not have enough capacity (replicas) available. In this case we need to scale up the server replicas to match the number required by our models.
There is currently only one policy for scaling up server replicas:
Model Replica Count:
This policy scales up the server replicas to match the number of model replicas that are required. In other words, if a model is scaled up, the system will scale up the server replicas to host these models. This policy is simple to implement and ensure that the server replicas are scaled up in response to changes in the number of model replicas that are required.
During the scale up process, the system will create new server replicas to host the new model replicas. The new server replicas will be created with the same configuration as the existing server replicas. This includes the server configuration, resources, etc. The new server replicas will be added to the existing server replicas and will be used to host the new model replicas.
Once we have scaled down the number of replicas for a model, some of the corresponding server replicas might be left unused (depending on whether those replicas also hosted other models or not). In this case, the extra server pods are wasting resources and causing additional infrastructure cost (especially if they have expensive resources such as GPUs attached)
Scaling down servers in sync with models is not straight forward in the case of multi-model serving. Scaling down one model does not necessarily mean that we also need to scale down the corresponding server replica as this server replica might be still serving load for other models.
Therefore we need to define some heuristics that can be used to scale down servers if we think that they are not properly used.
Empty Server Replica:
In the simplest case we can remove a server replica if it does not host any models. This guarantees that there is no load on a particular server replica before removing it.
This policy works best in the case of single model serving where the server replicas are only hosting a single model. In this case, if the model is scaled down, the server replica will be empty and can be removed.
However in the case of MMS, only reducing the number of server replicas when one of the replicas no longer hosts any models can lead to a suboptimal packing of models onto server replicas. This is because the system will not automatically pack models onto the smaller set of replicas. This can lead to more server replicas being used than necessary. This can be mitigated by the lightly loaded server replicas policy.
Lightly Loaded Server Replicas (Experimental):
Warning: This policy is experimental and is not enabled by default. It can be enabled by setting autoscaling.serverPackingEnabled
to true
and autoscaling.serverPackingPercentage
to a value between 0 and 100. This policy is still under development and might in some cases increase latencies, so it's worth testing ahead of time to observer behavior for a given setup.
Initial assignment:
There is an argument that this is might not be optimized and in MMS the assignment could be:
As the system evolves this imbalance can get larger and could cause the serving infrastructure to be less optimized.
The behavior above is actually not limited to autoscaling, however autoscaling will aggravate the issue causing more imbalance over time.
This imbalance can be mitigated by making by the following observation: If the max number of replicas of any given model (assigned to a server from a logical point of view) is less than the number of replicas for this server, then we can pack the models hosted onto a smaller set of replicas. Note in Core 2 a server replica can host only 1 replica of a given model.
While this heuristic is going to pack models onto a set of fewer replicas, which allows us to scale models down, there is still the risk that the packing could increase latencies, trigger a later scale up. Core 2 tries to make sure that do not flip-flopping between these states. The user can also reduce the number of packing events by setting autoscaling.serverPackingPercentage
to a lower value.
Currently Core 2 triggers the packing logic only when there is model replica being removed, either from a model scale down or a model being deleted. In the future we might trigger this logic more frequently to ensure that the models are packed onto a fewer set of replicas.
Models can be scaled by setting their replica count, e.g.
Currently, the number of replicas will need not to exceed the replicas of the Server the model is scheduled to.
Servers can be scaled by setting their replica count, e.g.
Currently, models scheduled to a server can only scale up to the server replica count.
Seldon Core 2 runs with several control and dataplane components. The scaling of these resources is discussed below:
Pipeline gateway.
This pipeline gateway handles REST and gRPC synchronous requests to Pipelines. It is stateless and can be scaled based on traffic demand.
Model gateway.
This component pulls model requests from Kafka and sends them to inference servers. It can be scaled up to the partition factor of your Kafka topics. At present we set a uniform partition factor for all topics in one installation of Seldon .
Dataflow engine.
The dataflow engine runs KStream topologies to manage Pipelines. It can run as multiple replicas and the scheduler will balance Pipelines to run across it with a consistent hashing load balancer. Each Pipeline is managed up to the partition factor of Kafka (presently hardwired to one).
Scheduler.
This manages the control plane operations. It is presently required to be one replica as it maintains internal state within a BadgerDB held on local persistent storage (stateful set in Kubernetes). Performance tests have shown this not to be a bottleneck at present.
Kubernetes Controller.
The Kubernetes controller manages resources updates on the cluster which it passes on to the Scheduler. It is by default one replica but has the ability to scale.
Envoy
Envoy replicas get their state from the scheduler for routing information and can be scaled as needed.
Allow configuration of partition factor for data plane consistent hashing load balancer.
Allow Model gateway and Pipeline gateway to use consistent hashing load balancer.
Consider control plane scaling options.
Autoscaling in Seldon applies to various concerns:
Inference servers autoscaling
Model autoscaling
Model memory overcommit
Autoscaling of servers can be done via HorizontalPodAutoscaler
(HPA).
HPA can be applied to any deployed Server
resource. In this case HPA will manage the number of server replicas in the corresponding statefulset according to utilisation metrics (e.g. CPU or memory).
For example assuming that a triton
server is deployed, then the user can attach an HPA based on cpu utilisation as follows:
In this case, according to load, the system will add / remove server replicas to / from the triton
statefulset.
It is worth considering the following points:
If HPA adds a new server replica, this new replica will be included in any future scheduling decisions. In other words, when deploying a new model or rescheduling failed models this new replica will be considered.
If HPA deletes an existing server replica, the scheduler will first attempt to drain any loaded model on this server replica before the server replica gets actually deleted. This is achieved by leveraging a PreStop
hook on the server replica pod that triggers a process before receiving the termination signal. This draining process is capped by terminationGracePeriodSeconds
, which the user can set (default is 2 minutes).
Therefore there should generally be minimal disruption to the inference workload during scaling.
As each model server can serve multiple models, models can scale across the available replicas of the server according to load.
Autoscaling of models is enabled if at least MinReplicas
or MaxReplicas
is set in the model custom resource. Then according to load the system will scale the number of Replicas
within this range.
For example the following model will be deployed at first with 1 replica and it can scale up according to load.
Note that model autoscaling will not attempt to add extra servers if the desired number of replicas cannot be currently fulfilled by the current provisioned number of servers. This is a process left to be done by server autoscaling.
Additionally when the system autoscales, the initial model spec is not changed (e.g. the number of Replicas
) and therefore the user cannot reset the number of replicas back to the initial specified value without an explicit change.
If only Replicas
is specified by the user, autoscaling of models is disabled and the system will have exactly the number of replicas of this model deployed regardless of inference load.
The model autoscaling architecture is designed such as each agent decides on which models to scale up / down according to some defined internal metrics and then sends a triggering message to the scheduler. The current metrics are collected from the data plane (inference path), representing a proxy on how loaded is a given model with fulfilling inference requests.
The main idea is that we keep the "lag" for each model. We define the "lag" as the difference between incoming and outgoing requests in a given time period. If the lag crosses a threshold, then we trigger a model scale up event. This threshold can be defined via SELDON_MODEL_INFERENCE_LAG_THRESHOLD
inference server environment variable.
For now we keep things simple and we trigger model scale down events if a model has not been used for a number of seconds. This is defined in SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD
inference server environment variable.
Each agent checks the above stats periodically and if any model hits the corresponding threshold, then the agent sends an event to the scheduler to request model scaling.
How often this process executes can be defined via SELDON_SCALING_STATS_PERIOD_SECONDS
inference server environment variable.
The scheduler will perform model autoscale if:
The model is stable (no state change in the last 5 minutes) and available.
The desired number of replicas is within range. Note we always have a least 1 replica of any deployed model and we rely on over commit to reduce the resources used further.
For scaling up, there is enough capacity for the new model replica.
Servers can hold more models than available memory if overcommit is swictched on (default yes). This allows under utilized models to be moved from inference server memory to allow for other models to take their place. Note that these evicted models are still registered and in the case future inference requests arrive, the system will reload the models back to memory before serving the requests. If traffic patterns for inference of models vary then this can allow more models than available server memory to be run on the system.
Note: Native autoscaling of servers is required in the case of MMS as the models are dynamically loaded and unloaded onto these server replicas. In this case Core 2 would autoscale server replicas according to changes to the model replicas that are required. This is in contrast to single-model autoscaling approach explained where server and model replicas are independently scaled using HPA (but relying on the same metric). Server autoscaling can also be used in the case of single-model serving, simplifying the autoscaling process as users would only needs to manage the scaling logic for model replicas, only requiring one HPA manifest.
There is a period of time where the new server replicas are being created and the new model replicas are being loaded onto these server replicas. During this period, the system will ensure that the existing server replicas are still serving load so that there is no downtime during the scale up process. This is achieved by using partial scheduling of the new model replicas onto the new server replicas. This ensures that the new server replicas are gradually loaded with the new model replicas and that the existing server replicas are still serving load. Check the document for more details.
Using the above policy which MMS enabled, different model replicas will be hosted on potentially different server replicas and as we scale these models up and down the system can end up in a situation where the models are not consolidated to an optimized number of servers. For illustration, take the case of 3 Models: , and . We have 1 server with 2 replicas: and that can host these 3 models. Assuming that initially we have and with 1 replica and with 2 replicas therefore the assignment is:
: ,
: ,
Now if the user unloads Model the assignment is:
:
:
: ,
: removed
In other words, consider the following example - for models and having 2 replicas each and we have 3 server replicas, the following assignment is not potentially optimized:
: ,
:
:
In this case we could trigger removal of for the server which could pack the models more appropriately
: ,
: ,
: removed
For more details on HPA check this .
Autoscaling both Models and Servers using HPA and custom metrics is possible for the special case of single model serving (i.e. single model per server). Check the detailed documentation . For multi-model serving (MMS), a different solution is needed as discussed below.
This page describes a predict/inference API independent of any specific ML/DL framework and model server. These APIs are able to support both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by being able to operate seamlessly on platforms that have standardized around this API. This protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and ONNX Runtime Server. It is sometimes referred to by its old name "V2 Inference Protocol".
For an inference server to be compliant with this protocol the server must implement all APIs described below, except where an optional feature is explicitly noted. A compliant inference server may choose to implement either or both of the HTTP/REST API and the GRPC API.
The protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately.
A compliant server must implement the health, metadata, and inference APIs described in this section.
The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field.
All strings in all contexts are case-sensitive.
For Seldon a server must recognize the following URLs. The versions portion of the URL is shown as optional to allow implementations that don’t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies).
Health:
Server Metadata:
Model Metadata:
Inference:
A health request is made with an HTTP GET to a health endpoint. The HTTP response status code indicates a boolean result for the health request. A 200 status code indicates true and a 4xx status code indicates false. The HTTP response body should be empty. There are three health APIs.
The “server live” API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes livenessProbe.
The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe.
The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies.
The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains the Server Metadata Response JSON Object or the Server Metadata Response JSON Error Object.
A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as $metadata_server_response, is returned in the HTTP body.
“name” : A descriptive name for the server.
"version" : The server version.
“extensions” : The extensions supported by the server. Currently no standard extensions are defined. Individual inference servers may define and document their own extensions.
A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_server_error_response object.
“error” : The descriptive message for the error.
The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.
A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as $metadata_model_response, is returned in the HTTP body for every successful model metadata request.
“name” : The name of the model.
"versions" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don’t support versions. Optional for models that don’t allow a version to be explicitly requested.
“platform” : The framework/backend for the model. See Platforms.
“inputs” : The inputs required by the model.
“outputs” : The outputs produced by the model.
Each model input and output tensors’ metadata is described with a $metadata_tensor object.
“name” : The name of the tensor.
"datatype" : The data-type of the tensor elements as defined in Tensor Data Types.
"shape" : The shape of the tensor. Variable-size dimensions are specified as -1.
A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_model_error_response object.
“error” : The descriptive message for the error.
An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the Inference Request JSON Object. In the corresponding response the HTTP body contains the Inference Response JSON Object or Inference Response JSON Error Object. See Inference Request Examples for some example HTTP/REST requests and responses.
The inference request object, identified as $inference_request
, is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.
id
: An identifier for this request. Optional, but if specified this identifier must be returned in the response.
parameters
: An object containing zero or more parameters for this inference request expressed as key/value pairs. See Parameters for more information.
inputs
: The input tensors. Each input is described using the $request_input
schema defined in Request Input.
outputs
: The output tensors requested for this inference. Each requested output is described using the $request_output
schema defined in Request Output. Optional, if not specified all outputs produced by the model will be returned using default $request_output
settings.
Request Input
The $request_input
JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch.
"name"
: The name of the input tensor.
"shape"
: The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.
"datatype"
: The data-type of the input tensor elements as defined in Tensor Data Types.
"parameters"
: An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information.
“data”
: The contents of the tensor. See Tensor Data for more information.
Request Output
The $request_output
JSON is used to request which output tensors should be returned from the model.
"name"
: The name of the output tensor.
"parameters"
: An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information.
A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as $inference_response, is returned in the HTTP body.
"model_name"
: The name of the model used for inference.
"model_version"
: The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response.
"id"
: The "id" identifier given in the request, if any.
"parameters"
: An object containing zero or more parameters for this response expressed as key/value pairs. See Parameters for more information.
"outputs"
: The output tensors. Each output is described using the $response_output
schema defined in Response Output.
Response Output
The $response_output
JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch.
"name"
: The name of the output tensor.
"shape"
: The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.
"datatype"
: The data-type of the output tensor elements as defined in Tensor Data Types.
"parameters"
: An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information.
“data”
: The contents of the tensor. See Tensor Data for more information.
A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $inference_error_response
object.
“error”
: The descriptive message for the error.
The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object.
For the above request the inference server must return the “output0” output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned.
The $parameters
JSON describes zero or more “name”
/”value”
pairs, where the “name”
is the name of the parameter and the “value”
is a $string
, $number
, or $boolean
.
Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.
Tensor data must be presented in row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation.
Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 is problematic to communicate explicitly since there is not a standard fp16 representation across backends nor typically the programmatic support to create the fp16 representation for a JSON number.
For example, the 2-dimensional matrix:
Can be represented in its natural format as:
Or in a flattened one-dimensional representation:
The GRPC API closely follows the concepts defined in the HTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section.
All strings in all contexts are case-sensitive.
The GRPC definition of the service is:
A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure.
The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are:
The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are:
The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are:
The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are:
The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are:
The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are:
The Parameters message describes a “name”
/”value”
pair, where the “name”
is the name of the parameter and the “value”
is a boolean, integer, or string corresponding to the parameter.
Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.
In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements.
Using a "raw" representation of tensors with ModelInferRequest::raw_input_contents
and ModelInferResponse::raw_output_contents
will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see issue here.
An alternative to the "raw"
representation is to use InferTensorContents to represent the tensor data in a format that matches the tensor's data type.
A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to a Model Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format “<project>_<format>”
. The following platform names are allowed:
tensorrt_plan
: A TensorRT model encoded as a serialized engine or “plan”
.
tensorflow_graphdef
: A TensorFlow model encoded as a GraphDef.
tensorflow_savedmodel
: A TensorFlow model encoded as a SavedModel.
onnx_onnxv1
: A ONNX model encoded for ONNX Runtime.
pytorch_torchscript
: A PyTorch model encoded as TorchScript.
mxnet_mxnet
An MXNet model
caffe2_netdef
: A Caffe2 model encoded as a NetDef.
Tensor data types are shown in the following table along with the size of each type, in bytes.
BOOL
1
UINT8
1
UINT16
2
UINT32
4
UINT64
8
INT8
1
INT16
2
INT32
4
INT64
8
FP16
2
FP32
4
FP64
8
BYTES
Variable (max 232)
This document is based on the KServe original created during the lifetime of the KFServing project in Kubeflow by its various contributors including Seldon, NVIDIA, IBM, Bloomberg and others.
Learn how to jointly autoscale model and server replicas based on a metric of inference requests per second (RPS) using HPA, when there is a one-to-one correspondence between models and servers (single-model serving). This will require:
Having a Seldon Core 2 install that publishes metrics to prometheus (default). In the following, we will assume that prometheus is already installed and configured in the seldon-monitoring
namespace.
Installing and configuring Prometheus Adapter, which allows prometheus queries on relevant metrics to be published as k8s custom metrics
Configuring HPA manifests to scale Models and the corresponding Server replicas based on the custom metrics
The Core 2 HPA-based autoscaling has the following constraints/limitations:
HPA scaling only targets single-model serving, where there is a 1:1 correspondence between models and servers. Autoscaling for multi-model serving (MMS) is supported for specific models and workloads via the Core 2 native features described here.
Significant improvements to MMS autoscaling are planned for future releases.
Only custom metrics from Prometheus are supported. Native Kubernetes resource metrics such as CPU or memory are not. This limitation exists because of HPA's design: In order to prevent multiple HPA CRs from issuing conflicting scaling instructions, each HPA CR must exclusively control a set of pods which is disjoint from the pods controlled by other HPA CRs. In Seldon Core 2, CPU/memory metrics can be used to scale the number of Server replicas via HPA. However, this also means that the CPU/memory metrics from the same set of pods can no longer be used to scale the number of model replicas.
We are working on improvements in Core 2 to allow both servers and models to be scaled based on a single HPA manifest, targeting the Model CR.
Each Kubernetes cluster supports only one active custom metrics provider. If your cluster already uses a custom metrics provider different from prometheus-adapter
, it will need to be removed before being able to scale Core 2 models and servers via HPA.
The Kubernetes community is actively exploring solutions for allowing multiple custom metrics providers to coexist.
The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom or external metrics. Those can then be accessed by HPA in order to take scaling decisions.
To install through helm:
These commands install prometheus-adapter
as a helm release named hpa-metrics
in the same namespace where Prometheus is installed, and point to its service URL (without the port).
The URL is not fully qualified as it references a Prometheus instance running in the same namespace. If you are using a separately-managed Prometheus instance, please update the URL accordingly.
If you are running Prometheus on a different port than the default 9090, you can also pass --set prometheus.port=[custom_port]
You may inspect all the options available as helm values by running helm show values prometheus-community/prometheus-adapter
Please check that the metricsRelistInterval
helm value (default to 1m) works well in your setup, and update it otherwise. This value needs to be larger than or equal to your Prometheus scrape interval. The corresponding prometheus adapter command-line argument is --metrics-relist-interval
. If the relist interval is set incorrectly, it will lead to some of the custom metrics being intermittently reported as missing.
We now need to configure the adapter to look for the correct prometheus metrics and compute per-model RPS values. On install, the adapter has created a ConfigMap
in the same namespace as itself, named [helm_release_name]-prometheus-adapter
. In our case, it will be hpa-metrics-prometheus-adapter
.
Overwrite the ConfigMap as shown in the following manifest, after applying any required customizations.
Change the name
if you've chosen a different value for the prometheus-adapter
helm release name. Change the namespace
to match the namespace where prometheus-adapter
is installed.
In this example, a single rule is defined to fetch the seldon_model_infer_total
metric from Prometheus, compute its per second change rate based on data within a 2 minute sliding window, and expose this to Kubernetes as the infer_rps
metric, with aggregations available at model, server, inference server pod and namespace level.
When HPA requests the infer_rps
metric via the custom metrics API for a specific model, prometheus-adapter issues a Prometheus query in line with what it is defined in its config.
For the configuration in our example, the query for a model named irisa0
in namespace seldon-mesh
would be:
You may want to modify the query in the example to match the one that you typically use in your monitoring setup for RPS metrics. The example calls rate()
with a 2 minute sliding window. Values scraped at the beginning and end of the 2 minute window before query time are used to compute the RPS.
It is important to sanity-check the query by executing it against your Prometheus instance. To do so, pick an existing model CR in your Seldon Core 2 install, and send some inference requests towards it. Then, wait for a period equal to at least twice the Prometheus scrape interval (Prometheus default 1 minute), so that two values from the series are captured and a rate can be computed. Finally, you can modify the model name and namespace in the query above to match the model you've picked and execute the query.
If the query result is empty, please adjust it until it consistently returns the expected metric values. Pay special attention to the window size (2 minutes in the example): if it is smaller than twice the Prometheus scrape interval, the query may return no results. A compromise needs to be reached to set the window size large enough to reject noise but also small enough to make the result responsive to quick changes in load.
Update the metricsQuery
in the prometheus-adapter ConfigMap to match any query changes you have made during tests.
A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available here, and those may be used when customizing the configuration.
The rule definition can be broken down in four parts:
Discovery (the seriesQuery
and seriesFilters
keys) controls what Prometheus metrics are considered for exposure via the k8s custom metrics API.
As an alternative to the example above, all the Seldon Prometheus metrics of the form seldon_model.*_total
could be considered, followed by excluding metrics pre-aggregated across all models (.*_aggregate_.*
) as well as the cummulative infer time per model (.*_seconds_total
):
For RPS, we are only interested in the model inference count (seldon_model_infer_total
)
Association (the resources
key) controls the Kubernetes resources that a particular metric can be attached to or aggregated over.
The resources key defines an association between certain labels from the Prometheus metric and k8s resources. For example, on line 17, "model": {group: "mlops.seldon.io", resource: "model"}
lets prometheus-adapter
know that, for the selected Prometheus metrics, the value of the "model" label represents the name of a k8s model.mlops.seldon.io
CR.
One k8s custom metric is generated for each k8s resource associated with a prometheus metric. In this way, it becomes possible to request the k8s custom metric values for models.mlops.seldon.io/iris
or for servers.mlops.seldon.io/mlserver
.
The labels that do not refer to a namespace
resource generate "namespaced" custom metrics (the label values refer to resources which are part of a namespace) -- this distinction becomes important when needing to fetch the metrics via kubectl, and in understanding how certain Prometheus query template placeholders are replaced.
Naming (the name
key) configures the naming of the k8s custom metric.
In the example ConfigMap, this is configured to take the Prometheus metric named seldon_model_infer_total
and expose custom metric endpoints named infer_rps
, which when called return the result of a query over the Prometheus metric.
Instead of a literal match, one could also use regex group capture expressions, which can then be referenced in the custom metric name:
Querying (the metricsQuery
key) defines how a request for a specific k8s custom metric gets converted into a Prometheus query.
The query can make use of the following placeholders:
.Series is replaced by the discovered prometheus metric name (e.g. seldon_model_infer_total
)
.LabelMatchers, when requesting a namespaced metric for resource X
with name x
in namespace n
, is replaced by X=~"x",namespace="n"
. For example, model=~"iris0", namespace="seldon-mesh"
. When requesting the namespace resource itself, only the namespace="n"
is kept.
.GroupBy is replaced by the resource type of the requested metric (e.g. model
, server
, pod
or namespace
).
For a complete reference for how prometheus-adapter
can be configured via the ConfigMap
, please consult the docs here.
Once you have applied any necessary customizations, replace the default prometheus-adapter config with the new one, and restart the deployment (this restart is required so that prometheus-adapter picks up the new config):
In order to test that the prometheus adapter config works and everything is set up correctly, you can issue raw kubectl requests against the custom metrics API
Listing the available metrics:
For namespaced metrics, the general template for fetching is:
For example:
Fetching model RPS metric for a specific (namespace, model)
pair (seldon-mesh, irisa0)
:
Fetching model RPS metric aggregated at the (namespace, server)
level (seldon-mesh, mlserver)
:
Fetching model RPS metric aggregated at the (namespace, pod)
level (seldon-mesh, mlserver-0)
:
Fetching the same metric aggregated at namespace
level (seldon-mesh)
:
For every (Model, Server) pair you want to autoscale, you need to apply 2 HPA manifests based on the same metric: one scaling the Model, the other the Server. The example below only works if the mapping between Models and Servers is 1-to-1 (i.e no multi-model serving).
Consider a model named irisa0
with the following manifest. Please note we don’t set minReplicas/maxReplicas
. This disables the seldon lag-based autoscaling so that it doesn’t interact with HPA (separate minReplicas/maxReplicas
configs will be set on the HPA side)
You must also explicitly define a value for spec.replicas
. This is the key modified by HPA to increase the number of replicas, and if not present in the manifest it will result in HPA not working until the Model CR is modified to have spec.replicas
defined.
Let’s scale this model when it is deployed on a server named mlserver
, with a target RPS per replica of 3 RPS (higher RPS would trigger scale-up, lower would trigger scale-down):
It is important to keep both the scaling metric and any scaling policies the same across the two HPA manifests. This is to ensure that both the Models and the Servers are scaled up/down at approximately the same time. Small variations in the scale-up time are expected because each HPA samples the metrics independently, at regular intervals.
In order to ensure similar scaling behaviour between Models and Servers, the number of minReplicas
and maxReplicas
, as well as any other configured scaling policies should be kept in sync across the HPA for the model and the server.
The Object metric allows for two target value types: AverageValue
and Value
. Of the two, only AverageValue
is supported for the current Seldon Core 2 setup. The Value
target type is typically used for metrics describing the utilization of a resource and would not be suitable for RPS-based scaling.
The example HPA manifests use metrics of type "Object" that fetch the data used in scaling decisions by querying k8s metrics associated with a particular k8s object. The endpoints that HPA uses for fetching those metrics are the same ones that were tested in the previous section using kubectl get --raw ...
. Because you have configured the Prometheus Adapter to expose those k8s metrics based on queries to Prometheus, a mapping exists between the information contained in the HPA Object metric definition and the actual query that is executed against Prometheus. This section aims to give more details on how this mapping works.
In our example, the metric.name:infer_rps
gets mapped to the seldon_model_infer_total
metric on the prometheus side, based on the configuration in the name
section of the Prometheus Adapter ConfigMap. The prometheus metric name is then used to fill in the <<.Series>>
template in the query (metricsQuery
in the same ConfigMap).
Then, the information provided in the describedObject
is used within the Prometheus query to select the right aggregations of the metric. For the RPS metric used to scale the Model (and the Server because of the 1-1 mapping), it makes sense to compute the aggregate RPS across all the replicas of a given model, so the describedObject
references a specific Model CR.
However, in the general case, the describedObject
does not need to be a Model. Any k8s object listed in the resources
section of the Prometheus Adapter ConfigMap may be used. The Prometheus label associated with the object kind fills in the <<.GroupBy>>
template, while the name gets used as part of the <<.LabelMatchers>>
. For example:
If the described object is { kind: Namespace, name: seldon-mesh }
, then the Prometheus query template configured in our example would be transformed into:
If the described object is not a namespace (for example, { kind: Pod, name: mlserver-0 }
) then the query will be passed the label describing the object, alongside an additional label identifying the namespace where the HPA manifest resides in.:
The target
section establishes the thresholds used in scaling decisions. For RPS, the AverageValue
target type refers to the threshold per replica RPS above which the number of the scaleTargetRef
(Model or Server) replicas should be increased. The target number of replicas is being computed by HPA according to the following formula:
As an example, if averageValue=50
and infer_rps=150
, the targetReplicas
would be 3.
Importantly, computing the target number of replicas does not require knowing the number of active pods currently associated with the Server or Model. This is what allows both the Model and the Server to be targeted by two separate HPA manifests. Otherwise, both HPA CRs would attempt to take ownership of the same set of pods, and transition into a failure state.
This is also why the Value
target type is not currently supported. In this case, HPA first computes an utilizationRatio
:
As an example, if threshold_value=100
and custom_metric_value=200
, the utilizationRatio
would be 2. HPA deduces from this that the number of active pods associated with the scaleTargetRef
object should be doubled, and expects that once that target is achieved, the custom_metric_value
will become equal to the threshold_value
(utilizationRatio=1
). However, by using the number of active pods, the HPA CRs for both the Model and the Server also try to take exclusive ownership of the same set of pods, and fail.
Each HPA CR has it's own timer on which it samples the specified custom metrics. This timer starts when the CR is created, with sampling of the metric being done at regular intervals (by default, 15 seconds).
As a side effect of this, creating the Model HPA and the Server HPA (for a given model) at different times will mean that the scaling decisions on the two are taken at different times. Even when creating the two CRs together as part of the same manifest, there will usually be a small delay between the point where the Model and Server spec.replicas
values are changed.
Despite this delay, the two will converge to the same number when the decisions are taken based on the same metric (as in the previous examples).
When showing the HPA CR information via kubectl get
, a column of the output will display the current metric value per replica and the target average value in the format [per replica metric value]/[target]
. This information is updated in accordance to the sampling rate of each HPA resource. It is therefore expected to sometimes see different metric values for the Model and it's corresponding Server.
Filtering metrics by additional labels on the prometheus metric:
The prometheus metric from which the model RPS is computed has the following labels managed by Seldon Core 2:
If you want the scaling metric to be computed based on a subset of the Prometheus time series with particular label values (labels either managed by Seldon Core 2 or added automatically within your infrastructure), you can add this as a selector the HPA metric config. This is shown in the following example, which scales only based on the RPS of REST requests as opposed to REST + gRPC:
Customize scale-up / scale-down rate & properties by using scaling policies as described in the HPA scaling policies docs
For more resources, please consult the HPA docs and the HPA walkthrough
When deploying HPA-based scaling for Seldon Core 2 models and servers as part of a production deployment, it is important to understand the exact interactions between HPA-triggered actions and Seldon Core 2 scheduling, as well as potential pitfalls in choosing particular HPA configurations.
Using the default scaling policy, HPA is relatively aggressive on scale-up (responding quickly to increases in load), with a maximum replicas increase of either 4 every 15 seconds or 100% of existing replicas within the same period (whichever is highest). In contrast, scaling-down is more gradual, with HPA only scaling down to the maximum number of recommended replicas in the most recent 5 minute rolling window, in order to avoid flapping. Those parameters can be customized via scaling policies.
When using custom metrics such as RPS, the actual number of replicas added during scale-up or reduced during scale-down will entirely depend, alongside the maximums imposed by the policy, on the configured target (averageValue
RPS per replica) and on how quickly the inferencing load varies in your cluster. All three need to be considered jointly in order to deliver both an efficient use of resources and meeting SLAs.
Naturally, the first thing to consider is an estimated peak inference load (including some margins) for each of the models in the cluster. If the minimum number of model replicas needed to serve that load without breaching latency SLAs is known, it should be set as spec.maxReplicas
, with the HPA target.averageValue
set to peak_infer_RPS
/maxReplicas
.
If maxReplicas
is not already known, an open-loop load test with a slowly ramping up request rate should be done on the target model (one replica, no scaling). This would allow you to determine the RPS (inference request throughput) when latency SLAs are breached or (depending on the desired operation point) when latency starts increasing. You would then set the HPA target.averageValue
taking some margin below this saturation RPS, and compute spec.maxReplicas
as peak_infer_RPS
/target.averageValue
. The margin taken below the saturation point is very important, because scaling-up cannot be instant (it requires spinning up new pods, downloading model artifacts, etc.). In the period until the new replicas become available, any load increases will still need to be absorbed by the existing replicas.
If there are multiple models which typically experience peak load in a correlated manner, you need to ensure that sufficient cluster resources are available for k8s to concurrently schedule the maximum number of server pods, with each pod holding one model replica. This can be ensured by using either Cluster Autoscaler or, when running workloads in the cloud, any provider-specific cluster autoscaling services.
It is important for the cluster to have sufficient resources for creating the total number of desired server replicas set by the HPA CRs across all the models at a given time.
Not having sufficient cluster resources to serve the number of replicas configured by HPA at a given moment, in particular under aggressive scale-up HPA policies, may result in breaches of SLAs. This is discussed in more detail in the following section.
A similar approach should be taken for setting minReplicas
, in relation to estimated RPS in the low-load regime. However, it's useful to balance lower resource usage to immediate availability of replicas for inference rate increases from that lowest load point. If low-load regimes only occur for small periods of time, and especially combined with a high rate of increase in RPS when moving out of the low-load regime, it might be worth to set the minReplicas
floor higher in order to ensure SLAs are met at all times.
Each spec.replica
value change for a Model or Server triggers a rescheduling event for the Seldon Core 2 scheduler, which considers any updates that are needed in mapping Model replicas to Server replicas such as rescheduling failed Model replicas, loading new ones, unloading in the case of the number of replicas going down, etc.
Two characteristics in the current implementation are important in terms of autoscaling and configuring the HPA scale-up policy:
The scheduler does not create new Server replicas when the existing replicas are not sufficient for loading a Model's replicas (one Model replica per Server replica). Whenever a Model requests more replicas than available on any of the available Servers, its ModelReady
condition transitions to Status: False
with a ScheduleFailed
message. However, any replicas of that Model that are already loaded at that point remain available for servicing inference load.
There is no partial scheduling of replicas. For example, consider a model with 2 replicas, currently loaded on a server with 3 replicas (two of those server replicas will have the model loaded). If you update the model replicas to 4, the scheduler will transition the model to ScheduleFailed
, seeing that it cannot satisfy the requested number of replicas. The existing 2 model replicas will continue to serve traffic, but a third replica will not be loaded onto the remaining server replica.
In other words, the scheduler either schedules all the requested replicas, or, if unable to do so, leaves the state of the cluster unchanged.
Introducing partial scheduling would make the overall results of assigning models to servers significantly less predictable and ephemeral. This is because models may end up moved back-and forth between servers depending on the speed with which various server replicas become available. Network partitions or other transient errors may also trigger large changes to the model-to-server assignments, making it challenging to sustain consistent data plane load during those periods.
Taken together, the two Core 2 scheduling characteristics, combined with a very aggressive HPA scale-up policy and a continuously increasing RPS may lead to the following pathological case:
Based on RPS, HPA decides to increase both the Model and Server replicas from 2 (an example start stable state) to 8. While the 6 new Server pods get scheduled and get the Model loaded onto them, the scheduler will transition the Model into the ScheduleFailed
state, because it cannot fulfill the requested replicas requirement. During this period, the initial 2 Model replicas continue to serve load, but are using their RPS margins and getting closer to the saturation point.
At the same time, load continues to increase, so HPA further increases the number of required Model and Server replicas from 8 to 12, before all of the 6 new Server pods had a chance to become available. The new replica target for the scheduler also becomes 12, and this would not be satisfied until all the 12 Server replicas are available. The 2 Model replicas that are available may by now be saturated and the infer latency spikes up, breaching set SLAs.
The process may continue until load stabilizes.
If at any point the number of requested replicas (<=maxReplicas
) exceeds the resource capacity of the cluster, the requested server replica count will never be reached and thus the Model will remain permanently in the ScheduleFailed
state.
While most likely encountered during continuous ramp-up RPS load tests with autoscaling enabled, the pathological case example is a good showcase for the elements that need to be taken into account when setting the HPA policies.
The speed with which new Server replicas can become available versus how many new replicas may HPA request in a given time:
The HPA scale-up policy should not be configured to request more replicas than can become available in the specified time. The following example reflects a confidence that 5 Server pods will become available within 90 seconds, with some safety margin. The default scale-up config, that also adds a percentage based policy (double the existing replicas within the set periodSeconds
) is not recommended because of this.
Perhaps more importantly, there is no reason to scale faster than the time it takes for replicas to become available - this is the true maximum rate with which scaling up can happen anyway.
The duration of transient load spikes which you might want to absorb within the existing per-replica RPS margins.
The previous example, at line 13, configures a scale-up stabilization window of one minute. It means that for all of the HPA recommended replicas in the last 60 second window (4 samples of the custom metric considering the default sampling rate), only the smallest will be applied.
Such stabilization windows should be set depending on typical load patterns in your cluster: not being too aggressive in reacting to increased load will allow you to achieve cost savings, but has the disadvantage of a delayed reaction if the load spike turns out to be sustained.
The duration of any typical/expected sustained ramp-up period, and the RPS increase rate during this period.
It is useful to consider whether the replica scale-up rate configured via the policy (line 15 in the example) is able to keep-up with this RPS increase rate.
Such a scenario may appear, for example, if you are planning for a smooth traffic ramp-up in a blue-green deployment as you are draining the "blue" deployment and transitioning to the "green" one
Drift detection models are treated as any other Model. You can run any saved Alibi-Detect drift detection model by adding the requirement alibi-detect
.
An example drift detection model from the CIFAR10 image classification example is shown below:
Usually you would run these models in an asynchronous part of a Pipeline, i.e. they are not connected to the output of the Pipeline which defines the synchronous path. For example, the CIFAR-10 image detection example uses a pipeline as shown below:
Note how the cifar10-drift
model is not part of the path to the outputs. Drift alerts can be read from the Kafka topic of the model.
This section describes how a user can run performance tests to understand the limits of a particular SCv2 deployment.
The base directly is tests/k6
k6 is used to drive requests for load, unload and infer workloads. It is recommended that the load test is run within the same cluster that has SCv2 installed as it requires internal access to some of the services that are not automatically exposed to the outside world. Furthermore having the driver withthin the same cluster minimises link latency to SCv2 entrypoint; therefore infer latencies are more representatives of actual overheads of the system.
Envoy Tests synchronous inference requests via envoy
To run: make deploy-envoy-test
Agent Tests inference requests direct to a specific agent, defaults to triton-0 or mlserver-0
To run: make deploy-rproxy-test
pr make deploy-rproxy-mlserver-test
Server Tests inference requests direct to a specific server (bypassing agent), defaults to triton-0 or mlserver-0
to run: make deploy-server-test
or deploy-server-mlserver-test
Pipeline gateway (HTTP-Kafka gateway) Tests inference requests to one-node pipeline HTTP and GPRC requests
To run: make deploy-kpipeline-test
Model gateway (Kafka-HTTP gateway) Tests inference requests to a model via kafka
To run: deploy-kmodel-test
One way to look at results is to look at the log of the pod that executed the kubernetes job.
Results can also be persisted to a gs bucket, a service account k6-sa-key
in the same namespace is required,
Users can also look at the metrics that are exposed in prometheus while the test is underway
In the case a user is modifying the actual scenario of the test:
export DOCKERHUB_USERNAME=mydockerhubaccount
build the k6 image via make build-push
in the same shell environment, deploying jobs will use this custome built docker image
Users can modify settings of the tests in tests/k6/configs/k8s/base/k6.yaml
. This will apply to all subsequent tests that are deployed using the above process.
Some settings that can be changed
k6 args
for a full list, check k6 args
Environment variables
for MODEL_TYPE
, choose from:
Seldon Core 2 is designed around data flow paradigm. Here we will explain what that means and some of the rationals behind this choice.
Initial release of Seldon Core introduced a concept of an inference graph, which can be thought of as a sequence of operations that happen to the inference request. Here is how it may look like:
In reality though this was not how Seldon Core v1 is implemented. Instead, Seldon deployment consists of a range of independent services that host models, transformations, detectors and explainers, and a central orchestrator that knows the inference graph topology and makes service calls in the correct order, passing data between requests and responses as necessary. Here is how the picture looks under the hood:
While this is a convenient way of implementing evaluation graph with microservices, it has a few problems. Orchestrator becomes a bottleneck and a single point of failure. It also hides all the data transformations that need to happen to translate one service's response to another service's request. Data tracing and lineage becomes difficult. All in all, while Seldon platform is all about processing data, under-the-hood implementation was still focused on order of operations and not on data itself.
The realisation of this disparity led to a new approach towards inference graph evaluation in v2, based on the data flow paradigm. Data flow is a well known concept in software engineering, known from 1960s. In contrast to services, that model programs as a control flow, focusing on the order of operations, data flow proposes to model software systems as a series of connections that modify incoming data, focusing on data flowing through the system. A particular flavor of data flow paradigm used by v2 is known as flow-based programming, FBP. FBP defines software applications as a set of processes which exchange data via connections that are external to those processes. Connections are made via named ports, which promotes data coupling between components of the system.
Data flow design makes data in software the top priority. That is one of the key messages of the so called "data-centric AI" idea, which is becoming increasingly popular within the ML community. Data is a key component of a successful ML project. Data needs to be discovered, described, cleaned, understood, monitored and verified. Consequently, there is a growing demand for data-centric platforms and solutions. Making Seldon Core data-centric was one of the key goals of the Seldon Core 2 design.
In the context of Seldon Core application of FBP design approach means that the evaluation implementation is done the same way inferece graph. So instead of routing everything through a centralized orchestrator the evaluation happens in the same graph-like manner:
As far as implementation goes, Seldon Core 2 runs on Kafka. Inference request is put onto a pipeline input topic, which triggers an evaluation. Each part of the inference graph is a service running in its own container fronted by a model gateway. Model gateway listens to a corresponding input Kafka topic, reads data from it, calls the service and puts the received response to an output Kafka topic. There is also a pipeline gateway that allows to interact with Seldon Core in synchronous manner.
This approach gives SCv2 several important features. Firstly, Seldon Core natively supports both synchronous and asynchronous modes of operation. Asynchronicity is achieved via streaming: input data can be sent to an input topic in Kafka, and after the evaluation the output topic will contain the inference result. For those looking to use it in the v1 style, a service API is provided.
Secondly, there is no single point of failure. Even if one or more nodes in the graph go down, the data will still be sitting on the streams waiting to be processed, and the evaluation resumes whenever the failed node comes back up.
Thirdly, data flow means intermediate data can be accessed at any arbitrary step of the graph, inspected and collected as necessary. Data lineage is possible throughout, which opens up opportunities for advanced monitoring and explainability use cases. This is a key feature for effective error surfacing in production environments as it allows:
Adding context from different parts of the graph to better understand a particular output
Reducing false positive rates of alerts as different slices of the data can be investigated
Enabling reproducing of results as fined-grained lineage of computation and associated data transformation are tracked by design
Finally, inference graph can now be extended with adding new nodes at arbitrary places, all without affecting the pipeline execution. This kind of flexibility was not possible with v1. This also allows multiple pipelines to share common nodes and therefore optimising resources usage.
More details and information on data-centric AI and data flow paradigm can be found in these resources:
Stanford MLSys seminar "What can Data-Centric AI Learn from Data and ML Engineering?"
A paper that explores data flow in ML deployment context
Introduction to flow based programming from its creator J.P. Morrison:
Pathways: Asynchronous Distributed Dataflow for ML research workfrom Google on the design and implementation of data flow based orchestration layer for accelerators
Better understanding of data requires tracking its history and context
Pipelines allow models to be connected into flows of data transformations. This allows more complex machine learning pipelines to be created with multiple models, feature transformations and monitoring components such as drift and outlier detectors.
The simplest way to create Pipelines is by defining them with thePipeline resource we provide for Kubernetes. This format
is accepted by our Kubernetes implementation but also locally via our seldon
CLI.
Internally in both cases Pipelines are created via our Scheduler API. Advanced users could submit Pipelines directly using this gRPC service.
An example that chains two models together is shown below:
steps
allow you to specify the models you want to combine into a pipeline. Each step name will
correspond to a model of the same name. These models will need to have been deployed and available
for the Pipeline to function, however Pipelines can be deployed before or at the same time you deploy
the underlying models.
steps.inputs
allow you to specify the inputs to this step.
outputs.steps
allow you to specify the output of the Pipeline. A pipeline can have multiple paths
include flows of data that do not reach the output, e.g. Drift detection steps. However, if you wish
to call your Pipeline in a synchronous manner via REST/gRPC then an output must be present so the
Pipeline can be treated as a function.
Model step inputs are defined with a dot notation of the form:
Inputs with just a step name will be assumed to be step.outputs
.
The default payloads for Pipelines is the V2 protocol which requires named tensors as inputs and outputs
from a model. If you require just certain tensors from a model you can reference those in the inputs,
e.g. mymodel.outputs.t1
will reference the tensor t1
from the model mymodel
.
For the specification of the V2 protocol.
The simplest Pipeline chains models together: the output of one model goes into the input of the next.
This will work out of the box if the output tensor names from a model match the input tensor names for
the one being chained to. If they do not then the tensorMap
construct presently needs to be used to
define the mapping explicitly, e.g. see below for a simple chained pipeline of two tfsimple example models:
In the above we rename tensor OUTPUT0
to INPUT0
and OUTPUT1
to INPUT1
. This allows these models to
be chained together. The shape and data-type of the tensors needs to match as well.
This example can be found in the pipeline examples.
Joining allows us to combine outputs from multiple steps as input to a new step.
Caption: "Joining the outputs of two models into a third model. The dashed lines signify model outputs that are not captured in the output of the pipeline."
Here we pass the pipeline inputs to two models and then take one output tensor from each and pass to the
final model. We use the same tensorMap
technique to rename tensors as disucssed in the previous section.
Joins can have a join type which can be specified with inputsJoinType
and can take the values:
inner
: require all inputs to be available to join.
outer
: wait for joinWindowMs
to join any inputs. Ignoring any inputs that have not sent any data at that
point. This will mean this step of the pipeline is guaranteed to have a latency of at least joinWindowMs
.
any
: wait for any of the specified data sources.
This example can be found in the pipeline examples.
Pipelines can create conditional flows via various methods. We will discuss each in turn.
The simplest way is to create a model that outputs different named tensors based on its decision. This way downstream steps can be dependant on different expected tensors. An example is shown below:
Caption: "Pipeline with a conditional output model. The model conditional only outputs one of the two tensors, so only one path through the graph (red or blue) is taken by a single request"
In the above we have a step conditional
that either outputs a tensor named OUTPUT0
or a tensor named OUTPUT1
.
The mul10
step depends on an output in OUTPUT0
while the add10 step depends on an output from OUTPUT1
.
Note, we also have a final Pipeline output step that does an any
join on these two models essentially outputting
fron the pipeline whichever data arrives from either model. This type of Pipeline can be used for Multi-Armed bandit
solutions where you want to route traffic dynamically.
This example can be found in the pipeline examples.
Its also possible to abort pipelines when an error is produced to in effect create a condition. This is illustrated below:
This Pipeline runs normally or throws an error based on whether the input tensors have certain values.
Sometimes you want to run a step if an output is received from a previous step but not to send the data from that step to the model. This is illustrated below:
Caption: "A pipeline with a single trigger. The model tfsimple3 only runs if the model check returns a tensor named OUTPUT
. The green edge signifies that this is a trigger and not an additional input to tfsimple3. The dashed lines signify model outputs that are not captured in the output of the pipeline."
In this example the last step tfsimple3
runs only if there are outputs from tfsimple1
and tfsimple2
but also
data from the check
step. However, if the step tfsimple3
is run it only receives the join of data from tfsimple1
and tfsimple2
.
This example can be found in the pipeline examples.
You can also define multiple triggers which need to happen based on a particulr join type. For example:
Caption: "A pipeline with multiple triggers and a trigger join of type any
. The pipeline has four inputs, but three of these are optional (signified by the dashed borders)."
Here the mul10
step is run if data is seen on the pipeline inputs in the ok1
or ok2
tensors based on
the any
join type. If data is seen on ok3
then the add10
step is run.
If we changed the triggersJoinType
for mul10
to inner
then both ok1
and ok2
would need to appear
before mul10
is run.
Pipelines by default can be accessed synchronously via http/grpc or asynchronously via the Kafka topic
created for them. However, it's also possible to create a pipeline to take input from one or more other
pipelines by specifying an input
section. If for example we already have the tfsimple
pipeline shown below:
We can create another pipeline which takes its input from this pipeline, as shown below:
Caption: "A pipeline taking as input the output of another pipeline."
In this way pipelines can be built to extend existing running pipelines to allow extensibility and sharing of data flows.
The spec follows the same spec for a step except that references to other pipelines are contained in
the externalInputs
section which takes the form of pipeline or pipeline.step references:
<pipelineName>.(inputs|outputs).<tensorName>
<pipelineName>.(step).<stepName>.<tensorName>
Tensor names are optional and only needed if you want to take just one tensor from an input or output.
There is also an externalTriggers
section which allows triggers from other pipelines.
Further examples can be found in the pipeline-to-pipeline examples.
Present caveats:
Circular dependencies are not presently detected.
Pipeline status is local to each pipeline.
Internally Pipelines are implemented using Kafka. Each input and output to a pipeline step has an associated Kafka topic. This has many advantages and allows auditing, replay and debugging easier as data is preserved from every step in your pipeline.
Tracing allows you to monitor the processing latency of your pipelines.
As each request to a pipelines moves through the steps its data will appear in input and output topics. This allows a full audit of every transformation to be carried out.
The list of Seldon Core 2 metrics that are compiling is as follows.
For the agent that sits next to the inference servers:
For the pipeline gateway that handles requests to pipelines:
Many of these metrics are model and pipeline level counters and gauges. Some of these metrics are aggregated to speed up the display of graphs. Currently,per-model histogram metrics are not stored for performance reasons. However, per-pipeline histogram metrics are stored.
This is experimental, and these metrics are expected to evolve to better capture relevant trends as more information becomes available about system usage.
Installing kube-prometheus-stack in the same Kubernetes cluster that hosts the Seldon Core 2.
kube-prometheus
, also known as Prometheus Operator, is a popular open-source project that provides complete monitoring and alerting solutions for Kubernetes clusters. It combines tools and components to create a monitoring stack for Kubernetes environments.
The Seldon Core 2, along with any deployed models, automatically exposes metrics to Prometheus. By default, certain alerting rules are pre-configured, and an alertmanager instance is included.
You can install kube-prometheus
to monitor Seldon components, and ensure that the appropriate ServiceMonitors
are in place for Seldon deployments. The analytics component is configured with the Prometheus integration. The monitoring for Seldon Core 2 is based on the Prometheus Operator and the related PodMonitor
and PrometheusRule
resources.
Monitoring the model deployments in Seldon Core 2 involves:
Create a namespace for the monitoring components of Seldon Core 2.
Create a YAML file to specify the initial configuration. For example, create the prometheus-values.yaml
file. Use your preferred text editor to create and save the file with the following content:
Note: Make sure to include metric-labels-allowlist: pods=[*]
in the Helm values file. If you are using your own Prometheus Operator installation, ensure that the pods labels, particularly app.kubernetes.io/managed-by=seldon-core
, are part of the collected metrics. These labels are essential for calculating deployment usage rules.
Change to the directory that contains the prometheus-values
file and run the following command to install version 9.5.12
of kube-prometheus
.
When the installation is complete, you should see this:
Check the status of the installation.
When the installation is complete, you should see this:
You can access Prometheus from outside the cluster by running the following commands:
You can access Alertmanager from outside the cluster by running the following commands:
Apply the Custom RBAC Configuration settings for kube-prometheus.
Configure metrics collection by createing the following PodMonitor
resources.
When the resources are created, you should see this:
You may now be able to check the status of Seldon components in Prometheus:
Open your browser and navigate to http://127.0.0.1:9090/
to access Prometheus UI from outside the cluster.
Go to Status and select Targets.
The status of all the endpoints and the scrape details are displayed.
While the system runs, Prometheus collects metrics that enable you to observe various aspects of Seldon Core 2, including throughput, latency, memory, and CPU usage. In addition to the standard Kubernetes metrics scraped by Prometheus, a provides a comprehensive system overview.
An to show raw metrics that Prometheus will scrape.
Outlier detection models are treated as any other Model. You can run any saved outlier detection model by adding the requirement alibi-detect
.
Install .
Install .
Install in the namespace seldon-monitoring
.
You can view the metrics in Grafana Dashboard after you set Prometheus as the Data Source, and import seldon.json
dashboard located at seldon-core/v2.8.2/prometheus/dashboards
in .
Explainers are Model resources with some extra settings. They allow a range of explainers from the Alibi-Explain library to be run on MLServer.
An example Anchors explainer definitions is shown below.
The key additions are:
type
: This must be one of the supported Alibi Explainer types supported by the Alibi Explain runtime in MLServer.
modelRef
: The model name for black box explainers.
pipelineRef
: The pipeline name for black box explainers.
Only one of modelRef and pipelineRef is allowed.
Blackbox explainers can explain a Pipeline as well as a model. An example from the Huggingface sentiment demo is show below.
Seldon Core 2 provides robust tools for tracking the performance and health of machine learning models in production.
Real-Time metrics: collects and displays real-time metrics from deployed models, such as response times, error rates, and resource usage.
Model performance tracking: monitors key performance indicators (KPIs) like accuracy, drift detection, and model degradation over time.
Custom metrics: allows you to define and track custom metrics specific to their models and use cases.
Visualization: Provides dashboards and visualizations to easily observe the status and performance of models.
There are two kinds of metrics present in Seldon Core 2 that you can monitor:
Operational metrics describe the performance of components in the system. Some examples of common operational considerations are memory consumption and CPU usage, request latency and throughput, and cache utilisation rates. Generally speaking, these are the metrics system administrators, operations teams, and engineers will be interested in.
Usage metrics describe the system at a higher and less dynamic level. Some examples include the number of deployed servers and models, and component versions. These are not typically metrics that engineers need insight into, but may be relevant to platform providers and operations teams.
Seldon inference is built from atomic Model components. Models as shown here cover a wide range of artifacts including:
Core machine learning models, e.g. a PyTorch model.
Feature transformations that might be built with custom python code.
Drift detectors.
Outlier detectors.
Explainers
Adversarial detectors.
A typical workflow for a production machine learning setup might be as follows:
You create a Tensorflow model for your core application use case and test this model in isolation to validate.
You create SKLearn feature transformation component before your model to convert the input into the correct form for your model. You also create Drift and Outlier detectors using Seldon's open source Alibi-detect library and test these in isolation.
You join these components together into a Pipeline for the final production setup.
These steps are shown in the diagram below:
This section will provide some examples to allow operations with Seldon to be tested so you can run your own models, experiments, pipelines and explainers.
There are various interesting system metrics about how Seldon Core 2 is used. These metrics can be recorded anonymously and sent to Seldon by a lightweight, optional, stand-alone component called Hodometer.
When provided, these metrics are used to understand the adoption of Seldon Core 2 and how you interact with it. For example, knowing how many clusters Seldon Core 2 is running on, if it is used in Kubernetes or for local development, and how many users are benefitting from features such as multi-model serving.
Hodometer is not an integral part of Seldon Core 2, but rather an independent component which connects to the public APIs of the Seldon Core 2 scheduler. If deployed in Kubernetes, it requests some basic information from the Kubernetes API.
Recorded metrics are sent to Seldon and, optionally, to any additional endpoints you define.
Hodometer was explicitly designed with privacy of user information and transparency of implementation in mind.
It does not record any sensitive or identifying information. For example, it has no knowledge of IP addresses, model names, or user information. All information sent to Seldon is anonymised with a completely random cluster identifier.
Hodometer supports different information levels, so you have full control over what metrics are provided to Seldon, if any.
For transparency, the implementation is fully open-source and designed to be easy to read. The full source code is available here, with metrics defined in code here. Seebelow for an equivalent table of metrics.
Metrics are collected as periodic snapshots a few times per day. They are lightweight to collect, coming mostly from the Seldon Core v2 scheduler, and are heavily aggregated. As such, they should have minimal impact on CPU, memory, and network consumption.
Hodometer does not store anything it records, so does not have any persistent storage. As a result, it should not be considered a replacement for tools like Prometheus.
Hodometer supports 3 different metrics levels:
Cluster
Basic information about the Seldon Core v2 installation
Resource
High-level information about which Seldon Core v2 resources are used
Feature
More detailed information about how resources are used and whether or not certain feature flags are enabled
Alternatively, usage metrics can be completely disabled. To do so, simply remove any existing deployment of Hodometer or disable it in the installation for your environment, discussed below.
The following environment variables control the behaviour of Hodometer, regardless of the environment it is installed in.
METRICS_LEVEL
string
feature
Level of detail for recorded metrics; one of feature
, resource
, or cluster
EXTRA_PUBLISH_URLS
comma-separated list of URLs
http://my-endpoint-1:8000,http://my-endpoint-2:8000
Additional endpoints to publish metrics to
SCHEDULER_HOST
string
seldon-scheduler
Hostname for Seldon Core v2 scheduler
SCHEDULER_PORT
integer
9004
Port for Seldon Core v2 scheduler
LOG_LEVEL
string
info
Level of detail for application logs
Hodometer is installed as a separate deployment, by default in the same namespace as the rest of the Seldon components.
Helm
If you install Seldon Core v2 by Helm chart, there are values corresponding to the key environment variables discussed above. These Helm values and their equivalents are provided below:
hodometer.metricsLevel
METRICS_LEVEL
hodometer.extraPublishUrls
EXTRA_PUBLISH_URLS
hodometer.logLevel
LOG_LEVEL
If you do not want usage metrics to be recorded, you can disable Hodometer via the hodometer.disable
Helm
value when installing the runtime Helm chart. The following command disables collection of usage metrics in
fresh installations and also serves to remove Hodometer from an existing installation:
The Compose setup provides a pre-configured and opinionated, yet still flexible, approach to using Seldon Core v2.
Hodometer is defined as a service called hodometer
in the Docker Compose manifest. It is automatically enabled
when running as per the installation instructions.
You can disable Hodometer in Docker Compose by removing the corresponding service from the base manifest.
Alternatively, you can gate it behind a profile.
If the service is already running, you can stop it directly using docker-compose stop ...
.
Configuration can be provided by environment variables when running make
or directly invoking docker-compose
.
The available variables are defined in the Docker Compose environment file, prefixed with HODOMETER_
.
Hodometer can be instructed to publish metrics not only to Seldon, but also to any extra endpoints you specify.
This is controlled by the EXTRA_PUBLISH_URLS
environment variable, which expects a comma-separated list of
HTTP-compatible URLs.
You might choose to use this for your own usage monitoring. For example, you could capture these metrics and expose them to Prometheus or another monitoring system using your own service.
Metrics are recorded in MixPanel-compatible format, which employs a highly flexible JSON schema.
For an example of how to define your own metrics listener, see thereceiver
Go package in the hodometer
sub-project.
cluster_id
cluster
UUID
A random identifier for this cluster for de-duplication
seldon_core_version
cluster
Version number
E.g. 1.2.3
is_global_installation
cluster
Boolean
Whether installation is global or namespaced
is_kubernetes
cluster
Boolean
Whether or not the installation is in Kubernetes
kubernetes_version
cluster
Version number
Kubernetes server version, if inside Kubernetes
node_count
cluster
Integer
Number of nodes in the cluster, if inside Kubernetes
model_count
resource
Integer
Number of Model
resources
pipeline_count
resource
Integer
Number of Pipeline
resources
experiment_count
resource
Integer
Number of Experiment
resources
server_count
resource
Integer
Number of Server
resources
server_replica_count
resource
Integer
Total number of Server
resource replicas
multimodel_enabled_count
feature
Integer
Number of Server
resources with multi-model serving enabled
overcommit_enabled_count
feature
Integer
Number of Server
resources with overcommitting enabled
gpu_enabled_count
feature
Integer
Number of Server
resources with GPUs attached
inference_server_name
feature
String
Name of inference server, e.g. MLServer or Triton
server_cpu_cores_sum
feature
Float
Total of CPU limits across all Server
resource replicas, in cores
server_memory_gb_sum
feature
Float
Total of memory limits across all Server
resource replicas, in GiB
We support Open Telemetry tracing. By default all components will attempt to send OLTP events to seldon-collector.seldon-mesh:4317
which will export to Jaeger at simplest-collector.seldon-mesh:4317
.
The components can be installed from the tracing/k8s
folder. In future an Ansible playbook will be created. This installs a Open Telemetry collector and a simple Jaeger install with a service that can be port forwarded to at simplest.seldon-mesh:16686
.
An example Jaeger trace is show below:
We use a simple sklearn iris classification model
Load the model
Wait for the model to be ready
Do a REST inference call
Do a gRPC inference call
Unload the model
We run a simple tensorflow model. Note the requirements section specifying tensorflow
.
Load the model.
Wait for the model to be ready.
Get model metadata
Do a REST inference call.
Do a gRPC inference call
Unload the model
We will use two SKlearn Iris classification models to illustrate an experiment.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.
Start the experiment.
Wait for the experiment to be ready.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Run one more request
Use sticky session key passed by last infer request to ensure same route is taken each time. We will test REST and gRPC.
Stop the experiment
Show the requests all go to original model now.
Unload both models.
Examples of various model artifact types from various frameworks running under Seldon Core 2.
SKlearn
Tensorflow
XGBoost
ONNX
Lightgbm
MLFlow
PyTorch
Python requirements in model-zoo-requirements.txt
The training code for this model can be found at scripts/models/iris
in SCv2 repo.
The training code for this model can be found at ./scripts/models/income-xgb
This model is a pretrained model as defined in ./scripts/models/Makefile
target mnist-onnx
The training code for this model can be found at ./scripts/models/income-lgb
The training code for this model can be found at ./scripts/models/wine-mlflow
This example model is downloaded and trained in ./scripts/models/Makefile
target mnist-pytorch
This notebook illustrates a series of Pipelines that are joined together.
gs://seldon-models/triton/simple
an example Triton tensorflow model that takes 2 inputs INPUT0 and INPUT1 and adds them to produce OUTPUT0 and also subtracts INPUT1
from INPUT0
to produce OUTPUT1. See here for the original source code and license.
Other models can be found at https://github.com/SeldonIO/triton-python-examples
This notebook illustrates a series of Pipelines showing of different ways of combining flows of data and conditional logic. We assume you have Seldon Core 2 running locally.
Other models can be found at https://github.com/SeldonIO/triton-python-examples
Chain the output of one model into the next. Also shows chaning the tensor names via tensorMap
to conform to the expected input tensor names of the second model.
The pipeline below chains the output of tfsimple1
into tfsimple2
. As these models have compatible shape and data type this can be done. However, the output tensor names from tfsimple1
need to be renamed to match the input tensor names for tfsimple2
. We do this with the tensorMap
feature.
The output of the Pipeline is the output from tfsimple2
.
We use the Seldon CLI pipeline inspect
feature to look at the data for all steps of the pipeline for the last data item passed through the pipeline (the default). This can be useful for debugging.
Next, we look get the output as json and use the jq
tool to get just one value.
Chain the output of one model into the next. Shows using the input and outputs and combining.
Join two flows of data from two models as input to a third model. This shows how individual flows of data can be combined.
In the pipeline below for the input to tfsimple3
we join 1 output tensor each from the two previous models tfsimple1
and tfsimple2
. We need to use the tensorMap
feature to rename each output tensor to one of the expected input tensors for the tfsimple3
model.
The outputs are the sequence "2,4,6..." which conforms to the logic of this model (addition and subtraction) when fed the output of the first two models.
Shows conditional data flows - one of two models is run based on output tensors from first.
Here we assume the conditional
model can output two tensors OUTPUT0 and OUTPUT1 but only outputs the former if the CHOICE input tensor is set to 0 otherwise it outputs tensor OUTPUT1. By this means only one of the two downstream models will receive data and run. The output
steps does an any
join from both models and whichever data appears first will be sent as output to pipeline. As in this case only 1 of the two models add10
and mul10
runs we will receive their output.
The mul10
model will run as the CHOICE tensor is set to 0.
The add10
model will run as the CHOICE tensor is not set to zero.
Access to indivudal tensors in pipeline inputs
This pipeline shows how we can access pipeline inputs INPUT0 and INPUT1 from different steps.
Shows how joins can be used for triggers as well.
Here we required tensors names ok1
or ok2
to exist on pipeline inputs to run the mul10
model but require tensor ok3
to exist on pipeline inputs to run the add10
model. The logic on mul10
is handled by a trigger join of any
meaning either of these input data can exist to satisfy the trigger join.
gs://seldon-models/triton/simple
an example Triton tensorflow model that takes 2 inputs INPUT0 and INPUT1 and adds them to produce OUTPUT0 and also subtracts INPUT1 from INPUT0 to produce OUTPUT1. See for the original source code and license.
We will use two SKlearn Iris classification models to illustrate experiments.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.
Start the experiment.
Wait for the experiment to be ready.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Show sticky session header x-seldon-route
that is returned
Use sticky session key passed by last infer request to ensure same route is taken each time.
Stop the experiment
Unload both models.
Use sticky session key passed by last infer request to ensure same route is taken each time.
We will use two SKlearn Iris classification models to illustrate a model with a mirror.
Load both models.
Wait for both models to be ready.
Create an experiment that modifies in which we mirror traffic to iris also to iris2.
Start the experiment.
Wait for the experiment to be ready.
We get responses from iris but all requests would also have been mirrored to iris2
We can check the local prometheus port from the agent to validate requests went to iris2
Stop the experiment
Unload both models.
Let's check that the mul10 model was called.
Let's do an http call and check agaib the two models
To install tritonclient
Note: for compatibility of Tritonclient check this issue.
Note: binary data support in HTTP is blocked by this issue
Note: binary data support in HTTP is blocked by https://github.com/SeldonIO/seldon-core-v2/issues/475
This notebook will show how we can update running experiments.
We will use three SKlearn Iris classification models to illustrate experiment updates.
Load all models.
Let's call all three models individually first.
We will start an experiment to change the iris endpoint to split traffic with the iris2
model.
Now when we call the iris model we should see a roughly 50/50 split between the two models.
Now we update the experiment to change to a split with the iris3
model.
Now we should see a split with the iris3
model.
Now the experiment has been stopped we check everything as before.
Here we test changing the model we want to split traffic on. We will use three SKlearn Iris classification models to illustrate.
Let's call all three models to verify initial conditions.
Now we start an experiment to change calls to the iris
model to split with the iris2
model.
Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.
Now let's change the model we want to experiment to modify to the iris3
model. Splitting between that and iris2
.
Let's check the iris model is now as before but the iris3 model has traffic split.
Finally, let's check now the experiment has stopped as is as at the start.
This example runs you through a series of batch inference requests made to both models and pipelines running on Seldon Core locally.
Deprecated: The MLServer CLI infer
feature is experimental and will be removed in future work.
If you haven't already, you'll need to clone the Seldon Core repository and run it locally before you run through this example.
First, let's jump in to the samples
folder where we'll find some sample models and pipelines we can use:
Let's take a look at a sample model before we deploy it:
The above manifest will deploy a simple sci-kit learn model based on the iris dataset.
Let's now deploy that model using the Seldon CLI:
Now that we've deployed our iris model, let's create a pipeline around the model.
We see that this pipeline only has one step, which is to call the iris
model we deployed earlier. We can create the pipeline by running:
To demonstrate batch inference requests to different types of models, we'll also deploy a simple tensorflow model:
The tensorflow model takes two arrays as inputs and returns two arrays as outputs. The first output is the addition of the two inputs and the second output is the value of (first input - second input).
Let's deploy the model:
Just as we did for the scikit-learn model, we'll deploy a simple pipeline for our tensorflow model:
Inspect the pipeline manifest:
and deploy it:
Once we've deployed a model or pipeline to Seldon Core, we can list them and check their status by running:
and
Your models and pieplines should be showing a state of ModelAvailable
and PipelineReady
respectively.
Before we run a large batch job of predictions through our models and pipelines, let's quickly check that they work with a single standalone inference request. We can do this using the seldon model infer
command.
The preidiction request body needs to be an Open Inference Protocol compatible payload and also match the expected inputs for the model you've deployed. In this case, the iris model expects data of shape [1, 4]
and of type FP32
.
You'll notice that the prediction results for this request come back on outputs[0].data
.
You'll notice that the inputs for our tensorflow model look different from the ones we sent to the iris model. This time, we're sending two arrays of shape [1,16]
. When sending an inference request, we can optionally chose which outputs we want back by including an {"outputs":...}
object.
In the samples folder there is a batch request input file: batch-inputs/iris-input.txt
. It contains 100 input payloads for our iris model. Let's take a look at the first line in that file:
To run a batch inference job we'll use the MLServer CLI. If you don't already have it installed you can install it using:
The inference job can be executed by running the following command:
The mlserver batch component will take your input file batch-inputs/iris-input.txt
, distribute those payloads across 5 different workers (--workers 5
), collect the responses and write them to a file /tmp/iris-output.txt
. For a full set of options check out the MLServer CLI Reference.
We can check the inference responses by looking at the contents of the output file:
We can run the same batch job for our iris pipeline and store the outputs in a different file:
We can check the inference responses by looking at the contents of the output file:
The samples folder contains an example batch input for the tensorflow model, just as it did for the scikit-learn model. You can find it at batch-inputs/tfsimple-input.txt
. Let's take a look at the first inference request in the file:
As before, we can run the inference batch job using the mlserver infer
command:
We can check the inference responses by looking at the contents of the output file:
You should get the following response:
We can check the inference responses by looking at the contents of the output file:
Now that we've run our batch examples, let's remove the models and pipelines we created:
And finally let's spin down our local instance of Seldon Core:
The Seldon models and pipelines are exposed via a single service endpoint in the install namespace called seldon-mesh
. All models, pipelines and experiments can be reached via this single Service endpoint by setting appropriate headers on the inference REST/gRPC request. By this means Seldon is agnostic to any service mesh you may wish to use in your organisation. We provide some example integrations for some example service meshes below (alphabetical order):
We welcome help to extend these to other service meshes.
The below setup also illustrates using kafka specific prefixes for topics and consumerIds for isolation where the kafka cluster is shared with other applications and you want to enforce constraints. You would not strictly need this in this example as we install Kafka just for Seldon here.
If you have installed Kafka via the ansible playbook setup-ecosystem then you can use the following command to see the consumer group ids which are reflecting the settings we created.
We can similarly show the topics that have been created.
To run this example in Kind we need to start Kind with access to a local folder where are models are location. In this example we will use a folder in /tmp
and associate that with a path in the container.
To start a Kind cluster with these settings using our ansible script you can run from the project root folder
Now you should finish the Seldon install following the docs.
Create the local folder we will use for our models and copy an example iris sklearn model to it.
Here we create a storage class and associated persistent colume referencing the /models
folder where our models are stored.
Now we create a new Server based on the provided MLServer configuration but extend it with our PVC by adding this to the rclone container which will allow rclone to move models from this PVC onto the server.
We also add a new capability pvc
to allow us to schedule models to this server that has the PVC.
We use a simple sklearn iris classification model with the added pvc
requirement so our MLServer with the PVC will be targeted during scheduling.
Do a gRPC inference call
Run these examples from the samples/examples/image_classifier
folder.
We show an image classifier (CIFAR10) with associated outlier and drift detectors using a Pipeline.
The model is a tensorflow CIFAR10 image classfier
The outlier detector is created from the CIFAR10 VAE Outlier example.
The drift detector is created from the CIFAR10 KS Drift example
To run local training run the training notebook.
Use the seldon CLI to look at the outputs from the CIFAR10 model. It will decode the Triton binary outputs for us.
We will run through some examples as shown in the notebook service-meshes/ambassador/ambassador.ipynb
in our repo.
Seldon Iris classifier model
Default Ambassador Host and Listener
Ambassador Mappings for REST and gRPC endpoints
Seldon provides an Experiment resource for service mesh agnostic traffic splitting but if you wish to control this via Ambassador and example is shown below to split traffic between two models.
Assumes
You have installed emissary as per their docs
Tested with
Seldon provides APIs for management and inference.
Prerequisites
Configure a OIDC provider to authenticate. Obtain the issuer
url, jwksUri
, and the Access token
from the OIDC provider.
In the following example, you can secure the endpoint such that any requests to the endpoint without the access token are denied.
To secure the endpoints of a model, you need to:
Create a RequestAuthentication
resource named ingress-jwt-auth
in the istio-system namespace
. Replace <OIDC_TOKEN_ISSUER>
and <OIDC_TOKEN_ISSUER_JWKS>
with your OIDC provider’s specific issuer URL and JWKS (JSON Web Key Set) URI.
Create the resource using kubectl apply -f ingress-jwt-auth.yaml
.
Create an authorization policy deny-empty-jwt
in the namespace istio-system
.
Create the resource using kubectl apply -f deny-empty-jwt.yaml
.
To verify that the requests without an access token are denied send this request:
The output is similar to:
Now, send the same request with an access token:
The output is similar to:
Seldon provides some internal gRPC services it uses to manage components.
To run this notebook you need the inference data. This can be acquired in two ways:
Run make train
or,
gsutil cp -R gs://seldon-models/scv2/examples/income/infer-data .
Show predictions from reference set. Should not be drift or outliers.
Show predictions from drift data. Should be drift and probably not outliers.
Show predictions from outlier data. Should be outliers and probably not drift.
In this example we create a Pipeline to chain two huggingface models to allow speech to sentiment functionalityand add an explainer to understand the result.
This example also illustrates how explainers can target pipelines to allow complex explanations flows.
This example requires ffmpeg package to be installed locally. run make install-requirements
for the Python dependencies.
Create a method to load speech from recorder; transform into mp3 and send at base64 data. On return of the result extract and show the text and sentiment.
We will load two Huggingface models for speech to text and text to sentiment.
To allow Alibi-Explain to more easily explain the sentiment we will need:
input and output transfrorms that take the Dict values input and output by the Huggingface sentiment model and turn them into values that Alibi-Explain can easily understand with the core values we want to explain and the outputs from the sentiment model.
A separate Pipeline to allow us to join the sentiment model with the output transform
These transform models are MLServer custom runtimes as shown below:
We can now create the final pipeline that will take speech and generate sentiment alongwith an explanation of why that sentiment was predicted.
We will wait for the explanation which is run asynchronously to the functional output from the Pipeline above.
We will run through some examples as shown in the notebook service-meshes/istio/istio.ipynb
in our repo.
A Seldon Iris Model
An istio Gateway
An instio VirtualService to expose REST and gRPC
Two Iris Models
An istio Gateway
An istio VirtualService with traffic split
Assumes
You have installed istio as per their docs
You have exposed the ingressgateway as an external loadbalancer
tested with:
provides service mesh and ingress products. Our examples here are based on the Emissary ingress.
Note: Traffic splitting does not presently work due to this . We recommend you use a Seldon Experiment instead.
emissary-ingress-7.3.2 insatlled via
Currently not working due to this
(Advanced)
(Reference)
In enterprise use cases, you may need to control who can access the endpoints for deployed models or pipelines. You can leverage existing authentication mechanisms in your cluster or environment, such as service mesh-level controls, or use cloud provider solutions like Apigee on GCP, Amazon API Gateway on AWS, or a provider-agnostic gateway like Gravitee. Seldon Core 2 integrates with various that support these requirements. Though Seldon Core 2 is service-mesh agnostic, the example on this page demonstrates how to set up authentication and authorization to secure a model endpoint using the Istio service mesh.
Service meshes offer a flexible way of defining authentication and authorization rules for your models. With Istio, for example, you can configure multiple layers of security within an Istio Gateway, such as a level, , as well as and policies to enforce both authentication and authorization controls.
provides a service mesh and ingress solution.
Warning Traffic splitting does not presently work due to this . We recommend you use a Seldon Experiment instead.
You have installed Traefik as per their into namespace traefik-v2
provides a service mesh and ingress solution.
Seldon inference servers must respect the following API specification.
In future, Seldon may provide extensions for use with Pipelines, Experiments and Explainers.
Learn about installing Seldon command line tool that you can use to manage Seldon Core 2 resources.
To install Seldon CLI using prebuild binaries or build them locally.
Download from a recent release from https://github.com/SeldonIO/seldon-core/releases
.
It is dynamically linked and will require and *nix architecture and glibc 2.25+.
Move to the seldon
folder and provide the permissions.
Add the folder to your PATH.
Install Go version 1.21.1
Clone and make the build.
Add <project-root>/operator/bin
to your PATH.
Install dependencies.
brew install go librdkafka
Clone the repository and make the build.
Add <project-root>/operator/bin
to your PATH.
Open your terminal and open up your .bashrc
or .zshrc
file and add the following line:
export PATH=$PATH:<project-root>/operator/bin
seldon config - manage configs
seldon experiment - manage experiments
seldon model - manage models
seldon pipeline - manage pipelines
seldon server - manage servers
Learn more about using Seldon CLI commands
Seldon provides a CLI for easy management and testing of model, experiment, and pipeline resources. The Seldon CLI allows you to view information about underlying Seldon resources and make changes to them through the scheduler in non-Kubernetes environments. However, it cannot modify underlying manifests within a Kubernetes cluster. Therefore, using the Seldon CLI for control plane operations in a Kubernetes environment is not recommended and is disabled by default.
While Seldon CLI control plane operations (e.g. load
and unload
) in a Kubernetes environment are not recommended, there are other use cases (e.g. inspecting kafka topics in a pipeline) that are enabled easily with Seldon CLI. It offers out-of-the box deserialisation of the kafka messages according to Open Inference Protocol (OIP), allowing to test these pipelines in these environments.
The following table provides more information about when and where to use these command line tools.
Primary purpose
Simplifies management for non-Kubernetes users, abstracting control plane operations like load, unload, status.
Kubernetes-native, manages resources via Kubernetes Custom Resources (CRs) like Deployments, Pods, etc.
Control Plane Operations
Executes operations such as load
and unload
models through scheduler gRPC endpoints, without interaction with Kubernetes. This is disabled by default and the user has to explicitly specify --force
flag as argument (or set envar SELDON_FORCE_CONTROL_PLANE=true
).
Interacts with Kubernetes, creating and managing CRs such as SeldonDeployments and other Kubernetes resources.
Data Plane Operations
Abstracts open inference protocol to issue infer
or inspect
requests for testing purposes. This is useful for example when inspecting intermediate kafka messages in a multi-step pipeline.
Used indirectly for data plane operations by exposing Kubernetes services and interacting with them.
Visibility of Resources
Resources created using Seldon CLI are internal to the scheduler and not visible as Kubernetes resources.
All resources that are created using kubectl
are visible and manageable within the Kubernetes environment.
Seldon CLI can be deployed as a Pod in the same namespace along side a Core 2 installation, which would allow users to have access to the different CLI commands with minimal setup (i.e. envars set appropriately).
seldonio/seldon-cli
Docker image has prepackaged Seldon CLI suitable for deployment in a Kubernetes cluster. Consider the following Kubernetes Pod manifest:
which can be deployed with
Notes:
The above manifest sets up envars with the correct ip/port for the scheduler control plane (SELDON_SCHEDULE_HOST
).
It mounts seldon-kafka
ConfigMap in /mnt/kafka/kafka.json
so that Seldon CLI reads the relevant details from it (specifically the kafka bootstrap servers ip/port).
It also sets Kafka TLS envars, which should be identical to seldon-modelgateway
deployment.
Service account seldon-scheduler
is required to access Kafka TLS secrets if any.
In this case users can run any Seldon CLI commands via kubectl exec
utility, For example:
In particular this mode is also useful for inspecting data in kafka topics that form a pipeline. This is enabled because all the relevant arguments are setup so that Seldon CLI can consume data from these topics using the same settings used in the corresponding Core 2 deployment in the same namespace.
check operator/config/cli
for a helper script and example manifest.
The CLI talks to 3 backend services on default endpoints:
The Seldon Core 2 Scheduler: default 0.0.0.0:9004
The Seldon Core inference endpoint: default 0.0.0.0:9000
The Seldon Kafka broker: default: 0.0.0.0:9092
These defaults will be correct when Seldon Core 2 is installed locally as per the docs. For Kubernetes, you will need to change these by defining environment variables for the following.
For a default install into the seldon-mesh
namespace if you have exposed the inference svc
as a loadbalancer you will find it at:
Use above IP at port 80:
For a default install into the seldon-mesh
namespace if you have exposed the scheduler svc as a loadbalancer you will find it at:
Use above IP at port 9004:
The Kafka broker will depend on how you have installed Kafka into your Kubernetes cluster. Find the broker IP and use:
You can create a config file to manage connections to running Seldon Core 2 installs. The settings will override any environment variable settings.
The definition is shown below:
An example below shows an example where we connect via TLS to the Seldon scheduler using our scheduler client certificate:
To manage config files and activate them you can use the CLI command seldon config
which has subcommands to list, add, remove, activate and decative configs.
For example:
pipeline inspect
)Additional Kafka configuration specifying Kafka broker and relevant consumer config can be specified as path to a json formatted file. This is currently only supported for inspecting data in topics that form a specific pipeline by passing --kafka-config-path
. For example:
For running with Kubernetes TLS connections on the control and/or data plane, certificates will need to be downloaded locally. We provide an example script which will download certificates from a Kubernetes secret and store them in a folder. It can be found in hack/download-k8s-certs.sh
and takes 2 or 3 arguments:
e.g.:
This API is for communication between the Seldon Scheduler and the Seldon Agent which runs next to each inference server and manages the loading and unloading of models onto the server as well as acting as a reverse proxy in the data plane for handling requests to the inference server.
The Seldon scheduler API provides a gRPC service to allow Models, Servers, Experiments, and Pipelines to be managed. In Kubernetes the manager deployed by Seldon translates Kubernetes custom resource definitions into calls to the Seldon Scheduler.
In non-Kubernetes environments users of Seldon could create a client to directly control Seldon resources using this API.
manage configs
Manage and activate configuration files for the CLI
seldon -
seldon config activate - activate config
seldon config add - add config
seldon config deactivate - deactivate config
seldon config list - list configs
seldon config remove - remove config