1 of 100

latest

About

Production-ready ML Serving Framework

Seldon Core 2 is a Kubernetes-native framework for deploying and managing machine learning (ML) and Large Language Model (LLM) systems at scale. Its data-centric approach and modular architecture enable seamless handling of simple models to complex ML applications across on-premise, hybrid, and multi-cloud environments while ensuring flexibility, standardization, observability, and cost efficiency.

Flexibility: Real-time, your way

Seldon Core 2 offers a platform-agnostic, flexible framework for seamless deployment of different types of ML models across on-premise, cloud, and hybrid environments. Its adaptive architecture enables customizable applications, future-proofing MLOps or LLMOps by scaling deployments as data and applications evolve. The modular design enhances resource-efficiency, allowing dynamic scaling, component reuse, and optimized resource allocation. This ensures long-term scalability, operational efficiency, and adaptability to changing demands.

Standardization: Consistency across workflows

Seldon Core 2 enforces best practices for ML deployment, ensuring consistency, reliability, and efficiency across the entire lifecycle. By automating key deployment steps, it removes operational bottlenecks, enabling faster rollouts and allowing teams to focus on high-value tasks.

With a "learn once, deploy anywhere" approach, Seldon Core 2 standardizes model deployment across on-premise, cloud, and hybrid environments, reducing risk and improving productivity. Its unified execution framework supports conventional, foundational, and LLM models, streamlining deployment and enabling seamless scalability. It also enhances collaboration between MLOps Engineers, Data Scientists, and Software Engineers by providing a customizable framework that fosters knowledge sharing, innovation, and the adoption of new data science capabilities.

Enhanced Observability

Observability in Seldon Core 2 enables real-time monitoring, analysis, and performance tracking of ML systems, covering data pipelines, models, and deployment environments. Its customizable framework combines operational and data science monitoring, ensuring teams have the key metrics needed for maintenance and decision-making.

Seldon simplifies operational monitoring, allowing real-time ML or LLM deployments to expand across organizations while supporting complex, mission-critical use cases. A data-centric approach ensures all prediction data is auditable, maintaining explainability, compliance, and trust in AI-driven decisions.

Optimization: Modularity to maximize resource efficiency

Seldon Core 2 is built for scalability, efficiency, and cost-effective ML operations, enabling you to deploy only the necessary components while maintaining agility and high performance. Its modular architecture ensures that resources are optimized, infrastructure is consolidated, and deployments remain adaptable to evolving business needs.

Scaling for Efficiency: scaling infrastructure dynamically based on demand, auto-scaling for real-time use cases while scaling to zero for on-demand workloads, preserving deployment state for seamless reactivation. By eliminating redundancy and optimizing deployments, it balances cost efficiency and performance, ensuring reliable inference at any scale.

Consolidated Serving Infrastructure: maximizes resource utilization with multi-model serving (MMS) and overcommit, reducing compute overhead while ensuring efficient, reliable inference.

Extendability & Modular Adaptation: integrates with LLMs, Alibi, and other modules, enabling on-demand ML expansion. Its modular design ensures scalable AI, maximizing value extraction, agility, and cost efficiency.

Reusability for Cost Optimization: provides predictable, fixed pricing, enabling cost-effective scaling and innovation while ensuring financial transparency and flexibility.

Next Steps

Explore our other Solutions
Learn about the feature of Seldon Core 2
Join our Slack Community for updates or for answers to any questions

Seldon Core Features

After the models are deployed, Core 2 enables the monitoring and experimentation on those systems in production. With support for a wide range of model types, and design patterns to build around those models, you can standardize ML deployment across a range of use-cases in the cloud or on-premise serving infrastructure of your choice.

Model Deployment

Seldon Core 2 orchestrates and scales machine learning components running as production-grade microservices. These components can be deployed locally or in enterprise-scale kubernetes clusters. The components of your ML system - such as models, processing steps, custom logic, or monitoring methods - are deployed as Models, leveraging serving solutions compatible with Core 2 such as MLServer, Alibi, LLM Module, or Triton Inference Server. These serving solutions package the required dependencies and standardize inference using the Open Inference Protocol. This ensures that, regardless of your model types and use-cases, all request and responses follow a unified format. After models are deployed, they can process REST or gRPC requests for real-time inference.

Complex Applications & Orchestration

Machine learning applications are increasingly complex. They’ve evolved from individual models deployed as services, to complex applications that can consist of multiple models, processing steps, custom logic, and asynchronous monitoring components. With Core you can build Pipelines that connect any of these components to make data-centric applications. Core 2 handles orchestration and scaling of the underlying components of such an application, and exposes the data streamed through the application in real time using Kafka.

Data-centricity is an approach that places the management, integrity, and flow of data at the core of the machine learning deployment framework.

This approach to MLOps, influenced by our position paper Desiderata for next generation of ML model serving, enables real-time observability, insight, and control on the behavior, and performance of your ML systems.

Lastly, Core 2 provides Experiments as part of its orchestration capabilities, enabling users to implement routing logic such as A/B tests or Canary deployments to models or pipelines in production. After experiments are run, you can promote new models and/or pipelines, or launch new experiments, so that you can continuously improve the performance of your ML applications.

Resource Management

In Seldon Core 2 your models are deployed on inference servers, which are software that manage the packaging and execution of ML workloads. As part its design, Core 2 separates out Servers and Models as separate resources. This approach enables flexible allocation of models to servers aligning with the requirements of your models, and to the underlying infrastructure that you want your servers to run on. Core 2 also provides functionality to autoscale your models and servers up and down as needed based on your workload requirements or user-defined metrics.

With the modular design of Core 2, users are able to implement cutting-edge methods to minimize hardware costs:

Multi-Model serving consolidates multiple models onto shared inference servers to optimize resource utilization and decrease the number of servers required.
Over-commit allows you to provision more models than the available memory would normally allow by dynamically loading and unloading models from memory to disk based on demand.

End-to-End MLOps with Core 2

Core 2 demonstrates the power of a standardized, data-centric approach to MLOps at scale, ensuring that data observability and management are prioritized across every layer of machine learning operations. Furthermore, Core 2 seamlessly integrates into end-to-end MLOps workflows, from CI/CD, managing traffic with the service mesh of your choice, alerting, data visualization, or authentication and authorization.

This modular, flexible architecture not only supports diverse deployment patterns but also ensures compatibility with the latest AI innovations. By embedding data-centricity and adaptability into its foundation, Core 2 equips organizations to scale and improve their machine learning systems effectively, to capture value from increasingly complex AI systems.

Next Steps

Install Seldon Core 2
Explore our Tutorials
Join our Slack Community for updates or for answers to any questions

Concepts

In the context of machine learning and Seldon Core 2, concepts provide a framework for understanding key functionalities, architectures, and workflows within the system. Some of the Key concepts in Seldon Core 2 are:

Data-Centric MLOps
Open Inference Protocol
Components
Pipelines
Servers
Experiments

Data-Centric MLOps

Data-centricity is an approach that prioritizes the management, integrity, and flow of data at the core of machine learning deployment. Rather than focusing solely on models, this approach ensures that data quality, consistency, and adaptability drive successful ML operations. In Seldon Core 2, data-centricity is embedded in every stage of the inference workflow, enabling scalable, real-time, and standardized model serving.

Key Principle

Description

Flexible Workflows

Seldon Core 2 supports adaptable and scalable data pathways, accommodating various use cases and experiments. This ensures ML pipelines remain agile, allowing you to evolve the data processing strategies as requirements change.

Real-Time Data Streaming

Integrated data streaming capabilities allow you to view, store, manage, and process data in real time. This enhances responsiveness and decision-making, ensuring models work with the most up-to-date data for accurate predictions.

Standardized Processing

Seldon Core 2 promotes reusable and consistent data transformation and routing mechanisms. Standardized processing ensures data integrity and uniformity across applications, reducing errors and inconsistencies.

Comprehensive Monitoring

Detailed metrics and logs provide real-time visibility into data integrity, transformations, and flow. This enables effective oversight and maintenance, allowing teams to detect anomalies, drifts, or inefficiencies early.

Why Data-Centricity Matters?

By adopting a data-centric approach, Seldon Core 2 enables:

More reliable and high-quality predictions by ensuring clean, well-structured data.
Scalable and future-proof ML deployments through standardized data management.
Efficient monitoring and maintenance, reducing risks related to model drift and inconsistencies.

With data-centricity as a core principle, Seldon Core 2 ensures end-to-end control over ML workflows, enabling you to maximize model performance and reliability in production environments.

Open Inference Protocol in Seldon Core 2

The Open Inference Protocol (OIP) ensures standardized communication between inference servers and clients, enabling interoperability and flexibility across different model-serving runtimes in Seldon Core 2.

To be compliant, an inference server must implement these three key APIs:

API

Function

Health API

Ensures the server is operational and available for inference requests.

Metadata API

Provides essential details about deployed models, including capabilities and configurations.

V2 Inference Protocol API

Facilitates standardized request and response handling for model inference.

Protocol Compatibility

Flexible API Support: A compliant server can implement either HTTP/REST or gRPC APIs, or both.
Runtime Compatibility:Users should refer to the model serving runtime table or the protocolVersion field in runtime YAML to confirm V2 Inference Protocol support for their serving runtime.

Case Sensitivity and Extensions All strings are case-sensitive across API descriptions. V2 Inference Protocol includes an extension mechanism, though specific extensions are defined separately.

By adopting the V2 Inference Protocol, Seldon Core 2 ensures standardized, scalable, and flexible model serving across diverse deployment environments.

Components in Seldon Core 2

Components are the building blocks of an inference graph, processing data at various stages of the ML inference pipeline. They provide reusable, standardized interfaces, making it easier to maintain and update workflows without disrupting the entire system. Components include ML models, data processors, routers, and supplementary services.

Types of Components

Component Type

Description

Sources

Starting points of an inference graph that receive and validate incoming data before processing.

Data Processors

Transform, filter, or aggregate data to ensure consistent, repeatable pre-processing.

Data Routers

Dynamically route data to different paths based on predefined rules for A/B testing, experimentation, or conditional logic.

Models

Perform inference tasks, including classification, regression, and Large Language Models (LLMs), hosted internally or via external APIs.

Supplementary Data Services

External services like vector databases that enable models to access embeddings and extended functionality.

Drift/Outlier Detectors & Explainers

Monitor model predictions for drift, anomalies, and explainability insights, ensuring transparency and performance tracking.

Sinks

Endpoints of an inference graph that deliver results to external consumers while maintaining a stable interface.

By modularizing inference workflows, components allow you to scale, experiment, and optimize ML deployments efficiently while ensuring data consistency and reliability.

Pipelines

In a model serving platform, a pipeline is an automated sequence of steps that manages the deployment, execution, and monitoring of machine learning models. Pipelines ensure that models are efficiently served, dynamically updated, and continuously monitored for performance and reliability in production environments.

Stage

Description

Request Ingestion

Receives inference requests from applications, APIs, or streaming sources.

Preprocessing

Transforms input data for example, tokenization, or normalization before passing it to the model.

Model Selection & Routing

Directs requests to the appropriate model based on rules, versions, or A/B testing.

Inference Execution

Runs predictions using the deployed model.

Postprocessing

Converts model outputs into a consumable format such as confidence scores, structured responses.

Response Delivery

Returns inference results to the requesting application or system.

Monitoring & Logging

Tracks model performance, latency, and accuracy; detects drift and triggers alerts if needed.

Servers

In Seldon Core 2, servers are responsible for hosting and serving machine learning models, handling inference requests, and ensuring scalability, efficiency, and observability in production. Seldon Core 2 supports multiple inference servers, including MLServer and NVIDIA Triton, enabling flexible and optimized model deployments.

Server

Description

Best Suited For

MLServer

A lightweight, extensible inference server designed to work with multiple ML frameworks, including scikit-learn, XGBoost, TensorFlow, and PyTorch. It supports custom Python models and integrates well with MLOps workflows.

General-purpose model serving, custom model wrappers, multi-framework support.

NVIDIA Triton

A high-performance inference server optimized for GPU and CPU acceleration, supporting deep learning models across frameworks like TensorFlow, PyTorch, and ONNX. Triton enables multi-model and ensemble model inference, making it ideal for scalable AI workloads.

High-throughput deep learning inference, multi-model deployments, GPU-accelerated workloads.

With MLServer and Triton, Seldon Core 2 provides a powerful, efficient, and flexible model-serving platform for production-scale AI applications.

Experiments

In Seldon Core 2, experiments enable controlled A/B testing, model comparisons, and performance evaluations by defining an HTTP traffic split between different models or inference pipelines. This allows organizations to test multiple versions of a model in production while managing risk and ensuring continuous improvements.

Experiment Type

Description

Traffic Splitting

Distributes inference requests across different models or pipelines based on predefined percentage splits. This enables A/B testing and comparison of multiple model versions.

Mirror Testing

Sends a percentage of the traffic to a mirror model or pipeline without affecting the response returned to users. This allows evaluation of new models without impacting production workflows.

Some of the advantages of using Experiments:

A/B Testing & Model comparison: Compare different models under real-world conditions without full deployment.
Risk-Free model validation: Test a new model or pipeline in parallel without affecting live predictions.
Performance & Drift monitoring: Assess latency, accuracy, and reliability before a full rollout.
Continuous improvement: Make data-driven deployment decisions based on real-time model performance.

Architecture

Seldon Core 2 uses a microservice architecture where each service has limited and well-defined responsibilities working together to orchestrate scalable and fault-tolerant ML serving and management. These components communicate internally using gRPC and they can be scaled independently. Seldon Core 2 services can be split into two categories:

Control Plane services are responsible for managing the operations and configurations of your ML models and workflows. This includes functionality to instantiate new inference servers, load models, update new versions of models, configure model experiments and pipelines, and expose endpoints that may receive inference requests. The main control plane component is the Scheduler that is responsible for managing the loading and unloading of resources (models, pipelines, experiments) onto the respective components.
Data Plane services are responsible for managing the flow of data between components or models. Core 2 supports REST and gRPC payloads that follow the Open Inference Protocol (OIP). The main data plane service is Envoy, which acts as a single ingress for all data plane load and routes data to the relevant servers internally (e.g. Seldon MLServer or NVidia Triton pods).

Note: Because Core 2 architecture separates control plane and data plane responsibilities, when control plane services are down (e.g. the Scheduler), data plane inference can still be served. In this manner the system is more resilient to failures. For example, an outage of control plane services does not impact the ability of the system to respond to end user traffic. Core 2 can be provisioned to be highly available on the data plane path.

The current set of services used in Seldon Core 2 is shown below. Following the diagram, we will describe each control plane and data plan service.

Control Plane

Scheduler

This service manages the loading and unloading of Models, Pipelines and Experiments on the relevant micro services. It is also responsible for matching Models with available Servers in a way that optimises infrastructure use. In the current design we can only have one instance of the Scheduler as its internal state is persisted on disk.

When the Scheduler (re)starts there is a synchronisation flow to coordinate the startup process and to attempt to wait for expected Model Servers to connect before proceeding with control plane operations. This is important so that ongoing data plane operations are not interrupted. This then introduces a delay on any control plane operations until the process has finished (including control plan resources status updates). This synchronisation process has a timeout, which has a default of 10 minutes. It can be changed by setting helm seldon-core-v2-components value scheduler.schedulerReadyTimeoutSeconds.

Agent

This service manages the loading and unloading of models on a server and access to the server over REST/gRPC. It acts as a reverse proxy to connect end users with the actual Model Servers. In this way the system collects stats and metrics about data plane inferences that helps with observability and scaling.

Controller

We also provide a Kubernetes Operator to allow Kubernetes usage. This is implemented in the Controller Manager microservice, which manages CRD reconciliation with Scheduler. Currently Core 2 supports one instance of the Controller.

Note: All services besides the Controller are Kubernetes agnostic and can run locally, e.g. on Docker Compose.

Data Plane

Pipeline Gateway

This service handles REST/gRPC calls to Pipelines. It translates between synchronous requests to Kafka operations, producing a message on the relevant input topic for a Pipeline and consuming from the output topic to return inference results back to the users.

Model Gateway

This service handles the flow of data from models to inference requests on servers and passes on the responses via Kafka.

Dataflow Engine

This service handles the flow of data between components in a pipeline, using Kafka Streams. It enables Core 2 to chain and join Models together to provide complex Pipelines.

Envoy

This service manages the proxying of requests to the correct servers including load balancing.

Dataflow Architecture and Pipelines

To support the movement towards data centric machine learning Seldon Core 2 follows a dataflow paradigm. By taking a decentralized route that focuses on the flow of data, users can have more flexibility and insight as they build and manage complex AI applications in production. This contrasts with more centralized orchestration approaches where data is secondary.

Kafka

Kafka is used as the backbone for Pipelines allowing decentralized, synchronous and asynchronous usage. This enables Models to be connected together into arbitrary directed acyclic graphs. Models can be reused in different Pipelines. The flow of data between models is handled by the dataflow engine using Kafka Streams.

By focusing on the data we allow users to join various flows together using stream joining concepts as shown below.

We support several types of joins:

inner joins, where all inputs need to be present for a transaction to join the tensors passed through the Pipeline;
outer joins, where only a subset needs to be available during the join window
triggers, in which data flows need to wait until records on one or more trigger data flows appear. The data in these triggers is not passed onwards from the join.

These techniques allow users to create complex pipeline flows of data between machine learning components.

More discussion on the data flow view of machine learning and its effect on v2 design can be found here.

Installation

Installation Overview

Seldon Core 2 can be installed in various setups to suit different stages of the development lifecycle. The most common modes include:

Local Environment

Ideal for development and testing purposes, a local setup allows for quick iteration and experimentation with minimal overhead. Common tools include:

Docker Compose: Simplifies deployment by orchestrating Seldon Core components and dependencies in Docker containers. Suitable for environments without Kubernetes, providing a lightweight alternative.
Kind (Kubernetes IN Docker): Runs a Kubernetes cluster inside Docker, offering a realistic testing environment. Ideal for experimenting with Kubernetes-native features.

Production Environment

Designed for high-availability and scalable deployments, a production setup ensures security, reliability, and resource efficiency. Typical tools and setups include:

Managed Kubernetes Clusters: Platforms like GKE (Google Kubernetes Engine), EKS (Amazon Elastic Kubernetes Service), and AKS (Azure Kubernetes Service) provide managed Kubernetes solutions. Suitable for enterprises requiring scalability and cloud integration.
On-Premises Kubernetes Clusters: For organizations with strict compliance or data sovereignty requirements. Can be deployed on platforms like OpenShift or custom Kubernetes setups.

By selecting the appropriate installation mode—whether it's Docker Compose for simplicity, Kind for local Kubernetes experimentation, or production-grade Kubernetes for scalability—you can effectively leverage Seldon Core 2 to meet your specific needs.

Helm Charts

Name of the Helm Chart

Description

seldon-core-v2-crds

Cluster-wide installation of custom resources.

seldon-core-v2-setup

Installation of the manager to manage resources in the namespace or cluster-wide. This also installs default SeldonConfig and ServerConfig resources, allowing Runtimes and Servers to be installed on demand.

seldon-core-v2-runtime

Installs a SeldonRuntime custom resource that creates the core components in a namespace.

seldon-core-v2-servers

Installs Server custom resources providing example core servers to load models.

For more information, see the published Helm charts.

For the description of (some) values that can be configured for these charts, see this Helm parameters section.

Seldon Core 2 Dependencies

Here is a list of components that Seldon Core 2 requires, along with the minimum and maximum supported versions:

Component

Minimum Version

Maximum Version

Notes

Kubernetes

1.27

1.33.0

Required

Envoy*

1.32.2

Required

Rclone*

1.68.2

1.69.0

Required

Kafka

3.4

3.8

Recommended (only required for operating Seldon Core 2 dataflow Pipelines)

Prometheus

2.0

2.x

Optional

Grafana

10.0

***

Optional (no hard limit on the maximum version to be used)

Prometheus-adapter

0.12

Optional

Opentelemetry Collector

0.68

***

Optional (no hard limit on the maximum version to be used)

Notes:

Envoy and Rclone: These components are included as part of the Seldon Core 2 Docker images. You are not required to install them separately but must be aware of the configuration options supported by these versions.
Kafka: Only required for operating Seldon Core 2 dataflow Pipelines. If not needed, you should avoid installing seldon-modelgateway, seldon-pipelinegateway, and seldon-dataflow-engine.
Maximum Versions marked with *** indicates no hard limit on the version that can be used.

Get started

Learning Environment

Install Seldon Core 2 in a local learning environment.

You can install Seldon Core 2 on your local computer that is running a Kubernetes cluster using kind.

Seldon publishes the Helm charts that are required to install Seldon Core 2. For more information about the Helm charts and the related dependencies,see Helm charts and Dependencies.

Note: These instructions guide you through installing Seldon Core 2 on a local Kubernetes cluster, focusing on ease of learning. Ensure your kind cluster is running on hardware with at least 32GB of RAM and a load balancer such as MetalLB is configured.

Prerequisites

Install a Kubernetes cluster that is running version 1.27 or later.
Install kubectl, the Kubernetes command-line tool.
Install Helm, the package manager for Kubernetes or Ansible, the automation tool used for provisioning, configuration management, and application deployment.

Note: Ansible automates provisioning, configuration management, and handles all dependencies required for Seldon Core 2. With Helm, you need to configure and manage the dependencies yourself.

Installing Seldon Core 2

Create a namespace to contain the main components of Seldon Core 2. For example, create the seldon-mesh namespace.
```
kubectl create ns seldon-mesh || echo "Namespace seldon-mesh already exists"
```

Add and update the Helm charts, seldon-charts, to the repository.

helm repo add seldon-charts https://seldonio.github.io/helm-charts/
helm repo update seldon-charts

Install Custom resource definitions for Seldon Core 2.

helm upgrade seldon-core-v2-crds seldon-charts/seldon-core-v2-crds \
--namespace default \
--install

Install Seldon Core 2 operator in the seldon-mesh namespace.
```
 helm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
 --namespace seldon-mesh --set controller.clusterwide=true \
 --install
```
This configuration installs the Seldon Core 2 operator across an entire Kubernetes cluster. To perform cluster-wide operations, create ClusterRoles and ensure your user has the necessary permissions during deployment. With cluster-wide operations, you can create SeldonRuntimes in any namespace.
You can configure the installation to deploy the Seldon Core 2 operator in a specific namespace so that it control resources in the provided namespace. To do this, set controller.clusterwide to false.

Install Seldon Core 2 runtimes in the seldon-mesh namespace.

helm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
--namespace seldon-mesh \
--install

Install Seldon Core 2 servers in the seldon-mesh namespace. Two example servers named mlserver-0, and triton-0 are installed so that you can load the models to these servers after installation.
```
 helm upgrade seldon-core-v2-servers seldon-charts/seldon-core-v2-servers \
 --namespace seldon-mesh \
 --install
```

Check Seldon Core 2 operator, runtimes, servers, and CRDS are installed in the seldon-mesh namespace. It might take a couple of minutes for all the Pods to be ready. To check the status of the Pods in real time use this command: kubectl get pods -w -n seldon-mesh.

 kubectl get pods -n seldon-mesh

The output should be similar to this:

NAME                                            READY   STATUS             RESTARTS      AGE
hodometer-749d7c6875-4d4vw                      1/1     Running            0             4m33s
mlserver-0                                      3/3     Running            0             4m10s
seldon-dataflow-engine-7b98c76d67-v2ztq         0/1     CrashLoopBackOff   5 (49s ago)   4m33s
seldon-envoy-bb99f6c6b-4mpjd                    1/1     Running            0             4m33s
seldon-modelgateway-5c76c7695b-bhfj5            1/1     Running            0             4m34s
seldon-pipelinegateway-584c7d95c-bs8c9          1/1     Running            0             4m34s
seldon-scheduler-0                              1/1     Running            0             4m34s
seldon-v2-controller-manager-5dd676c7b7-xq5sm   1/1     Running            0             4m52s
triton-0                                        2/3     Running            0             4m10s

Note: Pods with names starting with seldon-dataflow-engine, seldon-pipelinegateway, and seldon-modelgateway may generate log errors until they successfully connect to Kafka. This occurs because Kafka is not yet fully integrated with Seldon Core 2.

Note: For more information about configurations, see the supported versions of Python libraries, and customization options

You can install Seldon Core 2 and its components using Ansible in one of the following methods:

Single command
Multiple commands

Single command

To install Seldon Core 2 into a new local kind Kubernetes cluster, you can use the seldon-all playbook with a single command:

ansible-playbook playbooks/seldon-all.yaml

This creates a kind cluster and installs ecosystem dependencies such kafka, Prometheus, OpenTelemetry, and Jaeger as well as all the seldon-specific components. The seldon components are installed using helm-charts from the current git checkout (../k8s/helm-charts/).

Internally this runs, in order, the following playbooks:

kind-cluster.yaml
setup-ecosystem.yaml
setup-seldon.yaml

You may pass any of the additonal variables which are configurable for those playbooks to seldon-all.

For example:

ansible-playbook playbooks/seldon-all.yaml -e seldon_mesh_namespace=my-seldon-mesh -e install_prometheus=no -e @playbooks/vars/set-custom-images.yaml

Running the playbooks individually gives you more control over what and when it runs. For example, if you want to install into an existing k8s cluster.

Multiple commands

Create a kind cluster.

ansible-playbook playbooks/kind-cluster.yaml

Setup ecosystem.
```
ansible-playbook playbooks/setup-ecosystem.yaml
```
Seldon runs by default in the seldon-mesh namespace and a Jaeger pod and OpenTelemetry collector are installed in the namespace. To install in a different <mynamespace> namespace:
```
ansible-playbook playbooks/setup-ecosystem.yaml -e seldon_mesh_namespace=<mynamespace>
```

Install Seldon Core 2 in the ansible/ folder.

ansible-playbook playbooks/setup-seldon.yaml

To install in a different namespace, <mynamespace>.

ansible-playbook playbooks/setup-seldon.yaml -e seldon_mesh_namespace=<mynamespace>

Next Steps

If you installed Seldon Core 2 using Helm, you need to complete the installation of other components in the following order:

Integrating with Kafka
Installing a Service mesh

Self-hosted Kafka

Integrate self-hosted Kafka with Seldon Core 2.

You can run Kafka in the same Kubernetes cluster that hosts the Seldon Core 2. Seldon recommends using the Strimzi operator for Kafka installation and maintenance. For more details about configuring Kafka with Seldon Core 2 see the Configuration section.

Note: These instructions help you quickly set up a Kafka cluster. For production grade installation consult Strimzi documentation or use one of managed solutions.

Integrating self-hosted Kafka with Seldon Core 2 includes these steps:

Install Kafka
Configure Seldon Core 2

Installing Kafka in a Kubernetes cluster

Strimzi provides a Kubernetes Operator to deploy and manage Kafka clusters. First, we need to install the Strimzi Operator in your Kubernetes cluster.

Create a namespace where you want to install Kafka. For example the namespace seldon-mesh:
```
kubectl create namespace seldon-mesh || echo "namespace seldon-mesh exists"
```

Install Strimzi.

helm repo add strimzi https://strimzi.io/charts/
helm repo update

Install Strimzi Operator.
```
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator --namespace seldon-mesh
```
This deploys the Strimzi Operator in the seldon-mesh namespace. After the Strimzi Operator is running, you can create a Kafka cluster by applying a Kafka custom resource definition.
Create a YAML file to specify the initial configuration.
Note: This configuration sets up a Kafka cluster with version 3.9.0. Ensure that you review the the supported versions of Kafka and update the version in the kafka.yaml file as needed. For more configuration examples, see this strimzi-kafka-operator.
Use your preferred text editor to create and save the file as kafka.yaml with the following content:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: seldon
  namespace: seldon-mesh
  annotations:
    strimzi.io/node-pools: enabled
    strimzi.io/kraft: enabled
spec:
  kafka:
    replicas: 3
    version: 3.9.0
    listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: tls
        port: 9093
        tls: true
        type: internal
    config:
      processMode: kraft
      auto.create.topics.enable: true
      default.replication.factor: 1
      inter.broker.protocol.version: 3.7
      min.insync.replicas: 1
      offsets.topic.replication.factor: 1
      transaction.state.log.min.isr: 1
      transaction.state.log.replication.factor: 1
  entityOperator: null

Apply the Kafka cluster configuration.

kubectl apply -f kafka.yaml -n seldon-mesh

Create a YAML file named kafka-nodepool.yaml to create a nodepool for the kafka cluster.

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
  name: kafka
  namespace: seldon-mesh
  labels:
    strimzi.io/cluster: seldon
spec:
  replicas: 3
  roles:
    - broker
    - controller
  resources:
    requests:
      cpu: '500m'
      memory: '2Gi'
    limits:
      memory: '2Gi'
  template:
    pod:
      tmpDirSizeLimit: 1Gi
  storage:
    type: jbod
    volumes:
      - id: 0
        type: ephemeral
        sizeLimit: 500Mi
        kraftMetadata: shared
      - id: 1
        type: persistent-claim
        size: 10Gi
        deleteClaim: false

Apply the Kafka node pool configuration.

kubectl apply -f kafka-nodepool.yaml -n seldon-mesh

Check the status of the Kafka Pods to ensure they are running properly:
```
kubectl get pods -n seldon-mesh
```

Note: It might take a couple of minutes for all the Pods to be ready. To check the status of the Pods in real time use this command: kubectl get pods -w -n seldon-mesh.

You should see multiple Pods for Kafka, and Strimzi operators running.

```bash
NAME                                            READY   STATUS    RESTARTS      AGE
hodometer-5489f768bf-9xnmd                      1/1     Running   0             25m
mlserver-0                                      3/3     Running   0             24m
seldon-dataflow-engine-75f9bf6d8f-2blgt         1/1     Running   5 (23m ago)   25m
seldon-envoy-7c764cc88-xg24l                    1/1     Running   0             25m
seldon-kafka-0                                  1/1     Running   0             21m
seldon-kafka-1                                  1/1     Running   0             21m
seldon-kafka-2                                  1/1     Running   0             21m
seldon-modelgateway-54d457794-x4nzq             1/1     Running   0             25m
seldon-pipelinegateway-6957c5f9dc-6blx6         1/1     Running   0             25m
seldon-scheduler-0                              1/1     Running   0             25m
seldon-v2-controller-manager-7b5df98677-4jbpp   1/1     Running   0             25m
strimzi-cluster-operator-66b5ff8bbb-qnr4l       1/1     Running   0             23m
triton-0                                        3/3     Running   0             24m
```

Troubleshooting

Error The Pod that begins with the name seldon-dataflow-engine does not show the status as Running.

One of the possible reasons could be that the DNS resolution for the service failed.

Solution

Check the logs of the Pod <seldon-dataflow-engine>:
```
kubectl logs <seldon-dataflow-engine> -n seldon-mesh
```

In the output check if a message reads:

WARN [main] org.apache.kafka.clients.ClientUtils : Couldn't resolve server seldon-kafka-bootstrap.seldon-mesh:9092 from bootstrap.servers as DNS resolution failed for seldon-kafka-bootstrap.seldon-mesh

Verify the name in the metadata for the kafka.yaml and kafka-nodepool.yaml. It should read seldon.
Check the name of the Kafka services in the namespace:
```
kubectl get svc -n seldon-mesh
```

Restart the Pod:

kubectl delete pod <seldon-dataflow-engine> -n seldon-mesh

Configuring Seldon Core 2

When the SeldonRuntime is installed in a namespace a ConfigMap is created with the settings for Kafka configuration. Update the ConfigMap only if you need to customize the configurations.

Verify that the ConfigMap resource named seldon-kafka that is created in the namespace seldon-mesh:

kubectl get configmaps -n seldon-mesh

You should the ConfigMaps for Kafka, Zookeeper, Strimzi operators, and others.

NAME                       DATA   AGE
kube-root-ca.crt           1      38m
seldon-agent               1      30m
seldon-kafka               1      30m
seldon-kafka-0             6      26m
seldon-kafka-1             6      26m
seldon-kafka-2             6      26m
seldon-manager-config      1      30m
seldon-tracing             4      30m
strimzi-cluster-operator   1      28m

View the configuration of the the ConfigMap named seldon-kafka.

kubectl get configmap seldon-kafka -n seldon-mesh -o yaml

You should see an output simialr to this:

apiVersion: v1
data:
  kafka.json: '{"bootstrap.servers":"seldon-kafka-bootstrap.seldon-mesh:9092","consumer":{"auto.offset.reset":"earliest","message.max.bytes":"1000000000","session.timeout.ms":"6000","topic.metadata.propagation.max.ms":"300000"},"producer":{"linger.ms":"0","message.max.bytes":"1000000000"},"topicPrefix":"seldon"}'
kind: ConfigMap
metadata:
  creationTimestamp: "2024-12-05T07:12:57Z"
  name: seldon-kafka
  namespace: seldon-mesh
  ownerReferences:
  - apiVersion: mlops.seldon.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: SeldonRuntime
    name: seldon
    uid: 9e724536-2487-487b-9250-8bcd57fc52bb
  resourceVersion: "778"
  uid: 5c041e69-f36b-4f14-8f0d-c8790003cb3e

After you integrated Seldon Core 2 with Kafka, you need to Install an Ingress Controller that adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.

Customizing the settings (optional)

To customize the settings you can add and modify the Kafka configuration using Helm, for example to add compression for producers.

Create a YAML file to specify the compression configuration for Seldon Core 2 runtime. For example, create the values-runtime-kafka-compression.yaml file. Use your preferred text editor to create and save the file with the following content:

Change to the directory that contains the values-runtime-kafka-compression.yaml file and then install Seldon Core 2 runtime in the namespace seldon-mesh.

helm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
--namespace seldon-mesh \
-f values-runtime-kafka-compression.yaml \
--install

Configuring topic and consumer isolation (optional)

If you are using a shared Kafka cluster with other applications, it is advisable to isolate topic names and consumer group IDs from other cluster users to prevent naming conflicts. This can be achieved by configuring the following two settings:

topicPrefix: set a prefix for all topics
consumerGroupIdPrefix: set a prefix for all consumer groups

Here’s an example of how to configure topic name and consumer group ID isolation during a Helm installation for an application named myorg:

helm upgrade --install seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
--namespace seldon-mesh \
--set controller.clusterwide=true \
--set kafka.topicPrefix=myorg \
--set kafka.consumerGroupIdPrefix=myorg

Next Steps

After you installed Seldon Core 2, and Kafka using Helm, you need to complete Installing a Service mesh.

Production Environment

Install Core 2 in a production Kubernetes environment.

Prerequisites

Set up and connect to a Kubernetes cluster running version 1.27 or later. For instructions on connecting to your Kubernetes cluster, refer to the documentation provided by your cloud provider.
Install kubectl, the Kubernetes command-line tool.
Install Helm, the package manager for Kubernetes.

To use Seldon Core 2 in a production environment:

Seldon publishes the Helm charts that are required to install Seldon Core 2. For more information about the Helm charts and the related dependencies, see Helm charts and Dependencies.

Creating Namespaces

Create a namespace to contain the main components of Seldon Core 2. For example, create the namespace seldon-mesh:
```
kubectl create ns seldon-mesh || echo "Namespace seldon-mesh already exists"
```
Create a namespace to contain the components related to monitoring. For example, create the namespace seldon-monitoring:
```
kubectl create ns seldon-monitoring || echo "Namespace seldon-monitoring already exists"
```

Installing Seldon Core 2

Add and update the Helm charts seldon-charts to the repository.

helm repo add seldon-charts https://seldonio.github.io/helm-charts/
helm repo update seldon-charts

Install custom resource definitions for Seldon Core 2.

helm upgrade seldon-core-v2-crds seldon-charts/seldon-core-v2-crds \
--namespace default \
--install

Install Seldon Core 2 operator in the seldon-mesh namespace.
```
 helm upgrade seldon-core-v2-setup seldon-charts/seldon-core-v2-setup \
 --namespace seldon-mesh --set controller.clusterwide=true \
 --install
```
This configuration installs the Seldon Core 2 operator across an entire Kubernetes cluster. To perform cluster-wide operations, create ClusterRoles and ensure your user has the necessary permissions during deployment. With cluster-wide operations, you can create SeldonRuntimes in any namespace.
You can configure the installation to deploy the Seldon Core 2 operator in a specific namespace so that it control resources in the provided namespace. To do this, set controller.clusterwide to false.

Install Seldon Core 2 runtimes in the namespace seldon-mesh.

helm upgrade seldon-core-v2-runtime seldon-charts/seldon-core-v2-runtime \
--namespace seldon-mesh \
--install

Install Seldon Core 2 servers in the namespace seldon-mesh. Two example servers named mlserver-0, and triton-0 are installed so that you can load the models to these servers after installation.
```
 helm upgrade seldon-core-v2-servers seldon-charts/seldon-core-v2-servers \
 --namespace seldon-mesh \
 --install
```

Check Seldon Core 2 operator, runtimes, servers, and CRDS are installed in the namespace seldon-mesh:

 kubectl get pods -n seldon-mesh

The output should be similar to this:

NAME                                            READY   STATUS             RESTARTS      AGE
hodometer-749d7c6875-4d4vw                      1/1     Running            0             4m33s
mlserver-0                                      3/3     Running            0             4m10s
seldon-dataflow-engine-7b98c76d67-v2ztq         0/1     CrashLoopBackOff   5 (49s ago)   4m33s
seldon-envoy-bb99f6c6b-4mpjd                    1/1     Running            0             4m33s
seldon-modelgateway-5c76c7695b-bhfj5            1/1     Running            0             4m34s
seldon-pipelinegateway-584c7d95c-bs8c9          1/1     Running            0             4m34s
seldon-scheduler-0                              1/1     Running            0             4m34s
seldon-v2-controller-manager-5dd676c7b7-xq5sm   1/1     Running            0             4m52s
triton-0                                        2/3     Running            0             4m10s

Next steps

You can integrate Seldon Core 2 with Kafka that is self-hosted or a managed Kafka.

Additional Resources

Kafka Integration

Kafka is a component in the Seldon Core 2 ecosystem, that provides scalable, reliable, and flexible communication for machine learning deployments. It serves as a strong backbone for building complex inference pipelines, managing high-throughput asynchronous predictions, and seamlessly integrating with event-driven systems—key features needed for contemporary enterprise-grade ML platforms.

An inference request is a request sent to a machine learning model to make a prediction or inference based on input data. It is a core concept in deploying machine learning models in production, where models serve predictions to users or systems in real-time or batch mode.

To explore this feature of Seldon Core 2, you need to integrate with Kafka. Integrate Kafka through managed cloud services or by deploying it directly within a Kubernetes cluster.

Note: Kafka is an external component outside of the main Seldon stack. Therefore, it is the cluster administrator’s responsibility to administrate and manage the Kafka instance used by Seldon. For production installation it is highly recommended to use managed Kafka instance.

Securing Kafka provides more information about the encrytion and authentication.
Configuration examples provides the steps to configure some of the managed Kafka services.

Self-hosted Kafka

Seldon Core 2 requires Kafka to implement data-centric inference Pipelines. To install Kafka for testing purposed in your Kubernetes cluster, use Strimzi Operator. For more information, see Self Hosted Kafka

Managed Kafka

Seldon recommends a managed Kafka service for production installation. You can integrate and secure your managed Kafka Seldon Core 2.

Some of the Managed Kafka services that are tested are:

Confluent Cloud (security: SASL/PLAIN)
Confluent Cloud (security: SASL/OAUTHBEARER)
Amazon MSK (security: mTLS)
Amazon MSK (security: SASL/SCRAM)
Azure Event Hub (security: SASL/PLAIN)

Integrating managed Kafka services

These instructions outline the necessary configurations to integrate managed Kafka services with Seldon Core 2.

Examples configurations
Configuring with Seldon Core 2

Securing managed Kafka services

You can secure Seldon Core 2 integration with managed Kafka services using:

Kafka Encryption
Kafka Authentication

Kafka Encryption (TLS)

In production settings, always set up TLS encryption with Kafka. This ensures that neither the credentials nor the payloads are transported in plaintext.

Note: TLS encryption involves only single-sided TLS. This means that the contents are encrypted and sent to the server, but the client won’t send any form of certificate. Therefore, it does not take care of authenticating the client. Client authentication can be configured through mutual TLS (mTLS) or SASL mechanism, which are covered in the Kafka Authentication section .

When TLS is enabled, the client needs to know the root CA certificate used to create the server’s certificate. This is used to validate the certificate sent back by the Kafka server.

Create a certificate named ca.crt that is encoded as a PEM certificate. It is important that the certificate is saved as ca.crt. Otherwise, Seldon Core 2 may not be able to find the certificate. Within the cluster, you can provide the server’s root CA certificate through a secret. For example, a secret named kafka-broker-tls with a certificate.

kubectl create secret generic kafka-broker-tls -n seldon-mesh --from-file ./ca.crt

Kafka Authentication

In production environments, Kafka clusters often require authentication, especially when using managed Kafka solutions. Therefore, when installing Seldon Core 2 components, it is crucial to provide the correct credentials for a secure connection to Kafka.

The type of authentication used with Kafka varies depending on the setup but typically includes one of the following:

Simple Authentication and Security Layer (SASL): Requires a username and password.
Mutual TLS (mTLS): Involves using SSL certificates as credentials.
OAuth 2.0: Uses the client credential flow to acquire a JWT token.

These credentials are stored as Kubernetes secrets within the cluster. When setting up Seldon Core 2 you must create the appropriate secret in the correct format and update the components-values.yaml, and install-values files respectively.

When you use SASL as the authentication mechanism for Kafka, the credentials consist of a username and password pair. The password is supplied through a secret.

Note:

Ensure that the field used for the password within the secret is named password. Otherwise, Seldon Core 2 may not be able to find the correct password.
This password must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.

Create a password for Seldon Core 2 in the namespace seldon-mesh.

kubectl create secret generic kafka-sasl-secret --from-literal password=<kafka-password> -n seldon-mesh

Values in Seldon Core 2

In Seldon Core 2 you need to specify these values in values.yaml

security.kafka.sasl.mechanism - SASL security mechanism, e.g. SCRAM-SHA-512
security.kafka.sasl.client.username - Kafka username
security.kafka.sasl.client.secret - Created secret with password
security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka Brokers

The resulting set of values to include in values.yaml is similar to:

security:
  kafka:
    protocol: SASL_SSL
    sasl:
      mechanism: SCRAM-SHA-512
      client:
        username: <kafka-username>      # TODO: Replace with your Kafka username
        secret: kafka-sasl-secret       # NOTE: Secret name from previous step
    ssl:
      client:
        secret:                                   # NOTE: Leave empty
        brokerValidationSecret: kafka-broker-tls  # NOTE: Optional

The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let’s Encrypt.

When you use OAuth 2.0 as the authentication mechanism for Kafka, the credentials consist of a Client ID and Client Secret, which are used with your Identity Provider to obtain JWT tokens for authenticating with Kafka brokers.

Create a Kubernetes secret kafka-oauth.yaml file.

apiVersion: v1
kind: Secret
metadata:
  name: kafka-oauth
type: Opaque
stringData:
  method: OIDC
  client_id: <client id>
  client_secret: <client secret>
  token_endpoint_url: <token endpoint url>
  extensions: ""
  scope: ""

Store the secret in the seldon-mesh namespace to configure with Seldon Core 2.
```
kubectl apply -f kafka-oauth.yaml -n seldon-mesh
```
This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.

Client ID, client secret and token endpoint url should come from identity provider such as Keycloak or Azure AD.

Values in Seldon Core 2

In Seldon Core 2 you need to specify these values:

security.kafka.sasl.mechanism - set to OAUTHBEARER
security.kafka.sasl.client.secret - Created secret with client credentials
security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka brokers

The resulting set of values in components-values.yaml is similar to:

security:
  kafka:
    protocol: SASL_SSL
    sasl:
      mechanism: OAUTHBEARER
      client:
          secret: kafka-oauth                     # NOTE: Secret name from earlier step
    ssl:
      client:
        secret:                                   # NOTE: Leave empty
        brokerValidationSecret: kafka-broker-tls  # NOTE: Optional

The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let’s Encrypt.

When you use mTLS as authentication mechanism Kafka uses a set of certificates to authenticate the client.

A client certificate, referred to as tls.crt.
A client key, referred to as tls.key.
A root certificate, referred to as ca.crt.

These certificates are expected to be encoded as PEM certificates and are provided through a secret, which can be created in teh namespace seldon:

kubectl create secret generic kafka-client-tls -n seldon \
  --from-file ./tls.crt \
  --from-file ./tls.key \
  --from-file ./ca.crt

This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.

Ensure that the field used within the secret follow the same naming convention: tls.crt, tls.key and ca.crt. Otherwise, Seldon Core 2 may not be able to find the correct set of certificates.

Reference these certificates it within the corresponding Helm values for Seldon Core 2 installation.

Values for Seldon Core 2 In Seldon Core 2 you need to specify these values:

security.kafka.ssl.client.secret - Secret name containing client certificates
security.kafka.ssl.client.brokerValidationSecret - Certificate Authority of Kafka Brokers

The resulting set of values in values.yaml is similar to:

security:
  kafka:
    protocol: SSL
    ssl:
      client:
        secret: kafka-client-tls                  # NOTE: Secret name from earlier step
        brokerValidationSecret: kafka-broker-tls  # NOTE: Optional

The security.kafka.ssl.client.brokerValidationSecret field is optional. Leave it empty if your brokers use well known Certificate Authority such as Let’s Encrypt.

Example configurations for managed Kafka services

Here are some examples to create secrets for managed Kafka services such as Azure Event Hub, Confluent Cloud(SASL), Confluent Cloud(OAuth2.0).

Prerequisites:

You must use at least the Standard tier for your Event Hub namespace because the Basic tier does not support the Kafka protocol.
Seldon Core 2 creates two Kafka topics for each model and pipeline, plus one global topic for errors. This results in a total number of topics calculated as: 2 x (number of models + number of pipelines) + 1. This topic count is likely to exceed the limit of the Standard tier in Azure Event Hub. For more information, see quota information.

Creating a namespace and obtaining the connection string

These are the steps that you need to perform in Azure Portal.

Create an Azure Event Hub namespace. You need to have an Azure Event Hub namespace. Follow the Azure quickstart documentation to create one. Note: You do not need to create individual Event Hubs (topics) as Seldon Core 2 automatically creates all necessary topics.
Connection string for Kafka Integration. To connect to the Azure Event Hub using the Kafka API, you need to obtain Kafka endpoint and Connection string. For more information, see Get an Event Hubs connection string
Note: Ensure you get the Connection string at the namespace level, as it is needed to dynamically create new topics. The format of the Connection string should be:
```
Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXX
```

Creating secrets for Seldon Core 2 These are the steps that you need to perform in the Kubernetes cluster that run Seldon Core 2 to store the SASL password.

Create a secret named azure-kafka-secret for Seldon Core 2 in the namespace seldon. In the following command make sure to replace <password> with a password of your choice and <namespace> with the namespace form Azure Event Hub.

kubectl create secret generic azure-kafka-secret --from-literal=<password>="Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXX" -n seldon

Create a secret named azure-kafka-secret for Seldon Core 2 in the namespace seldon-system. In the following command make sure to replace <password> with a password of your choice and <namespace> with the namespace form Azure Event Hub.

kubectl create secret generic azure-kafka-secret --from-literal=<password>="Endpoint=sb://<namespace>.servicebus.windows.net/;SharedAccessKeyName=XXXXXX;SharedAccessKey=XXXXXX" -n seldon-system

Creating API Keys

These are the steps that you need to perform in Confluent Cloud.

Navigate to Clients > New client and choose a client, for example GO and generate new Kafka cluster API key. For more information, see Confluent documentation.

Confluent generates a configuration file with the details.

Save the values of Key, Secret, and bootstrap.servers from the configuration file.

Creating secrets for Seldon Core 2 These are the steps that you need to perform in the Kubernetes cluster that run Seldon Core 2 to store the SASL password.

Create a secret named confluent-kafka-sasl for Seldon Core 2 in the namespace seldon. In the following command make sure to replace <password> with with the value of Secret that you generated in Confluent cloud.

kubectl create secret generic confluent-kafka-sasl --from-literal password="<password>" -n seldon

Confluent Cloud managed Kafka supports OAuth 2.0 to authenticate your Kafka clients. See Confluent Cloud documentation for further details.

Configuring Identity Provider

In Confluent Cloud Console Navigate to Account & Access / Identity providers and complete these steps:

register your Identity Provider. See Confluent Cloud documentation for further details.
add new identity pool to your newly registered Identity Provider. See Confluent Cloud documentation for further details.
Obtain these details from Confluent Cloud:
- Cluster ID: Cluster Overview → Cluster Settings → General → Identification
- Identity Pool ID: Accounts & access → Identity providers → .
Obtain these details from your identity providers such as Keycloak or Azure AD.
- Client ID
- Client secret
- Token Endpoint URL

If you are using Azure AD you may will need to set scope: api://<client id>/.default.

Creating Kubernetes secret

Create Kubernetes secrets to store the required client credentials. For example, create a kafka-secret.yaml file by replacing the values of <client id>, <client secret>, <token endpoint url>, <cluster id>,<identity pool id> with the values that you obtained from Confluent Cloud and your identity provider.

apiVersion: v1
kind: Secret
metadata:
  name: confluent-kafka-oauth
type: Opaque
stringData:
  method: OIDC
  client_id: <client id>
  client_secret: <client secret>
  token_endpoint_url: <token endpoint url>
  extensions: logicalCluster=<cluster id>,identityPoolId=<identity pool id>
  scope: ""

Provide the secret named confluent-kafka-oauth in the seldon namespace to configure with Seldon Core 2.
```
kubectl apply -f kafka-secret.yaml -n seldon
```
This secret must be present in seldon-logs namespace and every namespace containing Seldon Core 2 runtime.

Configuring Seldon Core 2

To integrate Kafka with Seldon Core 2.

Update the initial configuration.

Note: In these configurations you may need:

to tweak the values for replicationFactor and numPartitions that best suits your cluster configuration.
set the value for username as $ConnectionString this is not a variable.
replace <namespace> with the namespace in Azure Event Hub.

Update the initial configuration for Seldon Core 2 in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:

controller:
  clusterwide: true

dataflow:
  resources:
    cpu: 500m

envoy:
  service:
    type: ClusterIP

kafka:
  bootstrap: <namespace>.servicebus.windows.net:9093
  topics:
    replicationFactor: 3
    numPartitions: 4    
security:
  kafka:
    protocol: SASL_SSL
    sasl:
      mechanism: "PLAIN"
      client:
        username: $ConnectionString
        secret: azure-kafka-secret
    ssl:
      client:
        secret:
        brokerValidationSecret:
        
opentelemetry:
  enable: false

scheduler:
  service:
    type: ClusterIP

serverConfig:
  mlserver:
    resources:
      cpu: 1
      memory: 2Gi

  triton:
    resources:
      cpu: 1
      memory: 2Gi

serviceGRPCPrefix: "http2-"

Note: In these configurations you may need:

to tweak the values for replicationFactor and numPartitions that best suits your cluster configuration.
replace <username> with thevalue of Key that you generated in Confluent Cloud.
replace <confluent-endpoints> with the value of bootstrap.server that you generated in Confluent Cloud.

Update the initial configuration for Seldon Core 2 Operator in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:

controller:
  clusterwide: true

dataflow:
  resources:
    cpu: 500m

envoy:
  service:
    type: ClusterIP

kafka:
  bootstrap: <confluent-endpoints>
  topics:
    replicationFactor: 3
    numPartitions: 4
  consumer:
    messageMaxBytes: 8388608
  producer:
    messageMaxBytes: 8388608

security:
  kafka:
    protocol: SASL_SSL
    sasl:
      mechanism: "PLAIN"
      client:
        username: <username>
        secret: confluent-kafka-sasl
    ssl:
      client:
        secret:
        brokerValidationSecret:

opentelemetry:
  enable: false

scheduler:
  service:
    type: ClusterIP

serverConfig:
  mlserver:
    resources:
      cpu: 1
      memory: 2Gi

  triton:
    resources:
      cpu: 1
      memory: 2Gi

serviceGRPCPrefix: "http2-"

Note: In these configurations you may need:

to tweak the values for replicationFactor and numPartitions that best suits your cluster configuration.
replace <confluent-endpoints> with the value of bootstrap.server that you generated in Confluent Cloud.

Update the initial configuration for Seldon Core 2 Operator in the components-values.yaml file. Use your preferred text editor to update and save the file with the following content:

controller:
  clusterwide: true

dataflow:
  resources:
    cpu: 500m

envoy:
  service:
    type: ClusterIP

kafka:
  bootstrap: <confluent-endpoints>
  topics:
    replicationFactor: 3
    numPartitions: 4
  consumer:
    messageMaxBytes: 8388608
  producer:
    messageMaxBytes: 8388608

security:
  kafka:
    protocol: SASL_SSL
    sasl:
      mechanism: OAUTHBEARER
      client:
        secret: confluent-kafka-oauth
    ssl:
      client:
        secret:
        brokerValidationSecret:

opentelemetry:
  enable: false

scheduler:
  service:
    type: ClusterIP

serverConfig:
  mlserver:
    resources:
      cpu: 1
      memory: 2Gi

  triton:
    resources:
      cpu: 1
      memory: 2Gi

serviceGRPCPrefix: "http2-"

To enable Kafka Encryption (TLS) you need to reference the secret that you created in the security.kafka.ssl.client.secret field of the Helm chart values. The resulting set of values to include in components-values.yaml is similar to:

security:
  kafka:
    ssl:
      secret:
        brokerValidationSecret: kafka-broker-tls

Change to the directory that contains the components-values.yaml file and then install Seldon Core 2 operator in the namespace seldon-system.

 helm upgrade seldon-core-v2-components seldon-charts/seldon-core-v2-setup \
 --version 2.8.0 \
 -f components-values.yaml \
 --namespace seldon-system \
 --install

Ingress Controller

Learn about installing Istio ingress controller in a Kubernetes cluster running Seldon Core 2.

An ingress controller functions as a reverse proxy and load balancer, implementing a Kubernetes Ingress. It adds an abstraction layer for traffic routing by receiving traffic from outside the Kubernetes platform and load balancing it to Pods running within the Kubernetes cluster.

Seldon Core 2 works seamlessly with any service mesh or ingress controller, offering flexibility in your deployment setup. This guide provides detailed instructions for installing and configuring Istio with Seldon Core 2.

Istio

Istio implements the Kubernetes ingress resource to expose a service and make it accessible from outside the cluster. You can install Istio in either a self-hosted Kubernetes cluster or a managed Kubernetes service provided by a cloud provider that is running the Seldon Core 2.

Prerequisites

Installing Istio ingress controller

Installing Istio ingress controller in a Kubernetes cluster running Seldon Core 2 involves these tasks:

Install Istio

Add the Istio Helm charts repository and update it:
Create the istio-system namespace where Istio components are installed:
Install the base component:
Install Istiod, the Istio control plane:

Install Istio Ingress Gateway

Install Istio Ingress Gateway:
Verify that Istio Ingress Gateway is installed:
This should return details of the Istio Ingress Gateway, including the external IP address.
Verify that all Istio Pods are running:
The output is similar to:
Inject Envoy sidecars into application Pods in the namespace seldon-mesh:
Verify that the injection happens to the Pods in the namespace seldon-mesh:
Find the IP address of the Seldon Core 2 instance running with Istio:

Expose Seldon mesh service

It is important to expose seldon-service service to enable communication between deployed machine learning models and external clients or services. The Seldon Core 2 inference API is exposed through the seldon-mesh service in the seldon-mesh namespace. If you install Core 2 in multiple namespaces, you need to expose the seldon-mesh service in each of namespace.

Verify if the seldon-mesh service is running for example, in the namespace seldon.
When the services are running you should see something similar to this:
Create a YAML file to create a VirtualService named iris-route in the namespace seldon-mesh. For example, create the seldon-mesh-vs.yaml file. Use your preferred text editor to create and save the file with the following content:
Create a virtual service to expose the seldon-mesh service.
When the virtual service is created, you should see this:

Next Steps

Additional Resources

Test the Installation

To confirm the successful installation of Seldon Core 2, Kafka, and the service mesh, deploy a sample model and perform an inference test. Follow these steps:

Deploy the Iris Model

Apply the following configuration to deploy the Iris model in the namespace seldon-mesh:

kubectl apply -f - --namespace=seldon-mesh <<EOF
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "gs://seldon-models/scv2/samples/mlserver_1.3.5/iris-sklearn"
  requirements:
    - sklearn
EOF

The output is:

model.mlops.seldon.io/iris created

Verify that the model is deployed in the namespace seldon-mesh.

kubectl wait --for condition=ready --timeout=300s model --all -n seldon-mesh

When the model is deployed, the output is similar to:

model.mlops.seldon.io/iris condition met

Deploy a pipeline for the Iris Model

Apply the following configuration to deploy the Iris model in the namespace seldon-mesh:

kubectl apply -f - --namespace=seldon-mesh <<EOF
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: irispipeline
spec:
  steps:
    - name: iris
  output:
    steps:
    - iris
EOF

The output is:

pipeline.mlops.seldon.io/irispipeline created

Verify that the pipeline is deployed in the namespace seldon-mesh.

kubectl wait --for condition=ready --timeout=300s pipeline --all -n seldon-mesh

When the pipeline is deployed, the output is similar to:

pipeline.mlops.seldon.io/irispipeline condition met

Perform an Inference test

Use curl to send a test inference request to the deployed model. Replace <INGRESS_IP> with your service mesh's ingress IP address. Ensure that:

The Host header matches the expected virtual host configured in your service mesh.
The Seldon-Model header specifies the correct model name.

curl -k http://<INGRESS_IP>:80/v2/models/iris/infer \
  -H "Host: seldon-mesh.inference.seldon" \
  -H "Content-Type: application/json" \
  -H "Seldon-Model: iris" \
  -d '{
    "inputs": [
      {
        "name": "predict",
        "shape": [1, 4],
        "datatype": "FP32",
        "data": [[1, 2, 3, 4]]
      }
    ]
  }'

The output is similar to:

{"model_name":"iris_1","model_version":"1","id":"f4d8b82f-2af3-44fb-b115-60a269cbfa5e","parameters":{},"outputs":[{"name":"predict","shape":[1,1],"datatype":"INT64","parameters":{"content_type":"np"},"data":[2]}]}

Use curl to send a test inference request through the pipeline to the deployed model. Replace <INGRESS_IP> with your service mesh's ingress IP address. Ensure that:

The Host header matches the expected virtual host configured in your service mesh.
The Seldon-Model header specifies the correct pipeline name.

curl -k http://<INGRESS_IP>:80/v2/models/irispipeline/infer \
  -H "Host: seldon-mesh.inference.seldon" \
  -H "Content-Type: application/json" \
  -H "Seldon-Model: irispipeline.pipeline" \
  -d '{
    "inputs": [
      {
        "name": "predict",
        "shape": [1, 4],
        "datatype": "FP32",
        "data": [[1, 2, 3, 4]]
      }
    ]
  }'

The output is similar to:

{"model_name":"","outputs":[{"data":[2],"name":"predict","shape":[1,1],"datatype":"INT64","parameters":{"content_type":"np"}}]}

Advanced Configurations

Helm Configuration

Seldon Core 2 provides a highly configurable deployment framework that allows you to fine-tune various components using Helm configuration options. These options offer control over deployment behavior, resource management, logging, autoscaling, and model lifecycle policies to optimize the performance and scalability of machine learning deployments.

This section details the key Helm configuration parameters for Envoy, Autoscaling, Server Prestop, and Model Control Plane, ensuring that you can customize deployment workflows and enhance operational efficiency.

Envoy: Manage pre-stop behaviors and configure access logging to track request-level interactions.
Autoscaling (Experimental): Fine-tune dynamic scaling policies for efficient resource allocation based on real-time inference workloads.
Servers: Define grace periods for controlled shutdowns and optimize model control plane parameters for efficient model loading, unloading, and error handling.
Logging: Define log levels for the different components of the system.

Envoy

Prestop

Access Log

Autoscaling

Native autoscaling (experimental)

Server

Prestop

Model Control Plane

Logging

Component Log Level

Notes:

We set kafka client library log level from the log level that is passed to the component, which could be different to the level expected by librdkafka (syslog level). In this case we attempt to map the log level value to the best match.

Server Config

Note: This section is for advanced usage where you want to define new types of inference servers.

Server configurations define how to create an inference server. By default one is provided for Seldon MLServer and one for NVIDIA Triton Inference Server. Both these servers support the V2 inference protocol which is a requirement for all inference servers. They define how the Kubernetes ReplicaSet is defined which includes the Seldon Agent reverse proxy as well as an Rclone server for downloading artifacts for the server. The Kustomize ServerConfig for MlServer is shown below:

Server Runtime

The SeldonRuntime resource is used to create an instance of Seldon installed in a particular namespace.

type SeldonRuntimeSpec struct {
	SeldonConfig string              `json:"seldonConfig"`
	Overrides    []*OverrideSpec     `json:"overrides,omitempty"`
	Config       SeldonConfiguration `json:"config,omitempty"`
	// +Optional
	// If set then when the referenced SeldonConfig changes we will NOT update the SeldonRuntime immediately.
	// Explicit changes to the SeldonRuntime itself will force a reconcile though
	DisableAutoUpdate bool `json:"disableAutoUpdate,omitempty"`
}

type OverrideSpec struct {
	Name        string         `json:"name"`
	Disable     bool           `json:"disable,omitempty"`
	Replicas    *int32         `json:"replicas,omitempty"`
	ServiceType v1.ServiceType `json:"serviceType,omitempty"`
	PodSpec     *PodSpec       `json:"podSpec,omitempty"`
}

For the definition of SeldonConfiguration above see the SeldonConfig resource.

The specification above contains overrides for the chosen SeldonConfig. To override the PodSpec for a given component, the overrides field needs to specify the component name and the PodSpec needs to specify the container name, along with fields to override.

For instance, the following overrides the resource limits for cpu and memory in the hodometer component in the seldon-mesh namespace, while using values specified in the seldonConfig elsewhere (e.g. default).

apiVersion: mlops.seldon.io/v1alpha1
kind: SeldonRuntime
metadata:
  name: seldon
  namespace: seldon-mesh
spec:
  overrides:
  - name: hodometer
    podSpec:
      containers:
      - name: hodometer
        resources:
          limits:
            memory: 64Mi
            cpu: 20m
  seldonConfig: default

As a minimal use you should just define the SeldonConfig to use as a base for this install, for example to install in the seldon-mesh namespace with the SeldonConfig named default:

apiVersion: mlops.seldon.io/v1alpha1
kind: SeldonRuntime
metadata:
  name: seldon
  namespace: seldon-mesh
spec:
  seldonConfig: default

The helm chart seldon-core-v2-runtime allows easy creation of this resource and associated default Servers for an installation of Seldon in a particular namespace.

SeldonConfig Update Propagation

When a SeldonConfig resource changes any SeldonRuntime resources that reference the changed SeldonConfig will also be updated immediately. If this behaviour is not desired you can set spec.disableAutoUpdate in the SeldonRuntime resource for it not be be updated immediately but only when it changes or any owned resource changes.

Seldon Config

Note: This section is for advanced usage where you want to define how seldon is installed in each namespace.

The SeldonConfig resource defines the core installation components installed by Seldon. If you wish to install Seldon, you can use the SeldonRuntime resource which allows easy overriding of some parts defined in this specification. In general, we advise core DevOps to use the default SeldonConfig or customize it for their usage. Individual installation of Seldon can then use the SeldonRuntime with a few overrides for special customisation needed in that namespace.

The specification contains core PodSpecs for each core component and a section for general configuration including the ConfigMaps that are created for the Agent (rclone defaults), Kafka and Tracing (open telemetry).

type SeldonConfigSpec struct {
	Components []*ComponentDefn    `json:"components,omitempty"`
	Config     SeldonConfiguration `json:"config,omitempty"`
}

type SeldonConfiguration struct {
	TracingConfig TracingConfig      `json:"tracingConfig,omitempty"`
	KafkaConfig   KafkaConfig        `json:"kafkaConfig,omitempty"`
	AgentConfig   AgentConfiguration `json:"agentConfig,omitempty"`
	ServiceConfig ServiceConfig      `json:"serviceConfig,omitempty"`
}

type ServiceConfig struct {
	GrpcServicePrefix string         `json:"grpcServicePrefix,omitempty"`
	ServiceType       v1.ServiceType `json:"serviceType,omitempty"`
}

type KafkaConfig struct {
	BootstrapServers      string                        `json:"bootstrap.servers,omitempty"`
	ConsumerGroupIdPrefix string                        `json:"consumerGroupIdPrefix,omitempty"`
	Debug                 string                        `json:"debug,omitempty"`
	Consumer              map[string]intstr.IntOrString `json:"consumer,omitempty"`
	Producer              map[string]intstr.IntOrString `json:"producer,omitempty"`
	Streams               map[string]intstr.IntOrString `json:"streams,omitempty"`
	TopicPrefix           string                        `json:"topicPrefix,omitempty"`
}

type AgentConfiguration struct {
	Rclone RcloneConfiguration `json:"rclone,omitempty" yaml:"rclone,omitempty"`
}

type RcloneConfiguration struct {
	ConfigSecrets []string `json:"config_secrets,omitempty" yaml:"config_secrets,omitempty"`
	Config        []string `json:"config,omitempty" yaml:"config,omitempty"`
}

type TracingConfig struct {
	Disable              bool   `json:"disable,omitempty"`
	OtelExporterEndpoint string `json:"otelExporterEndpoint,omitempty"`
	OtelExporterProtocol string `json:"otelExporterProtocol,omitempty"`
	Ratio                string `json:"ratio,omitempty"`
}

type ComponentDefn struct {
	// +kubebuilder:validation:Required

	Name                 string                  `json:"name"`
	Labels               map[string]string       `json:"labels,omitempty"`
	Annotations          map[string]string       `json:"annotations,omitempty"`
	Replicas             *int32                  `json:"replicas,omitempty"`
	PodSpec              *v1.PodSpec             `json:"podSpec,omitempty"`
	VolumeClaimTemplates []PersistentVolumeClaim `json:"volumeClaimTemplates,omitempty"`
}

Some of these values can be overridden on a per namespace basis via the SeldonRuntime resource. Labels and annotations can also be set at the component level - these will be merged with the labels and annotations from the SeldonConfig resource in which they are defined and added to the component's corresponding Deployment, or StatefulSet.

The default configuration is shown below.

Pipeline Config

Pipelines allow one to connect flows of inference data transformed by Model components. A directed acyclic graph (DAG) of steps can be defined to join Models together. Each Model will need to be capable of receiving a V2 inference request and respond with a V2 inference response. An example Pipeline is shown below:

The steps list shows three models: tfsimple1, tfsimple2 and tfsimple3. These three models each take two tensors called INPUT0 and INPUT1 of integers. The models produce two outputs OUTPUT0 (the sum of the inputs) and OUTPUT1 (subtraction of the second input from the first).

tfsimple1 and tfsimple2 take as inputs the input to the Pipeline: the default assumption when no explicit inputs are defined. tfsimple3 takes one V2 tensor input from each of the outputs of tfsimple1 and tfsimple2. As the outputs of tfsimple1 and tfsimple2 have tensors named OUTPUT0 and OUTPUT1 their names need to be changed to respect the expected input tensors and this is done with a tensorMap component providing this tensor renaming. This is only required if your models can not be directly chained together.

The output of the Pipeline is the output from the tfsimple3 model.

Detailed Specification

The full GoLang specification for a Pipeline is shown below:

type PipelineSpec struct {
	// External inputs to this pipeline, optional
	Input *PipelineInput `json:"input,omitempty"`

	// The steps of this inference graph pipeline
	Steps []PipelineStep `json:"steps"`

	// Synchronous output from this pipeline, optional
	Output *PipelineOutput `json:"output,omitempty"`
}

// +kubebuilder:validation:Enum=inner;outer;any
type JoinType string

const (
	// data must be available from all inputs
	JoinTypeInner JoinType = "inner"
	// data will include any data from any inputs at end of window
	JoinTypeOuter JoinType = "outer"
	// first data input that arrives will be forwarded
	JoinTypeAny JoinType = "any"
)

type PipelineStep struct {
	// Name of the step
	Name string `json:"name"`

	// Previous step to receive data from
	Inputs []string `json:"inputs,omitempty"`

	// msecs to wait for messages from multiple inputs to arrive before joining the inputs
	JoinWindowMs *uint32 `json:"joinWindowMs,omitempty"`

	// Map of tensor name conversions to use e.g. output1 -> input1
	TensorMap map[string]string `json:"tensorMap,omitempty"`

	// Triggers required to activate step
	Triggers []string `json:"triggers,omitempty"`

	// +kubebuilder:default=inner
	InputsJoinType *JoinType `json:"inputsJoinType,omitempty"`

	TriggersJoinType *JoinType `json:"triggersJoinType,omitempty"`

	// Batch size of request required before data will be sent to this step
	Batch *PipelineBatch `json:"batch,omitempty"`
}

type PipelineBatch struct {
	Size     *uint32 `json:"size,omitempty"`
	WindowMs *uint32 `json:"windowMs,omitempty"`
	Rolling  bool    `json:"rolling,omitempty"`
}

type PipelineInput struct {
	// Previous external pipeline steps to receive data from
	ExternalInputs []string `json:"externalInputs,omitempty"`

	// Triggers required to activate inputs
	ExternalTriggers []string `json:"externalTriggers,omitempty"`

	// msecs to wait for messages from multiple inputs to arrive before joining the inputs
	JoinWindowMs *uint32 `json:"joinWindowMs,omitempty"`

	// +kubebuilder:default=inner
	JoinType *JoinType `json:"joinType,omitempty"`

	// +kubebuilder:default=inner
	TriggersJoinType *JoinType `json:"triggersJoinType,omitempty"`

	// Map of tensor name conversions to use e.g. output1 -> input1
	TensorMap map[string]string `json:"tensorMap,omitempty"`
}

type PipelineOutput struct {
	// Previous step to receive data from
	Steps []string `json:"steps,omitempty"`

	// msecs to wait for messages from multiple inputs to arrive before joining the inputs
	JoinWindowMs uint32 `json:"joinWindowMs,omitempty"`

	// +kubebuilder:default=inner
	StepsJoin *JoinType `json:"stepsJoin,omitempty"`

	// Map of tensor name conversions to use e.g. output1 -> input1
	TensorMap map[string]string `json:"tensorMap,omitempty"`
}

Upgrading

Upgrading from 2.7 - 2.8

Core 2.8 introduces several new fields in our CRDs:

statefulSetPersistentVolumeClaimRetentionPolicy enables users to configure the cleaning of PVC on their servers. This field is set to retain as default.
Status.selector was introduced as a mandatory field for models in 2.8.4 and made optional in 2.8.5. This field enables autoscaling with HPA.
PodSpec in the OverrideSpec for SeldonRuntimes enables users to customize how Seldon Core 2 pods are created. In particular, this also allows for setting custom taints/tolerations, adding additional containers to our pods, configuring custom security settings.

These added fields do not result in breaking changes, apart from 2.8.4 which required the setting of the Status.selector upon upgrading. This field was however changed to optional in the subsequent 2.8.5 release. Updating the CRDs (e.g. via helm) will enable users to benefit from the associated functionality.

Upgrading from 2.6 - 2.7

All pods provisioned through the operator i.e. SeldonRuntime and Server resources now have the label app.kubernetes.io/name for identifying the pods.

Previously, the labelling has been inconsistent across different versions of Seldon Core 2, with mixture of app and app.kubernetes.io/name used.

If using the Prometheus operator ("Kube Prometheus"), please apply the v2.7.0 manifests for Seldon Core 2 according to the metrics documentation.

Note that these manifests need to be adjusted to discover metrics endpoints based on the existing setup.

If previous pod monitors had namespaceSelector fields set, these should be copied over and applied to the new manifests.

If namespaces do not matter, cluster-wide metrics endpoint discovery can be setup by modifying the namespaceSelector field in the pod monitors:

spec:
  namespaceSelector:
    any: true

Upgrading from 2.5 - 2.6

Release 2.6 brings with it new custom resources SeldonConfig and SeldonRuntime, which provide a new way to install Seldon Core 2 in Kubernetes. Upgrading in the same namespace will cause downtime while the pods are being recreated. Alternatively users can have an external service mesh or other means to be used over multiple namespaces to bring up the system in a new namespace and redeploy models before switch traffic between them.

If the new 2.6 charts are used to upgrade in an existing namespace models will eventually be redeloyed but there will be service downtime as the core components are redeployed.

User Guide

Kubernetes Resources

For Kubernetes usage we provide a set of custom resources for interacting with Seldon.

SeldonRuntime - for installing Seldon in a particular namespace.
Servers - for deploying sets of replicas of core inference servers (MLServer or Triton).
Models - for deploying single machine learning models, custom transformation logic, drift detectors, outliers detectors and explainers.
Experiments - for testing new versions of models.
Pipelines - for connecting together flows of data between models.

Advanced Customization

SeldonConfig and ServerConfig define the core installation configuration and machine learning inference server configuration for Seldon. Normally, you would not need to customize these but this may be required for your particular custom installation within your organisation.

ServerConfigs - for defining new types of inference server that can be reference by a Server resource.
SeldonConfig - for defining how seldon is installed

Servers

By default Seldon installs two server farms using MLServer and Triton with 1 replica each. Models are scheduled onto servers based on the server's resources and whether the capabilities of the server matches the requirements specified in the Model request. For example:

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
  requirements:
  - sklearn
  memory: 100Ki

This model specifies the requirement sklearn

There is a default capabilities for each server as follows:

MLServer

- name: SELDON_SERVER_CAPABILITIES
  value: "mlserver,alibi-detect,alibi-explain,huggingface,lightgbm,mlflow,python,sklearn,spark-mlib,xgboost"

Triton

- name: SELDON_SERVER_CAPABILITIES
  value: "triton,dali,fil,onnx,openvino,python,pytorch,tensorflow,tensorrt"

Custom Capabilities

Servers can be defined with a capabilities field to indicate custom configurations (e.g. Python dependencies). For instance:

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-134
spec:
  serverConfig: mlserver
  capabilities:
  - mlserver-1.3.4
  podSpec:
    containers:
    - image: seldonio/mlserver:1.3.4
      name: mlserver

These capabilities override the ones from the serverConfig: mlserver. A model that takes advantage of this is shown below:

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "gs://seldon-models/mlserver/iris"
  requirements:
  - mlserver-1.3.4

This above model will be matched with the previous custom server mlserver-134.

Servers can also be set up with the extraCapabilities that add to existing capabilities from the referenced ServerConfig. For instance:

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-extra
spec:
  serverConfig: mlserver
  extraCapabilities:
  - extra

This server, mlserver-extra, inherits a default set of capabilities via serverConfig: mlserver. These defaults are discussed above. The extraCapabilities are appended to these to create a single list of capabilities for this server.

Models can then specify requirements to select a server that satisfies those requirements as follows.

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: extra-model-requirements
spec:
  storageUri: "gs://seldon-models/mlserver/iris"
  requirements:
  - extra

The capabilities field takes precedence over the extraCapabilities field.

For some examples see here.

Autoscaling of Servers

Within docker we don't support this but for Kubernetes see here

Resource allocation

Learn more about using taints and tolerations with node affinity or node selector to allocate resources in a Kubernetes cluster.

When deploying machine learning models in Kubernetes, you may need to control which infrastructure resources these models use. This is especially important in environments where certain workloads, such as resource-intensive models, should be isolated from others or where specific hardware such as GPUs, needs to be dedicated to particular tasks. Without fine-grained control over workload placement, models might end up running on suboptimal nodes, leading to inefficiencies or resource contention.

For example, you may want to:

Isolate inference workloads from control plane components or other services to prevent resource contention.
Ensure that GPU nodes are reserved exclusively for models that require hardware acceleration.
Keep business-critical models on dedicated nodes to ensure performance and reliability.
Run external dependencies like Kafka on separate nodes to avoid interference with inference workloads.

To solve these problems, Kubernetes provides mechanisms such as taints, tolerations, and nodeAffinity or nodeSelector to control resource allocation and workload scheduling.

Taints are applied to nodes and tolerations to Pods to control which Pods can be scheduled on specific nodes within the Kubernetes cluster. Pods without a matching toleration for a node’s taint are not scheduled on that node. For instance, if a node has GPUs or other specialized hardware, you can prevent Pods that don’t need these resources from running on that node to avoid unnecessary resource usage.

Note: Taints and tolerations alone do not ensure that a Pod runs on a tainted node. Even if a Pod has the correct toleration, Kubernetes may still schedule it on other nodes without taints. To ensure a Pod runs on a specific node, you need to also use node affinity and node selector rules.

When used together, taints and tolerations with nodeAffinity or nodeSelector can effectively allocate certain Pods to specific nodes, while preventing other Pods from being scheduled on those nodes.

In a Kubernetes cluster running Seldon Core 2, this involves two key configurations:

Configuring servers to run on specific nodes using mechanisms like taints, tolerations, and nodeAffinity or nodeSelector.
Configuring models so that they are scheduled and loaded on the appropriate servers.

This ensures that models are deployed on the optimal infrastructure and servers that meet their requirements.

Models

Models provide the atomic building blocks of Seldon. They represents machine learning models, drift detectors, outlier detectors, explainers, feature transformations, and more complex routing models such as multi-armed bandits.

Kubernetes Example

A Kubernetes yaml example is shown below for a SKLearn model for iris classification:

Its Kubernetes spec has two core requirements

A storageUri specifying the location of the artifact. This can be any rclone URI specification.
A requirements list which provides tags that need to be matched by the Server that can run this artifact type. By default when you install Seldon we provide a set of Servers that cover a range of artifact types.

GRPC Example

You can also load models directly over the scheduler grpc service. An example is shown below use grpcurl tool:

Multi-model Serving with Overcommit

Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. It is a feature provided out of the box by Nvidia Triton and Seldon MLServer. Multi-model serving reduces infrastructure hardware requirements (e.g. expensive GPUs) which enables the deployment of a large number of models while making it efficient to operate the system at scale.

Seldon Core 2 leverages multi-model serving by design and it is the default option for deploying models. The system will find an appropriate server to load the model onto based on requirements that the user defines in the Model deployment definition.

Moreover, in many cases demand patterns allow for further Overcommit of resources. Seldon Core 2 is able to register more models than what can be served by the provisioned (memory) infrastructure and will swap models dynamically according to least used without adding significant latency overheads to inference workload.

Autoscaling of Models

Scheduling of Models onto Servers

CRD

Registration

Versioning

LLM

Parameterized Models

Links to Secret Management -->

Multi-Model Serving

Multi-model Serving

Multi-model serving is an architecture pattern where one ML inference server hosts multiple models at the same time. This means that, within a single instance of the server, you can serve multiple models under different paths. This is a feature provided out of the box by Nvidia Triton and Seldon MLServer, currently the two inference servers that are integrated in Seldon Core 2.

This deployment pattern allows the system to handle a large number of deployed models letting them share hardware resources allocated to inference servers (e.g GPUs). For example if a single model inference server is deployed on a one GPU node, the underlying loaded models on this inference server instance are able to effectively share this GPU. This is contrast to a single model per server deployment pattern where only one model can use the allocated GPU.

Multi-model serving is enabled by design in Seldon Core 2. Based on requirements that are specified by the user on a given Model, the Scheduler will find an appropriate model inference server instance to load the model onto.

In the below example, given that the model is tensorflow, the system will deploy the model onto a triton server instance (matching with the Server labels). Additionally as the model memory requirement is 100Ki, the system will pick the server instance that has enough (memory) capacity to host this model in parallel to other potentially existing models.

# samples/models/tfsimple1.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple1
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki

All models are loaded and active on this model server. Inference requests for these models are served concurrently and the hardware resources are shared to fulfil these inference requests.

Overcommit

Overcommit allows shared servers to handle more models than can fit in memory. This is done by keeping highly utilized models in memory and evicting other ones to disk using a least-recently-used (LRU) cache mechanism. From a user perspective these models are all registered and "ready" to serve inference requests. If an inference request comes for a model that is unloaded/evicted to disk, the system will reload the model first before forwarding the request to the inference server.

Overcommit is enabled by setting SELDON_OVERCOMMIT_PERCENTAGE on shared servers; it is set by default at 10%. In other words a given model inference server instance can register models with a total memory requirement up to MEMORY_REQUEST * ( 1 + SELDON_OVERCOMMIT_PERCENTAGE / 100).

The Seldon Agent (a side car next to each model inference server deployment) is keeping track of inference requests times on the different models. These models are sorted in time ascending order and this data structure is used to evict the least recently used model in order to make room for another incoming model. This happens during two scenarios:

A new model load request beyond the active memory capacity of the inference server.
An incoming inference request to a registered model that is not loaded in-memory (previously evicted).

This is done seamlessly to users and specifically for reloading a model onto the inference server to respond to an inference request, the model artifact is cached on disk which allows a faster reload (no remote artifact fetch). Therefore we expect that the extra latency to reload a model during an inference request is acceptable in many cases (with a lower bound of ~100ms).

Overcommit can be disabled by setting SELDON_OVERCOMMIT_PERCENTAGE to 0 for a given shared server.

Note: Currently we are using memory requirement values that are specified by the user on the Server and Model side. In the future we are looking at how to make the system automatically handle memory management.

Check this notebook for a local example.

Inference Artifacts

To run your model inside Seldon you must supply an inference artifact that can be downloaded and run on one of MLServer or Triton inference servers. We list artifacts below by alphabetical order below.

Type

Server

Tag

Example

Alibi-Detect

MLServer

alibi-detect

Alibi-Explain

MLServer

alibi-explain

DALI

Triton

dali

TBC

Huggingface

MLServer

huggingface

LightGBM

MLServer

lightgbm

MLFlow

MLServer

mlflow

ONNX

Triton

onnx

OpenVino

Triton

openvino

TBC

Custom Python

MLServer

python, mlserver

Custom Python

Triton

python, triton

PyTorch

Triton

pytorch

SKLearn

MLServer

sklearn

Spark Mlib

MLServer

spark-mlib

TBC

Tensorflow

Triton

tensorflow

TensorRT

Triton

tensorrt

TBC

Triton FIL

Triton

fil

TBC

XGBoost

MLServer

xgboost

Saving Model artifacts

For many machine learning artifacts you can simply save them to a folder and load them into Seldon Core 2. Details are given below as well as a link to creating a custom model settings file if needed.

Type

Notes

Custom Model Settings

Alibi-Detect

Alibi-Explain

DALI

Follow the Triton docs to create a config.pbtxt and model folder with artifact.

Huggingface

Create an MLServer model-settings.json with the Huggingface model required

LightGBM

Save model to file with extension.bst.

MLFlow

Use the created artifacts/model folder from your training run.

ONNX

Save you model with name model.onnx.

OpenVino

Follow the Triton docs to create your model artifacts.

Custom MLServer Python

Create a python file with a class that extends MLModel.

Custom Triton Python

Follow the Triton docs to create your config.pbtxt and associated python files.

PyTorch

Create a Triton config.pbtxt describing inputs and outputs and place traced torchscript in folder as model.pt.

SKLearn

Save model via joblib to a file with extension .joblib or with pickle to a file with extension .pkl or .pickle.

Spark Mlib

Follow the MLServer docs.

Tensorflow

Save model in "Saved Model" format as model.savedodel. If using graphdef format you will need to create Triton config.pbtxt and place your model in a numbered sub folder. HDF5 is not supported.

TensorRT

Follow the Triton docs to create your model artifacts.

Triton FIL

Follow the Triton docs to create your model artifacts.

XGBoost

Save model to file with extension.bst or .json.

Custom MLServer Model Settings

For MLServer targeted models you can create a model-settings.json file to help MLServer load your model and place this alongside your artifact. See the MLServer project for details.

Custom Triton Configuration

For Triton inference server models you can create a configuration config.pbtxt file alongside your artifact.

Notes

The tag field represents the tag you need to add to the requirements part of the Model spec for your artifact to be loaded on a compatible server. e.g. for an sklearn model:

# samples/models/sklearn-iris-gs.yaml
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "gs://seldon-models/scv2/samples/mlserver_1.5.0/iris-sklearn"
  requirements:
  - sklearn
  memory: 100Ki

rClone

We utilize Rclone to copy model artifacts from a storage location to the model servers. This allows users to take advantage of Rclones support for over 40 cloud storage backends including Amazon S3, Google Storage and many others.

For local storage while developing see here.

For authorization needed for cloud storage when running on Kubernetes see here.

Pandas Query

This invocation check filters for tensor A having value 1.

The model also returns a tensor called status which indicates the operation run and whether it was a success. If no rows satisfy the query then just a status tensor output will be returned.

This model can be useful for conditional Pipelines. For example, you could have two invocations of this model:

and

By including these in a Pipeline as follows we can define conditional routes:

Here the mul10 model will be called if the choice-is-one model succeeds and the add10 model will be called if the choice-is-two model succeeds.

Inference

Data Science Monitoring

Integrations

Resources

Security

OIP

This page describes a predict/inference API independent of any specific ML/DL framework and model server. These APIs are able to support both easy-to-use and high-performance use cases. By implementing this protocol both inference clients and servers will increase their utility and portability by being able to operate seamlessly on platforms that have standardized around this API. This protocol is endorsed by NVIDIA Triton Inference Server, TensorFlow Serving, and ONNX Runtime Server. It is sometimes referred to by its old name "V2 Inference Protocol".

For an inference server to be compliant with this protocol the server must implement all APIs described below, except where an optional feature is explicitly noted. A compliant inference server may choose to implement either or both of the HTTP/REST API and the GRPC API.

The protocol supports an extension mechanism as a required part of the API, but this document does not propose any specific extensions. Any specific extensions will be proposed separately.

HTTP/REST

A compliant server must implement the health, metadata, and inference APIs described in this section.

The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field.

All strings in all contexts are case-sensitive.

For Seldon a server must recognize the following URLs. The versions portion of the URL is shown as optional to allow implementations that don’t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies).

Health:

GET v2/health/live
GET v2/health/ready
GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/ready

Server Metadata:

GET v2

Model Metadata:

GET v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]

Inference:

POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/infer

Health

A health request is made with an HTTP GET to a health endpoint. The HTTP response status code indicates a boolean result for the health request. A 200 status code indicates true and a 4xx status code indicates false. The HTTP response body should be empty. There are three health APIs.

Server Live

The “server live” API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes livenessProbe.

Server Ready

The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe.

Model Ready

The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies.

Server Metadata

The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains the Server Metadata Response JSON Object or the Server Metadata Response JSON Error Object.

Server Metadata Response JSON Object

A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as $metadata_server_response, is returned in the HTTP body.

$metadata_server_response =
{
  "name" : $string,
  "version" : $string,
  "extensions" : [ $string, ... ]
}

“name” : A descriptive name for the server.
"version" : The server version.
“extensions” : The extensions supported by the server. Currently no standard extensions are defined. Individual inference servers may define and document their own extensions.

Server Metadata Response JSON Error Object

A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_server_error_response object.

$metadata_server_error_response =
{
  "error": $string
}

“error” : The descriptive message for the error.

Model Metadata

The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

Model Metadata Response JSON Object

A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as $metadata_model_response, is returned in the HTTP body for every successful model metadata request.

$metadata_model_response =
{
  "name" : $string,
  "versions" : [ $string, ... ] #optional,
  "platform" : $string,
  "inputs" : [ $metadata_tensor, ... ],
  "outputs" : [ $metadata_tensor, ... ]
}

“name” : The name of the model.
"versions" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don’t support versions. Optional for models that don’t allow a version to be explicitly requested.
“platform” : The framework/backend for the model. See Platforms.
“inputs” : The inputs required by the model.
“outputs” : The outputs produced by the model.

Each model input and output tensors’ metadata is described with a $metadata_tensor object.

$metadata_tensor =
{
  "name" : $string,
  "datatype" : $string,
  "shape" : [ $number, ... ]
}

“name” : The name of the tensor.
"datatype" : The data-type of the tensor elements as defined in Tensor Data Types.
"shape" : The shape of the tensor. Variable-size dimensions are specified as -1.

Model Metadata Response JSON Error Object

A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_model_error_response object.

$metadata_model_error_response =
{
  "error": $string
}

“error” : The descriptive message for the error.

Inference

An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the Inference Request JSON Object. In the corresponding response the HTTP body contains the Inference Response JSON Object or Inference Response JSON Error Object. See Inference Request Examples for some example HTTP/REST requests and responses.

Inference Request JSON Object

The inference request object, identified as $inference_request, is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

$inference_request =
{
  "id" : $string #optional,
  "parameters" : $parameters #optional,
  "inputs" : [ $request_input, ... ],
  "outputs" : [ $request_output, ... ] #optional
}

id : An identifier for this request. Optional, but if specified this identifier must be returned in the response.
parameters : An object containing zero or more parameters for this inference request expressed as key/value pairs. See Parameters for more information.
inputs : The input tensors. Each input is described using the $request_input schema defined in Request Input.
outputs : The output tensors requested for this inference. Each requested output is described using the $request_output schema defined in Request Output. Optional, if not specified all outputs produced by the model will be returned using default $request_output settings.

Request Input

The $request_input JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch.

$request_input =
{
  "name" : $string,
  "shape" : [ $number, ... ],
  "datatype"  : $string,
  "parameters" : $parameters #optional,
  "data" : $tensor_data
}

"name": The name of the input tensor.
"shape": The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.
"datatype": The data-type of the input tensor elements as defined in Tensor Data Types.
"parameters": An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information.
“data”: The contents of the tensor. See Tensor Data for more information.

Request Output

The $request_output JSON is used to request which output tensors should be returned from the model.

$request_output =
{
  "name" : $string,
  "parameters" : $parameters #optional,
}

"name": The name of the output tensor.
"parameters": An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information.

Inference Response JSON Object

A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as $inference_response, is returned in the HTTP body.

$inference_response =
{
  "model_name" : $string,
  "model_version" : $string #optional,
  "id" : $string,
  "parameters" : $parameters #optional,
  "outputs" : [ $response_output, ... ]
}

"model_name": The name of the model used for inference.
"model_version": The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response.
"id": The "id" identifier given in the request, if any.
"parameters": An object containing zero or more parameters for this response expressed as key/value pairs. See Parameters for more information.
"outputs": The output tensors. Each output is described using the $response_output schema defined in Response Output.

Response Output

The $response_output JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch.

$response_output =
{
  "name" : $string,
  "shape" : [ $number, ... ],
  "datatype"  : $string,
  "parameters" : $parameters #optional,
  "data" : $tensor_data
}

"name": The name of the output tensor.
"shape": The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.
"datatype": The data-type of the output tensor elements as defined in Tensor Data Types.
"parameters": An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information.
“data”: The contents of the tensor. See Tensor Data for more information.

Inference Response JSON Error Object

A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $inference_error_response object.

$inference_error_response =
{
  "error": <error message string>
}

“error”: The descriptive message for the error.

Inference Request Examples

The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object.

POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
  "id" : "42",
  "inputs" : [
    {
      "name" : "input0",
      "shape" : [ 2, 2 ],
      "datatype" : "UINT32",
      "data" : [ 1, 2, 3, 4 ]
    },
    {
      "name" : "input1",
      "shape" : [ 3 ],
      "datatype" : "BOOL",
      "data" : [ true ]
    }
  ],
  "outputs" : [
    {
      "name" : "output0"
    }
  ]
}

For the above request the inference server must return the “output0” output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned.

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: <yy>
{
  "id" : "42"
  "outputs" : [
    {
      "name" : "output0",
      "shape" : [ 3, 2 ],
      "datatype"  : "FP32",
      "data" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ]
    }
  ]
}

Parameters

The $parameters JSON describes zero or more “name”/”value” pairs, where the “name” is the name of the parameter and the “value” is a $string, $number, or $boolean.

$parameters =
{
  $parameter, ...
}

$parameter = $string : $string | $number | $boolean

Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.

Tensor Data

Tensor data must be presented in row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation.

Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 is problematic to communicate explicitly since there is not a standard fp16 representation across backends nor typically the programmatic support to create the fp16 representation for a JSON number.

For example, the 2-dimensional matrix:

[ 1 2
  4 5 ]

Can be represented in its natural format as:

"data" : [ [ 1, 2 ], [ 4, 5 ] ]

Or in a flattened one-dimensional representation:

"data" : [ 1, 2, 4, 5 ]

GRPC

The GRPC API closely follows the concepts defined in the HTTP/REST API. A compliant server must implement the health, metadata, and inference APIs described in this section.

All strings in all contexts are case-sensitive.

The GRPC definition of the service is:

//
// Inference Server GRPC endpoints.
//
service GRPCInferenceService
{
  // Check liveness of the inference server.
  rpc ServerLive(ServerLiveRequest) returns (ServerLiveResponse) {}

  // Check readiness of the inference server.
  rpc ServerReady(ServerReadyRequest) returns (ServerReadyResponse) {}

  // Check readiness of a model in the inference server.
  rpc ModelReady(ModelReadyRequest) returns (ModelReadyResponse) {}

  // Get server metadata.
  rpc ServerMetadata(ServerMetadataRequest) returns (ServerMetadataResponse) {}

  // Get model metadata.
  rpc ModelMetadata(ModelMetadataRequest) returns (ModelMetadataResponse) {}

  // Perform inference using a specific model.
  rpc ModelInfer(ModelInferRequest) returns (ModelInferResponse) {}
}

Health

A health request is made using the ServerLive, ServerReady, or ModelReady endpoint. For each of these endpoints errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure.

Server Live

The ServerLive API indicates if the inference server is able to receive and respond to metadata and inference requests. The request and response messages for ServerLive are:

message ServerLiveRequest {}

message ServerLiveResponse
{
  // True if the inference server is live, false if not live.
  bool live = 1;
}

Server Ready

The ServerReady API indicates if the server is ready for inferencing. The request and response messages for ServerReady are:

message ServerReadyRequest {}

message ServerReadyResponse
{
  // True if the inference server is ready, false if not ready.
  bool ready = 1;
}

Model Ready

The ModelReady API indicates if a specific model is ready for inferencing. The request and response messages for ModelReady are:

message ModelReadyRequest
{
  // The name of the model to check for readiness.
  string name = 1;

  // The version of the model to check for readiness. If not given the
  // server will choose a version based on the model and internal policy.
  string version = 2;
}

message ModelReadyResponse
{
  // True if the model is ready, false if not ready.
  bool ready = 1;
}

Server Metadata

The ServerMetadata API provides information about the server. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ServerMetadata are:

message ServerMetadataRequest {}

message ServerMetadataResponse
{
  // The server name.
  string name = 1;

  // The server version.
  string version = 2;

  // The extensions supported by the server.
  repeated string extensions = 3;
}

Model Metadata

The per-model metadata API provides information about a model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelMetadata are:

message ModelMetadataRequest
{
  // The name of the model.
  string name = 1;

  // The version of the model to check for readiness. If not given the
  // server will choose a version based on the model and internal policy.
  string version = 2;
}

message ModelMetadataResponse
{
  // Metadata for a tensor.
  message TensorMetadata
  {
    // The tensor name.
    string name = 1;

    // The tensor data type.
    string datatype = 2;

    // The tensor shape. A variable-size dimension is represented
    // by a -1 value.
    repeated int64 shape = 3;
  }

  // The model name.
  string name = 1;

  // The versions of the model available on the server.
  repeated string versions = 2;

  // The model's platform. See Platforms.
  string platform = 3;

  // The model's inputs.
  repeated TensorMetadata inputs = 4;

  // The model's outputs.
  repeated TensorMetadata outputs = 5;
}

Inference

The ModelInfer API performs inference using the specified model. Errors are indicated by the google.rpc.Status returned for the request. The OK code indicates success and other codes indicate failure. The request and response messages for ModelInfer are:

message ModelInferRequest
{
  // An input tensor for an inference request.
  message InferInputTensor
  {
    // The tensor name.
    string name = 1;

    // The tensor data type.
    string datatype = 2;

    // The tensor shape.
    repeated int64 shape = 3;

    // Optional inference input tensor parameters.
    map<string, InferParameter> parameters = 4;

    // The tensor contents using a data-type format. This field must
    // not be specified if "raw" tensor contents are being used for
    // the inference request.
    InferTensorContents contents = 5;
  }

  // An output tensor requested for an inference request.
  message InferRequestedOutputTensor
  {
    // The tensor name.
    string name = 1;

    // Optional requested output tensor parameters.
    map<string, InferParameter> parameters = 2;
  }

  // The name of the model to use for inferencing.
  string model_name = 1;

  // The version of the model to use for inference. If not given the
  // server will choose a version based on the model and internal policy.
  string model_version = 2;

  // Optional identifier for the request. If specified will be
  // returned in the response.
  string id = 3;

  // Optional inference parameters.
  map<string, InferParameter> parameters = 4;

  // The input tensors for the inference.
  repeated InferInputTensor inputs = 5;

  // The requested output tensors for the inference. Optional, if not
  // specified all outputs produced by the model will be returned.
  repeated InferRequestedOutputTensor outputs = 6;

  // The data contained in an input tensor can be represented in "raw"
  // bytes form or in the repeated type that matches the tensor's data
  // type. To use the raw representation 'raw_input_contents' must be
  // initialized with data for each tensor in the same order as
  // 'inputs'. For each tensor, the size of this content must match
  // what is expected by the tensor's shape and data type. The raw
  // data must be the flattened, one-dimensional, row-major order of
  // the tensor elements without any stride or padding between the
  // elements. Note that the FP16 data type must be represented as raw
  // content as there is no specific data type for a 16-bit float
  // type.
  //
  // If this field is specified then InferInputTensor::contents must
  // not be specified for any input tensor.
  repeated bytes raw_input_contents = 7;
}

message ModelInferResponse
{
  // An output tensor returned for an inference request.
  message InferOutputTensor
  {
    // The tensor name.
    string name = 1;

    // The tensor data type.
    string datatype = 2;

    // The tensor shape.
    repeated int64 shape = 3;

    // Optional output tensor parameters.
    map<string, InferParameter> parameters = 4;

    // The tensor contents using a data-type format. This field must
    // not be specified if "raw" tensor contents are being used for
    // the inference response.
    InferTensorContents contents = 5;
  }

  // The name of the model used for inference.
  string model_name = 1;

  // The version of the model used for inference.
  string model_version = 2;

  // The id of the inference request if one was specified.
  string id = 3;

  // Optional inference response parameters.
  map<string, InferParameter> parameters = 4;

  // The output tensors holding inference results.
  repeated InferOutputTensor outputs = 5;

  // The data contained in an output tensor can be represented in
  // "raw" bytes form or in the repeated type that matches the
  // tensor's data type. To use the raw representation 'raw_output_contents'
  // must be initialized with data for each tensor in the same order as
  // 'outputs'. For each tensor, the size of this content must match
  // what is expected by the tensor's shape and data type. The raw
  // data must be the flattened, one-dimensional, row-major order of
  // the tensor elements without any stride or padding between the
  // elements. Note that the FP16 data type must be represented as raw
  // content as there is no specific data type for a 16-bit float
  // type.
  //
  // If this field is specified then InferOutputTensor::contents must
  // not be specified for any output tensor.
  repeated bytes raw_output_contents = 6;
}

Parameters

The Parameters message describes a “name”/”value” pair, where the “name” is the name of the parameter and the “value” is a boolean, integer, or string corresponding to the parameter.

//
// An inference parameter value.
//
message InferParameter
{
  // The parameter value can be a string, an int64, a boolean
  // or a message specific to a predefined parameter.
  oneof parameter_choice
  {
    // A boolean parameter value.
    bool bool_param = 1;

    // An int64 parameter value.
    int64 int64_param = 2;

    // A string parameter value.
    string string_param = 3;
  }
}

Tensor Data

In all representations tensor data must be flattened to a one-dimensional, row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements.

Using a "raw" representation of tensors with ModelInferRequest::raw_input_contents and ModelInferResponse::raw_output_contents will typically allow higher performance due to the way protobuf allocation and reuse interacts with GRPC. For example, see issue here.

An alternative to the "raw" representation is to use InferTensorContents to represent the tensor data in a format that matches the tensor's data type.

//
// The data contained in a tensor represented by the repeated type
// that matches the tensor's data type. Protobuf oneof is not used
// because oneofs cannot contain repeated fields.
//
message InferTensorContents
{
  // Representation for BOOL data type. The size must match what is
  // expected by the tensor's shape. The contents must be the flattened,
  // one-dimensional, row-major order of the tensor elements.
  repeated bool bool_contents = 1;

  // Representation for INT8, INT16, and INT32 data types. The size
  // must match what is expected by the tensor's shape. The contents
  // must be the flattened, one-dimensional, row-major order of the
  // tensor elements.
  repeated int32 int_contents = 2;

  // Representation for INT64 data types. The size must match what
  // is expected by the tensor's shape. The contents must be the
  // flattened, one-dimensional, row-major order of the tensor elements.
  repeated int64 int64_contents = 3;

  // Representation for UINT8, UINT16, and UINT32 data types. The size
  // must match what is expected by the tensor's shape. The contents
  // must be the flattened, one-dimensional, row-major order of the
  // tensor elements.
  repeated uint32 uint_contents = 4;

  // Representation for UINT64 data types. The size must match what
  // is expected by the tensor's shape. The contents must be the
  // flattened, one-dimensional, row-major order of the tensor elements.
  repeated uint64 uint64_contents = 5;

  // Representation for FP32 data type. The size must match what is
  // expected by the tensor's shape. The contents must be the flattened,
  // one-dimensional, row-major order of the tensor elements.
  repeated float fp32_contents = 6;

  // Representation for FP64 data type. The size must match what is
  // expected by the tensor's shape. The contents must be the flattened,
  // one-dimensional, row-major order of the tensor elements.
  repeated double fp64_contents = 7;

  // Representation for BYTES data type. The size must match what is
  // expected by the tensor's shape. The contents must be the flattened,
  // one-dimensional, row-major order of the tensor elements.
  repeated bytes bytes_contents = 8;
}

Platforms

A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to a Model Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format “<project>_<format>”. The following platform names are allowed:

tensorrt_plan: A TensorRT model encoded as a serialized engine or “plan”.
tensorflow_graphdef: A TensorFlow model encoded as a GraphDef.
tensorflow_savedmodel: A TensorFlow model encoded as a SavedModel.
onnx_onnxv1: A ONNX model encoded for ONNX Runtime.
pytorch_torchscript: A PyTorch model encoded as TorchScript.
mxnet_mxnet An MXNet model
caffe2_netdef: A Caffe2 model encoded as a NetDef.

Tensor Data Types

Tensor data types are shown in the following table along with the size of each type, in bytes.

Data Type

Size (bytes)

BOOL

UINT8

UINT16

UINT32

UINT64

INT8

INT16

INT32

INT64

FP16

FP32

FP64

BYTES

Variable (max 232)

References

This document is based on the KServe original created during the lifetime of the KFServing project in Kubeflow by its various contributors including Seldon, NVIDIA, IBM, Bloomberg and others.

HPA Autoscaling in single-model serving

Learn how to jointly autoscale model and server replicas based on a metric of inference requests per second (RPS) using HPA, when there is a one-to-one correspondence between models and servers (single-model serving). This will require:

Having a Seldon Core 2 install that publishes metrics to prometheus (default). In the following, we will assume that prometheus is already installed and configured in the seldon-monitoring namespace.
Installing and configuring Prometheus Adapter, which allows prometheus queries on relevant metrics to be published as k8s custom metrics
Configuring HPA manifests to scale Models and the corresponding Server replicas based on the custom metrics

The Core 2 HPA-based autoscaling has the following constraints/limitations:

HPA scaling only targets single-model serving, where there is a 1:1 correspondence between models and servers. Autoscaling for multi-model serving (MMS) is supported for specific models and workloads via the Core 2 native features described here.
- Significant improvements to MMS autoscaling are planned for future releases.
Only custom metrics from Prometheus are supported. Native Kubernetes resource metrics such as CPU or memory are not. This limitation exists because of HPA's design: In order to prevent multiple HPA CRs from issuing conflicting scaling instructions, each HPA CR must exclusively control a set of pods which is disjoint from the pods controlled by other HPA CRs. In Seldon Core 2, CPU/memory metrics can be used to scale the number of Server replicas via HPA. However, this also means that the CPU/memory metrics from the same set of pods can no longer be used to scale the number of model replicas.
- We are working on improvements in Core 2 to allow both servers and models to be scaled based on a single HPA manifest, targeting the Model CR.
Each Kubernetes cluster supports only one active custom metrics provider. If your cluster already uses a custom metrics provider different from prometheus-adapter, it will need to be removed before being able to scale Core 2 models and servers via HPA.
- The Kubernetes community is actively exploring solutions for allowing multiple custom metrics providers to coexist.

Installing and configuring the Prometheus Adapter

The role of the Prometheus Adapter is to expose queries on metrics in Prometheus as k8s custom or external metrics. Those can then be accessed by HPA in order to take scaling decisions.

To install through helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install --set prometheus.url='http://seldon-monitoring-prometheus' hpa-metrics prometheus-community/prometheus-adapter -n seldon-monitoring

These commands install prometheus-adapter as a helm release named hpa-metrics in the same namespace where Prometheus is installed, and point to its service URL (without the port).

The URL is not fully qualified as it references a Prometheus instance running in the same namespace. If you are using a separately-managed Prometheus instance, please update the URL accordingly.

If you are running Prometheus on a different port than the default 9090, you can also pass --set prometheus.port=[custom_port] You may inspect all the options available as helm values by running helm show values prometheus-community/prometheus-adapter

Please check that the metricsRelistInterval helm value (default to 1m) works well in your setup, and update it otherwise. This value needs to be larger than or equal to your Prometheus scrape interval. The corresponding prometheus adapter command-line argument is --metrics-relist-interval. If the relist interval is set incorrectly, it will lead to some of the custom metrics being intermittently reported as missing.

We now need to configure the adapter to look for the correct prometheus metrics and compute per-model RPS values. On install, the adapter has created a ConfigMap in the same namespace as itself, named [helm_release_name]-prometheus-adapter. In our case, it will be hpa-metrics-prometheus-adapter.

Overwrite the ConfigMap as shown in the following manifest, after applying any required customizations.

Change the name if you've chosen a different value for the prometheus-adapter helm release name. Change the namespace to match the namespace where prometheus-adapter is installed.

prometheus-adapter.config.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: hpa-metrics-prometheus-adapter
  namespace: seldon-monitoring
data:
  config.yaml: |-
    "rules":
    -
      "seriesQuery": |
         {__name__="seldon_model_infer_total",namespace!=""}
      "resources":
        "overrides":
          "model": {group: "mlops.seldon.io", resource: "model"}
          "server": {group: "mlops.seldon.io", resource: "server"}
          "pod": {resource: "pod"}
          "namespace": {resource: "namespace"}
      "name":
        "matches": "seldon_model_infer_total"
        "as": "infer_rps"
      "metricsQuery": |
        sum by (<<.GroupBy>>) (
          rate (
            <<.Series>>{<<.LabelMatchers>>}[2m]
          )
        )

In this example, a single rule is defined to fetch the seldon_model_infer_total metric from Prometheus, compute its per second change rate based on data within a 2 minute sliding window, and expose this to Kubernetes as the infer_rps metric, with aggregations available at model, server, inference server pod and namespace level.

When HPA requests the infer_rps metric via the custom metrics API for a specific model, prometheus-adapter issues a Prometheus query in line with what it is defined in its config.

For the configuration in our example, the query for a model named irisa0 in namespace seldon-mesh would be:

sum by (model) (
  rate (
    seldon_model_infer_total{model="irisa0", namespace="seldon-mesh"}[2m]
  )
)

You may want to modify the query in the example to match the one that you typically use in your monitoring setup for RPS metrics. The example calls rate() with a 2 minute sliding window. Values scraped at the beginning and end of the 2 minute window before query time are used to compute the RPS.

It is important to sanity-check the query by executing it against your Prometheus instance. To do so, pick an existing model CR in your Seldon Core 2 install, and send some inference requests towards it. Then, wait for a period equal to at least twice the Prometheus scrape interval (Prometheus default 1 minute), so that two values from the series are captured and a rate can be computed. Finally, you can modify the model name and namespace in the query above to match the model you've picked and execute the query.

If the query result is empty, please adjust it until it consistently returns the expected metric values. Pay special attention to the window size (2 minutes in the example): if it is smaller than twice the Prometheus scrape interval, the query may return no results. A compromise needs to be reached to set the window size large enough to reject noise but also small enough to make the result responsive to quick changes in load.

Update the metricsQuery in the prometheus-adapter ConfigMap to match any query changes you have made during tests.

A list of all the Prometheus metrics exposed by Seldon Core 2 in relation to Models, Servers and Pipelines is available here, and those may be used when customizing the configuration.

Customizing prometheus-adapter rule definitions

The rule definition can be broken down in four parts:

Discovery (the seriesQuery and seriesFilters keys) controls what Prometheus metrics are considered for exposure via the k8s custom metrics API.
As an alternative to the example above, all the Seldon Prometheus metrics of the form seldon_model.*_total could be considered, followed by excluding metrics pre-aggregated across all models (.*_aggregate_.*) as well as the cummulative infer time per model (.*_seconds_total):
```
"seriesQuery": |
        {__name__=~"^seldon_model.*_total",namespace!=""}
    "seriesFilters":
        - "isNot": "^seldon_.*_seconds_total"
        - "isNot": "^seldon_.*_aggregate_.*"
...
```
For RPS, we are only interested in the model inference count (seldon_model_infer_total)
Association (the resources key) controls the Kubernetes resources that a particular metric can be attached to or aggregated over.
The resources key defines an association between certain labels from the Prometheus metric and k8s resources. For example, on line 17, "model": {group: "mlops.seldon.io", resource: "model"} lets prometheus-adapter know that, for the selected Prometheus metrics, the value of the "model" label represents the name of a k8s model.mlops.seldon.io CR.
One k8s custom metric is generated for each k8s resource associated with a prometheus metric. In this way, it becomes possible to request the k8s custom metric values for models.mlops.seldon.io/iris or for servers.mlops.seldon.io/mlserver.
The labels that do not refer to a namespace resource generate "namespaced" custom metrics (the label values refer to resources which are part of a namespace) -- this distinction becomes important when needing to fetch the metrics via kubectl, and in understanding how certain Prometheus query template placeholders are replaced.
Naming (the name key) configures the naming of the k8s custom metric.
In the example ConfigMap, this is configured to take the Prometheus metric named seldon_model_infer_total and expose custom metric endpoints named infer_rps, which when called return the result of a query over the Prometheus metric.
Instead of a literal match, one could also use regex group capture expressions, which can then be referenced in the custom metric name:
```
"name":
  "matches": "^seldon_model_(.*)_total"
  "as": "${1}_rps"
```
Querying (the metricsQuery key) defines how a request for a specific k8s custom metric gets converted into a Prometheus query.
The query can make use of the following placeholders:
- .Series is replaced by the discovered prometheus metric name (e.g. seldon_model_infer_total)
- .LabelMatchers, when requesting a namespaced metric for resource X with name x in namespace n, is replaced by X=~"x",namespace="n". For example, model=~"iris0", namespace="seldon-mesh". When requesting the namespace resource itself, only the namespace="n" is kept.
- .GroupBy is replaced by the resource type of the requested metric (e.g. model, server, pod or namespace).

For a complete reference for how prometheus-adapter can be configured via the ConfigMap, please consult the docs here.

Once you have applied any necessary customizations, replace the default prometheus-adapter config with the new one, and restart the deployment (this restart is required so that prometheus-adapter picks up the new config):

# Replace default prometheus adapter config
kubectl replace -f prometheus-adapter.config.yaml
# Restart prometheus-adapter pods
kubectl rollout restart deployment hpa-metrics-prometheus-adapter -n seldon-monitoring

Testing the install using the custom metrics API

In order to test that the prometheus adapter config works and everything is set up correctly, you can issue raw kubectl requests against the custom metrics API

Note: If no inference requests were issued towards any model in the Seldon install, the metrics configured above will not be available in prometheus, and thus will also not appear when checking via the commands below. Therefore, please first run some inference requests towards a sample model to ensure that the metrics are available — this is only required for the testing of the install.

Listing the available metrics:

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/ | jq .

For namespaced metrics, the general template for fetching is:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/[NAMESPACE]/[API_RESOURCE_NAME]/[CR_NAME]/[METRIC_NAME]"

For example:

Fetching model RPS metric for a specific (namespace, model) pair (seldon-mesh, irisa0):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/models.mlops.seldon.io/irisa0/infer_rps

Fetching model RPS metric aggregated at the (namespace, server) level (seldon-mesh, mlserver):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/servers.mlops.seldon.io/mlserver/infer_rps

Fetching model RPS metric aggregated at the (namespace, pod) level (seldon-mesh, mlserver-0):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/seldon-mesh/pods/mlserver-0/infer_rps

Fetching the same metric aggregated at namespace level (seldon-mesh):

kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/*/metrics/infer_rps

Configuring HPA manifests

For every (Model, Server) pair you want to autoscale, you need to apply 2 HPA manifests based on the same metric: one scaling the Model, the other the Server. The example below only works if the mapping between Models and Servers is 1-to-1 (i.e no multi-model serving).

Consider a model named irisa0 with the following manifest. Please note we don’t set minReplicas/maxReplicas. This disables the seldon lag-based autoscaling so that it doesn’t interact with HPA (separate minReplicas/maxReplicas configs will be set on the HPA side)

You must also explicitly define a value for spec.replicas. This is the key modified by HPA to increase the number of replicas, and if not present in the manifest it will result in HPA not working until the Model CR is modified to have spec.replicas defined.

irisa0.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: irisa0
  namespace: seldon-mesh
spec:
  memory: 3M
  replicas: 1
  requirements:
  - sklearn
  storageUri: gs://seldon-models/testing/iris1

Let’s scale this model when it is deployed on a server named mlserver, with a target RPS per replica of 3 RPS (higher RPS would trigger scale-up, lower would trigger scale-down):

irisa0-hpa.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: irisa0-model-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Model
    name: irisa0
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Object
    object:
      metric:
        name: infer_rps
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      target:
        type: AverageValue
        averageValue: 3
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mlserver-server-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    apiVersion: mlops.seldon.io/v1alpha1
    kind: Server
    name: mlserver
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Object
    object:
      metric:
        name: infer_rps
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      target:
        type: AverageValue
        averageValue: 3

It is important to keep both the scaling metric and any scaling policies the same across the two HPA manifests. This is to ensure that both the Models and the Servers are scaled up/down at approximately the same time. Small variations in the scale-up time are expected because each HPA samples the metrics independently, at regular intervals.

Note: If a Model gets scaled up slightly before its corresponding Server, the model is currently marked with the condition ModelReady "Status: False" with a "ScheduleFailed" message until new Server replicas become available. However, the existing replicas of that model remain available and will continue to serve inference load.

In order to ensure similar scaling behaviour between Models and Servers, the number of minReplicas and maxReplicas, as well as any other configured scaling policies should be kept in sync across the HPA for the model and the server.

The Object metric allows for two target value types: AverageValue and Value. Of the two, only AverageValue is supported for the current Seldon Core 2 setup. The Value target type is typically used for metrics describing the utilization of a resource and would not be suitable for RPS-based scaling.

HPA metrics of type Object

The example HPA manifests use metrics of type "Object" that fetch the data used in scaling decisions by querying k8s metrics associated with a particular k8s object. The endpoints that HPA uses for fetching those metrics are the same ones that were tested in the previous section using kubectl get --raw .... Because you have configured the Prometheus Adapter to expose those k8s metrics based on queries to Prometheus, a mapping exists between the information contained in the HPA Object metric definition and the actual query that is executed against Prometheus. This section aims to give more details on how this mapping works.

In our example, the metric.name:infer_rps gets mapped to the seldon_model_infer_total metric on the prometheus side, based on the configuration in the name section of the Prometheus Adapter ConfigMap. The prometheus metric name is then used to fill in the <<.Series>> template in the query (metricsQuery in the same ConfigMap).

Then, the information provided in the describedObject is used within the Prometheus query to select the right aggregations of the metric. For the RPS metric used to scale the Model (and the Server because of the 1-1 mapping), it makes sense to compute the aggregate RPS across all the replicas of a given model, so the describedObject references a specific Model CR.

However, in the general case, the describedObject does not need to be a Model. Any k8s object listed in the resources section of the Prometheus Adapter ConfigMap may be used. The Prometheus label associated with the object kind fills in the <<.GroupBy>> template, while the name gets used as part of the <<.LabelMatchers>>. For example:

If the described object is { kind: Namespace, name: seldon-mesh }, then the Prometheus query template configured in our example would be transformed into:

sum by (namespace) (
  rate (
    seldon_model_infer_total{namespace="seldon-mesh"}[2m]
  )
)

If the described object is not a namespace (for example, { kind: Pod, name: mlserver-0 }) then the query will be passed the label describing the object, alongside an additional label identifying the namespace where the HPA manifest resides in.:

sum by (pod) (
  rate (
    seldon_model_infer_total{pod="mlserver-0", namespace="seldon-mesh"}[2m]
  )
)

The target section establishes the thresholds used in scaling decisions. For RPS, the AverageValue target type refers to the threshold per replica RPS above which the number of the scaleTargetRef (Model or Server) replicas should be increased. The target number of replicas is being computed by HPA according to the following formula:

$\texttt{targetReplicas} = \frac{\texttt{infer\_rps}}{\texttt{averageValue}}$

As an example, if averageValue=50 and infer_rps=150, the targetReplicas would be 3.

Importantly, computing the target number of replicas does not require knowing the number of active pods currently associated with the Server or Model. This is what allows both the Model and the Server to be targeted by two separate HPA manifests. Otherwise, both HPA CRs would attempt to take ownership of the same set of pods, and transition into a failure state.

This is also why the Value target type is not currently supported. In this case, HPA first computes an utilizationRatio:

$\texttt{utilizationRatio} = \frac{\texttt{custom\_metric\_value}}{\texttt{threshold\_value}}$

As an example, if threshold_value=100 and custom_metric_value=200, the utilizationRatio would be 2. HPA deduces from this that the number of active pods associated with the scaleTargetRef object should be doubled, and expects that once that target is achieved, the custom_metric_value will become equal to the threshold_value (utilizationRatio=1). However, by using the number of active pods, the HPA CRs for both the Model and the Server also try to take exclusive ownership of the same set of pods, and fail.

HPA sampling of custom metrics

Each HPA CR has it's own timer on which it samples the specified custom metrics. This timer starts when the CR is created, with sampling of the metric being done at regular intervals (by default, 15 seconds).

As a side effect of this, creating the Model HPA and the Server HPA (for a given model) at different times will mean that the scaling decisions on the two are taken at different times. Even when creating the two CRs together as part of the same manifest, there will usually be a small delay between the point where the Model and Server spec.replicas values are changed.

Despite this delay, the two will converge to the same number when the decisions are taken based on the same metric (as in the previous examples).

When showing the HPA CR information via kubectl get, a column of the output will display the current metric value per replica and the target average value in the format [per replica metric value]/[target]. This information is updated in accordance to the sampling rate of each HPA resource. It is therefore expected to sometimes see different metric values for the Model and it's corresponding Server.

Some versions of k8s will display [per pod metric value] instead of [per replica metric value], with the number of pods being computed based on a label selector present in the target resource CR (the status.selector value for the Model or Server in the Core 2 case).

HPA is designed so that multiple HPA CRs cannot target the same underlying pod with this selector (with HPA stopping when such a condition is detected). This means that in Core 2, the Model and Server selector cannot be the same. A design choice was made to assign the Model a unique selector that does not match any pods.

As a result, for the k8s versions displaying [per pod metric value], the information shown for the Model HPA CR will be an overflow caused by division by zero. This is only a display artefact, with the Model HPA continuing to work normally. The actual value of the metric can be seen by inspecting the corresponding Server HPA CR, or by fetching the metric directly via kubectl get --raw

Advanced settings

Filtering metrics by additional labels on the prometheus metric:

The prometheus metric from which the model RPS is computed has the following labels managed by Seldon Core 2:

seldon_model_infer_total{
    code="200", 
    container="agent", 
    endpoint="metrics", 
    instance="10.244.0.39:9006", 
    job="seldon-mesh/agent", 
    method_type="rest", 
    model="irisa0", 
    model_internal="irisa0_1", 
    namespace="seldon-mesh", 
    pod="mlserver-0", 
    server="mlserver", 
    server_replica="0"
}

If you want the scaling metric to be computed based on a subset of the Prometheus time series with particular label values (labels either managed by Seldon Core 2 or added automatically within your infrastructure), you can add this as a selector the HPA metric config. This is shown in the following example, which scales only based on the RPS of REST requests as opposed to REST + gRPC:

  metrics:
  - type: Object
    object:
      describedObject:
        apiVersion: mlops.seldon.io/v1alpha1
        kind: Model
        name: irisa0
      metric:
        name: infer_rps
        selector:
          matchLabels:
            method_type: rest
      target:
	    type: AverageValue
        averageValue: "3"

Customize scale-up / scale-down rate & properties by using scaling policies as described in the HPA scaling policies docs
For more resources, please consult the HPA docs and the HPA walkthrough

Cluster operation guidelines when using HPA-based scaling

When deploying HPA-based scaling for Seldon Core 2 models and servers as part of a production deployment, it is important to understand the exact interactions between HPA-triggered actions and Seldon Core 2 scheduling, as well as potential pitfalls in choosing particular HPA configurations.

Using the default scaling policy, HPA is relatively aggressive on scale-up (responding quickly to increases in load), with a maximum replicas increase of either 4 every 15 seconds or 100% of existing replicas within the same period (whichever is highest). In contrast, scaling-down is more gradual, with HPA only scaling down to the maximum number of recommended replicas in the most recent 5 minute rolling window, in order to avoid flapping. Those parameters can be customized via scaling policies.

When using custom metrics such as RPS, the actual number of replicas added during scale-up or reduced during scale-down will entirely depend, alongside the maximums imposed by the policy, on the configured target (averageValue RPS per replica) and on how quickly the inferencing load varies in your cluster. All three need to be considered jointly in order to deliver both an efficient use of resources and meeting SLAs.

Customizing per-replica RPS targets and replica limits

Naturally, the first thing to consider is an estimated peak inference load (including some margins) for each of the models in the cluster. If the minimum number of model replicas needed to serve that load without breaching latency SLAs is known, it should be set as spec.maxReplicas, with the HPA target.averageValue set to peak_infer_RPS/maxReplicas.

If maxReplicas is not already known, an open-loop load test with a slowly ramping up request rate should be done on the target model (one replica, no scaling). This would allow you to determine the RPS (inference request throughput) when latency SLAs are breached or (depending on the desired operation point) when latency starts increasing. You would then set the HPA target.averageValue taking some margin below this saturation RPS, and compute spec.maxReplicas as peak_infer_RPS/target.averageValue. The margin taken below the saturation point is very important, because scaling-up cannot be instant (it requires spinning up new pods, downloading model artifacts, etc.). In the period until the new replicas become available, any load increases will still need to be absorbed by the existing replicas.

If there are multiple models which typically experience peak load in a correlated manner, you need to ensure that sufficient cluster resources are available for k8s to concurrently schedule the maximum number of server pods, with each pod holding one model replica. This can be ensured by using either Cluster Autoscaler or, when running workloads in the cloud, any provider-specific cluster autoscaling services.

It is important for the cluster to have sufficient resources for creating the total number of desired server replicas set by the HPA CRs across all the models at a given time.

Not having sufficient cluster resources to serve the number of replicas configured by HPA at a given moment, in particular under aggressive scale-up HPA policies, may result in breaches of SLAs. This is discussed in more detail in the following section.

A similar approach should be taken for setting minReplicas, in relation to estimated RPS in the low-load regime. However, it's useful to balance lower resource usage to immediate availability of replicas for inference rate increases from that lowest load point. If low-load regimes only occur for small periods of time, and especially combined with a high rate of increase in RPS when moving out of the low-load regime, it might be worth to set the minReplicas floor higher in order to ensure SLAs are met at all times.

Customizing HPA policy settings for ensuring correct scaling behaviour

Each spec.replica value change for a Model or Server triggers a rescheduling event for the Seldon Core 2 scheduler, which considers any updates that are needed in mapping Model replicas to Server replicas such as rescheduling failed Model replicas, loading new ones, unloading in the case of the number of replicas going down, etc.

Two characteristics in the current implementation are important in terms of autoscaling and configuring the HPA scale-up policy:

The scheduler does not create new Server replicas when the existing replicas are not sufficient for loading a Model's replicas (one Model replica per Server replica). Whenever a Model requests more replicas than available on any of the available Servers, its ModelReady condition transitions to Status: False with a ScheduleFailed message. However, any replicas of that Model that are already loaded at that point remain available for servicing inference load.
There is no partial scheduling of replicas. For example, consider a model with 2 replicas, currently loaded on a server with 3 replicas (two of those server replicas will have the model loaded). If you update the model replicas to 4, the scheduler will transition the model to ScheduleFailed, seeing that it cannot satisfy the requested number of replicas. The existing 2 model replicas will continue to serve traffic, but a third replica will not be loaded onto the remaining server replica.
In other words, the scheduler either schedules all the requested replicas, or, if unable to do so, leaves the state of the cluster unchanged.
Introducing partial scheduling would make the overall results of assigning models to servers significantly less predictable and ephemeral. This is because models may end up moved back-and forth between servers depending on the speed with which various server replicas become available. Network partitions or other transient errors may also trigger large changes to the model-to-server assignments, making it challenging to sustain consistent data plane load during those periods.

Taken together, the two Core 2 scheduling characteristics, combined with a very aggressive HPA scale-up policy and a continuously increasing RPS may lead to the following pathological case:

Based on RPS, HPA decides to increase both the Model and Server replicas from 2 (an example start stable state) to 8. While the 6 new Server pods get scheduled and get the Model loaded onto them, the scheduler will transition the Model into the ScheduleFailed state, because it cannot fulfill the requested replicas requirement. During this period, the initial 2 Model replicas continue to serve load, but are using their RPS margins and getting closer to the saturation point.
At the same time, load continues to increase, so HPA further increases the number of required Model and Server replicas from 8 to 12, before all of the 6 new Server pods had a chance to become available. The new replica target for the scheduler also becomes 12, and this would not be satisfied until all the 12 Server replicas are available. The 2 Model replicas that are available may by now be saturated and the infer latency spikes up, breaching set SLAs.
The process may continue until load stabilizes.
If at any point the number of requested replicas (<=maxReplicas) exceeds the resource capacity of the cluster, the requested server replica count will never be reached and thus the Model will remain permanently in the ScheduleFailed state.

While most likely encountered during continuous ramp-up RPS load tests with autoscaling enabled, the pathological case example is a good showcase for the elements that need to be taken into account when setting the HPA policies.

The speed with which new Server replicas can become available versus how many new replicas may HPA request in a given time:
- The HPA scale-up policy should not be configured to request more replicas than can become available in the specified time. The following example reflects a confidence that 5 Server pods will become available within 90 seconds, with some safety margin. The default scale-up config, that also adds a percentage based policy (double the existing replicas within the set periodSeconds) is not recommended because of this.
- Perhaps more importantly, there is no reason to scale faster than the time it takes for replicas to become available - this is the true maximum rate with which scaling up can happen anyway.

hpa-custom-policy.yaml

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: irisa0-model-hpa
  namespace: seldon-mesh
spec:
  scaleTargetRef:
    ...
  minReplicas: 1
  maxReplicas: 3
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 5
        periodSeconds: 90
  metrics:
    ...

The duration of transient load spikes which you might want to absorb within the existing per-replica RPS margins.
- The previous example, at line 13, configures a scale-up stabilization window of one minute. It means that for all of the HPA recommended replicas in the last 60 second window (4 samples of the custom metric considering the default sampling rate), only the smallest will be applied.
- Such stabilization windows should be set depending on typical load patterns in your cluster: not being too aggressive in reacting to increased load will allow you to achieve cost savings, but has the disadvantage of a delayed reaction if the load spike turns out to be sustained.
The duration of any typical/expected sustained ramp-up period, and the RPS increase rate during this period.
- It is useful to consider whether the replica scale-up rate configured via the policy (line 15 in the example) is able to keep-up with this RPS increase rate.
- Such a scenario may appear, for example, if you are planning for a smooth traffic ramp-up in a blue-green deployment as you are draining the "blue" deployment and transitioning to the "green" one

Pipeline to pipeline examples

This notebook illustrates a series of Pipelines that are joined together.

Models Used

gs://seldon-models/triton/simple an example Triton tensorflow model that takes 2 inputs INPUT0 and INPUT1 and adds them to produce OUTPUT0 and also subtracts INPUT1 from INPUT0 to produce OUTPUT1. See here for the original source code and license.
Other models can be found at https://github.com/SeldonIO/triton-python-examples

Pipeline pulling from one other Pipeline

cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple1
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple2
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki

seldon model load -f ./models/tfsimple1.yaml
seldon model load -f ./models/tfsimple2.yaml

{}
{}

seldon model status tfsimple1 -w ModelAvailable | jq -M .
seldon model status tfsimple2 -w ModelAvailable | jq -M .

{}
{}

cat ./pipelines/tfsimple.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple
spec:
  steps:
    - name: tfsimple1
  output:
    steps:
    - tfsimple1

seldon pipeline load -f ./pipelines/tfsimple.yaml

seldon pipeline status tfsimple -w PipelineReady | jq -M .

{
  "pipelineName": "tfsimple",
  "versions": [
    {
      "pipeline": {
        "name": "tfsimple",
        "uid": "cieq5dqi8ufs73flaj4g",
        "version": 1,
        "steps": [
          {
            "name": "tfsimple1"
          }
        ],
        "output": {
          "steps": [
            "tfsimple1.outputs"
          ]
        },
        "kubernetesMeta": {}
      },
      "state": {
        "pipelineVersion": 1,
        "status": "PipelineReady",
        "reason": "created pipeline",
        "lastChangeTimestamp": "2023-06-29T15:26:48.074696631Z",
        "modelsReady": true
      }
    }
  ]
}

seldon pipeline infer tfsimple \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .

{
  "model_name": "",
  "outputs": [
    {
      "data": [
        2,
        4,
        6,
        8,
        10,
        12,
        14,
        16,
        18,
        20,
        22,
        24,
        26,
        28,
        30,
        32
      ],
      "name": "OUTPUT0",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    },
    {
      "data": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ],
      "name": "OUTPUT1",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    }
  ]
}

cat ./pipelines/tfsimple-extended.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-extended
spec:
  input:
    externalInputs:
      - tfsimple.outputs
    tensorMap:
      tfsimple.outputs.OUTPUT0: INPUT0
      tfsimple.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2

seldon pipeline load -f ./pipelines/tfsimple-extended.yaml

seldon pipeline status tfsimple-extended -w PipelineReady | jq -M .

{
  "pipelineName": "tfsimple-extended",
  "versions": [
    {
      "pipeline": {
        "name": "tfsimple-extended",
        "uid": "cieq5h2i8ufs73flaj50",
        "version": 1,
        "steps": [
          {
            "name": "tfsimple2"
          }
        ],
        "output": {
          "steps": [
            "tfsimple2.outputs"
          ]
        },
        "kubernetesMeta": {},
        "input": {
          "externalInputs": [
            "tfsimple.outputs"
          ],
          "tensorMap": {
            "tfsimple.outputs.OUTPUT0": "INPUT0",
            "tfsimple.outputs.OUTPUT1": "INPUT1"
          }
        }
      },
      "state": {
        "pipelineVersion": 1,
        "status": "PipelineReady",
        "reason": "created pipeline",
        "lastChangeTimestamp": "2023-06-29T15:27:01.095715504Z",
        "modelsReady": true
      }
    }
  ]
}

seldon pipeline infer tfsimple --header x-request-id=test-id \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'

{
	"model_name": "",
	"outputs": [
		{
			"data": [
				2,
				4,
				6,
				8,
				10,
				12,
				14,
				16,
				18,
				20,
				22,
				24,
				26,
				28,
				30,
				32
			],
			"name": "OUTPUT0",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		},
		{
			"data": [
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0
			],
			"name": "OUTPUT1",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		}
	]
}

seldon pipeline inspect tfsimple

seldon.default.model.tfsimple1.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.model.tfsimple1.outputs	test-id	{"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.pipeline.tfsimple.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}

seldon pipeline inspect tfsimple-extended

seldon.default.model.tfsimple2.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.model.tfsimple2.outputs	test-id	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}

seldon pipeline unload tfsimple-extended
seldon pipeline unload tfsimple

seldon model unload tfsimple1
seldon model unload tfsimple2

Pipeline pulling from two other Pipelines

cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple1
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple2
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki

seldon model load -f ./models/tfsimple1.yaml
seldon model load -f ./models/tfsimple2.yaml

{}
{}

seldon model status tfsimple1 -w ModelAvailable | jq -M .
seldon model status tfsimple2 -w ModelAvailable | jq -M .

{}
{}

cat ./pipelines/tfsimple.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple
spec:
  steps:
    - name: tfsimple1
  output:
    steps:
    - tfsimple1

seldon pipeline load -f ./pipelines/tfsimple.yaml

seldon pipeline status tfsimple -w PipelineReady | jq -M .

{
  "pipelineName": "tfsimple",
  "versions": [
    {
      "pipeline": {
        "name": "tfsimple",
        "uid": "cieq6aai8ufs73flaj5g",
        "version": 1,
        "steps": [
          {
            "name": "tfsimple1"
          }
        ],
        "output": {
          "steps": [
            "tfsimple1.outputs"
          ]
        },
        "kubernetesMeta": {}
      },
      "state": {
        "pipelineVersion": 1,
        "status": "PipelineReady",
        "reason": "created pipeline",
        "lastChangeTimestamp": "2023-06-29T15:28:41.766794892Z",
        "modelsReady": true
      }
    }
  ]
}

seldon pipeline infer tfsimple \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .

{
  "model_name": "",
  "outputs": [
    {
      "data": [
        2,
        4,
        6,
        8,
        10,
        12,
        14,
        16,
        18,
        20,
        22,
        24,
        26,
        28,
        30,
        32
      ],
      "name": "OUTPUT0",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    },
    {
      "data": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ],
      "name": "OUTPUT1",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    }
  ]
}

cat ./pipelines/tfsimple-extended.yaml
echo "---"
cat ./pipelines/tfsimple-extended2.yaml
echo "---"
cat ./pipelines/tfsimple-combined.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-extended
spec:
  input:
    externalInputs:
      - tfsimple.outputs
    tensorMap:
      tfsimple.outputs.OUTPUT0: INPUT0
      tfsimple.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-extended2
spec:
  input:
    externalInputs:
      - tfsimple.outputs
    tensorMap:
      tfsimple.outputs.OUTPUT0: INPUT0
      tfsimple.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-combined
spec:
  input:
    externalInputs:
      - tfsimple-extended.outputs.OUTPUT0
      - tfsimple-extended2.outputs.OUTPUT1
    tensorMap:
      tfsimple-extended.outputs.OUTPUT0: INPUT0
      tfsimple-extended2.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2

seldon pipeline load -f ./pipelines/tfsimple-extended.yaml
seldon pipeline load -f ./pipelines/tfsimple-extended2.yaml
seldon pipeline load -f ./pipelines/tfsimple-combined.yaml

seldon pipeline status tfsimple-extended -w PipelineReady
seldon pipeline status tfsimple-extended2 -w PipelineReady
seldon pipeline status tfsimple-combined -w PipelineReady

{"pipelineName":"tfsimple-extended", "versions":[{"pipeline":{"name":"tfsimple-extended", "uid":"cieq6dai8ufs73flaj60", "version":1, "steps":[{"name":"tfsimple2"}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}, "input":{"externalInputs":["tfsimple.outputs"], "tensorMap":{"tfsimple.outputs.OUTPUT0":"INPUT0", "tfsimple.outputs.OUTPUT1":"INPUT1"}}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T15:28:53.963808852Z", "modelsReady":true}}]}
{"pipelineName":"tfsimple-extended2", "versions":[{"pipeline":{"name":"tfsimple-extended2", "uid":"cieq6dai8ufs73flaj6g", "version":1, "steps":[{"name":"tfsimple2"}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}, "input":{"externalInputs":["tfsimple.outputs"], "tensorMap":{"tfsimple.outputs.OUTPUT0":"INPUT0", "tfsimple.outputs.OUTPUT1":"INPUT1"}}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T15:28:54.087670106Z", "modelsReady":true}}]}
{"pipelineName":"tfsimple-combined", "versions":[{"pipeline":{"name":"tfsimple-combined", "uid":"cieq6dii8ufs73flaj70", "version":1, "steps":[{"name":"tfsimple2"}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}, "input":{"externalInputs":["tfsimple-extended.outputs.OUTPUT0", "tfsimple-extended2.outputs.OUTPUT1"], "tensorMap":{"tfsimple-extended.outputs.OUTPUT0":"INPUT0", "tfsimple-extended2.outputs.OUTPUT1":"INPUT1"}}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T15:28:54.330770841Z", "modelsReady":true}}]}

seldon pipeline infer tfsimple --header x-request-id=test-id2 \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'

{
	"model_name": "",
	"outputs": [
		{
			"data": [
				2,
				4,
				6,
				8,
				10,
				12,
				14,
				16,
				18,
				20,
				22,
				24,
				26,
				28,
				30,
				32
			],
			"name": "OUTPUT0",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		},
		{
			"data": [
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0
			],
			"name": "OUTPUT1",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		}
	]
}

seldon pipeline inspect tfsimple

seldon.default.model.tfsimple1.inputs	test-id2	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.model.tfsimple1.outputs	test-id2	{"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple.inputs	test-id2	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.pipeline.tfsimple.outputs	test-id2	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}

seldon pipeline inspect tfsimple-extended --offset 2 --verbose

seldon.default.model.tfsimple2.inputs	test-id2	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}		x-request-id=[test-id2]	x-forwarded-proto=[http]	x-seldon-route=[:tfsimple1_1:]	x-envoy-upstream-service-time=[1]	pipeline=[tfsimple-extended]	traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-e468d06afdab8f52-01]	x-envoy-expected-rq-timeout-ms=[60000]
seldon.default.model.tfsimple2.outputs	test-id2	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}		x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id2]	x-forwarded-proto=[http]	x-seldon-route=[:tfsimple1_1: :tfsimple2_1:]	x-envoy-upstream-service-time=[1]	pipeline=[tfsimple-extended]	traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-73bd1ee54a94d8fb-01]
seldon.default.pipeline.tfsimple-extended.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}		pipeline=[tfsimple-extended]	traceparent=[00-3a6047efa647efc2b3fc5266ae023d23-fee12926788ce3b6-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id]	x-forwarded-proto=[http]	x-envoy-upstream-service-time=[5]	x-seldon-route=[:tfsimple1_1:]
seldon.default.pipeline.tfsimple-extended.inputs	test-id2	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}		x-forwarded-proto=[http]	x-seldon-route=[:tfsimple1_1:]	x-envoy-upstream-service-time=[1]	pipeline=[tfsimple-extended]	traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-4df8459a992e0278-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id2]
seldon.default.pipeline.tfsimple-extended.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}		pipeline=[tfsimple-extended]	traceparent=[00-3a6047efa647efc2b3fc5266ae023d23-b2f899a739c5cafd-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id]	x-forwarded-proto=[http]	x-envoy-upstream-service-time=[5]	x-seldon-route=[:tfsimple1_1: :tfsimple2_1:]
seldon.default.pipeline.tfsimple-extended.outputs	test-id2	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}		x-envoy-upstream-service-time=[1]	pipeline=[tfsimple-extended]	traceparent=[00-e438b82ad361ac2d5481bcfc494074d2-dfa399143feec23d-01]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[test-id2]	x-forwarded-proto=[http]	x-seldon-route=[:tfsimple1_1: :tfsimple2_1:]

seldon pipeline inspect tfsimple-extended2 --offset 2

seldon.default.pipeline.tfsimple-extended2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended2.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}

seldon pipeline inspect tfsimple-combined

seldon.default.model.tfsimple2.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs	test-id	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple-combined.inputs	test-id	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.pipeline.tfsimple-combined.outputs	test-id	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}

seldon pipeline unload tfsimple-extended
seldon pipeline unload tfsimple-extended2
seldon pipeline unload tfsimple-combined
seldon pipeline unload tfsimple

seldon model unload tfsimple1
seldon model unload tfsimple2

Pipeline pullin from one pipeline with a trigger to another

cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple1
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple2
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki

seldon model load -f ./models/tfsimple1.yaml
seldon model load -f ./models/tfsimple2.yaml

{}
{}

seldon model status tfsimple1 -w ModelAvailable | jq -M .
seldon model status tfsimple2 -w ModelAvailable | jq -M .

{}
{}

cat ./pipelines/tfsimple.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple
spec:
  steps:
    - name: tfsimple1
  output:
    steps:
    - tfsimple1

seldon pipeline load -f ./pipelines/tfsimple.yaml

seldon pipeline status tfsimple -w PipelineReady | jq -M .

{
  "pipelineName": "tfsimple",
  "versions": [
    {
      "pipeline": {
        "name": "tfsimple",
        "uid": "ciepkmii8ufs73flaj2g",
        "version": 1,
        "steps": [
          {
            "name": "tfsimple1"
          }
        ],
        "output": {
          "steps": [
            "tfsimple1.outputs"
          ]
        },
        "kubernetesMeta": {}
      },
      "state": {
        "pipelineVersion": 1,
        "status": "PipelineReady",
        "reason": "created pipeline",
        "lastChangeTimestamp": "2023-06-29T14:51:06.822716088Z",
        "modelsReady": true
      }
    }
  ]
}

seldon pipeline infer tfsimple \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .

{
  "model_name": "",
  "outputs": [
    {
      "data": [
        2,
        4,
        6,
        8,
        10,
        12,
        14,
        16,
        18,
        20,
        22,
        24,
        26,
        28,
        30,
        32
      ],
      "name": "OUTPUT0",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    },
    {
      "data": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ],
      "name": "OUTPUT1",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    }
  ]
}

cat ./pipelines/tfsimple-extended.yaml
echo "---"
cat ./pipelines/tfsimple-extended2.yaml
echo "---"
cat ./pipelines/tfsimple-combined-trigger.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-extended
spec:
  input:
    externalInputs:
      - tfsimple.outputs
    tensorMap:
      tfsimple.outputs.OUTPUT0: INPUT0
      tfsimple.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-extended2
spec:
  input:
    externalInputs:
      - tfsimple.outputs
    tensorMap:
      tfsimple.outputs.OUTPUT0: INPUT0
      tfsimple.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-combined-trigger
spec:
  input:
    externalInputs:
      - tfsimple-extended.outputs
    externalTriggers:
      - tfsimple-extended2.outputs
    tensorMap:
      tfsimple-extended.outputs.OUTPUT0: INPUT0
      tfsimple-extended.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2

seldon pipeline load -f ./pipelines/tfsimple-extended.yaml
seldon pipeline load -f ./pipelines/tfsimple-extended2.yaml
seldon pipeline load -f ./pipelines/tfsimple-combined-trigger.yaml

seldon pipeline status tfsimple-extended -w PipelineReady
seldon pipeline status tfsimple-extended2 -w PipelineReady
seldon pipeline status tfsimple-combined-trigger -w PipelineReady

{"pipelineName":"tfsimple-extended", "versions":[{"pipeline":{"name":"tfsimple-extended", "uid":"ciepkoii8ufs73flaj30", "version":1, "steps":[{"name":"tfsimple2"}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}, "input":{"externalInputs":["tfsimple.outputs"], "tensorMap":{"tfsimple.outputs.OUTPUT0":"INPUT0", "tfsimple.outputs.OUTPUT1":"INPUT1"}}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:51:14.937544974Z", "modelsReady":true}}]}
{"pipelineName":"tfsimple-extended2", "versions":[{"pipeline":{"name":"tfsimple-extended2", "uid":"ciepkoii8ufs73flaj3g", "version":1, "steps":[{"name":"tfsimple2"}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}, "input":{"externalInputs":["tfsimple.outputs"], "tensorMap":{"tfsimple.outputs.OUTPUT0":"INPUT0", "tfsimple.outputs.OUTPUT1":"INPUT1"}}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:51:15.062097751Z", "modelsReady":true}}]}
{"pipelineName":"tfsimple-combined-trigger", "versions":[{"pipeline":{"name":"tfsimple-combined-trigger", "uid":"ciepkoqi8ufs73flaj40", "version":1, "steps":[{"name":"tfsimple2"}], "output":{"steps":["tfsimple2.outputs"]}, "kubernetesMeta":{}, "input":{"externalInputs":["tfsimple-extended.outputs"], "externalTriggers":["tfsimple-extended2.outputs"], "tensorMap":{"tfsimple-extended.outputs.OUTPUT0":"INPUT0", "tfsimple-extended.outputs.OUTPUT1":"INPUT1"}}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:51:15.326170068Z", "modelsReady":true}}]}

seldon pipeline infer tfsimple --header x-request-id=test-id3 \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'

{
	"model_name": "",
	"outputs": [
		{
			"data": [
				2,
				4,
				6,
				8,
				10,
				12,
				14,
				16,
				18,
				20,
				22,
				24,
				26,
				28,
				30,
				32
			],
			"name": "OUTPUT0",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		},
		{
			"data": [
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0
			],
			"name": "OUTPUT1",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		}
	]
}

seldon pipeline inspect tfsimple

seldon.default.model.tfsimple1.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.model.tfsimple1.outputs	test-id3	{"modelName":"tfsimple1_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]}}]}
seldon.default.pipeline.tfsimple.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}

seldon pipeline inspect tfsimple-extended --offset 2

seldon.default.model.tfsimple2.outputs	test-id3	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}

seldon pipeline inspect tfsimple-extended2 --offset 2

seldon.default.model.tfsimple2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}
seldon.default.pipeline.tfsimple-extended2.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}]}

seldon pipeline inspect tfsimple-combined-trigger

seldon.default.model.tfsimple2.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs	test-id3	{"modelName":"tfsimple2_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}
seldon.default.pipeline.tfsimple-combined-trigger.inputs	test-id3	{"inputs":[{"name":"INPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}, {"name":"INPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32]}}], "rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==", "AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.pipeline.tfsimple-combined-trigger.outputs	test-id3	{"outputs":[{"name":"OUTPUT0", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64]}}, {"name":"OUTPUT1", "datatype":"INT32", "shape":["1", "16"], "contents":{"intContents":[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}}]}

seldon pipeline unload tfsimple-extended
seldon pipeline unload tfsimple-extended2
seldon pipeline unload tfsimple-combined-trigger
seldon pipeline unload tfsimple

seldon model unload tfsimple1
seldon model unload tfsimple2

Pipeline pulling from one other Pipeline Step

cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple1
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple2
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki

seldon model load -f ./models/tfsimple1.yaml
seldon model load -f ./models/tfsimple2.yaml

{}
{}

seldon model status tfsimple1 -w ModelAvailable | jq -M .
seldon model status tfsimple2 -w ModelAvailable | jq -M .

{}
{}

cat ./pipelines/tfsimple.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple
spec:
  steps:
    - name: tfsimple1
  output:
    steps:
    - tfsimple1

seldon pipeline load -f ./pipelines/tfsimple.yaml

{}

seldon pipeline status tfsimple -w PipelineReady | jq -M .

{
  "pipelineName": "tfsimple",
  "versions": [
    {
      "pipeline": {
        "name": "tfsimple",
        "uid": "cg5g6m46dpcs73c4qhl0",
        "version": 1,
        "steps": [
          {
            "name": "tfsimple1"
          }
        ],
        "output": {
          "steps": [
            "tfsimple1.outputs"
          ]
        },
        "kubernetesMeta": {}
      },
      "state": {
        "pipelineVersion": 1,
        "status": "PipelineReady",
        "reason": "created pipeline",
        "lastChangeTimestamp": "2023-03-10T10:15:52.515491456Z",
        "modelsReady": true
      }
    }
  ]
}

seldon pipeline infer tfsimple \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .

{
  "model_name": "",
  "outputs": [
    {
      "data": [
        2,
        4,
        6,
        8,
        10,
        12,
        14,
        16,
        18,
        20,
        22,
        24,
        26,
        28,
        30,
        32
      ],
      "name": "OUTPUT0",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    },
    {
      "data": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ],
      "name": "OUTPUT1",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    }
  ]
}

cat ./pipelines/tfsimple-extended-step.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-extended-step
spec:
  input:
    externalInputs:
      - tfsimple.step.tfsimple1.outputs
    tensorMap:
      tfsimple.step.tfsimple1.outputs.OUTPUT0: INPUT0
      tfsimple.step.tfsimple1.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2

seldon pipeline load -f ./pipelines/tfsimple-extended-step.yaml

{}

seldon pipeline status tfsimple-extended-step -w PipelineReady | jq -M .

{
  "pipelineName": "tfsimple-extended-step",
  "versions": [
    {
      "pipeline": {
        "name": "tfsimple-extended-step",
        "uid": "cg5g6ns6dpcs73c4qhlg",
        "version": 1,
        "steps": [
          {
            "name": "tfsimple2"
          }
        ],
        "output": {
          "steps": [
            "tfsimple2.outputs"
          ]
        },
        "kubernetesMeta": {},
        "input": {
          "externalInputs": [
            "tfsimple.step.tfsimple1.outputs"
          ],
          "tensorMap": {
            "tfsimple.step.tfsimple1.outputs.OUTPUT0": "INPUT0",
            "tfsimple.step.tfsimple1.outputs.OUTPUT1": "INPUT1"
          }
        }
      },
      "state": {
        "pipelineVersion": 1,
        "status": "PipelineReady",
        "reason": "created pipeline",
        "lastChangeTimestamp": "2023-03-10T10:15:59.634720740Z",
        "modelsReady": true
      }
    }
  ]
}

seldon pipeline infer tfsimple \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'

{
	"model_name": "",
	"outputs": [
		{
			"data": [
				2,
				4,
				6,
				8,
				10,
				12,
				14,
				16,
				18,
				20,
				22,
				24,
				26,
				28,
				30,
				32
			],
			"name": "OUTPUT0",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		},
		{
			"data": [
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0
			],
			"name": "OUTPUT1",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		}
	]
}

seldon pipeline inspect tfsimple --verbose

seldon.default.model.tfsimple1.inputs	cg5g6ogfh5ss73a44vvg	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}		pipeline=[tfsimple]	traceparent=[00-2c66ff815d920ad238365be52a4467f5-90824e4cb70c3242-01]	x-forwarded-proto=[http]	x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[cg5g6ogfh5ss73a44vvg]
seldon.default.model.tfsimple1.outputs	cg5g6ogfh5ss73a44vvg	{"modelName":"tfsimple1_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}		x-request-id=[cg5g6ogfh5ss73a44vvg]	pipeline=[tfsimple]	x-envoy-upstream-service-time=[8]	x-seldon-route=[:tfsimple1_1:]	traceparent=[00-2c66ff815d920ad238365be52a4467f5-ca023a540fa463b3-01]	x-forwarded-proto=[http]	x-envoy-expected-rq-timeout-ms=[60000]
seldon.default.pipeline.tfsimple.inputs	cg5g6ogfh5ss73a44vvg	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}		pipeline=[tfsimple]	x-request-id=[cg5g6ogfh5ss73a44vvg]	traceparent=[00-2c66ff815d920ad238365be52a4467f5-843d6ce39292396d-01]	x-forwarded-proto=[http]	x-envoy-expected-rq-timeout-ms=[60000]
seldon.default.pipeline.tfsimple.outputs	cg5g6ogfh5ss73a44vvg	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}		x-envoy-expected-rq-timeout-ms=[60000]	x-request-id=[cg5g6ogfh5ss73a44vvg]	x-envoy-upstream-service-time=[8]	x-seldon-route=[:tfsimple1_1:]	pipeline=[tfsimple]	traceparent=[00-2c66ff815d920ad238365be52a4467f5-ee7527353e9fe5a2-01]	x-forwarded-proto=[http]

seldon pipeline inspect tfsimple-extended-step

seldon.default.model.tfsimple2.inputs	cg5g6ogfh5ss73a44vvg	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.model.tfsimple2.outputs	cg5g6ogfh5ss73a44vvg	{"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}
seldon.default.pipeline.tfsimple-extended-step.inputs	cg5g6ogfh5ss73a44vvg	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended-step.outputs	cg5g6ogfh5ss73a44vvg	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}

seldon pipeline unload tfsimple-extended-step
seldon pipeline unload tfsimple

{}
{}

seldon model unload tfsimple1
seldon model unload tfsimple2

{}
{}

Pipeline pulling from two other Pipeline steps from same model

cat ./models/tfsimple1.yaml
cat ./models/tfsimple2.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple1
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki
apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: tfsimple2
spec:
  storageUri: "gs://seldon-models/triton/simple"
  requirements:
  - tensorflow
  memory: 100Ki

seldon model load -f ./models/tfsimple1.yaml
seldon model load -f ./models/tfsimple2.yaml

{}
{}

seldon model status tfsimple1 -w ModelAvailable | jq -M .
seldon model status tfsimple2 -w ModelAvailable | jq -M .

{}
{}

cat ./pipelines/tfsimple.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple
spec:
  steps:
    - name: tfsimple1
  output:
    steps:
    - tfsimple1

seldon pipeline load -f ./pipelines/tfsimple.yaml

{}

seldon pipeline status tfsimple -w PipelineReady | jq -M .

{
  "pipelineName": "tfsimple",
  "versions": [
    {
      "pipeline": {
        "name": "tfsimple",
        "uid": "cg5g6u46dpcs73c4qhm0",
        "version": 1,
        "steps": [
          {
            "name": "tfsimple1"
          }
        ],
        "output": {
          "steps": [
            "tfsimple1.outputs"
          ]
        },
        "kubernetesMeta": {}
      },
      "state": {
        "pipelineVersion": 1,
        "status": "PipelineReady",
        "reason": "created pipeline",
        "lastChangeTimestamp": "2023-03-10T10:16:24.433333171Z",
        "modelsReady": true
      }
    }
  ]
}

seldon pipeline infer tfsimple \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}' | jq -M .

{
  "model_name": "",
  "outputs": [
    {
      "data": [
        2,
        4,
        6,
        8,
        10,
        12,
        14,
        16,
        18,
        20,
        22,
        24,
        26,
        28,
        30,
        32
      ],
      "name": "OUTPUT0",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    },
    {
      "data": [
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0,
        0
      ],
      "name": "OUTPUT1",
      "shape": [
        1,
        16
      ],
      "datatype": "INT32"
    }
  ]
}

cat ./pipelines/tfsimple-extended.yaml
echo "---"
cat ./pipelines/tfsimple-extended2.yaml
echo "---"
cat ./pipelines/tfsimple-combined-step.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-extended
spec:
  input:
    externalInputs:
      - tfsimple.outputs
    tensorMap:
      tfsimple.outputs.OUTPUT0: INPUT0
      tfsimple.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-extended2
spec:
  input:
    externalInputs:
      - tfsimple.outputs
    tensorMap:
      tfsimple.outputs.OUTPUT0: INPUT0
      tfsimple.outputs.OUTPUT1: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2
---
apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: tfsimple-combined-step
spec:
  input:
    externalInputs:
      - tfsimple-extended.step.tfsimple2.outputs.OUTPUT0
      - tfsimple-extended2.step.tfsimple2.outputs.OUTPUT0
    tensorMap:
      tfsimple-extended.step.tfsimple2.outputs.OUTPUT0: INPUT0
      tfsimple-extended2.step.tfsimple2.outputs.OUTPUT0: INPUT1
  steps:
    - name: tfsimple2
  output:
    steps:
    - tfsimple2

seldon pipeline load -f ./pipelines/tfsimple-extended.yaml
seldon pipeline load -f ./pipelines/tfsimple-extended2.yaml
seldon pipeline load -f ./pipelines/tfsimple-combined-step.yaml

{}
{}
{}

seldon pipeline status tfsimple-extended -w PipelineReady
seldon pipeline status tfsimple-extended2 -w PipelineReady
seldon pipeline status tfsimple-combined-step -w PipelineReady

{"pipelineName":"tfsimple-extended","versions":[{"pipeline":{"name":"tfsimple-extended","uid":"cg5g7046dpcs73c4qhmg","version":1,"steps":[{"name":"tfsimple2"}],"output":{"steps":["tfsimple2.outputs"]},"kubernetesMeta":{},"input":{"externalInputs":["tfsimple.outputs"],"tensorMap":{"tfsimple.outputs.OUTPUT0":"INPUT0","tfsimple.outputs.OUTPUT1":"INPUT1"}}},"state":{"pipelineVersion":1,"status":"PipelineReady","reason":"created pipeline","lastChangeTimestamp":"2023-03-10T10:16:32.576588675Z","modelsReady":true}}]}
{"pipelineName":"tfsimple-extended2","versions":[{"pipeline":{"name":"tfsimple-extended2","uid":"cg5g7046dpcs73c4qhn0","version":1,"steps":[{"name":"tfsimple2"}],"output":{"steps":["tfsimple2.outputs"]},"kubernetesMeta":{},"input":{"externalInputs":["tfsimple.outputs"],"tensorMap":{"tfsimple.outputs.OUTPUT0":"INPUT0","tfsimple.outputs.OUTPUT1":"INPUT1"}}},"state":{"pipelineVersion":1,"status":"PipelineReady","reason":"created pipeline","lastChangeTimestamp":"2023-03-10T10:16:32.711813099Z","modelsReady":true}}]}
{"pipelineName":"tfsimple-combined-step","versions":[{"pipeline":{"name":"tfsimple-combined-step","uid":"cg5g7046dpcs73c4qhng","version":1,"steps":[{"name":"tfsimple2"}],"output":{"steps":["tfsimple2.outputs"]},"kubernetesMeta":{},"input":{"externalInputs":["tfsimple-extended.step.tfsimple2.outputs.OUTPUT0","tfsimple-extended2.step.tfsimple2.outputs.OUTPUT0"],"tensorMap":{"tfsimple-extended.step.tfsimple2.outputs.OUTPUT0":"INPUT0","tfsimple-extended2.step.tfsimple2.outputs.OUTPUT0":"INPUT1"}}},"state":{"pipelineVersion":1,"status":"PipelineReady","reason":"created pipeline","lastChangeTimestamp":"2023-03-10T10:16:33.017843490Z","modelsReady":true}}]}

seldon pipeline infer tfsimple \
    '{"inputs":[{"name":"INPUT0","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]},{"name":"INPUT1","data":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16],"datatype":"INT32","shape":[1,16]}]}'

{
	"model_name": "",
	"outputs": [
		{
			"data": [
				2,
				4,
				6,
				8,
				10,
				12,
				14,
				16,
				18,
				20,
				22,
				24,
				26,
				28,
				30,
				32
			],
			"name": "OUTPUT0",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		},
		{
			"data": [
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0,
				0
			],
			"name": "OUTPUT1",
			"shape": [
				1,
				16
			],
			"datatype": "INT32"
		}
	]
}

seldon pipeline inspect tfsimple

seldon.default.model.tfsimple1.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}
seldon.default.model.tfsimple1.outputs	cg5g710fh5ss73a4500g	{"modelName":"tfsimple1_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon.default.pipeline.tfsimple.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]}}]}
seldon.default.pipeline.tfsimple.outputs	cg5g710fh5ss73a4500g	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}

seldon pipeline inspect tfsimple-extended

seldon.default.model.tfsimple2.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs	cg5g710fh5ss73a4500g	{"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon.default.pipeline.tfsimple-extended.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended.outputs	cg5g710fh5ss73a4500g	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}

seldon pipeline inspect tfsimple-extended2

seldon.default.model.tfsimple2.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs	cg5g710fh5ss73a4500g	{"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon.default.pipeline.tfsimple-extended2.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=="]}
seldon.default.pipeline.tfsimple-extended2.outputs	cg5g710fh5ss73a4500g	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}]}

seldon pipeline inspect tfsimple-combined-step

seldon.default.model.tfsimple2.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.model.tfsimple2.outputs	cg5g710fh5ss73a4500g	{"modelName":"tfsimple2_1","modelVersion":"1","outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}
seldon.default.pipeline.tfsimple-combined-step.inputs	cg5g710fh5ss73a4500g	{"inputs":[{"name":"INPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}},{"name":"INPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32]}}],"rawInputContents":["AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA==","AgAAAAQAAAAGAAAACAAAAAoAAAAMAAAADgAAABAAAAASAAAAFAAAABYAAAAYAAAAGgAAABwAAAAeAAAAIAAAAA=="]}
seldon.default.pipeline.tfsimple-combined-step.outputs	cg5g710fh5ss73a4500g	{"outputs":[{"name":"OUTPUT0","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64]}},{"name":"OUTPUT1","datatype":"INT32","shape":["1","16"],"contents":{"intContents":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]}}]}

seldon pipeline unload tfsimple-extended
seldon pipeline unload tfsimple-extended2
seldon pipeline unload tfsimple-combined-step
seldon pipeline unload tfsimple

{}
{}
{}
{}

seldon model unload tfsimple1
seldon model unload tfsimple2

{}
{}

Local experiments

Model Experiment

We will use two SKlearn Iris classification models to illustrate experiments.

cat ./models/sklearn1.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "gs://seldon-models/mlserver/iris"
  requirements:
  - sklearn

cat ./models/sklearn2.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris2
spec:
  storageUri: "gs://seldon-models/mlserver/iris"
  requirements:
  - sklearn

Load both models.

seldon model load -f ./models/sklearn1.yaml
seldon model load -f ./models/sklearn2.yaml

{}
{}

Wait for both models to be ready.

seldon model status iris -w ModelAvailable
seldon model status iris2 -w ModelAvailable

{}
{}

seldon model infer iris -i 50 \
  '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

Success: map[:iris_1::50]

seldon model infer iris2 -i 50 \
  '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

Success: map[:iris2_1::50]

Create an experiment that modifies the iris model to add a second model splitting traffic 50/50 between the two.

cat ./experiments/ab-default-model.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
  name: experiment-sample
spec:
  default: iris
  candidates:
  - name: iris
    weight: 50
  - name: iris2
    weight: 50

Start the experiment.

seldon experiment start -f ./experiments/ab-default-model.yaml

Wait for the experiment to be ready.

seldon experiment status experiment-sample -w | jq -M .

{
  "experimentName": "experiment-sample",
  "active": true,
  "candidatesReady": true,
  "mirrorReady": true,
  "statusDescription": "experiment active",
  "kubernetesMeta": {}
}

Run a set of calls and record which route the traffic took. There should be roughly a 50/50 split.

seldon model infer iris -i 50 \
  '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

Success: map[:iris2_1::27 :iris_1::23]

Show sticky session header x-seldon-route that is returned

seldon model infer iris --show-headers \
  '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

> POST /v2/models/iris/infer HTTP/1.1
> Host: localhost:9000
> Content-Type:[application/json]
> Seldon-Model:[iris]

< X-Seldon-Route:[:iris_1:]
< Ce-Id:[463e96ad-645f-4442-8890-4c340b58820b]
< Traceparent:[00-fe9e87fcbe4be98ed82fb76166e15ceb-d35e7ac96bd8b718-01]
< X-Envoy-Upstream-Service-Time:[3]
< Ce-Specversion:[0.3]
< Date:[Thu, 29 Jun 2023 14:03:03 GMT]
< Ce-Source:[io.seldon.serving.deployment.mlserver]
< Content-Type:[application/json]
< Server:[envoy]
< X-Request-Id:[cieou5ofh5ss73fbjdu0]
< Ce-Endpoint:[iris_1]
< Ce-Modelid:[iris_1]
< Ce-Type:[io.seldon.serving.inference.response]
< Content-Length:[213]
< Ce-Inferenceservicename:[mlserver]
< Ce-Requestid:[463e96ad-645f-4442-8890-4c340b58820b]

{
	"model_name": "iris_1",
	"model_version": "1",
	"id": "463e96ad-645f-4442-8890-4c340b58820b",
	"parameters": {},
	"outputs": [
		{
			"name": "predict",
			"shape": [
				1,
				1
			],
			"datatype": "INT64",
			"parameters": {
				"content_type": "np"
			},
			"data": [
				2
			]
		}
	]
}

Use sticky session key passed by last infer request to ensure same route is taken each time.

seldon model infer iris -s -i 50 \
  '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

Success: map[:iris_1::50]

seldon model infer iris --inference-mode grpc -s -i 50\
   '{"model_name":"iris","inputs":[{"name":"input","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[1,4]}]}'

Success: map[:iris_1::50]

Stop the experiment

seldon experiment stop experiment-sample

Unload both models.

seldon model unload iris
seldon model unload iris2

Pipeline Experiment

cat ./models/add10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: add10
spec:
  storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
  requirements:
  - triton
  - python

cat ./models/mul10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: mul10
spec:
  storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
  requirements:
  - triton
  - python

seldon model load -f ./models/add10.yaml
seldon model load -f ./models/mul10.yaml

{}
{}

seldon model status add10 -w ModelAvailable
seldon model status mul10 -w ModelAvailable

{}
{}

cat ./pipelines/mul10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: pipeline-mul10
spec:
  steps:
    - name: mul10
  output:
    steps:
    - mul10

cat ./pipelines/add10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: pipeline-add10
spec:
  steps:
    - name: add10
  output:
    steps:
    - add10

seldon pipeline load -f ./pipelines/add10.yaml
seldon pipeline load -f ./pipelines/mul10.yaml

seldon pipeline status pipeline-add10 -w PipelineReady
seldon pipeline status pipeline-mul10 -w PipelineReady

{"pipelineName":"pipeline-add10", "versions":[{"pipeline":{"name":"pipeline-add10", "uid":"cieov47l80lc739juklg", "version":1, "steps":[{"name":"add10"}], "output":{"steps":["add10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:05:04.460868091Z", "modelsReady":true}}]}
{"pipelineName":"pipeline-mul10", "versions":[{"pipeline":{"name":"pipeline-mul10", "uid":"cieov47l80lc739jukm0", "version":1, "steps":[{"name":"mul10"}], "output":{"steps":["mul10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:05:04.631980330Z", "modelsReady":true}}]}

seldon pipeline infer pipeline-add10 --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .

{
  "outputs": [
    {
      "name": "OUTPUT",
      "datatype": "FP32",
      "shape": [
        "4"
      ],
      "contents": {
        "fp32Contents": [
          11,
          12,
          13,
          14
        ]
      }
    }
  ]
}

seldon pipeline infer pipeline-mul10 --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}' | jq -M .

{
  "outputs": [
    {
      "name": "OUTPUT",
      "datatype": "FP32",
      "shape": [
        "4"
      ],
      "contents": {
        "fp32Contents": [
          10,
          20,
          30,
          40
        ]
      }
    }
  ]
}

cat ./experiments/addmul10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
  name: addmul10
spec:
  default: pipeline-add10
  resourceType: pipeline
  candidates:
  - name: pipeline-add10
    weight: 50
  - name: pipeline-mul10
    weight: 50

seldon experiment start -f ./experiments/addmul10.yaml

seldon experiment status addmul10 -w | jq -M .

{
  "experimentName": "addmul10",
  "active": true,
  "candidatesReady": true,
  "mirrorReady": true,
  "statusDescription": "experiment active",
  "kubernetesMeta": {}
}

seldon pipeline infer pipeline-add10 -i 50 --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

Success: map[:add10_1::28 :mul10_1::22 :pipeline-add10.pipeline::28 :pipeline-mul10.pipeline::22]

Use sticky session key passed by last infer request to ensure same route is taken each time.

seldon pipeline infer pipeline-add10 --show-headers --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

> /inference.GRPCInferenceService/ModelInfer HTTP/2
> Host: localhost:9000
> seldon-model:[pipeline-add10.pipeline]

< x-envoy-expected-rq-timeout-ms:[60000]
< x-request-id:[cieov8ofh5ss739277i0]
< date:[Thu, 29 Jun 2023 14:05:23 GMT]
< server:[envoy]
< content-type:[application/grpc]
< x-envoy-upstream-service-time:[6]
< x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
< x-forwarded-proto:[http]

{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}

seldon pipeline infer pipeline-add10 -s --show-headers --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

> /inference.GRPCInferenceService/ModelInfer HTTP/2
> Host: localhost:9000
> x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
> seldon-model:[pipeline-add10.pipeline]

< content-type:[application/grpc]
< x-forwarded-proto:[http]
< x-envoy-expected-rq-timeout-ms:[60000]
< x-seldon-route:[:add10_1: :pipeline-add10.pipeline: :pipeline-add10.pipeline:]
< x-request-id:[cieov90fh5ss739277ig]
< x-envoy-upstream-service-time:[7]
< date:[Thu, 29 Jun 2023 14:05:24 GMT]
< server:[envoy]

{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}

seldon pipeline infer pipeline-add10 -s -i 50 --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

Success: map[:add10_1::50 :pipeline-add10.pipeline::150]

cat ./models/add20.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: add20
spec:
  storageUri: "gs://seldon-models/triton/add20"
  requirements:
  - triton
  - python

seldon model load -f ./models/add20.yaml

{}

seldon model status add20 -w ModelAvailable

{}

cat ./experiments/add1020.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
  name: add1020
spec:
  default: add10
  candidates:
  - name: add10
    weight: 50
  - name: add20
    weight: 50

seldon experiment start -f ./experiments/add1020.yaml

seldon experiment status add1020 -w | jq -M .

{
  "experimentName": "add1020",
  "active": true,
  "candidatesReady": true,
  "mirrorReady": true,
  "statusDescription": "experiment active",
  "kubernetesMeta": {}
}

seldon model infer add10 -i 50  --inference-mode grpc \
  '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

Success: map[:add10_1::22 :add20_1::28]

seldon pipeline infer pipeline-add10 -i 100 --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

Success: map[:add10_1::24 :add20_1::32 :mul10_1::44 :pipeline-add10.pipeline::56 :pipeline-mul10.pipeline::44]

seldon pipeline infer pipeline-add10 --show-headers --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

> /inference.GRPCInferenceService/ModelInfer HTTP/2
> Host: localhost:9000
> seldon-model:[pipeline-add10.pipeline]

< x-request-id:[cieovf0fh5ss739279u0]
< x-envoy-upstream-service-time:[5]
< x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
< date:[Thu, 29 Jun 2023 14:05:48 GMT]
< server:[envoy]
< content-type:[application/grpc]
< x-forwarded-proto:[http]
< x-envoy-expected-rq-timeout-ms:[60000]

{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}

seldon pipeline infer pipeline-add10 -s --show-headers --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

> /inference.GRPCInferenceService/ModelInfer HTTP/2
> Host: localhost:9000
> x-seldon-route:[:add10_1: :pipeline-add10.pipeline:]
> seldon-model:[pipeline-add10.pipeline]

< x-forwarded-proto:[http]
< x-envoy-expected-rq-timeout-ms:[60000]
< x-request-id:[cieovf8fh5ss739279ug]
< x-envoy-upstream-service-time:[6]
< date:[Thu, 29 Jun 2023 14:05:49 GMT]
< server:[envoy]
< content-type:[application/grpc]
< x-seldon-route:[:add10_1: :pipeline-add10.pipeline: :add20_1: :pipeline-add10.pipeline:]

{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[21, 22, 23, 24]}}]}

seldon experiment stop addmul10
seldon experiment stop add1020
seldon pipeline unload pipeline-add10
seldon pipeline unload pipeline-mul10
seldon model unload add10
seldon model unload add20
seldon model unload mul10

Model Mirror Experiment

We will use two SKlearn Iris classification models to illustrate a model with a mirror.

cat ./models/sklearn1.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris
spec:
  storageUri: "gs://seldon-models/mlserver/iris"
  requirements:
  - sklearn

cat ./models/sklearn2.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: iris2
spec:
  storageUri: "gs://seldon-models/mlserver/iris"
  requirements:
  - sklearn

Load both models.

seldon model load -f ./models/sklearn1.yaml
seldon model load -f ./models/sklearn2.yaml

{}
{}

Wait for both models to be ready.

seldon model status iris -w ModelAvailable
seldon model status iris2 -w ModelAvailable

{}
{}

Create an experiment that modifies in which we mirror traffic to iris also to iris2.

cat ./experiments/sklearn-mirror.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
  name: sklearn-mirror
spec:
  default: iris
  candidates:
  - name: iris
    weight: 100
  mirror:
    name: iris2
    percent: 100

Start the experiment.

seldon experiment start -f ./experiments/sklearn-mirror.yaml

Wait for the experiment to be ready.

seldon experiment status sklearn-mirror -w | jq -M .

{
  "experimentName": "sklearn-mirror",
  "active": true,
  "candidatesReady": true,
  "mirrorReady": true,
  "statusDescription": "experiment active",
  "kubernetesMeta": {}
}

We get responses from iris but all requests would also have been mirrored to iris2

seldon model infer iris -i 50 \
  '{"inputs": [{"name": "predict", "shape": [1, 4], "datatype": "FP32", "data": [[1, 2, 3, 4]]}]}'

Success: map[:iris_1::50]

We can check the local prometheus port from the agent to validate requests went to iris2

curl -s 0.0.0:9006/metrics | grep seldon_model_infer_total | grep iris2_1

seldon_model_infer_total{code="200",method_type="rest",model="iris",model_internal="iris2_1",server="mlserver",server_replica="0"} 50

Stop the experiment

seldon experiment stop sklearn-mirror

Unload both models.

seldon model unload iris
seldon model unload iris2

Pipeline Mirror Experiment

cat ./models/add10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: add10
spec:
  storageUri: "gs://seldon-models/scv2/samples/triton_23-03/add10"
  requirements:
  - triton
  - python

cat ./models/mul10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Model
metadata:
  name: mul10
spec:
  storageUri: "gs://seldon-models/scv2/samples/triton_23-03/mul10"
  requirements:
  - triton
  - python

seldon model load -f ./models/add10.yaml
seldon model load -f ./models/mul10.yaml

{}
{}

seldon model status add10 -w ModelAvailable
seldon model status mul10 -w ModelAvailable

{}
{}

cat ./pipelines/mul10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: pipeline-mul10
spec:
  steps:
    - name: mul10
  output:
    steps:
    - mul10

cat ./pipelines/add10.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Pipeline
metadata:
  name: pipeline-add10
spec:
  steps:
    - name: add10
  output:
    steps:
    - add10

seldon pipeline load -f ./pipelines/add10.yaml
seldon pipeline load -f ./pipelines/mul10.yaml

seldon pipeline status pipeline-add10 -w PipelineReady
seldon pipeline status pipeline-mul10 -w PipelineReady

{"pipelineName":"pipeline-add10", "versions":[{"pipeline":{"name":"pipeline-add10", "uid":"ciep072i8ufs73flaipg", "version":1, "steps":[{"name":"add10"}], "output":{"steps":["add10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:07:24.903503109Z", "modelsReady":true}}]}
{"pipelineName":"pipeline-mul10", "versions":[{"pipeline":{"name":"pipeline-mul10", "uid":"ciep072i8ufs73flaiq0", "version":1, "steps":[{"name":"mul10"}], "output":{"steps":["mul10.outputs"]}, "kubernetesMeta":{}}, "state":{"pipelineVersion":1, "status":"PipelineReady", "reason":"created pipeline", "lastChangeTimestamp":"2023-06-29T14:07:25.082642153Z", "modelsReady":true}}]}

seldon pipeline infer pipeline-add10 --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}

seldon pipeline infer pipeline-mul10 --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}

cat ./experiments/addmul10-mirror.yaml

apiVersion: mlops.seldon.io/v1alpha1
kind: Experiment
metadata:
  name: addmul10-mirror
spec:
  default: pipeline-add10
  resourceType: pipeline
  candidates:
  - name: pipeline-add10
    weight: 100
  mirror:
    name: pipeline-mul10
    percent: 100

seldon experiment start -f ./experiments/addmul10-mirror.yaml

seldon experiment status addmul10-mirror -w | jq -M .

{
  "experimentName": "addmul10-mirror",
  "active": true,
  "candidatesReady": true,
  "mirrorReady": true,
  "statusDescription": "experiment active",
  "kubernetesMeta": {}
}

seldon pipeline infer pipeline-add10 -i 1 --inference-mode grpc \
 '{"model_name":"add10","inputs":[{"name":"INPUT","contents":{"fp32_contents":[1,2,3,4]},"datatype":"FP32","shape":[4]}]}'

{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[11, 12, 13, 14]}}]}

Let's check that the mul10 model was called.

curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep mul10_1

seldon_model_infer_total{code="OK",method_type="grpc",model="mul10",model_internal="mul10_1",server="triton",server_replica="0"} 2

curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep add10_1

seldon_model_infer_total{code="OK",method_type="grpc",model="add10",model_internal="add10_1",server="triton",server_replica="0"} 2

Let's do an http call and check agaib the two models

seldon pipeline infer pipeline-add10 -i 1 \
 '{"model_name":"add10","inputs":[{"name":"INPUT","data":[1,2,3,4],"datatype":"FP32","shape":[4]}]}'

{
	"model_name": "",
	"outputs": [
		{
			"data": [
				11,
				12,
				13,
				14
			],
			"name": "OUTPUT",
			"shape": [
				4
			],
			"datatype": "FP32"
		}
	]
}

curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep mul10_1

seldon_model_infer_total{code="OK",method_type="grpc",model="mul10",model_internal="mul10_1",server="triton",server_replica="0"} 3

curl -s 0.0.0:9007/metrics | grep seldon_model_infer_total | grep add10_1

seldon_model_infer_total{code="OK",method_type="grpc",model="add10",model_internal="add10_1",server="triton",server_replica="0"} 3

seldon pipeline inspect pipeline-mul10

seldon.default.model.mul10.inputs	ciep0bofh5ss73dpdiq0	{"inputs":[{"name":"INPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[1, 2, 3, 4]}}]}
seldon.default.model.mul10.outputs	ciep0bofh5ss73dpdiq0	{"modelName":"mul10_1", "modelVersion":"1", "outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}
seldon.default.pipeline.pipeline-mul10.inputs	ciep0bofh5ss73dpdiq0	{"inputs":[{"name":"INPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[1, 2, 3, 4]}}]}
seldon.default.pipeline.pipeline-mul10.outputs	ciep0bofh5ss73dpdiq0	{"outputs":[{"name":"OUTPUT", "datatype":"FP32", "shape":["4"], "contents":{"fp32Contents":[10, 20, 30, 40]}}]}

seldon experiment stop addmul10-mirror
seldon pipeline unload pipeline-add10
seldon pipeline unload pipeline-mul10
seldon model unload add10
seldon model unload mul10